diff mbox series

[v1] s390/virtio_ccw: don't allocate/assign airqs for non-existing queues

Message ID 20250402203621.940090-1-david@redhat.com (mailing list archive)
State New
Headers show
Series [v1] s390/virtio_ccw: don't allocate/assign airqs for non-existing queues | expand

Commit Message

David Hildenbrand April 2, 2025, 8:36 p.m. UTC
If we finds a vq without a name in our input array in
virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.

Consequently, we create only a queue if it actually exists (name != NULL)
and assign an incremental queue index to each such existing queue.

However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
will not ignore these "non-existing queues", but instead assign an airq
indicator to them.

Besides never releasing them in virtio_ccw_drop_indicators() (because
there is no virtqueue), the bigger issue seems to be that there will be a
disagreement between the device and the Linux guest about the airq
indicator to be used for notifying a queue, because the indicator bit
for adapter I/O interrupt is derived from the queue index.

The virtio spec states under "Setting Up Two-Stage Queue Indicators":

	... indicator contains the guest address of an area wherein the
	indicators for the devices are contained, starting at bit_nr, one
	bit per virtqueue of the device.

And further in "Notification via Adapter I/O Interrupts":

	For notifying the driver of virtqueue buffers, the device sets the
	bit in the guest-provided indicator area at the corresponding
	offset.

For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
"vector") to select the relevant indicator bit. If a queue does not exist,
it does not have a corresponding indicator bit assigned, because it
effectively doesn't have a queue index.

Using a virtio-balloon-ccw device under QEMU with free-page-hinting
disabled ("free-page-hint=off") but free-page-reporting enabled
("free-page-reporting=on") will result in free page reporting
not working as expected: in the virtio_balloon driver, we'll be stuck
forever in virtballoon_free_page_report()->wait_event(), because the
waitqueue will not be woken up as the notification from the device is
lost: it would use the wrong indicator bit.

Free page reporting stops working and we get splats (when configured to
detect hung wqs) like:

 INFO: task kworker/1:3:463 blocked for more than 61 seconds.
       Not tainted 6.14.0 #4
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/1:3 [...]
 Workqueue: events page_reporting_process
 Call Trace:
  [<000002f404e6dfb2>] __schedule+0x402/0x1640
  [<000002f404e6f22e>] schedule+0x3e/0xe0
  [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
  [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
  [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
  [<000002f403fd4b96>] worker_thread+0x296/0x420
  [<000002f403fe10b4>] kthread+0x124/0x290
  [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
  [<000002f404e77272>] ret_from_fork+0xa/0x38

There was recently a discussion [1] whether the "holes" should be
treated differently again, effectively assigning also non-existing
queues a queue index: that should also fix the issue, but requires other
workarounds to not break existing setups.

Let's fix it without affecting existing setups for now by properly ignoring
the non-existing queues, so the indicator bits will match the queue
indexes.

[1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/

Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL")
Reported-by: Chandra Merla <cmerla@redhat.com>
Cc: <Stable@vger.kernel.org>
Cc: Cornelia Huck <cohuck@redhat.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Halil Pasic <pasic@linux.ibm.com>
Cc: Eric Farman <farman@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Thomas Huth April 3, 2025, 9:44 a.m. UTC | #1
On 02/04/2025 22.36, David Hildenbrand wrote:
> If we finds a vq without a name in our input array in
> virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
> to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
> 
> Consequently, we create only a queue if it actually exists (name != NULL)
> and assign an incremental queue index to each such existing queue.
> 
> However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
> will not ignore these "non-existing queues", but instead assign an airq
> indicator to them.
> 
> Besides never releasing them in virtio_ccw_drop_indicators() (because
> there is no virtqueue), the bigger issue seems to be that there will be a
> disagreement between the device and the Linux guest about the airq
> indicator to be used for notifying a queue, because the indicator bit
> for adapter I/O interrupt is derived from the queue index.
> 
> The virtio spec states under "Setting Up Two-Stage Queue Indicators":
> 
> 	... indicator contains the guest address of an area wherein the
> 	indicators for the devices are contained, starting at bit_nr, one
> 	bit per virtqueue of the device.
> 
> And further in "Notification via Adapter I/O Interrupts":
> 
> 	For notifying the driver of virtqueue buffers, the device sets the
> 	bit in the guest-provided indicator area at the corresponding
> 	offset.
> 
> For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
> "vector") to select the relevant indicator bit. If a queue does not exist,
> it does not have a corresponding indicator bit assigned, because it
> effectively doesn't have a queue index.
> 
> Using a virtio-balloon-ccw device under QEMU with free-page-hinting
> disabled ("free-page-hint=off") but free-page-reporting enabled
> ("free-page-reporting=on") will result in free page reporting
> not working as expected: in the virtio_balloon driver, we'll be stuck
> forever in virtballoon_free_page_report()->wait_event(), because the
> waitqueue will not be woken up as the notification from the device is
> lost: it would use the wrong indicator bit.
> 
> Free page reporting stops working and we get splats (when configured to
> detect hung wqs) like:
> 
>   INFO: task kworker/1:3:463 blocked for more than 61 seconds.
>         Not tainted 6.14.0 #4
>   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>   task:kworker/1:3 [...]
>   Workqueue: events page_reporting_process
>   Call Trace:
>    [<000002f404e6dfb2>] __schedule+0x402/0x1640
>    [<000002f404e6f22e>] schedule+0x3e/0xe0
>    [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
>    [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
>    [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
>    [<000002f403fd4b96>] worker_thread+0x296/0x420
>    [<000002f403fe10b4>] kthread+0x124/0x290
>    [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
>    [<000002f404e77272>] ret_from_fork+0xa/0x38
> 
> There was recently a discussion [1] whether the "holes" should be
> treated differently again, effectively assigning also non-existing
> queues a queue index: that should also fix the issue, but requires other
> workarounds to not break existing setups.
> 
> Let's fix it without affecting existing setups for now by properly ignoring
> the non-existing queues, so the indicator bits will match the queue
> indexes.
> 
> [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/
> 
> Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL")
> Reported-by: Chandra Merla <cmerla@redhat.com>
> Cc: <Stable@vger.kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++----
>   1 file changed, 12 insertions(+), 4 deletions(-)

Thanks for tackling this, David!

I tested the patch and the problems are gone now, indeed!

Tested-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Cornelia Huck April 3, 2025, 12:45 p.m. UTC | #2
On Wed, Apr 02 2025, David Hildenbrand <david@redhat.com> wrote:

> If we finds a vq without a name in our input array in
> virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
> to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
>
> Consequently, we create only a queue if it actually exists (name != NULL)
> and assign an incremental queue index to each such existing queue.
>
> However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
> will not ignore these "non-existing queues", but instead assign an airq
> indicator to them.
>
> Besides never releasing them in virtio_ccw_drop_indicators() (because
> there is no virtqueue), the bigger issue seems to be that there will be a
> disagreement between the device and the Linux guest about the airq
> indicator to be used for notifying a queue, because the indicator bit
> for adapter I/O interrupt is derived from the queue index.
>
> The virtio spec states under "Setting Up Two-Stage Queue Indicators":
>
> 	... indicator contains the guest address of an area wherein the
> 	indicators for the devices are contained, starting at bit_nr, one
> 	bit per virtqueue of the device.
>
> And further in "Notification via Adapter I/O Interrupts":
>
> 	For notifying the driver of virtqueue buffers, the device sets the
> 	bit in the guest-provided indicator area at the corresponding
> 	offset.
>
> For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
> "vector") to select the relevant indicator bit. If a queue does not exist,
> it does not have a corresponding indicator bit assigned, because it
> effectively doesn't have a queue index.
>
> Using a virtio-balloon-ccw device under QEMU with free-page-hinting
> disabled ("free-page-hint=off") but free-page-reporting enabled
> ("free-page-reporting=on") will result in free page reporting
> not working as expected: in the virtio_balloon driver, we'll be stuck
> forever in virtballoon_free_page_report()->wait_event(), because the
> waitqueue will not be woken up as the notification from the device is
> lost: it would use the wrong indicator bit.
>
> Free page reporting stops working and we get splats (when configured to
> detect hung wqs) like:
>
>  INFO: task kworker/1:3:463 blocked for more than 61 seconds.
>        Not tainted 6.14.0 #4
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/1:3 [...]
>  Workqueue: events page_reporting_process
>  Call Trace:
>   [<000002f404e6dfb2>] __schedule+0x402/0x1640
>   [<000002f404e6f22e>] schedule+0x3e/0xe0
>   [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
>   [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
>   [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
>   [<000002f403fd4b96>] worker_thread+0x296/0x420
>   [<000002f403fe10b4>] kthread+0x124/0x290
>   [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
>   [<000002f404e77272>] ret_from_fork+0xa/0x38
>
> There was recently a discussion [1] whether the "holes" should be
> treated differently again, effectively assigning also non-existing
> queues a queue index: that should also fix the issue, but requires other
> workarounds to not break existing setups.
>
> Let's fix it without affecting existing setups for now by properly ignoring
> the non-existing queues, so the indicator bits will match the queue
> indexes.
>
> [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/
>
> Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL")
> Reported-by: Chandra Merla <cmerla@redhat.com>
> Cc: <Stable@vger.kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

[I assume that one of the IBM folks can simply pick this up?]
Michael S. Tsirkin April 3, 2025, 12:57 p.m. UTC | #3
On Wed, Apr 02, 2025 at 10:36:21PM +0200, David Hildenbrand wrote:
> If we finds a vq without a name in our input array in
> virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
> to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
> 
> Consequently, we create only a queue if it actually exists (name != NULL)
> and assign an incremental queue index to each such existing queue.
> 
> However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we
> will not ignore these "non-existing queues", but instead assign an airq
> indicator to them.
> 
> Besides never releasing them in virtio_ccw_drop_indicators() (because
> there is no virtqueue), the bigger issue seems to be that there will be a
> disagreement between the device and the Linux guest about the airq
> indicator to be used for notifying a queue, because the indicator bit
> for adapter I/O interrupt is derived from the queue index.
> 
> The virtio spec states under "Setting Up Two-Stage Queue Indicators":
> 
> 	... indicator contains the guest address of an area wherein the
> 	indicators for the devices are contained, starting at bit_nr, one
> 	bit per virtqueue of the device.
> 
> And further in "Notification via Adapter I/O Interrupts":
> 
> 	For notifying the driver of virtqueue buffers, the device sets the
> 	bit in the guest-provided indicator area at the corresponding
> 	offset.
> 
> For example, QEMU uses in virtio_ccw_notify() the queue index (passed as
> "vector") to select the relevant indicator bit. If a queue does not exist,
> it does not have a corresponding indicator bit assigned, because it
> effectively doesn't have a queue index.
> 
> Using a virtio-balloon-ccw device under QEMU with free-page-hinting
> disabled ("free-page-hint=off") but free-page-reporting enabled
> ("free-page-reporting=on") will result in free page reporting
> not working as expected: in the virtio_balloon driver, we'll be stuck
> forever in virtballoon_free_page_report()->wait_event(), because the
> waitqueue will not be woken up as the notification from the device is
> lost: it would use the wrong indicator bit.
> 
> Free page reporting stops working and we get splats (when configured to
> detect hung wqs) like:
> 
>  INFO: task kworker/1:3:463 blocked for more than 61 seconds.
>        Not tainted 6.14.0 #4
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/1:3 [...]
>  Workqueue: events page_reporting_process
>  Call Trace:
>   [<000002f404e6dfb2>] __schedule+0x402/0x1640
>   [<000002f404e6f22e>] schedule+0x3e/0xe0
>   [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon]
>   [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740
>   [<000002f403fd3ee2>] process_one_work+0x1c2/0x400
>   [<000002f403fd4b96>] worker_thread+0x296/0x420
>   [<000002f403fe10b4>] kthread+0x124/0x290
>   [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60
>   [<000002f404e77272>] ret_from_fork+0xa/0x38
> 
> There was recently a discussion [1] whether the "holes" should be
> treated differently again, effectively assigning also non-existing
> queues a queue index: that should also fix the issue, but requires other
> workarounds to not break existing setups.
> 
> Let's fix it without affecting existing setups for now by properly ignoring
> the non-existing queues, so the indicator bits will match the queue
> indexes.
> 
> [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/
> 
> Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL")
> Reported-by: Chandra Merla <cmerla@redhat.com>
> Cc: <Stable@vger.kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>


feel free to merge.

> ---
>  drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
> index 21fa7ac849e5c..4904b831c0a75 100644
> --- a/drivers/s390/virtio/virtio_ccw.c
> +++ b/drivers/s390/virtio/virtio_ccw.c
> @@ -302,11 +302,17 @@ static struct airq_info *new_airq_info(int index)
>  static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
>  					 u64 *first, void **airq_info)
>  {
> -	int i, j;
> +	int i, j, queue_idx, highest_queue_idx = -1;
>  	struct airq_info *info;
>  	unsigned long *indicator_addr = NULL;
>  	unsigned long bit, flags;
>  
> +	/* Array entries without an actual queue pointer must be ignored. */
> +	for (i = 0; i < nvqs; i++) {
> +		if (vqs[i])
> +			highest_queue_idx++;
> +	}
> +
>  	for (i = 0; i < MAX_AIRQ_AREAS && !indicator_addr; i++) {
>  		mutex_lock(&airq_areas_lock);
>  		if (!airq_areas[i])
> @@ -316,7 +322,7 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
>  		if (!info)
>  			return NULL;
>  		write_lock_irqsave(&info->lock, flags);
> -		bit = airq_iv_alloc(info->aiv, nvqs);
> +		bit = airq_iv_alloc(info->aiv, highest_queue_idx + 1);
>  		if (bit == -1UL) {
>  			/* Not enough vacancies. */
>  			write_unlock_irqrestore(&info->lock, flags);
> @@ -325,8 +331,10 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
>  		*first = bit;
>  		*airq_info = info;
>  		indicator_addr = info->aiv->vector;
> -		for (j = 0; j < nvqs; j++) {
> -			airq_iv_set_ptr(info->aiv, bit + j,
> +		for (j = 0, queue_idx = 0; j < nvqs; j++) {
> +			if (!vqs[j])
> +				continue;
> +			airq_iv_set_ptr(info->aiv, bit + queue_idx++,
>  					(unsigned long)vqs[j]);
>  		}
>  		write_unlock_irqrestore(&info->lock, flags);
> -- 
> 2.48.1
Christian Borntraeger April 3, 2025, 1:12 p.m. UTC | #4
Am 02.04.25 um 22:36 schrieb David Hildenbrand:
[...]>
> Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL")
> Reported-by: Chandra Merla <cmerla@redhat.com>
> Cc: <Stable@vger.kernel.org>
> Cc: Cornelia Huck <cohuck@redhat.com>
> Cc: Thomas Huth <thuth@redhat.com>
> Cc: Halil Pasic <pasic@linux.ibm.com>
> Cc: Eric Farman <farman@linux.ibm.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Alexander Gordeev <agordeev@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Wei Wang <wei.w.wang@intel.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>

I will apply this to our internal s390 tree and let it sit for a
day to get CI coverage.

Alexander, Vasily, Heiko,
please then schedule for the s390/fixes branch if there is no CI fallout.
Halil Pasic April 3, 2025, 2:18 p.m. UTC | #5
On Wed,  2 Apr 2025 22:36:21 +0200
David Hildenbrand <david@redhat.com> wrote:

> If we finds a vq without a name in our input array in
> virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
> to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
> 
> Consequently, we create only a queue if it actually exists (name != NULL)
> and assign an incremental queue index to each such existing queue.

First and foremost: thank you for addressing this! I have to admit, I'm
still plagued by some cognitive dissonance here. Please bear with me.

For starters the commit message of a229989d975e ("virtio: don't
allocate vqs when names[i] = NULL") goes like this:
"""
    virtio: don't allocate vqs when names[i] = NULL
    
    Some vqs may not need to be allocated when their related feature bits
    are disabled. So callers may pass in such vqs with "names = NULL".
    Then we skip such vq allocations.
"""

In my reading it does not talk about "non-existent" queues, but queues
that do not need to be allocated. This could make sense for something
like virtio-net where controlq 2N is with N being max_virtqueue_pairs.

I guess for the guest it could make sense to not set up some of the
queues initially, but those, I guess would be perfectly existent queues
spec-wise and we would expect the index of controlq being 2N. And the
queues that don't get set up initially can get set up later. At least
this is my naive understanding at the moment.

Now apparently there is a different case where queues may or may not
exist, but we would, for some reason like to have the non-existent
queues in the array, because for an other set of features negotiated
those queues would actually exist and occupy and index. Frankly
I don't fully comprehend it at the moment, but I will have another look
at the code and at the spec.

So lookign at the spec for virtio-ballon I see:



5.5.2 Virtqueues

0
    inflateq 
1
    deflateq 
2
    statsq 
3
    free_page_vq 
4
    reporting_vq

statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.

free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.

reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. 

Which is IMHO weird.  I used to think about the number in front of the
name as the virtqueue index. But based on this patch I'm wondering if
that is compatible with the approach of this patch.

What does for example mean if we have VIRTIO_BALLOON_F_STATS_VQ not
offered, VIRTIO_BALLOON_F_FREE_PAGE_HINT offered but not negotiated
and VIRTIO_BALLOON_F_PAGE_REPORTING negotiated.

One reading of the things is that statq is does not exist for sure,
free_page_vq is a little tricky because "is set" is not precise enough,
and reporting_vq exists for sure. And in your reading of the spec, if
I understood you correctly and we assume that free_page_vq does not
exist, it has index 2. But I from the top of my head, I don't know why
interpreting the spec like it reporting_vq has index 4 and indexes 2
and 3 are not mapped to existing-queues would be considered wrong.

And even if we do want reportig_vq to have index 2, the virtio-balloon
code could still give us an array where reportig_vq is at index 2. Why
not?

Sorry this ended up being a very long rant. the bottom line is that, I
lack conceptual clarity on where the problem exactly is and how it needs
to be addressed. I would like to understand this properly before moving
forward.

[..]
> 
> There was recently a discussion [1] whether the "holes" should be
> treated differently again, effectively assigning also non-existing
> queues a queue index: that should also fix the issue, but requires other
> workarounds to not break existing setups.
> 

Sorry I have to have a look at that discussion. Maybe it will answer
some my questions.

> Let's fix it without affecting existing setups for now by properly ignoring
> the non-existing queues, so the indicator bits will match the queue
> indexes.

Just one question. My understanding is that the crux is that Linux
and QEMU (or the driver and the device) disagree at which index
reporting_vq is actually sitting. Is that right?

Regards,
Halil
David Hildenbrand April 3, 2025, 2:28 p.m. UTC | #6
On 03.04.25 16:18, Halil Pasic wrote:
> On Wed,  2 Apr 2025 22:36:21 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> If we finds a vq without a name in our input array in
>> virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
>> to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
>>
>> Consequently, we create only a queue if it actually exists (name != NULL)
>> and assign an incremental queue index to each such existing queue.
> 
> First and foremost: thank you for addressing this! I have to admit, I'm
> still plagued by some cognitive dissonance here. Please bear with me.
> 
> For starters the commit message of a229989d975e ("virtio: don't
> allocate vqs when names[i] = NULL") goes like this:
> """
>      virtio: don't allocate vqs when names[i] = NULL
>      
>      Some vqs may not need to be allocated when their related feature bits
>      are disabled. So callers may pass in such vqs with "names = NULL".
>      Then we skip such vq allocations.
> """
> 
> In my reading it does not talk about "non-existent" queues, but queues
> that do not need to be allocated. This could make sense for something
> like virtio-net where controlq 2N is with N being max_virtqueue_pairs.
> 
> I guess for the guest it could make sense to not set up some of the
> queues initially, but those, I guess would be perfectly existent queues
> spec-wise and we would expect the index of controlq being 2N. And the
> queues that don't get set up initially can get set up later. At least
> this is my naive understanding at the moment.
> 
> Now apparently there is a different case where queues may or may not
> exist, but we would, for some reason like to have the non-existent
> queues in the array, because for an other set of features negotiated
> those queues would actually exist and occupy and index. Frankly
> I don't fully comprehend it at the moment, but I will have another look
> at the code and at the spec.
> 
> So lookign at the spec for virtio-ballon I see:
> 
> 
> 
> 5.5.2 Virtqueues
> 
> 0
>      inflateq
> 1
>      deflateq
> 2
>      statsq
> 3
>      free_page_vq
> 4
>      reporting_vq
> 
> statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> 
> free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> 
> reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> 
> Which is IMHO weird.  I used to think about the number in front of the
> name as the virtqueue index. But based on this patch I'm wondering if
> that is compatible with the approach of this patch.
> 
> What does for example mean if we have VIRTIO_BALLOON_F_STATS_VQ not
> offered, VIRTIO_BALLOON_F_FREE_PAGE_HINT offered but not negotiated
> and VIRTIO_BALLOON_F_PAGE_REPORTING negotiated.
> 
> One reading of the things is that statq is does not exist for sure,
> free_page_vq is a little tricky because "is set" is not precise enough,
> and reporting_vq exists for sure. And in your reading of the spec, if
> I understood you correctly and we assume that free_page_vq does not
> exist, it has index 2. But I from the top of my head, I don't know why
> interpreting the spec like it reporting_vq has index 4 and indexes 2
> and 3 are not mapped to existing-queues would be considered wrong.
> 
> And even if we do want reportig_vq to have index 2, the virtio-balloon
> code could still give us an array where reportig_vq is at index 2. Why
> not?
> 
> Sorry this ended up being a very long rant. the bottom line is that, I
> lack conceptual clarity on where the problem exactly is and how it needs
> to be addressed. I would like to understand this properly before moving
> forward.
> 

I would suggest you take a look at [1] I added below, and the disconnect 
between the spec and what Linux + QEMU actually implemented.

In reality (with QEMU), reporting_vq sits either on index 3 or 4, 
depending on the existence of VIRTIO_BALLOON_F_FREE_PAGE_HINT.


> [..]
>>
>> There was recently a discussion [1] whether the "holes" should be
>> treated differently again, effectively assigning also non-existing
>> queues a queue index: that should also fix the issue, but requires other
>> workarounds to not break existing setups.
>>
> 
> Sorry I have to have a look at that discussion. Maybe it will answer
> some my questions.

Yes, I think so.

> 
>> Let's fix it without affecting existing setups for now by properly ignoring
>> the non-existing queues, so the indicator bits will match the queue
>> indexes.
> 
> Just one question. My understanding is that the crux is that Linux
> and QEMU (or the driver and the device) disagree at which index
> reporting_vq is actually sitting. Is that right?

I thought I made it clear: this is only about the airq indicator bit. 
That's where both disagree.

Not the actual queue index (see above).
Michael S. Tsirkin April 3, 2025, 2:35 p.m. UTC | #7
On Thu, Apr 03, 2025 at 04:18:36PM +0200, Halil Pasic wrote:
> On Wed,  2 Apr 2025 22:36:21 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
> > If we finds a vq without a name in our input array in
> > virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer
> > to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq.
> > 
> > Consequently, we create only a queue if it actually exists (name != NULL)
> > and assign an incremental queue index to each such existing queue.
> 
> First and foremost: thank you for addressing this! I have to admit, I'm
> still plagued by some cognitive dissonance here. Please bear with me.
> 
> For starters the commit message of a229989d975e ("virtio: don't
> allocate vqs when names[i] = NULL") goes like this:
> """
>     virtio: don't allocate vqs when names[i] = NULL
>     
>     Some vqs may not need to be allocated when their related feature bits
>     are disabled. So callers may pass in such vqs with "names = NULL".
>     Then we skip such vq allocations.
> """
> 
> In my reading it does not talk about "non-existent" queues, but queues
> that do not need to be allocated. This could make sense for something
> like virtio-net where controlq 2N is with N being max_virtqueue_pairs.
> 
> I guess for the guest it could make sense to not set up some of the
> queues initially, but those, I guess would be perfectly existent queues
> spec-wise and we would expect the index of controlq being 2N. And the
> queues that don't get set up initially can get set up later. At least
> this is my naive understanding at the moment.
> 
> Now apparently there is a different case where queues may or may not
> exist, but we would, for some reason like to have the non-existent
> queues in the array, because for an other set of features negotiated
> those queues would actually exist and occupy and index. Frankly
> I don't fully comprehend it at the moment, but I will have another look
> at the code and at the spec.
> 
> So lookign at the spec for virtio-ballon I see:
> 
> 
> 
> 5.5.2 Virtqueues
> 
> 0
>     inflateq 
> 1
>     deflateq 
> 2
>     statsq 
> 3
>     free_page_vq 
> 4
>     reporting_vq

Indeed. Unfortunately, no one at all implemented this properly as
per spec :(.

Balloon is the worst offender here but other devices are broken
too in some configurations.


Given it has been like this for many years I'm inclined in this
instance to fix the spec not the implementations.




> statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> 
> free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> 
> reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. 
> 
> Which is IMHO weird.  I used to think about the number in front of the
> name as the virtqueue index. But based on this patch I'm wondering if
> that is compatible with the approach of this patch.
> 
> What does for example mean if we have VIRTIO_BALLOON_F_STATS_VQ not
> offered, VIRTIO_BALLOON_F_FREE_PAGE_HINT offered but not negotiated
> and VIRTIO_BALLOON_F_PAGE_REPORTING negotiated.
> 
> One reading of the things is that statq is does not exist for sure,
> free_page_vq is a little tricky because "is set" is not precise enough,
> and reporting_vq exists for sure. And in your reading of the spec, if
> I understood you correctly and we assume that free_page_vq does not
> exist, it has index 2. But I from the top of my head, I don't know why
> interpreting the spec like it reporting_vq has index 4 and indexes 2
> and 3 are not mapped to existing-queues would be considered wrong.
> 
> And even if we do want reportig_vq to have index 2, the virtio-balloon
> code could still give us an array where reportig_vq is at index 2. Why
> not?
> 
> Sorry this ended up being a very long rant. the bottom line is that, I
> lack conceptual clarity on where the problem exactly is and how it needs
> to be addressed. I would like to understand this properly before moving
> forward.
> 
> [..]
> > 
> > There was recently a discussion [1] whether the "holes" should be
> > treated differently again, effectively assigning also non-existing
> > queues a queue index: that should also fix the issue, but requires other
> > workarounds to not break existing setups.


Yep. I tried that, and it is basically a terrible mess :(.


> > 
> 
> Sorry I have to have a look at that discussion. Maybe it will answer
> some my questions.
> 
> > Let's fix it without affecting existing setups for now by properly ignoring
> > the non-existing queues, so the indicator bits will match the queue
> > indexes.
> 
> Just one question. My understanding is that the crux is that Linux
> and QEMU (or the driver and the device) disagree at which index
> reporting_vq is actually sitting. Is that right?
> 
> Regards,


> Halil
Halil Pasic April 4, 2025, 4:02 a.m. UTC | #8
On Thu, 3 Apr 2025 10:35:33 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Apr 03, 2025 at 04:18:36PM +0200, Halil Pasic wrote:
> > On Wed,  2 Apr 2025 22:36:21 +0200
[..]
> > 
> > 5.5.2 Virtqueues
> > 
> > 0
> >     inflateq 
> > 1
> >     deflateq 
> > 2
> >     statsq 
> > 3
> >     free_page_vq 
> > 4
> >     reporting_vq  
> 
> Indeed. Unfortunately, no one at all implemented this properly as
> per spec :(.
> 
> Balloon is the worst offender here but other devices are broken
> too in some configurations.
> 
> 
> Given it has been like this for many years I'm inclined in this
> instance to fix the spec not the implementations.
> 

I understand! For me at least knowing if we are going to change the
spec or the implementations is pretty important. It would probably
make sense to spin a patch for the spec, just for the unlikely case that
somebody did end up implementing this by the spec and wants to protest.

I think, a consequence of this design is that all queues need to be
created and allocated at initialization time.

I mean in the spec we have something like 
"""
5.1.6.5.6.1 Driver Requirements: Automatic receive steering in multiqueue mode
The driver MUST configure the virtqueues before enabling them with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. 
"""

For example one could want to hotplug 2 more vCPUs and still have a
queue-pair per cpu (and a controlq). Those 2 extra queue-pairs could
in theory be allocated on-demand instead of having to allocate memory
for the rings for all the queues corresponding to the maxed out setup.
I've had a look at the Linux virtio-net and it does allocate all the
queues up front.

Also with this design, I believe we would effectively prohibit "holes".
Now if holes are prohibited, IMHO it does not make a whole lot of
sense for the virtio_find_vqs() to support holes. Because the holes
and a fair amount of complexity. Of course that would make sense as a
cleanup on top.

Regards,
Halil
Halil Pasic April 4, 2025, 4:36 a.m. UTC | #9
On Thu, 3 Apr 2025 16:28:31 +0200
David Hildenbrand <david@redhat.com> wrote:

> > Sorry I have to have a look at that discussion. Maybe it will answer
> > some my questions.  
> 
> Yes, I think so.
> 
> >   
> >> Let's fix it without affecting existing setups for now by properly
> >> ignoring the non-existing queues, so the indicator bits will match
> >> the queue indexes.  
> > 
> > Just one question. My understanding is that the crux is that Linux
> > and QEMU (or the driver and the device) disagree at which index
> > reporting_vq is actually sitting. Is that right?  
> 
> I thought I made it clear: this is only about the airq indicator bit. 
> That's where both disagree.
> 
> Not the actual queue index (see above).

I did some more research including having a look at that discussion. Let
me try to sum up how did we end up here.

Before commit a229989d975e ("virtio: don't allocate vqs when names[i] =
NULL") the kernel behavior used to be in spec, but QEMU and possibly
other hypervisor were out of spec and things did not work.

Possibly because of the complexity of fixing the hypervisor(s) commit
a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted
for changing the guest side so that it does not fit the spec but fits
the hypervisor(s). It unfortunately also broke notifiers (for the with
holes) scenario for virtio-ccw only.

Now we had another look at this, and have concluded that fixing the
hypervisor(s) and fixing the kernel, and making sure that the fixed
kernel can tolerate the old broken hypervisor(s) is way to complicated
if possible at all. So we decided to give the spec a reality check and
fix the notifier bit assignment for virtio-ccw which is broken beyond
doubt if we accept that the correct virtqueue index is the one that the
hypervisor(s) use and not the one that the spec says they should use.

With the spec fixed, the whole notion of "holes" will be something that
does not make sense any more. With that the merit of the kernel interface
virtio_find_vqs() supporting "holes" is quite questionable. Now we need
it because the drivers within the Linux kernel still think of the queues
in terms of the current spec, i.e. they try to have the "holes" as
mandated by the spec, and the duty of making it work with the broken
device implementations falls to the transports.

Under the assumption that the spec is indeed going to be fixed:

Reviewed-by: Halil Pasic <pasic@linux.ibm.com>

Regards,
Halil
Michael S. Tsirkin April 4, 2025, 5:33 a.m. UTC | #10
On Fri, Apr 04, 2025 at 06:02:04AM +0200, Halil Pasic wrote:
> On Thu, 3 Apr 2025 10:35:33 -0400
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Apr 03, 2025 at 04:18:36PM +0200, Halil Pasic wrote:
> > > On Wed,  2 Apr 2025 22:36:21 +0200
> [..]
> > > 
> > > 5.5.2 Virtqueues
> > > 
> > > 0
> > >     inflateq 
> > > 1
> > >     deflateq 
> > > 2
> > >     statsq 
> > > 3
> > >     free_page_vq 
> > > 4
> > >     reporting_vq  
> > 
> > Indeed. Unfortunately, no one at all implemented this properly as
> > per spec :(.
> > 
> > Balloon is the worst offender here but other devices are broken
> > too in some configurations.
> > 
> > 
> > Given it has been like this for many years I'm inclined in this
> > instance to fix the spec not the implementations.
> > 
> 
> I understand! For me at least knowing if we are going to change the
> spec or the implementations is pretty important. It would probably
> make sense to spin a patch for the spec, just for the unlikely case that
> somebody did end up implementing this by the spec and wants to protest.
> 
> I think, a consequence of this design is that all queues need to be
> created and allocated at initialization time.

Why? after feature negotiation.

> I mean in the spec we have something like 
> """
> 5.1.6.5.6.1 Driver Requirements: Automatic receive steering in multiqueue mode
> The driver MUST configure the virtqueues before enabling them with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. 
> """
> 
> For example one could want to hotplug 2 more vCPUs and still have a
> queue-pair per cpu (and a controlq). Those 2 extra queue-pairs could
> in theory be allocated on-demand instead of having to allocate memory
> for the rings for all the queues corresponding to the maxed out setup.
> I've had a look at the Linux virtio-net and it does allocate all the
> queues up front.
> 
> Also with this design, I believe we would effectively prohibit "holes".

For existing devices at least.

> Now if holes are prohibited, IMHO it does not make a whole lot of
> sense for the virtio_find_vqs() to support holes. Because the holes
> and a fair amount of complexity. Of course that would make sense as a
> cleanup on top.
> 
> Regards,
> Halil
David Hildenbrand April 4, 2025, 10 a.m. UTC | #11
On 04.04.25 06:36, Halil Pasic wrote:
> On Thu, 3 Apr 2025 16:28:31 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>> Sorry I have to have a look at that discussion. Maybe it will answer
>>> some my questions.
>>
>> Yes, I think so.
>>
>>>    
>>>> Let's fix it without affecting existing setups for now by properly
>>>> ignoring the non-existing queues, so the indicator bits will match
>>>> the queue indexes.
>>>
>>> Just one question. My understanding is that the crux is that Linux
>>> and QEMU (or the driver and the device) disagree at which index
>>> reporting_vq is actually sitting. Is that right?
>>
>> I thought I made it clear: this is only about the airq indicator bit.
>> That's where both disagree.
>>
>> Not the actual queue index (see above).
> 
> I did some more research including having a look at that discussion. Let
> me try to sum up how did we end up here.

Let me add some more details after digging as well:

> 
> Before commit a229989d975e ("virtio: don't allocate vqs when names[i] =
> NULL") the kernel behavior used to be in spec, but QEMU and possibly
> other hypervisor were out of spec and things did not work.

It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that,
we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very
end. So there was no possibility for holes "in-between".

In the Linux driver, we created the stats queue only if the feature bit
VIRTIO_BALLOON_F_STATS_VQ was actually around:

	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);

That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would
unconditionally create 4 queues. QEMU always supported the first 3 queues
unconditionally, but old QEMU did obviously not support the (new)
VIRTIO_BALLOON_F_FREE_PAGE_HINT queue.

390x didn't particularly like getting queried for non-existing
queues. [1] So the fix was not for a hypervisor that was out of spec, but
because quering non-existing queues didn't work.

The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue
index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well.

Again, as QEMU always implemented the 3 first queues unconditionally, this was
not a problem.

[1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/#t

> 
> Possibly because of the complexity of fixing the hypervisor(s) commit
> a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted
> for changing the guest side so that it does not fit the spec but fits
> the hypervisor(s). It unfortunately also broke notifiers (for the with
> holes) scenario for virtio-ccw only.

Yes, it broke the notifiers.

But note that everything was in spec at that point, because we only documented
"free_page_vq == 3" in the spec *2 years later*, in 2020:

commit 38448268eba0c105200d131c3f7f660129a4d673
Author: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Date:   Tue Aug 25 07:45:02 2020 -0700

     content: Document balloon feature free page hints
     
     Free page hints allow the balloon driver to provide information on what
     pages are not currently in use so that we can avoid the cost of copying
     them in migration scenarios. Add a feature description for free page hints
     describing basic functioning and requirements.
     
At that point, what we documented in the spec *did not match reality* in
Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is
unconditionally set.


QEMU and Linux kept using that queue index assignment model, and the spec
was wrong (out of sync?) at that point. The spec got more wrong with

commit d917d4a8d552c003e046b0e3b1b529d98f7e695b
Author: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Date:   Tue Aug 25 07:45:17 2020 -0700

     content: Document balloon feature free page reporting
     
     Free page reporting is a feature that allows the guest to proactively
     report unused pages to the host. By making use of this feature is is
     possible to reduce the overall memory footprint of the guest in cases where
     some significant portion of the memory is idle. Add documentation for the
     free page reporting feature describing the functionality and requirements.

Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to
QEMU+Linux implementation, so the spec did not reflect reality.

I'll note also cloud-hypervisor [2] today follows that model.

In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning
the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented
in the spec to be *4*.

So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending
on the availability of the other two features/queues.

[2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs


> 
> Now we had another look at this, and have concluded that fixing the
> hypervisor(s) and fixing the kernel, and making sure that the fixed
> kernel can tolerate the old broken hypervisor(s) is way to complicated
> if possible at all. So we decided to give the spec a reality check and
> fix the notifier bit assignment for virtio-ccw which is broken beyond
> doubt if we accept that the correct virtqueue index is the one that the
> hypervisor(s) use and not the one that the spec says they should use.

In case of virtio-balloon, it's unfortunate that it went that way, but the
spec simply did not / does not reflect reality when it was added to the spec.

> 
> With the spec fixed, the whole notion of "holes" will be something that
> does not make sense any more. With that the merit of the kernel interface
> virtio_find_vqs() supporting "holes" is quite questionable. Now we need
> it because the drivers within the Linux kernel still think of the queues
> in terms of the current spec, i.e. they try to have the "holes" as
> mandated by the spec, and the duty of making it work with the broken
> device implementations falls to the transports.
> 

Right, the "holes" only exist in the input array.

> Under the assumption that the spec is indeed going to be fixed:
> 
> Reviewed-by: Halil Pasic <pasic@linux.ibm.com>

Thanks!
David Hildenbrand April 4, 2025, 10:55 a.m. UTC | #12
On 04.04.25 12:00, David Hildenbrand wrote:
> On 04.04.25 06:36, Halil Pasic wrote:
>> On Thu, 3 Apr 2025 16:28:31 +0200
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>>> Sorry I have to have a look at that discussion. Maybe it will answer
>>>> some my questions.
>>>
>>> Yes, I think so.
>>>
>>>>     
>>>>> Let's fix it without affecting existing setups for now by properly
>>>>> ignoring the non-existing queues, so the indicator bits will match
>>>>> the queue indexes.
>>>>
>>>> Just one question. My understanding is that the crux is that Linux
>>>> and QEMU (or the driver and the device) disagree at which index
>>>> reporting_vq is actually sitting. Is that right?
>>>
>>> I thought I made it clear: this is only about the airq indicator bit.
>>> That's where both disagree.
>>>
>>> Not the actual queue index (see above).
>>
>> I did some more research including having a look at that discussion. Let
>> me try to sum up how did we end up here.
> 
> Let me add some more details after digging as well:
> 
>>
>> Before commit a229989d975e ("virtio: don't allocate vqs when names[i] =
>> NULL") the kernel behavior used to be in spec, but QEMU and possibly
>> other hypervisor were out of spec and things did not work.
> 
> It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that,
> we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very
> end. So there was no possibility for holes "in-between".
> 
> In the Linux driver, we created the stats queue only if the feature bit
> VIRTIO_BALLOON_F_STATS_VQ was actually around:
> 
> 	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> 	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> 
> That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would
> unconditionally create 4 queues. QEMU always supported the first 3 queues
> unconditionally, but old QEMU did obviously not support the (new)
> VIRTIO_BALLOON_F_FREE_PAGE_HINT queue.
> 
> 390x didn't particularly like getting queried for non-existing
> queues. [1] So the fix was not for a hypervisor that was out of spec, but
> because quering non-existing queues didn't work.
> 
> The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue
> index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well.
> 
> Again, as QEMU always implemented the 3 first queues unconditionally, this was
> not a problem.
> 
> [1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/#t
> 
>>
>> Possibly because of the complexity of fixing the hypervisor(s) commit
>> a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted
>> for changing the guest side so that it does not fit the spec but fits
>> the hypervisor(s). It unfortunately also broke notifiers (for the with
>> holes) scenario for virtio-ccw only.
> 
> Yes, it broke the notifiers.
> 
> But note that everything was in spec at that point, because we only documented
> "free_page_vq == 3" in the spec *2 years later*, in 2020:
> 
> commit 38448268eba0c105200d131c3f7f660129a4d673
> Author: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Date:   Tue Aug 25 07:45:02 2020 -0700
> 
>       content: Document balloon feature free page hints
>       
>       Free page hints allow the balloon driver to provide information on what
>       pages are not currently in use so that we can avoid the cost of copying
>       them in migration scenarios. Add a feature description for free page hints
>       describing basic functioning and requirements.
>       
> At that point, what we documented in the spec *did not match reality* in
> Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is
> unconditionally set.
> 
> 
> QEMU and Linux kept using that queue index assignment model, and the spec
> was wrong (out of sync?) at that point. The spec got more wrong with
> 
> commit d917d4a8d552c003e046b0e3b1b529d98f7e695b
> Author: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Date:   Tue Aug 25 07:45:17 2020 -0700
> 
>       content: Document balloon feature free page reporting
>       
>       Free page reporting is a feature that allows the guest to proactively
>       report unused pages to the host. By making use of this feature is is
>       possible to reduce the overall memory footprint of the guest in cases where
>       some significant portion of the memory is idle. Add documentation for the
>       free page reporting feature describing the functionality and requirements.
> 
> Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to
> QEMU+Linux implementation, so the spec did not reflect reality.
> 
> I'll note also cloud-hypervisor [2] today follows that model.
> 
> In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning
> the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented
> in the spec to be *4*.
> 
> So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending
> on the availability of the other two features/queues.
> 
> [2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs
> 
> 
>>
>> Now we had another look at this, and have concluded that fixing the
>> hypervisor(s) and fixing the kernel, and making sure that the fixed
>> kernel can tolerate the old broken hypervisor(s) is way to complicated
>> if possible at all. So we decided to give the spec a reality check and
>> fix the notifier bit assignment for virtio-ccw which is broken beyond
>> doubt if we accept that the correct virtqueue index is the one that the
>> hypervisor(s) use and not the one that the spec says they should use.
> 
> In case of virtio-balloon, it's unfortunate that it went that way, but the
> spec simply did not / does not reflect reality when it was added to the spec.
> 
>>
>> With the spec fixed, the whole notion of "holes" will be something that
>> does not make sense any more. With that the merit of the kernel interface
>> virtio_find_vqs() supporting "holes" is quite questionable. Now we need
>> it because the drivers within the Linux kernel still think of the queues
>> in terms of the current spec, i.e. they try to have the "holes" as
>> mandated by the spec, and the duty of making it work with the broken
>> device implementations falls to the transports.
>>
> 
> Right, the "holes" only exist in the input array.
> 
>> Under the assumption that the spec is indeed going to be fixed:

For virito-balloon, we should probably do the following:

 From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Fri, 4 Apr 2025 12:53:16 +0200
Subject: [PATCH] virtio-balloon: Fix queue index assignment for
  non-existing queues

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  device-types/balloon/description.tex | 22 ++++++++++++++++------
  1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex
index a1d9603..a7396ff 100644
--- a/device-types/balloon/description.tex
+++ b/device-types/balloon/description.tex
@@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I
    5
  
  \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
+
+\begin{description}
+\item[inflateq] Exists unconditionally.
+\item[deflateq] Exists unconditionally.
+\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
+\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
+\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
+\end{description}
+
+\begin{note}
+Virtqueue indexes are assigned sequentially for existing queues, starting
+with index 0; consequently, if a virtqueue does not exist, it does not get
+an index assigned. Assuming all virtqueues exist for a device, the indexes
+are:
+
  \begin{description}
  \item[0] inflateq
  \item[1] deflateq
@@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque
  \item[3] free_page_vq
  \item[4] reporting_vq
  \end{description}
-
-  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
-
-  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
-
-  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
+\end{note}
  
  \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits}
  \begin{description}
Halil Pasic April 4, 2025, 12:05 p.m. UTC | #13
On Fri, 4 Apr 2025 01:33:28 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> > 
> > I think, a consequence of this design is that all queues need to be
> > created and allocated at initialization time.  
> 
> Why? after feature negotiation.

What I mean is, with this change having queues that exist but are not
set up before the device becomes operational is not viable any more.

Let me use the virtio-net example again. I assume by the current spec
it would be OK to have max_virtqueue_pairs quite big e.g. 64, just in
case the guest ends up having many vCPUs hotplugged. But start out with
2 vcpus, 2 queue pairs and initially doing VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET with 2.

And then grow the guest to 8 vcpus, 'discover' virtqueues
2(I-1) receiveqI, 2(I-1)+1 transmitqI for 2 < I < 9 (I is a natural number)
and do another VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET with 8.

Please notice that the controlq would sit at index 128 (64*2) all along. That
is in the old world. In the new world we don't do holes, so we need to
allocate all the virtqueues up to controlq up-front. To avoid having
holes. Or any queue-pairs that are discoverd after the initial vq discovery
would need to have an index larger than controlq has.

Regards,
Halil
Halil Pasic April 4, 2025, 1:36 p.m. UTC | #14
On Fri, 4 Apr 2025 12:55:09 +0200
David Hildenbrand <david@redhat.com> wrote:

> For virito-balloon, we should probably do the following:
> 
>  From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Fri, 4 Apr 2025 12:53:16 +0200
> Subject: [PATCH] virtio-balloon: Fix queue index assignment for
>   non-existing queues
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   device-types/balloon/description.tex | 22 ++++++++++++++++------
>   1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex
> index a1d9603..a7396ff 100644
> --- a/device-types/balloon/description.tex
> +++ b/device-types/balloon/description.tex
> @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I
>     5
>   
>   \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
> +
> +\begin{description}
> +\item[inflateq] Exists unconditionally.
> +\item[deflateq] Exists unconditionally.
> +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.

s/is set/is negotiated/?

I think we should stick to "feature is offered" and "feature is
negotiated".

> +\end{description}
> +
> +\begin{note}
> +Virtqueue indexes are assigned sequentially for existing queues, starting
> +with index 0; consequently, if a virtqueue does not exist, it does not get
> +an index assigned. Assuming all virtqueues exist for a device, the indexes
> +are:
> +
>   \begin{description}
>   \item[0] inflateq
>   \item[1] deflateq
> @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque
>   \item[3] free_page_vq
>   \item[4] reporting_vq
>   \end{description}
> -
> -  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> -
> -  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> -
> -  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> +\end{note}
>   
>   \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits}
>   \begin{description}

Sounds good to me! But I'm still a little confused by the "holes". What
confuses me is that i can think of at least 2 distinct types of "holes":
1) Holes that can be filled later. The queue conceptually exists, but
   there is no need to back it with any resources for now because it is 
   dormant (it can be seen a hole in comparison to queues that need to
  materialize -- vring, notifiers, ...)
2) Holes that can not be filled without resetting the device: i.e. if
   certain features are not negotiated, then a queue X does not exist,
   but subsequent queues retain their index.

Can we have both kinds or was/will be 1) and/or 2) never a thing?

This patch would make sure that neither 1) nor 2) applies to
virtio-balloon, which is good. But we are talking about a
transport fix here, and I would like to eventually make sure
that the abstractions make sense.

That being said, I think we should proceed with this patch, because I
don't think Linux uses type 1) holes.

Regards,
Halil
David Hildenbrand April 4, 2025, 1:48 p.m. UTC | #15
On 04.04.25 15:36, Halil Pasic wrote:
> On Fri, 4 Apr 2025 12:55:09 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> For virito-balloon, we should probably do the following:
>>
>>   From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@redhat.com>
>> Date: Fri, 4 Apr 2025 12:53:16 +0200
>> Subject: [PATCH] virtio-balloon: Fix queue index assignment for
>>    non-existing queues
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    device-types/balloon/description.tex | 22 ++++++++++++++++------
>>    1 file changed, 16 insertions(+), 6 deletions(-)
>>
>> diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex
>> index a1d9603..a7396ff 100644
>> --- a/device-types/balloon/description.tex
>> +++ b/device-types/balloon/description.tex
>> @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I
>>      5
>>    
>>    \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
>> +
>> +\begin{description}
>> +\item[inflateq] Exists unconditionally.
>> +\item[deflateq] Exists unconditionally.
>> +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
>> +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>> +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> 
> s/is set/is negotiated/?
> 
> I think we should stick to "feature is offered" and "feature is
> negotiated".
> 
>> +\end{description}
>> +
>> +\begin{note}
>> +Virtqueue indexes are assigned sequentially for existing queues, starting
>> +with index 0; consequently, if a virtqueue does not exist, it does not get
>> +an index assigned. Assuming all virtqueues exist for a device, the indexes
>> +are:
>> +
>>    \begin{description}
>>    \item[0] inflateq
>>    \item[1] deflateq
>> @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque
>>    \item[3] free_page_vq
>>    \item[4] reporting_vq
>>    \end{description}
>> -
>> -  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
>> -
>> -  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>> -
>> -  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
>> +\end{note}
>>    
>>    \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits}
>>    \begin{description}
> 
> Sounds good to me! But I'm still a little confused by the "holes". What
> confuses me is that i can think of at least 2 distinct types of "holes":
> 1) Holes that can be filled later. The queue conceptually exists, but
>     there is no need to back it with any resources for now because it is
>     dormant (it can be seen a hole in comparison to queues that need to
>    materialize -- vring, notifiers, ...)
> 2) Holes that can not be filled without resetting the device: i.e. if
>     certain features are not negotiated, then a queue X does not exist,
>     but subsequent queues retain their index.

I think it is not about "negotiated", that might be the wrong terminology.

E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues 
(virtio_add_queue()) if virtio_has_feature(s->host_features).

That is, it's independent of a feature negotiation (IIUC), it's static 
for the device --  "host_features"


Is that really "negotiated" or is it "the device offers the feature X" ?
Halil Pasic April 4, 2025, 2 p.m. UTC | #16
On Fri, 4 Apr 2025 15:48:49 +0200
David Hildenbrand <david@redhat.com> wrote:

> > Sounds good to me! But I'm still a little confused by the "holes".
> > What confuses me is that i can think of at least 2 distinct types of
> > "holes": 1) Holes that can be filled later. The queue conceptually
> > exists, but there is no need to back it with any resources for now
> > because it is dormant (it can be seen a hole in comparison to queues
> > that need to materialize -- vring, notifiers, ...)
> > 2) Holes that can not be filled without resetting the device: i.e. if
> >     certain features are not negotiated, then a queue X does not
> > exist, but subsequent queues retain their index.  
> 
> I think it is not about "negotiated", that might be the wrong
> terminology.
> 
> E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues 
> (virtio_add_queue()) if virtio_has_feature(s->host_features).
> 
> That is, it's independent of a feature negotiation (IIUC), it's static 
> for the device --  "host_features"
> 
> 
> Is that really "negotiated" or is it "the device offers the feature X"
> ?

It is offered. And this is precisely why I'm so keen on having a precise
wording here.

Usually for compatibility one needs negotiated. Because the feature
negotiation is mostly about compatibility. I.e. the driver should be
able to say, hey I don't know about that feature, and get compatible
behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and
VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only
VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make reporting_vq
jump to +1 compared to the case where VIRTIO_BALLOON_F_FREE_PAGE_HINT is
not offered. Which is IMHO no good, because for the features that the
driver is going to reject in most of the cases it should not matter if
it was offered or not.

@MST: Please correct me if I'm wrong!

Regards,
Halil
David Hildenbrand April 4, 2025, 2:17 p.m. UTC | #17
On 04.04.25 16:00, Halil Pasic wrote:
> On Fri, 4 Apr 2025 15:48:49 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>> Sounds good to me! But I'm still a little confused by the "holes".
>>> What confuses me is that i can think of at least 2 distinct types of
>>> "holes": 1) Holes that can be filled later. The queue conceptually
>>> exists, but there is no need to back it with any resources for now
>>> because it is dormant (it can be seen a hole in comparison to queues
>>> that need to materialize -- vring, notifiers, ...)
>>> 2) Holes that can not be filled without resetting the device: i.e. if
>>>      certain features are not negotiated, then a queue X does not
>>> exist, but subsequent queues retain their index.
>>
>> I think it is not about "negotiated", that might be the wrong
>> terminology.
>>
>> E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues
>> (virtio_add_queue()) if virtio_has_feature(s->host_features).
>>
>> That is, it's independent of a feature negotiation (IIUC), it's static
>> for the device --  "host_features"
>>
>>
>> Is that really "negotiated" or is it "the device offers the feature X"
>> ?
> 
> It is offered. And this is precisely why I'm so keen on having a precise
> wording here.

Yes, me too. The current phrasing in the spec is not clear.

Linux similarly checks 
virtio_has_feature()->virtio_check_driver_offered_feature().

> 
> Usually for compatibility one needs negotiated. Because the feature
> negotiation is mostly about compatibility. I.e. the driver should be
> able to say, hey I don't know about that feature, and get compatible
> behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and
> VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only
> VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make reporting_vq
> jump to +1 compared to the case where VIRTIO_BALLOON_F_FREE_PAGE_HINT is
> not offered. Which is IMHO no good, because for the features that the
> driver is going to reject in most of the cases it should not matter if
> it was offered or not.

Yes. The key part is that we may only add new features to the tail of 
our feature list; maybe we should document that as well.

I agree that a driver that implements VIRTIO_BALLOON_F_PAGE_REPORTING 
*must* be aware that VIRTIO_BALLOON_F_FREE_PAGE_HINT exists. So queue 
existence is not about feature negotiation but about features being 
offered from the device.

... which is a bit the same behavior as with fixed-assigned numbers a 
bit. VIRTIO_BALLOON_F_PAGE_REPORTING was documented as "4" because 
VIRTIO_BALLOON_F_FREE_PAGE_HINT was documented to be "3" -- IOW, it 
already existed in the spec.

Not perfect, but AFAIKS, not horrible.

(as Linux supports all these features, it's easy. A driver that only 
supports some features has to calculate the queue index manually based 
on the offered features)

> 
> @MST: Please correct me if I'm wrong!
> 
> Regards,
> Halil
>
Halil Pasic April 4, 2025, 3:39 p.m. UTC | #18
On Fri, 4 Apr 2025 16:17:14 +0200
David Hildenbrand <david@redhat.com> wrote:

> > It is offered. And this is precisely why I'm so keen on having a
> > precise wording here.  
> 
> Yes, me too. The current phrasing in the spec is not clear.
> 
> Linux similarly checks 
> virtio_has_feature()->virtio_check_driver_offered_feature().

Careful, that is a *driver* offered and not a *device* offered! 

We basically mandate that one can only check for a feature F with
virtio_has_feature() such that it is either in drv->feature_table or in
drv->feature_table_legacy.

AFAICT *device_features* obtained via dev->config->get_features(dev)
isn't even saved but is only used for binary and-ing it with the
driver_features to obtain the negotiated features.

That basically means that if I was, for the sake of fun do

--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1197,7 +1197,6 @@ static unsigned int features[] = {
        VIRTIO_BALLOON_F_MUST_TELL_HOST,
        VIRTIO_BALLOON_F_STATS_VQ,
        VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
-       VIRTIO_BALLOON_F_FREE_PAGE_HINT,
        VIRTIO_BALLOON_F_PAGE_POISON,
        VIRTIO_BALLOON_F_REPORTING,
 };

I would end up with virtio_check_driver_offered_feature() calling
BUG().

That basically means that Linux mandates implementing all previous
features regardless whether does are supposed to be optional ones or
not. Namely if you put the feature into drv->feature_table it will
get negotiated. 

Which is not nice IMHO.

> 
> > 
> > Usually for compatibility one needs negotiated. Because the feature
> > negotiation is mostly about compatibility. I.e. the driver should be
> > able to say, hey I don't know about that feature, and get compatible
> > behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and
> > VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only
> > VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make
> > reporting_vq jump to +1 compared to the case where
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT is not offered. Which is IMHO no
> > good, because for the features that the driver is going to reject in
> > most of the cases it should not matter if it was offered or not.  
> 
> Yes. The key part is that we may only add new features to the tail of 
> our feature list; maybe we should document that as well.
> 
> I agree that a driver that implements VIRTIO_BALLOON_F_PAGE_REPORTING 
> *must* be aware that VIRTIO_BALLOON_F_FREE_PAGE_HINT exists. So queue 
> existence is not about feature negotiation but about features being 
> offered from the device.
> 
> ... which is a bit the same behavior as with fixed-assigned numbers a 
> bit. VIRTIO_BALLOON_F_PAGE_REPORTING was documented as "4" because 
> VIRTIO_BALLOON_F_FREE_PAGE_HINT was documented to be "3" -- IOW, it 
> already existed in the spec.

I don't agree with the comparison.  One obviously needs to avoid fatal
collisions when extending the spec, and has to consider prior art.

But ideally not implemented  or fenced optional features A should have no
impact to implemented optional or not optional features B -- unless the
features are actually interdependent, but then the spec would prohibit
the combo of having B but not A. And IMHO exactly this would have been
the advantage of fixed-assigned numbers: you may not care if the other
queueues exist or not. 

Also like cloud-hypervisor has decided that they are going only to
support VIRTIO_BALLOON_F_REPORTING some weird OS could in theory
decide that they only care about VIRTIO_BALLOON_F_REPORTING. In that
setting having to look at VIRTIO_BALLOON_F_STATS_VQ and 
VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered is weird. But that is all water
under the bridge. We have to respect what is out there in the field.

> 
> Not perfect, but AFAIKS, not horrible.

It is like it is. QEMU does queue exist if the corresponding feature
is offered by the device, and that is what we have to live with.

Yes, I agree we should make the spec reflect reality!

> 
> (as Linux supports all these features, it's easy. A driver that only 
> supports some features has to calculate the queue index manually based 
> on the offered features)

As I've tried to explain above, not implementing/accepting optional
features and then implementing/accepting a newer feature is problematic
with the current code. Supporting some features would work only as
supporting all features up to X.

Regards,
Halil
David Hildenbrand April 4, 2025, 4:49 p.m. UTC | #19
On 04.04.25 17:39, Halil Pasic wrote:
> On Fri, 4 Apr 2025 16:17:14 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>>> It is offered. And this is precisely why I'm so keen on having a
>>> precise wording here.
>>
>> Yes, me too. The current phrasing in the spec is not clear.
>>
>> Linux similarly checks
>> virtio_has_feature()->virtio_check_driver_offered_feature().
> 
> Careful, that is a *driver* offered and not a *device* offered!

Right, I was pointing at the usage of the term "offered". 
virtio_check_driver_offered_feature(). (but was also confused about that 
function)

virtio_has_feature() is clearer: "helper to determine if this device has 
this feature."

The way it's currently implemented is that it's essentially "device has 
this feature and we know about it (->feature_table)"

> 
> We basically mandate that one can only check for a feature F with
> virtio_has_feature() such that it is either in drv->feature_table or in
> drv->feature_table_legacy.
> 
> AFAICT *device_features* obtained via dev->config->get_features(dev)
> isn't even saved but is only used for binary and-ing it with the
> driver_features to obtain the negotiated features.
> 
> That basically means that if I was, for the sake of fun do
> 
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1197,7 +1197,6 @@ static unsigned int features[] = {
>          VIRTIO_BALLOON_F_MUST_TELL_HOST,
>          VIRTIO_BALLOON_F_STATS_VQ,
>          VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> -       VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>          VIRTIO_BALLOON_F_PAGE_POISON,
>          VIRTIO_BALLOON_F_REPORTING,
>   };
> 
> I would end up with virtio_check_driver_offered_feature() calling
> BUG().
> 

Right.

> That basically means that Linux mandates implementing all previous
> features regardless whether does are supposed to be optional ones or
> not. Namely if you put the feature into drv->feature_table it will
> get negotiated.
> 
> Which is not nice IMHO.

I think the validate() callbacks allows for fixing that up.

Like us unconditionally clearing VIRTIO_F_ACCESS_PLATFORM (I know, 
that's a transport feature and a bit different for this reason).

... not that I think the current way of achieving that is nice :)

> 
>>
>>>
>>> Usually for compatibility one needs negotiated. Because the feature
>>> negotiation is mostly about compatibility. I.e. the driver should be
>>> able to say, hey I don't know about that feature, and get compatible
>>> behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and
>>> VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only
>>> VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make
>>> reporting_vq jump to +1 compared to the case where
>>> VIRTIO_BALLOON_F_FREE_PAGE_HINT is not offered. Which is IMHO no
>>> good, because for the features that the driver is going to reject in
>>> most of the cases it should not matter if it was offered or not.
>>
>> Yes. The key part is that we may only add new features to the tail of
>> our feature list; maybe we should document that as well.
>>
>> I agree that a driver that implements VIRTIO_BALLOON_F_PAGE_REPORTING
>> *must* be aware that VIRTIO_BALLOON_F_FREE_PAGE_HINT exists. So queue
>> existence is not about feature negotiation but about features being
>> offered from the device.
>>
>> ... which is a bit the same behavior as with fixed-assigned numbers a
>> bit. VIRTIO_BALLOON_F_PAGE_REPORTING was documented as "4" because
>> VIRTIO_BALLOON_F_FREE_PAGE_HINT was documented to be "3" -- IOW, it
>> already existed in the spec.
> 
> I don't agree with the comparison.  One obviously needs to avoid fatal
> collisions when extending the spec, and has to consider prior art.

Agreed. But IMHO it's similar to two out-of-spec driver starting to use 
"queue index 5" in a fix-assigned world. It cannot work.

> 
> But ideally not implemented  or fenced optional features A should have no
> impact to implemented optional or not optional features B -- unless the
> features are actually interdependent, but then the spec would prohibit
> the combo of having B but not A. And IMHO exactly this would have been
> the advantage of fixed-assigned numbers: you may not care if the other
> queueues exist or not.
> 
> Also like cloud-hypervisor has decided that they are going only to
> support VIRTIO_BALLOON_F_REPORTING some weird OS could in theory
> decide that they only care about VIRTIO_BALLOON_F_REPORTING. In that
> setting having to look at VIRTIO_BALLOON_F_STATS_VQ and
> VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered is weird. But that is all water
> under the bridge. We have to respect what is out there in the field.

Yes, they would have to do the math based on offered features. 
Definitely not nice, but as you say, that ship has sailed.

[...]

>>
>> (as Linux supports all these features, it's easy. A driver that only
>> supports some features has to calculate the queue index manually based
>> on the offered features)
> 
> As I've tried to explain above, not implementing/accepting optional
> features and then implementing/accepting a newer feature is problematic
> with the current code. Supporting some features would work only as
> supporting all features up to X.

See above regarding validate().

Again, doesn't win a beauty contest ... I'll send an improved 
virtio-spec update next week, thanks!
David Hildenbrand April 4, 2025, 5:36 p.m. UTC | #20
On 04.04.25 18:49, David Hildenbrand wrote:
> On 04.04.25 17:39, Halil Pasic wrote:
>> On Fri, 4 Apr 2025 16:17:14 +0200
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>>> It is offered. And this is precisely why I'm so keen on having a
>>>> precise wording here.
>>>
>>> Yes, me too. The current phrasing in the spec is not clear.
>>>
>>> Linux similarly checks
>>> virtio_has_feature()->virtio_check_driver_offered_feature().
>>
>> Careful, that is a *driver* offered and not a *device* offered!
> 
> Right, I was pointing at the usage of the term "offered".
> virtio_check_driver_offered_feature(). (but was also confused about that
> function)
> 
> virtio_has_feature() is clearer: "helper to determine if this device has
> this feature."
> 
> The way it's currently implemented is that it's essentially "device has
> this feature and we know about it (->feature_table)"
> 
>>
>> We basically mandate that one can only check for a feature F with
>> virtio_has_feature() such that it is either in drv->feature_table or in
>> drv->feature_table_legacy.
>>
>> AFAICT *device_features* obtained via dev->config->get_features(dev)
>> isn't even saved but is only used for binary and-ing it with the
>> driver_features to obtain the negotiated features.
>>
>> That basically means that if I was, for the sake of fun do
>>
>> --- a/drivers/virtio/virtio_balloon.c
>> +++ b/drivers/virtio/virtio_balloon.c
>> @@ -1197,7 +1197,6 @@ static unsigned int features[] = {
>>           VIRTIO_BALLOON_F_MUST_TELL_HOST,
>>           VIRTIO_BALLOON_F_STATS_VQ,
>>           VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
>> -       VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>>           VIRTIO_BALLOON_F_PAGE_POISON,
>>           VIRTIO_BALLOON_F_REPORTING,
>>    };
>>
>> I would end up with virtio_check_driver_offered_feature() calling
>> BUG().
>>
> 
> Right.
> 
>> That basically means that Linux mandates implementing all previous
>> features regardless whether does are supposed to be optional ones or
>> not. Namely if you put the feature into drv->feature_table it will
>> get negotiated.
>>
>> Which is not nice IMHO.
> 
> I think the validate() callbacks allows for fixing that up.
> 
> Like us unconditionally clearing VIRTIO_F_ACCESS_PLATFORM (I know,
> that's a transport feature and a bit different for this reason).
> 
> ... not that I think the current way of achieving that is nice :)

Thinking again, that won't work, because it would also make 
virtio_has_feature() say that the device does not have that feature.

So yeah, virtio_has_feature() is confusing and the documentation does 
not quite match.

Would need a change/cleanup to handle such features that we don't 
implement but still want to check if they are offered.
Michael S. Tsirkin April 6, 2025, 3:40 p.m. UTC | #21
On Fri, Apr 04, 2025 at 12:55:09PM +0200, David Hildenbrand wrote:
> On 04.04.25 12:00, David Hildenbrand wrote:
> > On 04.04.25 06:36, Halil Pasic wrote:
> > > On Thu, 3 Apr 2025 16:28:31 +0200
> > > David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > > > Sorry I have to have a look at that discussion. Maybe it will answer
> > > > > some my questions.
> > > > 
> > > > Yes, I think so.
> > > > 
> > > > > > Let's fix it without affecting existing setups for now by properly
> > > > > > ignoring the non-existing queues, so the indicator bits will match
> > > > > > the queue indexes.
> > > > > 
> > > > > Just one question. My understanding is that the crux is that Linux
> > > > > and QEMU (or the driver and the device) disagree at which index
> > > > > reporting_vq is actually sitting. Is that right?
> > > > 
> > > > I thought I made it clear: this is only about the airq indicator bit.
> > > > That's where both disagree.
> > > > 
> > > > Not the actual queue index (see above).
> > > 
> > > I did some more research including having a look at that discussion. Let
> > > me try to sum up how did we end up here.
> > 
> > Let me add some more details after digging as well:
> > 
> > > 
> > > Before commit a229989d975e ("virtio: don't allocate vqs when names[i] =
> > > NULL") the kernel behavior used to be in spec, but QEMU and possibly
> > > other hypervisor were out of spec and things did not work.
> > 
> > It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that,
> > we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very
> > end. So there was no possibility for holes "in-between".
> > 
> > In the Linux driver, we created the stats queue only if the feature bit
> > VIRTIO_BALLOON_F_STATS_VQ was actually around:
> > 
> > 	nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2;
> > 	err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL);
> > 
> > That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would
> > unconditionally create 4 queues. QEMU always supported the first 3 queues
> > unconditionally, but old QEMU did obviously not support the (new)
> > VIRTIO_BALLOON_F_FREE_PAGE_HINT queue.
> > 
> > 390x didn't particularly like getting queried for non-existing
> > queues. [1] So the fix was not for a hypervisor that was out of spec, but
> > because quering non-existing queues didn't work.
> > 
> > The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue
> > index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well.
> > 
> > Again, as QEMU always implemented the 3 first queues unconditionally, this was
> > not a problem.
> > 
> > [1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/#t
> > 
> > > 
> > > Possibly because of the complexity of fixing the hypervisor(s) commit
> > > a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted
> > > for changing the guest side so that it does not fit the spec but fits
> > > the hypervisor(s). It unfortunately also broke notifiers (for the with
> > > holes) scenario for virtio-ccw only.
> > 
> > Yes, it broke the notifiers.
> > 
> > But note that everything was in spec at that point, because we only documented
> > "free_page_vq == 3" in the spec *2 years later*, in 2020:
> > 
> > commit 38448268eba0c105200d131c3f7f660129a4d673
> > Author: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > Date:   Tue Aug 25 07:45:02 2020 -0700
> > 
> >       content: Document balloon feature free page hints
> >       Free page hints allow the balloon driver to provide information on what
> >       pages are not currently in use so that we can avoid the cost of copying
> >       them in migration scenarios. Add a feature description for free page hints
> >       describing basic functioning and requirements.
> > At that point, what we documented in the spec *did not match reality* in
> > Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is
> > unconditionally set.
> > 
> > 
> > QEMU and Linux kept using that queue index assignment model, and the spec
> > was wrong (out of sync?) at that point. The spec got more wrong with
> > 
> > commit d917d4a8d552c003e046b0e3b1b529d98f7e695b
> > Author: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> > Date:   Tue Aug 25 07:45:17 2020 -0700
> > 
> >       content: Document balloon feature free page reporting
> >       Free page reporting is a feature that allows the guest to proactively
> >       report unused pages to the host. By making use of this feature is is
> >       possible to reduce the overall memory footprint of the guest in cases where
> >       some significant portion of the memory is idle. Add documentation for the
> >       free page reporting feature describing the functionality and requirements.
> > 
> > Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to
> > QEMU+Linux implementation, so the spec did not reflect reality.
> > 
> > I'll note also cloud-hypervisor [2] today follows that model.
> > 
> > In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning
> > the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented
> > in the spec to be *4*.
> > 
> > So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending
> > on the availability of the other two features/queues.
> > 
> > [2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs
> > 
> > 
> > > 
> > > Now we had another look at this, and have concluded that fixing the
> > > hypervisor(s) and fixing the kernel, and making sure that the fixed
> > > kernel can tolerate the old broken hypervisor(s) is way to complicated
> > > if possible at all. So we decided to give the spec a reality check and
> > > fix the notifier bit assignment for virtio-ccw which is broken beyond
> > > doubt if we accept that the correct virtqueue index is the one that the
> > > hypervisor(s) use and not the one that the spec says they should use.
> > 
> > In case of virtio-balloon, it's unfortunate that it went that way, but the
> > spec simply did not / does not reflect reality when it was added to the spec.
> > 
> > > 
> > > With the spec fixed, the whole notion of "holes" will be something that
> > > does not make sense any more. With that the merit of the kernel interface
> > > virtio_find_vqs() supporting "holes" is quite questionable. Now we need
> > > it because the drivers within the Linux kernel still think of the queues
> > > in terms of the current spec, i.e. they try to have the "holes" as
> > > mandated by the spec, and the duty of making it work with the broken
> > > device implementations falls to the transports.
> > > 
> > 
> > Right, the "holes" only exist in the input array.
> > 
> > > Under the assumption that the spec is indeed going to be fixed:
> 
> For virito-balloon, we should probably do the following:
> 
> From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Fri, 4 Apr 2025 12:53:16 +0200
> Subject: [PATCH] virtio-balloon: Fix queue index assignment for
>  non-existing queues
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  device-types/balloon/description.tex | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex
> index a1d9603..a7396ff 100644
> --- a/device-types/balloon/description.tex
> +++ b/device-types/balloon/description.tex
> @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I
>    5
>  \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
> +
> +\begin{description}
> +\item[inflateq] Exists unconditionally.
> +\item[deflateq] Exists unconditionally.
> +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> +\end{description}
> +
> +\begin{note}
> +Virtqueue indexes are assigned sequentially for existing queues, starting
> +with index 0; consequently, if a virtqueue does not exist, it does not get
> +an index assigned. Assuming all virtqueues exist for a device, the indexes
> +are:
> +
>  \begin{description}
>  \item[0] inflateq
>  \item[1] deflateq
> @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque
>  \item[3] free_page_vq
>  \item[4] reporting_vq
>  \end{description}
> -
> -  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> -
> -  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> -
> -  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> +\end{note}
>  \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits}
>  \begin{description}
> -- 
> 2.48.1
> 
> 
> If something along these lines sounds reasonable, I can send a proper patch to the
> proper audience.


Indeed, but do we want to add a note about previous spec versions
saying something different? Maybe, with a hint how devices following
old spec can be detected?



> -- 
> Cheers,
> 
> David / dhildenb
Michael S. Tsirkin April 6, 2025, 6:42 p.m. UTC | #22
On Fri, Apr 04, 2025 at 03:48:49PM +0200, David Hildenbrand wrote:
> On 04.04.25 15:36, Halil Pasic wrote:
> > On Fri, 4 Apr 2025 12:55:09 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > > For virito-balloon, we should probably do the following:
> > > 
> > >   From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001
> > > From: David Hildenbrand <david@redhat.com>
> > > Date: Fri, 4 Apr 2025 12:53:16 +0200
> > > Subject: [PATCH] virtio-balloon: Fix queue index assignment for
> > >    non-existing queues
> > > 
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >    device-types/balloon/description.tex | 22 ++++++++++++++++------
> > >    1 file changed, 16 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex
> > > index a1d9603..a7396ff 100644
> > > --- a/device-types/balloon/description.tex
> > > +++ b/device-types/balloon/description.tex
> > > @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I
> > >      5
> > >    \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
> > > +
> > > +\begin{description}
> > > +\item[inflateq] Exists unconditionally.
> > > +\item[deflateq] Exists unconditionally.
> > > +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> > > +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> > > +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> > 
> > s/is set/is negotiated/?
> > 
> > I think we should stick to "feature is offered" and "feature is
> > negotiated".
> > 
> > > +\end{description}
> > > +
> > > +\begin{note}
> > > +Virtqueue indexes are assigned sequentially for existing queues, starting
> > > +with index 0; consequently, if a virtqueue does not exist, it does not get
> > > +an index assigned. Assuming all virtqueues exist for a device, the indexes
> > > +are:
> > > +
> > >    \begin{description}
> > >    \item[0] inflateq
> > >    \item[1] deflateq
> > > @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque
> > >    \item[3] free_page_vq
> > >    \item[4] reporting_vq
> > >    \end{description}
> > > -
> > > -  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
> > > -
> > > -  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
> > > -
> > > -  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
> > > +\end{note}
> > >    \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits}
> > >    \begin{description}
> > 
> > Sounds good to me! But I'm still a little confused by the "holes". What
> > confuses me is that i can think of at least 2 distinct types of "holes":
> > 1) Holes that can be filled later. The queue conceptually exists, but
> >     there is no need to back it with any resources for now because it is
> >     dormant (it can be seen a hole in comparison to queues that need to
> >    materialize -- vring, notifiers, ...)
> > 2) Holes that can not be filled without resetting the device: i.e. if
> >     certain features are not negotiated, then a queue X does not exist,
> >     but subsequent queues retain their index.
> 
> I think it is not about "negotiated", that might be the wrong terminology.
> 
> E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues
> (virtio_add_queue()) if virtio_has_feature(s->host_features).
> 
> That is, it's independent of a feature negotiation (IIUC), it's static for
> the device --  "host_features"


No no that is a bad idea. Breaks forward compatibility.

Oh my. I did not realize. It is really broken hopelessly.

Because, note, the guest looks at the guest features :)


Now I am beginning to think we should leave the spec alone
and fix the drivers ... Ugh ....




> 
> Is that really "negotiated" or is it "the device offers the feature X" ?
> 
> -- 
> Cheers,
> 
> David / dhildenb
David Hildenbrand April 7, 2025, 7:18 a.m. UTC | #23
On 06.04.25 20:42, Michael S. Tsirkin wrote:
> On Fri, Apr 04, 2025 at 03:48:49PM +0200, David Hildenbrand wrote:
>> On 04.04.25 15:36, Halil Pasic wrote:
>>> On Fri, 4 Apr 2025 12:55:09 +0200
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> For virito-balloon, we should probably do the following:
>>>>
>>>>    From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001
>>>> From: David Hildenbrand <david@redhat.com>
>>>> Date: Fri, 4 Apr 2025 12:53:16 +0200
>>>> Subject: [PATCH] virtio-balloon: Fix queue index assignment for
>>>>     non-existing queues
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>     device-types/balloon/description.tex | 22 ++++++++++++++++------
>>>>     1 file changed, 16 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex
>>>> index a1d9603..a7396ff 100644
>>>> --- a/device-types/balloon/description.tex
>>>> +++ b/device-types/balloon/description.tex
>>>> @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I
>>>>       5
>>>>     \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues}
>>>> +
>>>> +\begin{description}
>>>> +\item[inflateq] Exists unconditionally.
>>>> +\item[deflateq] Exists unconditionally.
>>>> +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
>>>> +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>>>> +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
>>>
>>> s/is set/is negotiated/?
>>>
>>> I think we should stick to "feature is offered" and "feature is
>>> negotiated".
>>>
>>>> +\end{description}
>>>> +
>>>> +\begin{note}
>>>> +Virtqueue indexes are assigned sequentially for existing queues, starting
>>>> +with index 0; consequently, if a virtqueue does not exist, it does not get
>>>> +an index assigned. Assuming all virtqueues exist for a device, the indexes
>>>> +are:
>>>> +
>>>>     \begin{description}
>>>>     \item[0] inflateq
>>>>     \item[1] deflateq
>>>> @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque
>>>>     \item[3] free_page_vq
>>>>     \item[4] reporting_vq
>>>>     \end{description}
>>>> -
>>>> -  statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set.
>>>> -
>>>> -  free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set.
>>>> -
>>>> -  reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set.
>>>> +\end{note}
>>>>     \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits}
>>>>     \begin{description}
>>>
>>> Sounds good to me! But I'm still a little confused by the "holes". What
>>> confuses me is that i can think of at least 2 distinct types of "holes":
>>> 1) Holes that can be filled later. The queue conceptually exists, but
>>>      there is no need to back it with any resources for now because it is
>>>      dormant (it can be seen a hole in comparison to queues that need to
>>>     materialize -- vring, notifiers, ...)
>>> 2) Holes that can not be filled without resetting the device: i.e. if
>>>      certain features are not negotiated, then a queue X does not exist,
>>>      but subsequent queues retain their index.
>>
>> I think it is not about "negotiated", that might be the wrong terminology.
>>
>> E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues
>> (virtio_add_queue()) if virtio_has_feature(s->host_features).
>>
>> That is, it's independent of a feature negotiation (IIUC), it's static for
>> the device --  "host_features"
> 
> 
> No no that is a bad idea. Breaks forward compatibility.
> 
> Oh my. I did not realize. It is really broken hopelessly.
> 
> Because, note, the guest looks at the guest features :)

Can you elaborate why?

statsq = 2

free_page_vq = statsq + host_offered_feat(VIRTIO_BALLOON_F_STATS_VQ)

reporting_vq = free_page_vq + 
host_offered_feat(VIRTIO_BALLOON_F_FREE_PAGE_HINT)


Independent of any upcoming features. And if a new feature defines a new 
virtqueue

new_vq = reporting_vq +  host_offered_feat(VIRTIO_BALLOON_F_PAGE_REPORTING)

We only have to make sure in the spec that these calculations always hold.

Querying of the host offered features already happens as part of 
determining the actual guest usable feature (driver_offered & host_offered).


> Now I am beginning to think we should leave the spec alone
> and fix the drivers ... Ugh ....

We could always say that starting with feature X, queue indexes are 
fixed again. E.g., VIRTIO_BALLOON_F_X would have it's virtqueue fixed at 
index 5, independent of the other (older) features where the virtqueue 
indexes are determined like today.

Won't make the implementation easier, though, I'm afraid.

(I also thought about a way to query the virtqueue index for a feature, 
but that's probably overengineering)
Michael S. Tsirkin April 7, 2025, 7:52 a.m. UTC | #24
On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
> > 
> > Not perfect, but AFAIKS, not horrible.
> 
> It is like it is. QEMU does queue exist if the corresponding feature
> is offered by the device, and that is what we have to live with.

I don't think we can live with this properly though.
It means a guest that does not know about some features
does not know where to find things.

So now, I am inclined to add linux code to work with current qemu and
with spec compliant one, and add qemu code to work with current linux
and spec compliant one.

Document the bug in the spec, maybe, in a non conformance section.
David Hildenbrand April 7, 2025, 8:17 a.m. UTC | #25
On 07.04.25 09:52, Michael S. Tsirkin wrote:
> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
>>>
>>> Not perfect, but AFAIKS, not horrible.
>>
>> It is like it is. QEMU does queue exist if the corresponding feature
>> is offered by the device, and that is what we have to live with.
> 
> I don't think we can live with this properly though.
> It means a guest that does not know about some features
> does not know where to find things.

Please describe a real scenario, I'm missing the point.

Whoever adds new feat_X *must be aware* about all previous features, 
otherwise we'd be reusing feature bits and everything falls to pieces.

> 
> So now, I am inclined to add linux code to work with current qemu and
> with spec compliant one, and add qemu code to work with current linux
> and spec compliant one.
> 
> Document the bug in the spec, maybe, in a non conformance section.

I'm afraid this results in a lot of churn without really making things 
better.

IMHO, documenting things how they actually behave, and maybe moving 
towards fixed queue indexes for new features is the low hanging fruit.

As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor.
Michael S. Tsirkin April 7, 2025, 8:34 a.m. UTC | #26
On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
> On 07.04.25 09:52, Michael S. Tsirkin wrote:
> > On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
> > > > 
> > > > Not perfect, but AFAIKS, not horrible.
> > > 
> > > It is like it is. QEMU does queue exist if the corresponding feature
> > > is offered by the device, and that is what we have to live with.
> > 
> > I don't think we can live with this properly though.
> > It means a guest that does not know about some features
> > does not know where to find things.
> 
> Please describe a real scenario, I'm missing the point.


OK so.

Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING
Driver only knows about VIRTIO_BALLOON_F_REPORTING so
it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do.
How does it know which vq to use for reporting?
It will try to use the free page hint one.



> Whoever adds new feat_X *must be aware* about all previous features,
> otherwise we'd be reusing feature bits and everything falls to pieces.


The knowledge is supposed be limited to which feature bit to use.



> > 
> > So now, I am inclined to add linux code to work with current qemu and
> > with spec compliant one, and add qemu code to work with current linux
> > and spec compliant one.
> > 
> > Document the bug in the spec, maybe, in a non conformance section.
> 
> I'm afraid this results in a lot of churn without really making things
> better.

> IMHO, documenting things how they actually behave, and maybe moving towards
> fixed queue indexes for new features is the low hanging fruit.

I worry about how to we ensure that?
If old code is messed up people will just keep propagating that.
I would like to fix old code so that new code is correct.

> 
> As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor.
> 
> -- 
> Cheers,
> 
> David / dhildenb

There's a slippery slope here in that people will come to us
with buggy devices and ask to change the spec.
David Hildenbrand April 7, 2025, 8:38 a.m. UTC | #27
On 07.04.25 10:17, David Hildenbrand wrote:
> On 07.04.25 09:52, Michael S. Tsirkin wrote:
>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
>>>>
>>>> Not perfect, but AFAIKS, not horrible.
>>>
>>> It is like it is. QEMU does queue exist if the corresponding feature
>>> is offered by the device, and that is what we have to live with.
>>
>> I don't think we can live with this properly though.
>> It means a guest that does not know about some features
>> does not know where to find things.
> 
> Please describe a real scenario, I'm missing the point.
> 
> Whoever adds new feat_X *must be aware* about all previous features,
> otherwise we'd be reusing feature bits and everything falls to pieces.
> 
>>
>> So now, I am inclined to add linux code to work with current qemu and
>> with spec compliant one, and add qemu code to work with current linux
>> and spec compliant one.
>>
>> Document the bug in the spec, maybe, in a non conformance section.
> 
> I'm afraid this results in a lot of churn without really making things
> better.
> 
> IMHO, documenting things how they actually behave, and maybe moving
> towards fixed queue indexes for new features is the low hanging fruit.
> 
> As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor.

I'm digging for other virtio-balloon implementations.


virtio-win: 
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/Balloon/sys/balloon.c

-> Does not support hinting/reporting -> no problem


libkrun: 
https://github.com/containers/libkrun/blob/main/src/devices/src/virtio/balloon/device.rs

-> Hard-codes queue indexes but always seems to offer all features
  -> Offers VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_STATS_VQ
     even though it doesn't seem to implement them (device-triggered, so
     nothing to do probably?)
  -> Actually seems to implements VIRTIO_BALLOON_F_REPORTING


crossvm: 
https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs

-> Hard-codes queue numbers; does *not* offer/implement
    VIRTIO_BALLOON_F_STATS_VQ but does offer VIRTIO_BALLOON_F_STATS_VQ
    and VIRTIO_BALLOON_F_DEFLATE_ON_OOM.

-> Implements something that is not in the virtio-spec

const VIRTIO_BALLOON_F_WS_REPORTING: u32 = 8; // Working Set Reporting 
virtqueues

and

const WS_DATA_VQ: usize = 5;
const WS_OP_VQ: usize = 6;


IIUC, Linux inside cross-vm might actually be problematic? They would 
disagree on the virtqueue for free-page-reporting


Maybe I am missing something, it's a mess ...
Michael S. Tsirkin April 7, 2025, 8:41 a.m. UTC | #28
On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
> That basically means that if I was, for the sake of fun do
> 
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1197,7 +1197,6 @@ static unsigned int features[] = {
>         VIRTIO_BALLOON_F_MUST_TELL_HOST,
>         VIRTIO_BALLOON_F_STATS_VQ,
>         VIRTIO_BALLOON_F_DEFLATE_ON_OOM,
> -       VIRTIO_BALLOON_F_FREE_PAGE_HINT,
>         VIRTIO_BALLOON_F_PAGE_POISON,
>         VIRTIO_BALLOON_F_REPORTING,
>  };
> 
> I would end up with virtio_check_driver_offered_feature() calling
> BUG().


I mean, yes, this is exactly to catch drivers that use
features without negotiating them first.
David Hildenbrand April 7, 2025, 8:44 a.m. UTC | #29
On 07.04.25 10:34, Michael S. Tsirkin wrote:
> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
>> On 07.04.25 09:52, Michael S. Tsirkin wrote:
>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
>>>>>
>>>>> Not perfect, but AFAIKS, not horrible.
>>>>
>>>> It is like it is. QEMU does queue exist if the corresponding feature
>>>> is offered by the device, and that is what we have to live with.
>>>
>>> I don't think we can live with this properly though.
>>> It means a guest that does not know about some features
>>> does not know where to find things.
>>
>> Please describe a real scenario, I'm missing the point.
> 
> 
> OK so.
> 
> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING
> Driver only knows about VIRTIO_BALLOON_F_REPORTING so
> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do.
> How does it know which vq to use for reporting?
> It will try to use the free page hint one.

"only knows" -- VIRTIO_BALLOON_F_FREE_PAGE_HINT was proposed + specified 
in the spec.

So I think this is not a very good example.

> 
> 
> 
>> Whoever adds new feat_X *must be aware* about all previous features,
>> otherwise we'd be reusing feature bits and everything falls to pieces.
> 
> 
> The knowledge is supposed be limited to which feature bit to use.

I think we also have to know which virtqueue bits can be used, right?

I mean, I agree that it's all nasty ...

>>>
>>> So now, I am inclined to add linux code to work with current qemu and
>>> with spec compliant one, and add qemu code to work with current linux
>>> and spec compliant one.
>>>
>>> Document the bug in the spec, maybe, in a non conformance section.
>>
>> I'm afraid this results in a lot of churn without really making things
>> better.
> 
>> IMHO, documenting things how they actually behave, and maybe moving towards
>> fixed queue indexes for new features is the low hanging fruit.
> 
> I worry about how to we ensure that?
> If old code is messed up people will just keep propagating that.
> I would like to fix old code so that new code is correct.
> 
>>
>> As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor.
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> There's a slippery slope here in that people will come to us
> with buggy devices and ask to change the spec.

I yet have to fully digest cross-vm: 
https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs

and how it would free-page-reporting work with upstream Linux ... :(
Michael S. Tsirkin April 7, 2025, 8:44 a.m. UTC | #30
Wow great job digging through all these hypervisors!

On Mon, Apr 07, 2025 at 10:38:59AM +0200, David Hildenbrand wrote:
> crossvm:
> https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs
> 
> -> Hard-codes queue numbers; does *not* offer/implement
>    VIRTIO_BALLOON_F_STATS_VQ but does offer VIRTIO_BALLOON_F_STATS_VQ
>    and VIRTIO_BALLOON_F_DEFLATE_ON_OOM.
> 
> -> Implements something that is not in the virtio-spec
> 
> const VIRTIO_BALLOON_F_WS_REPORTING: u32 = 8; // Working Set Reporting
> virtqueues
> 
> and
> 
> const WS_DATA_VQ: usize = 5;
> const WS_OP_VQ: usize = 6;
> 
> 
> IIUC, Linux inside cross-vm might actually be problematic? They would
> disagree on the virtqueue for free-page-reporting


That's why things must be tied to negotiated features, not to offered
ones.


> 
> Maybe I am missing something, it's a mess ...

Indeed. Let's fix the implementations while keeping work arounds for
existing ones.
Michael S. Tsirkin April 7, 2025, 8:49 a.m. UTC | #31
On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
> > 
> > 
> > 
> > > Whoever adds new feat_X *must be aware* about all previous features,
> > > otherwise we'd be reusing feature bits and everything falls to pieces.
> > 
> > 
> > The knowledge is supposed be limited to which feature bit to use.
> 
> I think we also have to know which virtqueue bits can be used, right?
> 

what are virtqueue bits? vq number?
David Hildenbrand April 7, 2025, 8:50 a.m. UTC | #32
On 07.04.25 10:44, Michael S. Tsirkin wrote:
> Wow great job digging through all these hypervisors!
> 
> On Mon, Apr 07, 2025 at 10:38:59AM +0200, David Hildenbrand wrote:
>> crossvm:
>> https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs
>>
>> -> Hard-codes queue numbers; does *not* offer/implement
>>     VIRTIO_BALLOON_F_STATS_VQ but does offer VIRTIO_BALLOON_F_STATS_VQ
>>     and VIRTIO_BALLOON_F_DEFLATE_ON_OOM.
>>
>> -> Implements something that is not in the virtio-spec
>>
>> const VIRTIO_BALLOON_F_WS_REPORTING: u32 = 8; // Working Set Reporting
>> virtqueues
>>
>> and
>>
>> const WS_DATA_VQ: usize = 5;
>> const WS_OP_VQ: usize = 6;
>>
>>
>> IIUC, Linux inside cross-vm might actually be problematic? They would
>> disagree on the virtqueue for free-page-reporting
> 
> 
> That's why things must be tied to negotiated features, not to offered
> ones.

cross-vm also has this weird comment:

"
const VIRTIO_BALLOON_F_PAGE_REPORTING: u32 = 5; // Page reporting virtqueue
                                                 // TODO(b/273973298): 
this should maybe be bit 6? to be changed later
"

Not sure why that should be bit 6, the spec says 5 ...

So maybe whatever they run inside the VM is also out of spec ... Really 
hard to tell.
David Hildenbrand April 7, 2025, 8:54 a.m. UTC | #33
On 07.04.25 10:49, Michael S. Tsirkin wrote:
> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
>>>
>>>
>>>
>>>> Whoever adds new feat_X *must be aware* about all previous features,
>>>> otherwise we'd be reusing feature bits and everything falls to pieces.
>>>
>>>
>>> The knowledge is supposed be limited to which feature bit to use.
>>
>> I think we also have to know which virtqueue bits can be used, right?
>>
> 
> what are virtqueue bits? vq number?

Yes, sorry.

Assume cross-vm as an example. It would make use of virtqueue indexes 
5+6 with their VIRTIO_BALLOON_F_WS_REPORTING.

So whatever feature another device implements couldn't use this feature 
bit or these virtqueue indexes.

(as long the other device never intends to implement 
VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. 
But the spec will also be a mess, because virtqueue indexes could also 
have duplicate meanings ... ugh)
Michael S. Tsirkin April 7, 2025, 8:54 a.m. UTC | #34
On Mon, Apr 07, 2025 at 09:18:21AM +0200, David Hildenbrand wrote:
> > Now I am beginning to think we should leave the spec alone
> > and fix the drivers ... Ugh ....
> 
> We could always say that starting with feature X, queue indexes are fixed
> again. E.g., VIRTIO_BALLOON_F_X would have it's virtqueue fixed at index 5,
> independent of the other (older) features where the virtqueue indexes are
> determined like today.
> 
> Won't make the implementation easier, though, I'm afraid.
> 
> (I also thought about a way to query the virtqueue index for a feature, but
> that's probably overengineering)

The best contract we have is the spec. Sometimes it is hopelessly broken
and we have to fix it, but not in this case.

Let's do a theoretical excercise, assuming we want to fix the drivers,
but we also want to have workarounds in place in qemu and in
drivers to support existing ones. How would we go about it?



Maybe we want a feature bit BALLOON_FIXED and ask everyone
to negotiate it?  But if we go this way, we really need to fix
the 48 bit limitation too.
Michael S. Tsirkin April 7, 2025, 8:58 a.m. UTC | #35
On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote:
> On 07.04.25 10:49, Michael S. Tsirkin wrote:
> > On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
> > > > 
> > > > 
> > > > 
> > > > > Whoever adds new feat_X *must be aware* about all previous features,
> > > > > otherwise we'd be reusing feature bits and everything falls to pieces.
> > > > 
> > > > 
> > > > The knowledge is supposed be limited to which feature bit to use.
> > > 
> > > I think we also have to know which virtqueue bits can be used, right?
> > > 
> > 
> > what are virtqueue bits? vq number?
> 
> Yes, sorry.

I got confused myself, it's vq index actually now, we made the spec
consistent with that terminology. used to be number/index
interchangeably.

> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6
> with their VIRTIO_BALLOON_F_WS_REPORTING.


crossvm guys really should have reserved the feature bit even if they
did not bother specifying it. Let's reserve it now at least?


> So whatever feature another device implements couldn't use this feature bit
> or these virtqueue indexes.
> 
> (as long the other device never intends to implement
> VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But
> the spec will also be a mess, because virtqueue indexes could also have
> duplicate meanings ... ugh)

what do they do with vq indices btw?



> -- 
> Cheers,
> 
> David / dhildenb
David Hildenbrand April 7, 2025, 9:08 a.m. UTC | #36
On 07.04.25 10:54, Michael S. Tsirkin wrote:
> On Mon, Apr 07, 2025 at 09:18:21AM +0200, David Hildenbrand wrote:
>>> Now I am beginning to think we should leave the spec alone
>>> and fix the drivers ... Ugh ....
>>
>> We could always say that starting with feature X, queue indexes are fixed
>> again. E.g., VIRTIO_BALLOON_F_X would have it's virtqueue fixed at index 5,
>> independent of the other (older) features where the virtqueue indexes are
>> determined like today.
>>
>> Won't make the implementation easier, though, I'm afraid.
>>
>> (I also thought about a way to query the virtqueue index for a feature, but
>> that's probably overengineering)
> 
> The best contract we have is the spec. Sometimes it is hopelessly broken
> and we have to fix it, but not in this case.
> 
> Let's do a theoretical excercise, assuming we want to fix the drivers,
> but we also want to have workarounds in place in qemu and in
> drivers to support existing ones. How would we go about it?

QEMU could likely be changed to always offer 
VIRTIO_BALLOON_F_FREE_PAGE_HINT, but not actually use it unless enabled 
for QEMU. That should work, because all action is initiated by the device.

That way, all virtqueue indexes would always be according to the spec.

We'll likely need compat machine handling ....


Regarding Linux, I'll have to think about it further ...

> Maybe we want a feature bit BALLOON_FIXED and ask everyone
> to negotiate it?  But if we go this way, we really need to fix
> the 48 bit limitation too.

I was thinking about the same, but it's all a mess ...
David Hildenbrand April 7, 2025, 9:11 a.m. UTC | #37
On 07.04.25 10:58, Michael S. Tsirkin wrote:
> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote:
>> On 07.04.25 10:49, Michael S. Tsirkin wrote:
>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Whoever adds new feat_X *must be aware* about all previous features,
>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces.
>>>>>
>>>>>
>>>>> The knowledge is supposed be limited to which feature bit to use.
>>>>
>>>> I think we also have to know which virtqueue bits can be used, right?
>>>>
>>>
>>> what are virtqueue bits? vq number?
>>
>> Yes, sorry.
> 
> I got confused myself, it's vq index actually now, we made the spec
> consistent with that terminology. used to be number/index
> interchangeably.
> 
>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6
>> with their VIRTIO_BALLOON_F_WS_REPORTING.
> 
> 
> crossvm guys really should have reserved the feature bit even if they
> did not bother specifying it. Let's reserve it now at least?

Along with the virtqueue indices, right?

Note that there was

https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html

and

https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT

But it only was RFC, and as the QEMU implementation didn't materialize, 
nobody seemed to care ...

> 
> 
>> So whatever feature another device implements couldn't use this feature bit
>> or these virtqueue indexes.
>>
>> (as long the other device never intends to implement
>> VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But
>> the spec will also be a mess, because virtqueue indexes could also have
>> duplicate meanings ... ugh)
> 
> what do they do with vq indices btw?

See above links, they use the two for "s_vq and notification_vq".
David Hildenbrand April 7, 2025, 9:13 a.m. UTC | #38
On 07.04.25 11:11, David Hildenbrand wrote:
> On 07.04.25 10:58, Michael S. Tsirkin wrote:
>> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote:
>>> On 07.04.25 10:49, Michael S. Tsirkin wrote:
>>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Whoever adds new feat_X *must be aware* about all previous features,
>>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces.
>>>>>>
>>>>>>
>>>>>> The knowledge is supposed be limited to which feature bit to use.
>>>>>
>>>>> I think we also have to know which virtqueue bits can be used, right?
>>>>>
>>>>
>>>> what are virtqueue bits? vq number?
>>>
>>> Yes, sorry.
>>
>> I got confused myself, it's vq index actually now, we made the spec
>> consistent with that terminology. used to be number/index
>> interchangeably.
>>
>>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6
>>> with their VIRTIO_BALLOON_F_WS_REPORTING.
>>
>>
>> crossvm guys really should have reserved the feature bit even if they
>> did not bother specifying it. Let's reserve it now at least?
> 
> Along with the virtqueue indices, right?
> 
> Note that there was
> 
> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html
> 
> and
> 
> https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT
> 
> But it only was RFC, and as the QEMU implementation didn't materialize,
> nobody seemed to care ...

Heh, but that one said:

+\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for
Working Set

Which does not seem to reflect reality ...
David Hildenbrand April 7, 2025, 9:22 a.m. UTC | #39
On 07.04.25 10:44, Michael S. Tsirkin wrote:
> Wow great job digging through all these hypervisors!

There is more ... :(

aloith: 
https://github.com/google/alioth/blob/main/alioth/src/virtio/dev/balloon.rs

It uses the incremental vq index assignment like QEMU.

impl VirtioMio for Balloon {
     fn activate<'a, 'm, Q: VirtQueue<'m>, S: IrqSender>(
         &mut self,
         feature: u64,
         _active_mio: &mut ActiveMio<'a, 'm, Q, S>,
     ) -> Result<()> {
         let feature = BalloonFeature::from_bits_retain(feature);
         self.queues[0] = BalloonQueue::Inflate;
         self.queues[1] = BalloonQueue::Deflate;
         let mut index = 2;
         if feature.contains(BalloonFeature::STATS_VQ) {
             self.queues[index] = BalloonQueue::Stats;
             index += 1;
         }
         if feature.contains(BalloonFeature::FREE_PAGE_HINT) {
             self.queues[index] = BalloonQueue::FreePage;
             index += 1;
         }
         if feature.contains(BalloonFeature::PAGE_REPORTING) {
             self.queues[index] = BalloonQueue::Reporting;
         }
         Ok(())
     }


I'll dig some more, but this is getting out of hand :D
Michael S. Tsirkin April 7, 2025, 9:37 a.m. UTC | #40
On Mon, Apr 07, 2025 at 11:11:34AM +0200, David Hildenbrand wrote:
> On 07.04.25 10:58, Michael S. Tsirkin wrote:
> > On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote:
> > > On 07.04.25 10:49, Michael S. Tsirkin wrote:
> > > > On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > Whoever adds new feat_X *must be aware* about all previous features,
> > > > > > > otherwise we'd be reusing feature bits and everything falls to pieces.
> > > > > > 
> > > > > > 
> > > > > > The knowledge is supposed be limited to which feature bit to use.
> > > > > 
> > > > > I think we also have to know which virtqueue bits can be used, right?
> > > > > 
> > > > 
> > > > what are virtqueue bits? vq number?
> > > 
> > > Yes, sorry.
> > 
> > I got confused myself, it's vq index actually now, we made the spec
> > consistent with that terminology. used to be number/index
> > interchangeably.
> > 
> > > Assume cross-vm as an example. It would make use of virtqueue indexes 5+6
> > > with their VIRTIO_BALLOON_F_WS_REPORTING.
> > 
> > 
> > crossvm guys really should have reserved the feature bit even if they
> > did not bother specifying it. Let's reserve it now at least?
> 
> Along with the virtqueue indices, right?

Well ... as long as the implementation is careful to check that feature
is negotiated, reusing vq index at least causes no trouble for others.


> Note that there was
> 
> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html
> 
> and
> 
> https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT
> 
> But it only was RFC, and as the QEMU implementation didn't materialize,
> nobody seemed to care ...

Thanks! I will try poke the author again.


> > 
> > 
> > > So whatever feature another device implements couldn't use this feature bit
> > > or these virtqueue indexes.
> > > 
> > > (as long the other device never intends to implement
> > > VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But
> > > the spec will also be a mess, because virtqueue indexes could also have
> > > duplicate meanings ... ugh)
> > 
> > what do they do with vq indices btw?
> 
> See above links, they use the two for "s_vq and notification_vq".
> 
> -- 
> Cheers,
> 
> David / dhildenb
Halil Pasic April 7, 2025, 1:12 p.m. UTC | #41
On Mon, 7 Apr 2025 04:34:29 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
> > On 07.04.25 09:52, Michael S. Tsirkin wrote:  
> > > On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:  
> > > > > 
> > > > > Not perfect, but AFAIKS, not horrible.  
> > > > 
> > > > It is like it is. QEMU does queue exist if the corresponding feature
> > > > is offered by the device, and that is what we have to live with.  
> > > 
> > > I don't think we can live with this properly though.
> > > It means a guest that does not know about some features
> > > does not know where to find things.  
> > 
> > Please describe a real scenario, I'm missing the point.  
> 
> 
> OK so.
> 
> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING
> Driver only knows about VIRTIO_BALLOON_F_REPORTING so
> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do.
> How does it know which vq to use for reporting?
> It will try to use the free page hint one.

First, sorry for not catching up again with the discussion earlier.

I think David's point is based on the assumption that by the time feature
with the feature bit N+1 is specified and allocates a queue Q, all
queues with indexes smaller than Q are allocated and possibly associated
with features that were previously specified (and probably have feature
bits smaller than N+1). 

I.e. that we can mandate, even if you don't want to care about other
optional features, you have to, because we say so, for the matter of
virtqueue existence. And anything in the future, you don't have to care
about because the queue index associated with future features is larger
than Q, so it does not affect our position.

I think that argument can fall a part if:
* future features reference optional queues defined in the past
* somebody managed to introduce a limbo where a feature is reserved, and
  they can not decide if they want a queue or not, or make the existence
  of the queue depend on something else than a feature bit.

Frankly I don't think the risks are huge, but it would undoubtedly make
the spec more ugly.

> 
> 
> 
> > Whoever adds new feat_X *must be aware* about all previous features,
> > otherwise we'd be reusing feature bits and everything falls to pieces.  
> 
> 
> The knowledge is supposed be limited to which feature bit to use.
> 

I do agree! This is why I brought this question up. Creating exceptions
from that rule would be very ugly IMHO. But I would not say it is
impossible.

> 
> 
> > > 
> > > So now, I am inclined to add linux code to work with current qemu and
> > > with spec compliant one, and add qemu code to work with current linux
> > > and spec compliant one.
> > > 
> > > Document the bug in the spec, maybe, in a non conformance section.  
> > 
> > I'm afraid this results in a lot of churn without really making things
> > better.  
> 
> > IMHO, documenting things how they actually behave, and maybe moving towards
> > fixed queue indexes for new features is the low hanging fruit.  
> 
> I worry about how to we ensure that?
> If old code is messed up people will just keep propagating that.
> I would like to fix old code so that new code is correct.
> 
> > 
> > As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor.
> > 
> > -- 
> > Cheers,
> > 
> > David / dhildenb  
> 
> There's a slippery slope here in that people will come to us
> with buggy devices and ask to change the spec.
> 

I agree! IMHO all we have are bad options. To decide for myself which is
less ugly I would love to see both.

I agree making the spec bend to the fact that implementations are buggy
does not seem right from the spec perspective. But if making things work
is not practical without sacrificing the sanctity of the spec, I am
willing to swallow the bitter pill and bend the spec.

If you don't mind try to keep me in the loop, even if I'm not able to
be as responsive as I would like to be. I'm happy fixing the code is
going to get another round of consideration.

Regards,
Halil
David Hildenbrand April 7, 2025, 1:13 p.m. UTC | #42
On 07.04.25 11:13, David Hildenbrand wrote:
> On 07.04.25 11:11, David Hildenbrand wrote:
>> On 07.04.25 10:58, Michael S. Tsirkin wrote:
>>> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote:
>>>> On 07.04.25 10:49, Michael S. Tsirkin wrote:
>>>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Whoever adds new feat_X *must be aware* about all previous features,
>>>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces.
>>>>>>>
>>>>>>>
>>>>>>> The knowledge is supposed be limited to which feature bit to use.
>>>>>>
>>>>>> I think we also have to know which virtqueue bits can be used, right?
>>>>>>
>>>>>
>>>>> what are virtqueue bits? vq number?
>>>>
>>>> Yes, sorry.
>>>
>>> I got confused myself, it's vq index actually now, we made the spec
>>> consistent with that terminology. used to be number/index
>>> interchangeably.
>>>
>>>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6
>>>> with their VIRTIO_BALLOON_F_WS_REPORTING.
>>>
>>>
>>> crossvm guys really should have reserved the feature bit even if they
>>> did not bother specifying it. Let's reserve it now at least?
>>
>> Along with the virtqueue indices, right?
>>
>> Note that there was
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html
>>
>> and
>>
>> https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT
>>
>> But it only was RFC, and as the QEMU implementation didn't materialize,
>> nobody seemed to care ...
> 
> Heh, but that one said:
> 
> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for
> Working Set
> 
> Which does not seem to reflect reality ...
> 

I dug a bit more into cross-vm, because that one seems to be the only
one out there that does not behave like everybody else I found (maybe good,
maybe bad :) ).


1) There was temporarily even another feature (VIRTIO_BALLOON_F_EVENTS_VQ)
and another queue.

It got removed from cross-vm in:

commit 9ba634b82b55ba762dc8724676b2cf9419460145
Author: Daniel Verkamp <dverkamp@chromium.org>
Date:   Thu Jul 11 11:29:52 2024 -0700

     devices: virtio-balloon: remove event queue support
     
     VIRTIO_BALLOON_F_EVENTS_VQ was part of a proposed virtio spec change.
     
     It is not currently supported by upstream Linux, so removing this should
     have no effect except for guest kernels that had CHROMIUM patches
     applied.
     
     The virtqueue indexes for the ws-related queues are decremented to fill
     the hole left by the removal of the event VQ; these are non-standard as
     well, so they do not have virtqueue indexes assigned in the virtio spec,
     but the proposed spec extension did actually use vq indexes 5 and 6.
     
     BUG=b:214864326


2) cross-vm is aware of the upstream Linux driver

They thought your fix would go upstream; it didn't.

commit a2fa119e759d0238a42ff15a9aff0dfd122afebd
Author: Daniel Verkamp <dverkamp@chromium.org>
Date:   Wed Jul 10 16:16:28 2024 -0700

     devices: virtio-balloon: warn about queue index mismatches
     
     The Linux kernel virtio-balloon driver spec non-compliance related to
     queue numbering is being fixed; add some diagnostics to our device that
     help to check if everything is working as expected.
     
     <https://lore.kernel.org/virtualization/CACGkMEsg0+vpav1Fo8JF1isq4Ef8t4_CFN1scyztDO8bXzRLBQ@mail.gmail.com/T/>
     
     Additionally, replace the num_expected_queues() function with per-queue
     checking to avoid the need for the duplicate feature checks and queue
     count calculation; each pop_queue() call will be checked using the `?`
     operator and return a more useful error message if a particular queue is
     missing.
     
     BUG=None
     TEST=crosvm run --balloon-page-reporting ...


IIRC, in that commit they switched to the "spec" behavior.

That's when they started hard-coding the queue indexes.

CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting.
How is that handled?
David Hildenbrand April 7, 2025, 1:17 p.m. UTC | #43
On 07.04.25 15:12, Halil Pasic wrote:
> On Mon, 7 Apr 2025 04:34:29 -0400
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
>> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
>>> On 07.04.25 09:52, Michael S. Tsirkin wrote:
>>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
>>>>>>
>>>>>> Not perfect, but AFAIKS, not horrible.
>>>>>
>>>>> It is like it is. QEMU does queue exist if the corresponding feature
>>>>> is offered by the device, and that is what we have to live with.
>>>>
>>>> I don't think we can live with this properly though.
>>>> It means a guest that does not know about some features
>>>> does not know where to find things.
>>>
>>> Please describe a real scenario, I'm missing the point.
>>
>>
>> OK so.
>>
>> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING
>> Driver only knows about VIRTIO_BALLOON_F_REPORTING so
>> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do.
>> How does it know which vq to use for reporting?
>> It will try to use the free page hint one.
> 
> First, sorry for not catching up again with the discussion earlier.
> 
> I think David's point is based on the assumption that by the time feature
> with the feature bit N+1 is specified and allocates a queue Q, all
> queues with indexes smaller than Q are allocated and possibly associated
> with features that were previously specified (and probably have feature
> bits smaller than N+1).
> 
> I.e. that we can mandate, even if you don't want to care about other
> optional features, you have to, because we say so, for the matter of
> virtqueue existence. And anything in the future, you don't have to care
> about because the queue index associated with future features is larger
> than Q, so it does not affect our position.
> 
> I think that argument can fall a part if:
> * future features reference optional queues defined in the past
> * somebody managed to introduce a limbo where a feature is reserved, and
>    they can not decide if they want a queue or not, or make the existence
>    of the queue depend on something else than a feature bit.

Staring at the cross-vmm, including the adding+removing of features and 
queues that are not in the spec, I am wondering if (in a world with 
fixed virtqueues)

1) Feature bits must be reserved before used.

2) Queue indices must be reserved before used.

It all smells like a problem similar to device IDs ...
Cornelia Huck April 7, 2025, 1:28 p.m. UTC | #44
On Mon, Apr 07 2025, David Hildenbrand <david@redhat.com> wrote:

> On 07.04.25 15:12, Halil Pasic wrote:
>> On Mon, 7 Apr 2025 04:34:29 -0400
>> "Michael S. Tsirkin" <mst@redhat.com> wrote:
>> 
>>> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
>>>> On 07.04.25 09:52, Michael S. Tsirkin wrote:
>>>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
>>>>>>>
>>>>>>> Not perfect, but AFAIKS, not horrible.
>>>>>>
>>>>>> It is like it is. QEMU does queue exist if the corresponding feature
>>>>>> is offered by the device, and that is what we have to live with.
>>>>>
>>>>> I don't think we can live with this properly though.
>>>>> It means a guest that does not know about some features
>>>>> does not know where to find things.
>>>>
>>>> Please describe a real scenario, I'm missing the point.
>>>
>>>
>>> OK so.
>>>
>>> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING
>>> Driver only knows about VIRTIO_BALLOON_F_REPORTING so
>>> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do.
>>> How does it know which vq to use for reporting?
>>> It will try to use the free page hint one.
>> 
>> First, sorry for not catching up again with the discussion earlier.
>> 
>> I think David's point is based on the assumption that by the time feature
>> with the feature bit N+1 is specified and allocates a queue Q, all
>> queues with indexes smaller than Q are allocated and possibly associated
>> with features that were previously specified (and probably have feature
>> bits smaller than N+1).
>> 
>> I.e. that we can mandate, even if you don't want to care about other
>> optional features, you have to, because we say so, for the matter of
>> virtqueue existence. And anything in the future, you don't have to care
>> about because the queue index associated with future features is larger
>> than Q, so it does not affect our position.
>> 
>> I think that argument can fall a part if:
>> * future features reference optional queues defined in the past
>> * somebody managed to introduce a limbo where a feature is reserved, and
>>    they can not decide if they want a queue or not, or make the existence
>>    of the queue depend on something else than a feature bit.
>
> Staring at the cross-vmm, including the adding+removing of features and 
> queues that are not in the spec, I am wondering if (in a world with 
> fixed virtqueues)
>
> 1) Feature bits must be reserved before used.
>
> 2) Queue indices must be reserved before used.
>
> It all smells like a problem similar to device IDs ...

Indeed, we need a rule "reserve a feature bit/queue index before using
it, even if you do not plan to spec it properly".
Michael S. Tsirkin April 7, 2025, 1:32 p.m. UTC | #45
On Mon, Apr 07, 2025 at 03:28:13PM +0200, Cornelia Huck wrote:
> On Mon, Apr 07 2025, David Hildenbrand <david@redhat.com> wrote:
> 
> > On 07.04.25 15:12, Halil Pasic wrote:
> >> On Mon, 7 Apr 2025 04:34:29 -0400
> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> 
> >>> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote:
> >>>> On 07.04.25 09:52, Michael S. Tsirkin wrote:
> >>>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote:
> >>>>>>>
> >>>>>>> Not perfect, but AFAIKS, not horrible.
> >>>>>>
> >>>>>> It is like it is. QEMU does queue exist if the corresponding feature
> >>>>>> is offered by the device, and that is what we have to live with.
> >>>>>
> >>>>> I don't think we can live with this properly though.
> >>>>> It means a guest that does not know about some features
> >>>>> does not know where to find things.
> >>>>
> >>>> Please describe a real scenario, I'm missing the point.
> >>>
> >>>
> >>> OK so.
> >>>
> >>> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING
> >>> Driver only knows about VIRTIO_BALLOON_F_REPORTING so
> >>> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do.
> >>> How does it know which vq to use for reporting?
> >>> It will try to use the free page hint one.
> >> 
> >> First, sorry for not catching up again with the discussion earlier.
> >> 
> >> I think David's point is based on the assumption that by the time feature
> >> with the feature bit N+1 is specified and allocates a queue Q, all
> >> queues with indexes smaller than Q are allocated and possibly associated
> >> with features that were previously specified (and probably have feature
> >> bits smaller than N+1).
> >> 
> >> I.e. that we can mandate, even if you don't want to care about other
> >> optional features, you have to, because we say so, for the matter of
> >> virtqueue existence. And anything in the future, you don't have to care
> >> about because the queue index associated with future features is larger
> >> than Q, so it does not affect our position.
> >> 
> >> I think that argument can fall a part if:
> >> * future features reference optional queues defined in the past
> >> * somebody managed to introduce a limbo where a feature is reserved, and
> >>    they can not decide if they want a queue or not, or make the existence
> >>    of the queue depend on something else than a feature bit.
> >
> > Staring at the cross-vmm, including the adding+removing of features and 
> > queues that are not in the spec, I am wondering if (in a world with 
> > fixed virtqueues)
> >
> > 1) Feature bits must be reserved before used.
> >
> > 2) Queue indices must be reserved before used.
> >
> > It all smells like a problem similar to device IDs ...
> 
> Indeed, we need a rule "reserve a feature bit/queue index before using
> it, even if you do not plan to spec it properly".


Reserving feature bits is something I do my best to advocate for
in all presentations I do.
Halil Pasic April 7, 2025, 5:26 p.m. UTC | #46
On Mon, 07 Apr 2025 15:28:13 +0200
Cornelia Huck <cohuck@redhat.com> wrote:

> > Staring at the cross-vmm, including the adding+removing of features and 
> > queues that are not in the spec, I am wondering if (in a world with 
> > fixed virtqueues)
> >
> > 1) Feature bits must be reserved before used.
> >
> > 2) Queue indices must be reserved before used.
> >
> > It all smells like a problem similar to device IDs ...  
> 
> Indeed, we need a rule "reserve a feature bit/queue index before using
> it, even if you do not plan to spec it properly".

What definition of usage do you guys have in mind?  Would an RFC patch
constitute usage.

I think reserving/allocating an identifier of this type before relying on
it for anything remotely serious is very basic common sense.

Frankly I would go even further and advocate for the following rule: we
don't accept anything virtio into Linux unless it is reasonably/properly
spec-ed. My train of thought is following: if a virtio thing gains
traction with Linux it has a fair chance of becoming a de-facto standard.

Consider our thinking on this one. Despite the fact that what is spec-ed
is obviously nicer, we almost decided to change the spec to fit what is
implemented and fielded out there. And IMHO for good reason. For any
rule we come up with, I think one of the most crucial questions is, who
is going to enforce it if anybody. The Linux (and probably also QEMU)
virtio maintainers are in my opinion the most reasonable point of
enforcement. Another thing to consider. After the code is in and things
work, I speculate that the motivation for writing a proper spec may
wane. I hope we do strive for consistency between the spec and the
implementations we are talking about. Having and eye on the spec while
looking at and trying to understand the code suits my workflow better
at least than the other way around. And licensing-wise getting the spec
merged first is probably the better option. 

Regards,
Halil
Daniel Verkamp April 7, 2025, 5:39 p.m. UTC | #47
On Mon, Apr 7, 2025 at 6:13 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.04.25 11:13, David Hildenbrand wrote:
> > On 07.04.25 11:11, David Hildenbrand wrote:
> >> On 07.04.25 10:58, Michael S. Tsirkin wrote:
> >>> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote:
> >>>> On 07.04.25 10:49, Michael S. Tsirkin wrote:
> >>>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Whoever adds new feat_X *must be aware* about all previous features,
> >>>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces.
> >>>>>>>
> >>>>>>>
> >>>>>>> The knowledge is supposed be limited to which feature bit to use.
> >>>>>>
> >>>>>> I think we also have to know which virtqueue bits can be used, right?
> >>>>>>
> >>>>>
> >>>>> what are virtqueue bits? vq number?
> >>>>
> >>>> Yes, sorry.
> >>>
> >>> I got confused myself, it's vq index actually now, we made the spec
> >>> consistent with that terminology. used to be number/index
> >>> interchangeably.
> >>>
> >>>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6
> >>>> with their VIRTIO_BALLOON_F_WS_REPORTING.
> >>>
> >>>
> >>> crossvm guys really should have reserved the feature bit even if they
> >>> did not bother specifying it. Let's reserve it now at least?
> >>
> >> Along with the virtqueue indices, right?
> >>
> >> Note that there was
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html
> >>
> >> and
> >>
> >> https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT
> >>
> >> But it only was RFC, and as the QEMU implementation didn't materialize,
> >> nobody seemed to care ...
> >
> > Heh, but that one said:
> >
> > +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for
> > Working Set
> >
> > Which does not seem to reflect reality ...

Please feel free to disregard these features and reuse their bits and
queue indexes; as far as I know, they are not actually enabled
anywhere currently and the corresponding guest patches were only
applied to some (no-longer-used) ChromeOS kernel trees, so the
compatibility impact should be minimal. I will also try to clean up
the leftover bits on the crosvm side just to clear things up.

> I dug a bit more into cross-vm, because that one seems to be the only
> one out there that does not behave like everybody else I found (maybe good,
> maybe bad :) ).
>
>
> 1) There was temporarily even another feature (VIRTIO_BALLOON_F_EVENTS_VQ)
> and another queue.
>
> It got removed from cross-vm in:
>
> commit 9ba634b82b55ba762dc8724676b2cf9419460145
> Author: Daniel Verkamp <dverkamp@chromium.org>
> Date:   Thu Jul 11 11:29:52 2024 -0700
>
>      devices: virtio-balloon: remove event queue support
>
>      VIRTIO_BALLOON_F_EVENTS_VQ was part of a proposed virtio spec change.
>
>      It is not currently supported by upstream Linux, so removing this should
>      have no effect except for guest kernels that had CHROMIUM patches
>      applied.
>
>      The virtqueue indexes for the ws-related queues are decremented to fill
>      the hole left by the removal of the event VQ; these are non-standard as
>      well, so they do not have virtqueue indexes assigned in the virtio spec,
>      but the proposed spec extension did actually use vq indexes 5 and 6.
>
>      BUG=b:214864326
>
>
> 2) cross-vm is aware of the upstream Linux driver
>
> They thought your fix would go upstream; it didn't.
>
> commit a2fa119e759d0238a42ff15a9aff0dfd122afebd
> Author: Daniel Verkamp <dverkamp@chromium.org>
> Date:   Wed Jul 10 16:16:28 2024 -0700
>
>      devices: virtio-balloon: warn about queue index mismatches
>
>      The Linux kernel virtio-balloon driver spec non-compliance related to
>      queue numbering is being fixed; add some diagnostics to our device that
>      help to check if everything is working as expected.
>
>      <https://lore.kernel.org/virtualization/CACGkMEsg0+vpav1Fo8JF1isq4Ef8t4_CFN1scyztDO8bXzRLBQ@mail.gmail.com/T/>
>
>      Additionally, replace the num_expected_queues() function with per-queue
>      checking to avoid the need for the duplicate feature checks and queue
>      count calculation; each pop_queue() call will be checked using the `?`
>      operator and return a more useful error message if a particular queue is
>      missing.
>
>      BUG=None
>      TEST=crosvm run --balloon-page-reporting ...
>
>
> IIRC, in that commit they switched to the "spec" behavior.
>
> That's when they started hard-coding the queue indexes.
>
> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting.
> How is that handled?

In practice, it only works because nobody calls crosvm with
--balloon-page-reporting (it's off by default), so the balloon device
does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature.

(I just went searching now, and it does seem like there is actually
one user in Android that does try to enable page reporting[1], which
I'll have to look into...)

In my opinion, it makes the most sense to keep the spec as it is and
change QEMU and the kernel to match, but obviously that's not trivial
to do in a way that doesn't break existing devices and drivers.

[1]: https://cs.android.com/android/platform/superproject/main/+/main:packages/modules/Virtualization/android/virtmgr/src/crosvm.rs;l=1079;drc=caff2827914c0918260ff76cd55f6e3dc6323d51
David Hildenbrand April 7, 2025, 6:47 p.m. UTC | #48
>>> Heh, but that one said:
>>>
>>> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for
>>> Working Set
>>>
>>> Which does not seem to reflect reality ...
> 
> Please feel free to disregard these features and reuse their bits and
> queue indexes; as far as I know, they are not actually enabled
> anywhere currently and the corresponding guest patches were only
> applied to some (no-longer-used) ChromeOS kernel trees, so the
> compatibility impact should be minimal. I will also try to clean up
> the leftover bits on the crosvm side just to clear things up.

Thanks for your reply, and thanks for clarifying+cleaning it up.

> 
>> I dug a bit more into cross-vm, because that one seems to be the only
>> one out there that does not behave like everybody else I found (maybe good,
>> maybe bad :) ).
>>
>>
>> 1) There was temporarily even another feature (VIRTIO_BALLOON_F_EVENTS_VQ)
>> and another queue.
>>
>> It got removed from cross-vm in:
>>
>> commit 9ba634b82b55ba762dc8724676b2cf9419460145
>> Author: Daniel Verkamp <dverkamp@chromium.org>
>> Date:   Thu Jul 11 11:29:52 2024 -0700
>>
>>       devices: virtio-balloon: remove event queue support
>>
>>       VIRTIO_BALLOON_F_EVENTS_VQ was part of a proposed virtio spec change.
>>
>>       It is not currently supported by upstream Linux, so removing this should
>>       have no effect except for guest kernels that had CHROMIUM patches
>>       applied.
>>
>>       The virtqueue indexes for the ws-related queues are decremented to fill
>>       the hole left by the removal of the event VQ; these are non-standard as
>>       well, so they do not have virtqueue indexes assigned in the virtio spec,
>>       but the proposed spec extension did actually use vq indexes 5 and 6.
>>
>>       BUG=b:214864326
>>
>>
>> 2) cross-vm is aware of the upstream Linux driver
>>
>> They thought your fix would go upstream; it didn't.
>>
>> commit a2fa119e759d0238a42ff15a9aff0dfd122afebd
>> Author: Daniel Verkamp <dverkamp@chromium.org>
>> Date:   Wed Jul 10 16:16:28 2024 -0700
>>
>>       devices: virtio-balloon: warn about queue index mismatches
>>
>>       The Linux kernel virtio-balloon driver spec non-compliance related to
>>       queue numbering is being fixed; add some diagnostics to our device that
>>       help to check if everything is working as expected.
>>
>>       <https://lore.kernel.org/virtualization/CACGkMEsg0+vpav1Fo8JF1isq4Ef8t4_CFN1scyztDO8bXzRLBQ@mail.gmail.com/T/>
>>
>>       Additionally, replace the num_expected_queues() function with per-queue
>>       checking to avoid the need for the duplicate feature checks and queue
>>       count calculation; each pop_queue() call will be checked using the `?`
>>       operator and return a more useful error message if a particular queue is
>>       missing.
>>
>>       BUG=None
>>       TEST=crosvm run --balloon-page-reporting ...
>>
>>
>> IIRC, in that commit they switched to the "spec" behavior.
>>
>> That's when they started hard-coding the queue indexes.
>>
>> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting.
>> How is that handled?
> 
> In practice, it only works because nobody calls crosvm with
> --balloon-page-reporting (it's off by default), so the balloon device
> does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature.
> 
> (I just went searching now, and it does seem like there is actually
> one user in Android that does try to enable page reporting[1], which
> I'll have to look into...)
> 
> In my opinion, it makes the most sense to keep the spec as it is and
> change QEMU and the kernel to match, but obviously that's not trivial
> to do in a way that doesn't break existing devices and drivers.

If only it would be limited to QEMU and Linux ... :)

Out of curiosity, assuming we'd make the spec match the current 
QEMU/Linux implementation at least for the 3 involved features only, 
would there be a way to adjust crossvm without any disruption?

I still have the feeling that it will be rather hard to get that all 
implementations match the spec ... For new features+queues it will be 
easy to force the usage of fixed virtqueue numbers, but for 
free-page-hinting and reporting, it's a mess :(
Daniel Verkamp April 7, 2025, 9:09 p.m. UTC | #49
On Mon, Apr 7, 2025 at 11:47 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>> Heh, but that one said:
> >>>
> >>> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for
> >>> Working Set
> >>>
> >>> Which does not seem to reflect reality ...
> >
> > Please feel free to disregard these features and reuse their bits and
> > queue indexes; as far as I know, they are not actually enabled
> > anywhere currently and the corresponding guest patches were only
> > applied to some (no-longer-used) ChromeOS kernel trees, so the
> > compatibility impact should be minimal. I will also try to clean up
> > the leftover bits on the crosvm side just to clear things up.
>
> Thanks for your reply, and thanks for clarifying+cleaning it up.
>
[...]
> >> IIRC, in that commit they switched to the "spec" behavior.
> >>
> >> That's when they started hard-coding the queue indexes.
> >>
> >> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting.
> >> How is that handled?
> >
> > In practice, it only works because nobody calls crosvm with
> > --balloon-page-reporting (it's off by default), so the balloon device
> > does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature.
> >
> > (I just went searching now, and it does seem like there is actually
> > one user in Android that does try to enable page reporting[1], which
> > I'll have to look into...)
> >
> > In my opinion, it makes the most sense to keep the spec as it is and
> > change QEMU and the kernel to match, but obviously that's not trivial
> > to do in a way that doesn't break existing devices and drivers.
>
> If only it would be limited to QEMU and Linux ... :)
>
> Out of curiosity, assuming we'd make the spec match the current
> QEMU/Linux implementation at least for the 3 involved features only,
> would there be a way to adjust crossvm without any disruption?
>
> I still have the feeling that it will be rather hard to get that all
> implementations match the spec ... For new features+queues it will be
> easy to force the usage of fixed virtqueue numbers, but for
> free-page-hinting and reporting, it's a mess :(

If the spec is changed, we can certainly update crosvm to match it; I
think this only really affects a few devices (balloon and technically
filesystem, but see below), only affects features that are generally
not turned on, and in many cases, the guest kernel is updated
simultaneously with the crosvm binary. I'm not opposed to changing the
spec to match reality, although that feels like a bad move from a
spec-integrity perspective.

Regardless of the chosen path, I think the spec should be clarified -
the meaning of "queue only exists if <feature> is set" leaves the
reader with too many questions:
- What does "if <feature> is set" mean? If it's advertised by the
device? If it's acked by the driver? (To me, "set" definitely hints at
the latter, but it should be explicit.)
- What does it mean for a virtqueue to "exist"? Does that queue index
disappear from the numbering if it does not exist, sliding all of the
later queues down? If so, the spec should really not have the static
queue numbers listed for the later queues, since they are only correct
if all previous feature-dependent queues were also "set", whatever
that means.

The way crosvm interpreted this was:
- "if <feature> is set" means "if the device advertised <feature>
*and* driver acknowledged <feature>"
- "queue only exists" means "if <feature> was not acked, the
corresponding virtqueue cannot be enabled by the driver" (attempting
to set queue_enable = 1 has no effect).
- Any later virtqueues are unaffected and still have the same queue indexes.

The way QEMU interpeted this (I think, just skimming the code and
working from memory here):
- "if <feature> is set" means "if the device advertised <feature>" (it
is checking host_features, not guest_features)
- "queue only exists" means "if <feature> was not offered by the
device, all later virtqueues are shifted down one index"

---

The spec for the filesystem device has a similar issue to the balloon device:
- Queue 0 (hiprio) is always available regardless of features.
- Queue 1 (notification queue) has a note that "The notification queue
only exists if VIRTIO_FS_F_NOTIFICATION is set."
- Queue 2..n are supposed to be the request queues per the numbering
in the spec.

This is how it has been specified since virtio 1.2 when the fs device
was originally added. However, the Linux driver (and probably all
existing device implementations - at least virtiofsd and crosvm's fs
device) don't support VIRTIO_FS_F_NOTIFICATION and use queue 1 as a
request queue, which matches the QEMU/Linux interpretation but means
the spec doesn't match reality again.

Thanks,
-- Daniel
Michael S. Tsirkin April 7, 2025, 9:20 p.m. UTC | #50
On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
> > In my opinion, it makes the most sense to keep the spec as it is and
> > change QEMU and the kernel to match, but obviously that's not trivial
> > to do in a way that doesn't break existing devices and drivers.
> 
> If only it would be limited to QEMU and Linux ... :)
> 
> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
> implementation at least for the 3 involved features only, would there be a
> way to adjust crossvm without any disruption?
> 
> I still have the feeling that it will be rather hard to get that all
> implementations match the spec ... For new features+queues it will be easy
> to force the usage of fixed virtqueue numbers, but for free-page-hinting and
> reporting, it's a mess :(


Still thinking about a way to fix drivers... We can discuss this
theoretically, maybe?
David Hildenbrand April 9, 2025, 10:46 a.m. UTC | #51
On 07.04.25 23:20, Michael S. Tsirkin wrote:
> On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
>>> In my opinion, it makes the most sense to keep the spec as it is and
>>> change QEMU and the kernel to match, but obviously that's not trivial
>>> to do in a way that doesn't break existing devices and drivers.
>>
>> If only it would be limited to QEMU and Linux ... :)
>>
>> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
>> implementation at least for the 3 involved features only, would there be a
>> way to adjust crossvm without any disruption?
>>
>> I still have the feeling that it will be rather hard to get that all
>> implementations match the spec ... For new features+queues it will be easy
>> to force the usage of fixed virtqueue numbers, but for free-page-hinting and
>> reporting, it's a mess :(
> 
> 
> Still thinking about a way to fix drivers... We can discuss this
> theoretically, maybe?

Yes, absolutely. I took the time to do some more digging; regarding 
drivers only Linux seems to be problematic.

virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support 
problematic features (free page hinting, free page reporting) in their 
virtio-balloon implementations.

So from the known drivers, only Linux is applicable.

reporting_vq is either at idx 4/3/2
free_page_vq is either at idx 3/2
statsq is at idx2 (only relevant if the feature is offered)

So if we could test for the existence of a virtqueue at an idx easily, 
we could test from highest-to-smallest idx.

But I recall that testing for the existance of a virtqueue on s390x 
resulted in the problem/deadlock in the first place ...
Michael S. Tsirkin April 9, 2025, 10:56 a.m. UTC | #52
On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote:
> On 07.04.25 23:20, Michael S. Tsirkin wrote:
> > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
> > > > In my opinion, it makes the most sense to keep the spec as it is and
> > > > change QEMU and the kernel to match, but obviously that's not trivial
> > > > to do in a way that doesn't break existing devices and drivers.
> > > 
> > > If only it would be limited to QEMU and Linux ... :)
> > > 
> > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
> > > implementation at least for the 3 involved features only, would there be a
> > > way to adjust crossvm without any disruption?
> > > 
> > > I still have the feeling that it will be rather hard to get that all
> > > implementations match the spec ... For new features+queues it will be easy
> > > to force the usage of fixed virtqueue numbers, but for free-page-hinting and
> > > reporting, it's a mess :(
> > 
> > 
> > Still thinking about a way to fix drivers... We can discuss this
> > theoretically, maybe?
> 
> Yes, absolutely. I took the time to do some more digging; regarding drivers
> only Linux seems to be problematic.
> 
> virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support
> problematic features (free page hinting, free page reporting) in their
> virtio-balloon implementations.
> 
> So from the known drivers, only Linux is applicable.
> 
> reporting_vq is either at idx 4/3/2
> free_page_vq is either at idx 3/2
> statsq is at idx2 (only relevant if the feature is offered)
> 
> So if we could test for the existence of a virtqueue at an idx easily, we
> could test from highest-to-smallest idx.
> 
> But I recall that testing for the existance of a virtqueue on s390x resulted
> in the problem/deadlock in the first place ...
> 
> -- 
> Cheers,
> 
> David / dhildenb

So let's talk about a new feature bit?

Since vqs are probed after feature negotiation, it looks like
we could have a feature bit trigger sane behaviour, right?

I kind of dislike it that we have a feature bit for bugs though.
What would be a minimal new feature to add so it does not
feel wrong?

Maybe it's in the field of psychology though ...
David Hildenbrand April 9, 2025, 11:02 a.m. UTC | #53
On 07.04.25 23:09, Daniel Verkamp wrote:
> On Mon, Apr 7, 2025 at 11:47 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>> Heh, but that one said:
>>>>>
>>>>> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for
>>>>> Working Set
>>>>>
>>>>> Which does not seem to reflect reality ...
>>>
>>> Please feel free to disregard these features and reuse their bits and
>>> queue indexes; as far as I know, they are not actually enabled
>>> anywhere currently and the corresponding guest patches were only
>>> applied to some (no-longer-used) ChromeOS kernel trees, so the
>>> compatibility impact should be minimal. I will also try to clean up
>>> the leftover bits on the crosvm side just to clear things up.
>>
>> Thanks for your reply, and thanks for clarifying+cleaning it up.
>>
> [...]
>>>> IIRC, in that commit they switched to the "spec" behavior.
>>>>
>>>> That's when they started hard-coding the queue indexes.
>>>>
>>>> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting.
>>>> How is that handled?
>>>
>>> In practice, it only works because nobody calls crosvm with
>>> --balloon-page-reporting (it's off by default), so the balloon device
>>> does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature.
>>>
>>> (I just went searching now, and it does seem like there is actually
>>> one user in Android that does try to enable page reporting[1], which
>>> I'll have to look into...)
>>>
>>> In my opinion, it makes the most sense to keep the spec as it is and
>>> change QEMU and the kernel to match, but obviously that's not trivial
>>> to do in a way that doesn't break existing devices and drivers.
>>
>> If only it would be limited to QEMU and Linux ... :)
>>
>> Out of curiosity, assuming we'd make the spec match the current
>> QEMU/Linux implementation at least for the 3 involved features only,
>> would there be a way to adjust crossvm without any disruption?
>>
>> I still have the feeling that it will be rather hard to get that all
>> implementations match the spec ... For new features+queues it will be
>> easy to force the usage of fixed virtqueue numbers, but for
>> free-page-hinting and reporting, it's a mess :(
> 
> If the spec is changed, we can certainly update crosvm to match it; I
> think this only really affects a few devices (balloon and technically
> filesystem, but see below), only affects features that are generally
> not turned on, and in many cases, the guest kernel is updated
> simultaneously with the crosvm binary. I'm not opposed to changing the
> spec to match reality, although that feels like a bad move from a
> spec-integrity perspective.

Right. We didn't pay attention that the spec would reflect reality, and 
the reality was a bad decision :)

> 
> Regardless of the chosen path, I think the spec should be clarified -
> the meaning of "queue only exists if <feature> is set" leaves the
> reader with too many questions:

Right, that's what we've been discussing.

> - What does "if <feature> is set" mean? If it's advertised by the
> device? If it's acked by the driver? (To me, "set" definitely hints at
> the latter, but it should be explicit.)

Currently it's "feature is offered by the device".

> - What does it mean for a virtqueue to "exist"? Does that queue index
> disappear from the numbering if it does not exist, sliding all of the
> later queues down?

Currently it's like that, yes.

> If so, the spec should really not have the static
> queue numbers listed for the later queues, since they are only correct
> if all previous feature-dependent queues were also "set", whatever
> that means.

Yes, that's also what we've been discussing. And that started restarted 
the whole "can we fix device/drivers instead" :)

> 
> The way crosvm interpreted this was:
> - "if <feature> is set" means "if the device advertised <feature>
> *and* driver acknowledged <feature>"
> - "queue only exists" means "if <feature> was not acked, the
> corresponding virtqueue cannot be enabled by the driver" (attempting
> to set queue_enable = 1 has no effect).
> - Any later virtqueues are unaffected and still have the same queue indexes.

Yes, that matches my understanding.

> 
> The way QEMU interpeted this (I think, just skimming the code and
> working from memory here):
> - "if <feature> is set" means "if the device advertised <feature>" (it
> is checking host_features, not guest_features)

Right, "offered features".

> - "queue only exists" means "if <feature> was not offered by the
> device, all later virtqueues are shifted down one index"

Exactly.

> 
> ---
> 
> The spec for the filesystem device has a similar issue to the balloon device:
> - Queue 0 (hiprio) is always available regardless of features.
> - Queue 1 (notification queue) has a note that "The notification queue
> only exists if VIRTIO_FS_F_NOTIFICATION is set."
> - Queue 2..n are supposed to be the request queues per the numbering
> in the spec.
> 
> This is how it has been specified since virtio 1.2 when the fs device
> was originally added. However, the Linux driver (and probably all
> existing device implementations - at least virtiofsd and crosvm's fs
> device) don't support VIRTIO_FS_F_NOTIFICATION and use queue 1 as a
> request queue, which matches the QEMU/Linux interpretation but means
> the spec doesn't match reality again.

Yes, these are the two known cases we have to sort out.

Thanks for the information on crossvm, and that whatever we decide to 
do, adjusting crossvm should not be a big problem (at least for now :) ).
David Hildenbrand April 9, 2025, 11:12 a.m. UTC | #54
On 09.04.25 12:56, Michael S. Tsirkin wrote:
> On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote:
>> On 07.04.25 23:20, Michael S. Tsirkin wrote:
>>> On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
>>>>> In my opinion, it makes the most sense to keep the spec as it is and
>>>>> change QEMU and the kernel to match, but obviously that's not trivial
>>>>> to do in a way that doesn't break existing devices and drivers.
>>>>
>>>> If only it would be limited to QEMU and Linux ... :)
>>>>
>>>> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
>>>> implementation at least for the 3 involved features only, would there be a
>>>> way to adjust crossvm without any disruption?
>>>>
>>>> I still have the feeling that it will be rather hard to get that all
>>>> implementations match the spec ... For new features+queues it will be easy
>>>> to force the usage of fixed virtqueue numbers, but for free-page-hinting and
>>>> reporting, it's a mess :(
>>>
>>>
>>> Still thinking about a way to fix drivers... We can discuss this
>>> theoretically, maybe?
>>
>> Yes, absolutely. I took the time to do some more digging; regarding drivers
>> only Linux seems to be problematic.
>>
>> virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support
>> problematic features (free page hinting, free page reporting) in their
>> virtio-balloon implementations.
>>
>> So from the known drivers, only Linux is applicable.
>>
>> reporting_vq is either at idx 4/3/2
>> free_page_vq is either at idx 3/2
>> statsq is at idx2 (only relevant if the feature is offered)
>>
>> So if we could test for the existence of a virtqueue at an idx easily, we
>> could test from highest-to-smallest idx.
>>
>> But I recall that testing for the existance of a virtqueue on s390x resulted
>> in the problem/deadlock in the first place ...
>>
>> -- 
>> Cheers,
>>
>> David / dhildenb
> 
> So let's talk about a new feature bit?

Are you thinking about a new feature that switches between "fixed queue 
indices" and "compressed queue indices", whereby the latter would be the 
legacy default and we would expect all devices to switch to the new 
fixed-queue-indices layout?

We could make all new features require "fixed-queue-indices".

> 
> Since vqs are probed after feature negotiation, it looks like
> we could have a feature bit trigger sane behaviour, right?

In the Linux driver, yes. In QEMU (devices), we add the queues when 
realizing, so we'd need some mechanism to adjust the queue indices based 
on feature negotiation I guess?

For virtio-balloon it might be doable to simply always create+indicate 
free-page hinting to resolve the issue easily.

For virtio-fs it might not be that easy.

> 
> I kind of dislike it that we have a feature bit for bugs though.
> What would be a minimal new feature to add so it does not
> feel wrong?

Probably as above: fixed vs. compressed virtqueue indices?
Michael S. Tsirkin April 9, 2025, 12:07 p.m. UTC | #55
On Wed, Apr 09, 2025 at 01:12:19PM +0200, David Hildenbrand wrote:
> On 09.04.25 12:56, Michael S. Tsirkin wrote:
> > On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote:
> > > On 07.04.25 23:20, Michael S. Tsirkin wrote:
> > > > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
> > > > > > In my opinion, it makes the most sense to keep the spec as it is and
> > > > > > change QEMU and the kernel to match, but obviously that's not trivial
> > > > > > to do in a way that doesn't break existing devices and drivers.
> > > > > 
> > > > > If only it would be limited to QEMU and Linux ... :)
> > > > > 
> > > > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
> > > > > implementation at least for the 3 involved features only, would there be a
> > > > > way to adjust crossvm without any disruption?
> > > > > 
> > > > > I still have the feeling that it will be rather hard to get that all
> > > > > implementations match the spec ... For new features+queues it will be easy
> > > > > to force the usage of fixed virtqueue numbers, but for free-page-hinting and
> > > > > reporting, it's a mess :(
> > > > 
> > > > 
> > > > Still thinking about a way to fix drivers... We can discuss this
> > > > theoretically, maybe?
> > > 
> > > Yes, absolutely. I took the time to do some more digging; regarding drivers
> > > only Linux seems to be problematic.
> > > 
> > > virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support
> > > problematic features (free page hinting, free page reporting) in their
> > > virtio-balloon implementations.
> > > 
> > > So from the known drivers, only Linux is applicable.
> > > 
> > > reporting_vq is either at idx 4/3/2
> > > free_page_vq is either at idx 3/2
> > > statsq is at idx2 (only relevant if the feature is offered)
> > > 
> > > So if we could test for the existence of a virtqueue at an idx easily, we
> > > could test from highest-to-smallest idx.
> > > 
> > > But I recall that testing for the existance of a virtqueue on s390x resulted
> > > in the problem/deadlock in the first place ...
> > > 
> > > -- 
> > > Cheers,
> > > 
> > > David / dhildenb
> > 
> > So let's talk about a new feature bit?
> 
> Are you thinking about a new feature that switches between "fixed queue
> indices" and "compressed queue indices", whereby the latter would be the
> legacy default and we would expect all devices to switch to the new
> fixed-queue-indices layout?
> 
> We could make all new features require "fixed-queue-indices".

I see two ways:
1. we make driver behave correctly with in spec and out of spec devices
   and we make qemu behave correctly with in spec and out of spec devices
2. a new feature bit

I prefer 1, and when we add a new feature we can also
document that it should be in spec if negotiated.

My question is if 1 is practical.





> > 
> > Since vqs are probed after feature negotiation, it looks like
> > we could have a feature bit trigger sane behaviour, right?
> 
> In the Linux driver, yes. In QEMU (devices), we add the queues when
> realizing, so we'd need some mechanism to adjust the queue indices based on
> feature negotiation I guess?

Well we can add queues later, nothing prevents that.


> For virtio-balloon it might be doable to simply always create+indicate
> free-page hinting to resolve the issue easily.


OK, so
- for devices, we suggest that basically VIRTIO_BALLOON_F_REPORTING
  only created with VIRTIO_BALLOON_F_FREE_PAGE_HINT and 
  VIRTIO_BALLOON_F_FREE_PAGE_HINT only created with VIRTIO_BALLOON_F_STATS_VQ

I got that.


Now, for drivers.

If the dependency is satisfied as above, no difference.

What should drivers do if not?



I think the thing to do would be to first probe spec compliant
vq numbers? If not there, try with the non compliant version?


However,  you wrote:
> > > But I recall that testing for the existance of a virtqueue on s390x resulted
> > > in the problem/deadlock in the first place ...

I think the deadlock was if trying to *use* a non-existent virtqueue?

This is qemu code:

    case CCW_CMD_READ_VQ_CONF:
        if (check_len) {
            if (ccw.count != sizeof(vq_config)) {
                ret = -EINVAL;
                break;
            }
        } else if (ccw.count < sizeof(vq_config)) {
            /* Can't execute command. */
            ret = -EINVAL;
            break;
        }
        if (!ccw.cda) {
            ret = -EFAULT;
        } else {
            ret = ccw_dstream_read(&sch->cds, vq_config.index);
            if (ret) {
                break;
            }
            vq_config.index = be16_to_cpu(vq_config.index);
            if (vq_config.index >= VIRTIO_QUEUE_MAX) {
                ret = -EINVAL;
                break;
            }
            vq_config.num_max = virtio_queue_get_num(vdev,
                                                     vq_config.index);
            vq_config.num_max = cpu_to_be16(vq_config.num_max);
            ret = ccw_dstream_write(&sch->cds, vq_config.num_max);
            if (!ret) {
                sch->curr_status.scsw.count = ccw.count - sizeof(vq_config);
            }
        }

and

            
int virtio_queue_get_num(VirtIODevice *vdev, int n)
{               
    return vdev->vq[n].vring.num;
}           
            


it seems to happily return vq size with no issues?




> For virtio-fs it might not be that easy.

virtio fs? But it has no features?

> > 
> > I kind of dislike it that we have a feature bit for bugs though.
> > What would be a minimal new feature to add so it does not
> > feel wrong?
> 
> Probably as above: fixed vs. compressed virtqueue indices?
> 
> -- 
> Cheers,
> 
> David / dhildenb
David Hildenbrand April 9, 2025, 12:24 p.m. UTC | #56
On 09.04.25 14:07, Michael S. Tsirkin wrote:
> On Wed, Apr 09, 2025 at 01:12:19PM +0200, David Hildenbrand wrote:
>> On 09.04.25 12:56, Michael S. Tsirkin wrote:
>>> On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote:
>>>> On 07.04.25 23:20, Michael S. Tsirkin wrote:
>>>>> On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
>>>>>>> In my opinion, it makes the most sense to keep the spec as it is and
>>>>>>> change QEMU and the kernel to match, but obviously that's not trivial
>>>>>>> to do in a way that doesn't break existing devices and drivers.
>>>>>>
>>>>>> If only it would be limited to QEMU and Linux ... :)
>>>>>>
>>>>>> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
>>>>>> implementation at least for the 3 involved features only, would there be a
>>>>>> way to adjust crossvm without any disruption?
>>>>>>
>>>>>> I still have the feeling that it will be rather hard to get that all
>>>>>> implementations match the spec ... For new features+queues it will be easy
>>>>>> to force the usage of fixed virtqueue numbers, but for free-page-hinting and
>>>>>> reporting, it's a mess :(
>>>>>
>>>>>
>>>>> Still thinking about a way to fix drivers... We can discuss this
>>>>> theoretically, maybe?
>>>>
>>>> Yes, absolutely. I took the time to do some more digging; regarding drivers
>>>> only Linux seems to be problematic.
>>>>
>>>> virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support
>>>> problematic features (free page hinting, free page reporting) in their
>>>> virtio-balloon implementations.
>>>>
>>>> So from the known drivers, only Linux is applicable.
>>>>
>>>> reporting_vq is either at idx 4/3/2
>>>> free_page_vq is either at idx 3/2
>>>> statsq is at idx2 (only relevant if the feature is offered)
>>>>
>>>> So if we could test for the existence of a virtqueue at an idx easily, we
>>>> could test from highest-to-smallest idx.
>>>>
>>>> But I recall that testing for the existance of a virtqueue on s390x resulted
>>>> in the problem/deadlock in the first place ...
>>>>
>>>> -- 
>>>> Cheers,
>>>>
>>>> David / dhildenb
>>>
>>> So let's talk about a new feature bit?
>>
>> Are you thinking about a new feature that switches between "fixed queue
>> indices" and "compressed queue indices", whereby the latter would be the
>> legacy default and we would expect all devices to switch to the new
>> fixed-queue-indices layout?
>>
>> We could make all new features require "fixed-queue-indices".
> 
> I see two ways:
> 1. we make driver behave correctly with in spec and out of spec devices
>     and we make qemu behave correctly with in spec and out of spec devices
> 2. a new feature bit
> 
> I prefer 1, and when we add a new feature we can also
> document that it should be in spec if negotiated.
> 
> My question is if 1 is practical.

AFAIKT, 1) implies:

virtio-balloon:

a) Driver

As mentioned above, we'd need a reliable way to test for the existence 
of a virtqueue, so we can e.g., test for reporting_vq idx 4 -> 3 -> 2

With that we'd be able to support compressed+fixed at the same time.

Q: Is it possible/feasible?

b) Device: virtio-balloon: we can fake existence of STAT and 
FREE_PAGE_HINTING easily, such that the compressed layout corresponds to 
the fixed layout easily.

Q: alternatives? We could try creating multiple queues for the same 
feature, but it's going to be a mess I'm afraid ...


virtio-fs:

a) Driver

Linux does not even implement VIRTIO_FS_F_NOTIFICATION or respect 
VIRTIO_FS_F_NOTIFICATION when calculating queue indices, ...

b) Device

Same applies to virtiofsd ...

Q: Did anybody actually implement VIRTIO_FS_F_NOTIFICATION ever? If not, 
can we just remove it from the spec completely and resolve the issue 
that way?
Michael S. Tsirkin April 9, 2025, 4:08 p.m. UTC | #57
On Wed, Apr 09, 2025 at 02:24:32PM +0200, David Hildenbrand wrote:
> On 09.04.25 14:07, Michael S. Tsirkin wrote:
> > On Wed, Apr 09, 2025 at 01:12:19PM +0200, David Hildenbrand wrote:
> > > On 09.04.25 12:56, Michael S. Tsirkin wrote:
> > > > On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote:
> > > > > On 07.04.25 23:20, Michael S. Tsirkin wrote:
> > > > > > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote:
> > > > > > > > In my opinion, it makes the most sense to keep the spec as it is and
> > > > > > > > change QEMU and the kernel to match, but obviously that's not trivial
> > > > > > > > to do in a way that doesn't break existing devices and drivers.
> > > > > > > 
> > > > > > > If only it would be limited to QEMU and Linux ... :)
> > > > > > > 
> > > > > > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux
> > > > > > > implementation at least for the 3 involved features only, would there be a
> > > > > > > way to adjust crossvm without any disruption?
> > > > > > > 
> > > > > > > I still have the feeling that it will be rather hard to get that all
> > > > > > > implementations match the spec ... For new features+queues it will be easy
> > > > > > > to force the usage of fixed virtqueue numbers, but for free-page-hinting and
> > > > > > > reporting, it's a mess :(
> > > > > > 
> > > > > > 
> > > > > > Still thinking about a way to fix drivers... We can discuss this
> > > > > > theoretically, maybe?
> > > > > 
> > > > > Yes, absolutely. I took the time to do some more digging; regarding drivers
> > > > > only Linux seems to be problematic.
> > > > > 
> > > > > virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support
> > > > > problematic features (free page hinting, free page reporting) in their
> > > > > virtio-balloon implementations.
> > > > > 
> > > > > So from the known drivers, only Linux is applicable.
> > > > > 
> > > > > reporting_vq is either at idx 4/3/2
> > > > > free_page_vq is either at idx 3/2
> > > > > statsq is at idx2 (only relevant if the feature is offered)
> > > > > 
> > > > > So if we could test for the existence of a virtqueue at an idx easily, we
> > > > > could test from highest-to-smallest idx.
> > > > > 
> > > > > But I recall that testing for the existance of a virtqueue on s390x resulted
> > > > > in the problem/deadlock in the first place ...
> > > > > 
> > > > > -- 
> > > > > Cheers,
> > > > > 
> > > > > David / dhildenb
> > > > 
> > > > So let's talk about a new feature bit?
> > > 
> > > Are you thinking about a new feature that switches between "fixed queue
> > > indices" and "compressed queue indices", whereby the latter would be the
> > > legacy default and we would expect all devices to switch to the new
> > > fixed-queue-indices layout?
> > > 
> > > We could make all new features require "fixed-queue-indices".
> > 
> > I see two ways:
> > 1. we make driver behave correctly with in spec and out of spec devices
> >     and we make qemu behave correctly with in spec and out of spec devices
> > 2. a new feature bit
> > 
> > I prefer 1, and when we add a new feature we can also
> > document that it should be in spec if negotiated.
> > 
> > My question is if 1 is practical.
> 
> AFAIKT, 1) implies:
> 
> virtio-balloon:
> 
> a) Driver
> 
> As mentioned above, we'd need a reliable way to test for the existence of a
> virtqueue, so we can e.g., test for reporting_vq idx 4 -> 3 -> 2
> 
> With that we'd be able to support compressed+fixed at the same time.
> 
> Q: Is it possible/feasible?
> 
> b) Device: virtio-balloon: we can fake existence of STAT and
> FREE_PAGE_HINTING easily, such that the compressed layout corresponds to the
> fixed layout easily.
> 
> Q: alternatives? We could try creating multiple queues for the same feature,
> but it's going to be a mess I'm afraid ...
> 
> 
> virtio-fs:
> 
> a) Driver
> 
> Linux does not even implement VIRTIO_FS_F_NOTIFICATION or respect
> VIRTIO_FS_F_NOTIFICATION when calculating queue indices, ...
> 
> b) Device
> 
> Same applies to virtiofsd ...
> 
> Q: Did anybody actually implement VIRTIO_FS_F_NOTIFICATION ever? If not, can
> we just remove it from the spec completely and resolve the issue that way?

Donnu. Vivek?

Or we can check for queue number 1+num_request_queues maybe?
If that exists then it is spec compliant?


> -- 
> Cheers,
> 
> David / dhildenb
diff mbox series

Patch

diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 21fa7ac849e5c..4904b831c0a75 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -302,11 +302,17 @@  static struct airq_info *new_airq_info(int index)
 static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
 					 u64 *first, void **airq_info)
 {
-	int i, j;
+	int i, j, queue_idx, highest_queue_idx = -1;
 	struct airq_info *info;
 	unsigned long *indicator_addr = NULL;
 	unsigned long bit, flags;
 
+	/* Array entries without an actual queue pointer must be ignored. */
+	for (i = 0; i < nvqs; i++) {
+		if (vqs[i])
+			highest_queue_idx++;
+	}
+
 	for (i = 0; i < MAX_AIRQ_AREAS && !indicator_addr; i++) {
 		mutex_lock(&airq_areas_lock);
 		if (!airq_areas[i])
@@ -316,7 +322,7 @@  static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
 		if (!info)
 			return NULL;
 		write_lock_irqsave(&info->lock, flags);
-		bit = airq_iv_alloc(info->aiv, nvqs);
+		bit = airq_iv_alloc(info->aiv, highest_queue_idx + 1);
 		if (bit == -1UL) {
 			/* Not enough vacancies. */
 			write_unlock_irqrestore(&info->lock, flags);
@@ -325,8 +331,10 @@  static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs,
 		*first = bit;
 		*airq_info = info;
 		indicator_addr = info->aiv->vector;
-		for (j = 0; j < nvqs; j++) {
-			airq_iv_set_ptr(info->aiv, bit + j,
+		for (j = 0, queue_idx = 0; j < nvqs; j++) {
+			if (!vqs[j])
+				continue;
+			airq_iv_set_ptr(info->aiv, bit + queue_idx++,
 					(unsigned long)vqs[j]);
 		}
 		write_unlock_irqrestore(&info->lock, flags);