Message ID | 20250402203621.940090-1-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] s390/virtio_ccw: don't allocate/assign airqs for non-existing queues | expand |
On 02/04/2025 22.36, David Hildenbrand wrote: > If we finds a vq without a name in our input array in > virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer > to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. > > Consequently, we create only a queue if it actually exists (name != NULL) > and assign an incremental queue index to each such existing queue. > > However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we > will not ignore these "non-existing queues", but instead assign an airq > indicator to them. > > Besides never releasing them in virtio_ccw_drop_indicators() (because > there is no virtqueue), the bigger issue seems to be that there will be a > disagreement between the device and the Linux guest about the airq > indicator to be used for notifying a queue, because the indicator bit > for adapter I/O interrupt is derived from the queue index. > > The virtio spec states under "Setting Up Two-Stage Queue Indicators": > > ... indicator contains the guest address of an area wherein the > indicators for the devices are contained, starting at bit_nr, one > bit per virtqueue of the device. > > And further in "Notification via Adapter I/O Interrupts": > > For notifying the driver of virtqueue buffers, the device sets the > bit in the guest-provided indicator area at the corresponding > offset. > > For example, QEMU uses in virtio_ccw_notify() the queue index (passed as > "vector") to select the relevant indicator bit. If a queue does not exist, > it does not have a corresponding indicator bit assigned, because it > effectively doesn't have a queue index. > > Using a virtio-balloon-ccw device under QEMU with free-page-hinting > disabled ("free-page-hint=off") but free-page-reporting enabled > ("free-page-reporting=on") will result in free page reporting > not working as expected: in the virtio_balloon driver, we'll be stuck > forever in virtballoon_free_page_report()->wait_event(), because the > waitqueue will not be woken up as the notification from the device is > lost: it would use the wrong indicator bit. > > Free page reporting stops working and we get splats (when configured to > detect hung wqs) like: > > INFO: task kworker/1:3:463 blocked for more than 61 seconds. > Not tainted 6.14.0 #4 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/1:3 [...] > Workqueue: events page_reporting_process > Call Trace: > [<000002f404e6dfb2>] __schedule+0x402/0x1640 > [<000002f404e6f22e>] schedule+0x3e/0xe0 > [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon] > [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740 > [<000002f403fd3ee2>] process_one_work+0x1c2/0x400 > [<000002f403fd4b96>] worker_thread+0x296/0x420 > [<000002f403fe10b4>] kthread+0x124/0x290 > [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60 > [<000002f404e77272>] ret_from_fork+0xa/0x38 > > There was recently a discussion [1] whether the "holes" should be > treated differently again, effectively assigning also non-existing > queues a queue index: that should also fix the issue, but requires other > workarounds to not break existing setups. > > Let's fix it without affecting existing setups for now by properly ignoring > the non-existing queues, so the indicator bits will match the queue > indexes. > > [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/ > > Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") > Reported-by: Chandra Merla <cmerla@redhat.com> > Cc: <Stable@vger.kernel.org> > Cc: Cornelia Huck <cohuck@redhat.com> > Cc: Thomas Huth <thuth@redhat.com> > Cc: Halil Pasic <pasic@linux.ibm.com> > Cc: Eric Farman <farman@linux.ibm.com> > Cc: Heiko Carstens <hca@linux.ibm.com> > Cc: Vasily Gorbik <gor@linux.ibm.com> > Cc: Alexander Gordeev <agordeev@linux.ibm.com> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > Cc: Sven Schnelle <svens@linux.ibm.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) Thanks for tackling this, David! I tested the patch and the problems are gone now, indeed! Tested-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Thomas Huth <thuth@redhat.com>
On Wed, Apr 02 2025, David Hildenbrand <david@redhat.com> wrote: > If we finds a vq without a name in our input array in > virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer > to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. > > Consequently, we create only a queue if it actually exists (name != NULL) > and assign an incremental queue index to each such existing queue. > > However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we > will not ignore these "non-existing queues", but instead assign an airq > indicator to them. > > Besides never releasing them in virtio_ccw_drop_indicators() (because > there is no virtqueue), the bigger issue seems to be that there will be a > disagreement between the device and the Linux guest about the airq > indicator to be used for notifying a queue, because the indicator bit > for adapter I/O interrupt is derived from the queue index. > > The virtio spec states under "Setting Up Two-Stage Queue Indicators": > > ... indicator contains the guest address of an area wherein the > indicators for the devices are contained, starting at bit_nr, one > bit per virtqueue of the device. > > And further in "Notification via Adapter I/O Interrupts": > > For notifying the driver of virtqueue buffers, the device sets the > bit in the guest-provided indicator area at the corresponding > offset. > > For example, QEMU uses in virtio_ccw_notify() the queue index (passed as > "vector") to select the relevant indicator bit. If a queue does not exist, > it does not have a corresponding indicator bit assigned, because it > effectively doesn't have a queue index. > > Using a virtio-balloon-ccw device under QEMU with free-page-hinting > disabled ("free-page-hint=off") but free-page-reporting enabled > ("free-page-reporting=on") will result in free page reporting > not working as expected: in the virtio_balloon driver, we'll be stuck > forever in virtballoon_free_page_report()->wait_event(), because the > waitqueue will not be woken up as the notification from the device is > lost: it would use the wrong indicator bit. > > Free page reporting stops working and we get splats (when configured to > detect hung wqs) like: > > INFO: task kworker/1:3:463 blocked for more than 61 seconds. > Not tainted 6.14.0 #4 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/1:3 [...] > Workqueue: events page_reporting_process > Call Trace: > [<000002f404e6dfb2>] __schedule+0x402/0x1640 > [<000002f404e6f22e>] schedule+0x3e/0xe0 > [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon] > [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740 > [<000002f403fd3ee2>] process_one_work+0x1c2/0x400 > [<000002f403fd4b96>] worker_thread+0x296/0x420 > [<000002f403fe10b4>] kthread+0x124/0x290 > [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60 > [<000002f404e77272>] ret_from_fork+0xa/0x38 > > There was recently a discussion [1] whether the "holes" should be > treated differently again, effectively assigning also non-existing > queues a queue index: that should also fix the issue, but requires other > workarounds to not break existing setups. > > Let's fix it without affecting existing setups for now by properly ignoring > the non-existing queues, so the indicator bits will match the queue > indexes. > > [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/ > > Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") > Reported-by: Chandra Merla <cmerla@redhat.com> > Cc: <Stable@vger.kernel.org> > Cc: Cornelia Huck <cohuck@redhat.com> > Cc: Thomas Huth <thuth@redhat.com> > Cc: Halil Pasic <pasic@linux.ibm.com> > Cc: Eric Farman <farman@linux.ibm.com> > Cc: Heiko Carstens <hca@linux.ibm.com> > Cc: Vasily Gorbik <gor@linux.ibm.com> > Cc: Alexander Gordeev <agordeev@linux.ibm.com> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > Cc: Sven Schnelle <svens@linux.ibm.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) Reviewed-by: Cornelia Huck <cohuck@redhat.com> [I assume that one of the IBM folks can simply pick this up?]
On Wed, Apr 02, 2025 at 10:36:21PM +0200, David Hildenbrand wrote: > If we finds a vq without a name in our input array in > virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer > to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. > > Consequently, we create only a queue if it actually exists (name != NULL) > and assign an incremental queue index to each such existing queue. > > However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we > will not ignore these "non-existing queues", but instead assign an airq > indicator to them. > > Besides never releasing them in virtio_ccw_drop_indicators() (because > there is no virtqueue), the bigger issue seems to be that there will be a > disagreement between the device and the Linux guest about the airq > indicator to be used for notifying a queue, because the indicator bit > for adapter I/O interrupt is derived from the queue index. > > The virtio spec states under "Setting Up Two-Stage Queue Indicators": > > ... indicator contains the guest address of an area wherein the > indicators for the devices are contained, starting at bit_nr, one > bit per virtqueue of the device. > > And further in "Notification via Adapter I/O Interrupts": > > For notifying the driver of virtqueue buffers, the device sets the > bit in the guest-provided indicator area at the corresponding > offset. > > For example, QEMU uses in virtio_ccw_notify() the queue index (passed as > "vector") to select the relevant indicator bit. If a queue does not exist, > it does not have a corresponding indicator bit assigned, because it > effectively doesn't have a queue index. > > Using a virtio-balloon-ccw device under QEMU with free-page-hinting > disabled ("free-page-hint=off") but free-page-reporting enabled > ("free-page-reporting=on") will result in free page reporting > not working as expected: in the virtio_balloon driver, we'll be stuck > forever in virtballoon_free_page_report()->wait_event(), because the > waitqueue will not be woken up as the notification from the device is > lost: it would use the wrong indicator bit. > > Free page reporting stops working and we get splats (when configured to > detect hung wqs) like: > > INFO: task kworker/1:3:463 blocked for more than 61 seconds. > Not tainted 6.14.0 #4 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/1:3 [...] > Workqueue: events page_reporting_process > Call Trace: > [<000002f404e6dfb2>] __schedule+0x402/0x1640 > [<000002f404e6f22e>] schedule+0x3e/0xe0 > [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon] > [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740 > [<000002f403fd3ee2>] process_one_work+0x1c2/0x400 > [<000002f403fd4b96>] worker_thread+0x296/0x420 > [<000002f403fe10b4>] kthread+0x124/0x290 > [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60 > [<000002f404e77272>] ret_from_fork+0xa/0x38 > > There was recently a discussion [1] whether the "holes" should be > treated differently again, effectively assigning also non-existing > queues a queue index: that should also fix the issue, but requires other > workarounds to not break existing setups. > > Let's fix it without affecting existing setups for now by properly ignoring > the non-existing queues, so the indicator bits will match the queue > indexes. > > [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/ > > Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") > Reported-by: Chandra Merla <cmerla@redhat.com> > Cc: <Stable@vger.kernel.org> > Cc: Cornelia Huck <cohuck@redhat.com> > Cc: Thomas Huth <thuth@redhat.com> > Cc: Halil Pasic <pasic@linux.ibm.com> > Cc: Eric Farman <farman@linux.ibm.com> > Cc: Heiko Carstens <hca@linux.ibm.com> > Cc: Vasily Gorbik <gor@linux.ibm.com> > Cc: Alexander Gordeev <agordeev@linux.ibm.com> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > Cc: Sven Schnelle <svens@linux.ibm.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> feel free to merge. > --- > drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c > index 21fa7ac849e5c..4904b831c0a75 100644 > --- a/drivers/s390/virtio/virtio_ccw.c > +++ b/drivers/s390/virtio/virtio_ccw.c > @@ -302,11 +302,17 @@ static struct airq_info *new_airq_info(int index) > static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs, > u64 *first, void **airq_info) > { > - int i, j; > + int i, j, queue_idx, highest_queue_idx = -1; > struct airq_info *info; > unsigned long *indicator_addr = NULL; > unsigned long bit, flags; > > + /* Array entries without an actual queue pointer must be ignored. */ > + for (i = 0; i < nvqs; i++) { > + if (vqs[i]) > + highest_queue_idx++; > + } > + > for (i = 0; i < MAX_AIRQ_AREAS && !indicator_addr; i++) { > mutex_lock(&airq_areas_lock); > if (!airq_areas[i]) > @@ -316,7 +322,7 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs, > if (!info) > return NULL; > write_lock_irqsave(&info->lock, flags); > - bit = airq_iv_alloc(info->aiv, nvqs); > + bit = airq_iv_alloc(info->aiv, highest_queue_idx + 1); > if (bit == -1UL) { > /* Not enough vacancies. */ > write_unlock_irqrestore(&info->lock, flags); > @@ -325,8 +331,10 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs, > *first = bit; > *airq_info = info; > indicator_addr = info->aiv->vector; > - for (j = 0; j < nvqs; j++) { > - airq_iv_set_ptr(info->aiv, bit + j, > + for (j = 0, queue_idx = 0; j < nvqs; j++) { > + if (!vqs[j]) > + continue; > + airq_iv_set_ptr(info->aiv, bit + queue_idx++, > (unsigned long)vqs[j]); > } > write_unlock_irqrestore(&info->lock, flags); > -- > 2.48.1
Am 02.04.25 um 22:36 schrieb David Hildenbrand: [...]> > Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") > Reported-by: Chandra Merla <cmerla@redhat.com> > Cc: <Stable@vger.kernel.org> > Cc: Cornelia Huck <cohuck@redhat.com> > Cc: Thomas Huth <thuth@redhat.com> > Cc: Halil Pasic <pasic@linux.ibm.com> > Cc: Eric Farman <farman@linux.ibm.com> > Cc: Heiko Carstens <hca@linux.ibm.com> > Cc: Vasily Gorbik <gor@linux.ibm.com> > Cc: Alexander Gordeev <agordeev@linux.ibm.com> > Cc: Christian Borntraeger <borntraeger@linux.ibm.com> > Cc: Sven Schnelle <svens@linux.ibm.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> I will apply this to our internal s390 tree and let it sit for a day to get CI coverage. Alexander, Vasily, Heiko, please then schedule for the s390/fixes branch if there is no CI fallout.
On Wed, 2 Apr 2025 22:36:21 +0200 David Hildenbrand <david@redhat.com> wrote: > If we finds a vq without a name in our input array in > virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer > to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. > > Consequently, we create only a queue if it actually exists (name != NULL) > and assign an incremental queue index to each such existing queue. First and foremost: thank you for addressing this! I have to admit, I'm still plagued by some cognitive dissonance here. Please bear with me. For starters the commit message of a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") goes like this: """ virtio: don't allocate vqs when names[i] = NULL Some vqs may not need to be allocated when their related feature bits are disabled. So callers may pass in such vqs with "names = NULL". Then we skip such vq allocations. """ In my reading it does not talk about "non-existent" queues, but queues that do not need to be allocated. This could make sense for something like virtio-net where controlq 2N is with N being max_virtqueue_pairs. I guess for the guest it could make sense to not set up some of the queues initially, but those, I guess would be perfectly existent queues spec-wise and we would expect the index of controlq being 2N. And the queues that don't get set up initially can get set up later. At least this is my naive understanding at the moment. Now apparently there is a different case where queues may or may not exist, but we would, for some reason like to have the non-existent queues in the array, because for an other set of features negotiated those queues would actually exist and occupy and index. Frankly I don't fully comprehend it at the moment, but I will have another look at the code and at the spec. So lookign at the spec for virtio-ballon I see: 5.5.2 Virtqueues 0 inflateq 1 deflateq 2 statsq 3 free_page_vq 4 reporting_vq statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. Which is IMHO weird. I used to think about the number in front of the name as the virtqueue index. But based on this patch I'm wondering if that is compatible with the approach of this patch. What does for example mean if we have VIRTIO_BALLOON_F_STATS_VQ not offered, VIRTIO_BALLOON_F_FREE_PAGE_HINT offered but not negotiated and VIRTIO_BALLOON_F_PAGE_REPORTING negotiated. One reading of the things is that statq is does not exist for sure, free_page_vq is a little tricky because "is set" is not precise enough, and reporting_vq exists for sure. And in your reading of the spec, if I understood you correctly and we assume that free_page_vq does not exist, it has index 2. But I from the top of my head, I don't know why interpreting the spec like it reporting_vq has index 4 and indexes 2 and 3 are not mapped to existing-queues would be considered wrong. And even if we do want reportig_vq to have index 2, the virtio-balloon code could still give us an array where reportig_vq is at index 2. Why not? Sorry this ended up being a very long rant. the bottom line is that, I lack conceptual clarity on where the problem exactly is and how it needs to be addressed. I would like to understand this properly before moving forward. [..] > > There was recently a discussion [1] whether the "holes" should be > treated differently again, effectively assigning also non-existing > queues a queue index: that should also fix the issue, but requires other > workarounds to not break existing setups. > Sorry I have to have a look at that discussion. Maybe it will answer some my questions. > Let's fix it without affecting existing setups for now by properly ignoring > the non-existing queues, so the indicator bits will match the queue > indexes. Just one question. My understanding is that the crux is that Linux and QEMU (or the driver and the device) disagree at which index reporting_vq is actually sitting. Is that right? Regards, Halil
On 03.04.25 16:18, Halil Pasic wrote: > On Wed, 2 Apr 2025 22:36:21 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> If we finds a vq without a name in our input array in >> virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer >> to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. >> >> Consequently, we create only a queue if it actually exists (name != NULL) >> and assign an incremental queue index to each such existing queue. > > First and foremost: thank you for addressing this! I have to admit, I'm > still plagued by some cognitive dissonance here. Please bear with me. > > For starters the commit message of a229989d975e ("virtio: don't > allocate vqs when names[i] = NULL") goes like this: > """ > virtio: don't allocate vqs when names[i] = NULL > > Some vqs may not need to be allocated when their related feature bits > are disabled. So callers may pass in such vqs with "names = NULL". > Then we skip such vq allocations. > """ > > In my reading it does not talk about "non-existent" queues, but queues > that do not need to be allocated. This could make sense for something > like virtio-net where controlq 2N is with N being max_virtqueue_pairs. > > I guess for the guest it could make sense to not set up some of the > queues initially, but those, I guess would be perfectly existent queues > spec-wise and we would expect the index of controlq being 2N. And the > queues that don't get set up initially can get set up later. At least > this is my naive understanding at the moment. > > Now apparently there is a different case where queues may or may not > exist, but we would, for some reason like to have the non-existent > queues in the array, because for an other set of features negotiated > those queues would actually exist and occupy and index. Frankly > I don't fully comprehend it at the moment, but I will have another look > at the code and at the spec. > > So lookign at the spec for virtio-ballon I see: > > > > 5.5.2 Virtqueues > > 0 > inflateq > 1 > deflateq > 2 > statsq > 3 > free_page_vq > 4 > reporting_vq > > statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > > free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > > reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > > Which is IMHO weird. I used to think about the number in front of the > name as the virtqueue index. But based on this patch I'm wondering if > that is compatible with the approach of this patch. > > What does for example mean if we have VIRTIO_BALLOON_F_STATS_VQ not > offered, VIRTIO_BALLOON_F_FREE_PAGE_HINT offered but not negotiated > and VIRTIO_BALLOON_F_PAGE_REPORTING negotiated. > > One reading of the things is that statq is does not exist for sure, > free_page_vq is a little tricky because "is set" is not precise enough, > and reporting_vq exists for sure. And in your reading of the spec, if > I understood you correctly and we assume that free_page_vq does not > exist, it has index 2. But I from the top of my head, I don't know why > interpreting the spec like it reporting_vq has index 4 and indexes 2 > and 3 are not mapped to existing-queues would be considered wrong. > > And even if we do want reportig_vq to have index 2, the virtio-balloon > code could still give us an array where reportig_vq is at index 2. Why > not? > > Sorry this ended up being a very long rant. the bottom line is that, I > lack conceptual clarity on where the problem exactly is and how it needs > to be addressed. I would like to understand this properly before moving > forward. > I would suggest you take a look at [1] I added below, and the disconnect between the spec and what Linux + QEMU actually implemented. In reality (with QEMU), reporting_vq sits either on index 3 or 4, depending on the existence of VIRTIO_BALLOON_F_FREE_PAGE_HINT. > [..] >> >> There was recently a discussion [1] whether the "holes" should be >> treated differently again, effectively assigning also non-existing >> queues a queue index: that should also fix the issue, but requires other >> workarounds to not break existing setups. >> > > Sorry I have to have a look at that discussion. Maybe it will answer > some my questions. Yes, I think so. > >> Let's fix it without affecting existing setups for now by properly ignoring >> the non-existing queues, so the indicator bits will match the queue >> indexes. > > Just one question. My understanding is that the crux is that Linux > and QEMU (or the driver and the device) disagree at which index > reporting_vq is actually sitting. Is that right? I thought I made it clear: this is only about the airq indicator bit. That's where both disagree. Not the actual queue index (see above).
On Thu, Apr 03, 2025 at 04:18:36PM +0200, Halil Pasic wrote: > On Wed, 2 Apr 2025 22:36:21 +0200 > David Hildenbrand <david@redhat.com> wrote: > > > If we finds a vq without a name in our input array in > > virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer > > to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. > > > > Consequently, we create only a queue if it actually exists (name != NULL) > > and assign an incremental queue index to each such existing queue. > > First and foremost: thank you for addressing this! I have to admit, I'm > still plagued by some cognitive dissonance here. Please bear with me. > > For starters the commit message of a229989d975e ("virtio: don't > allocate vqs when names[i] = NULL") goes like this: > """ > virtio: don't allocate vqs when names[i] = NULL > > Some vqs may not need to be allocated when their related feature bits > are disabled. So callers may pass in such vqs with "names = NULL". > Then we skip such vq allocations. > """ > > In my reading it does not talk about "non-existent" queues, but queues > that do not need to be allocated. This could make sense for something > like virtio-net where controlq 2N is with N being max_virtqueue_pairs. > > I guess for the guest it could make sense to not set up some of the > queues initially, but those, I guess would be perfectly existent queues > spec-wise and we would expect the index of controlq being 2N. And the > queues that don't get set up initially can get set up later. At least > this is my naive understanding at the moment. > > Now apparently there is a different case where queues may or may not > exist, but we would, for some reason like to have the non-existent > queues in the array, because for an other set of features negotiated > those queues would actually exist and occupy and index. Frankly > I don't fully comprehend it at the moment, but I will have another look > at the code and at the spec. > > So lookign at the spec for virtio-ballon I see: > > > > 5.5.2 Virtqueues > > 0 > inflateq > 1 > deflateq > 2 > statsq > 3 > free_page_vq > 4 > reporting_vq Indeed. Unfortunately, no one at all implemented this properly as per spec :(. Balloon is the worst offender here but other devices are broken too in some configurations. Given it has been like this for many years I'm inclined in this instance to fix the spec not the implementations. > statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > > free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > > reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > > Which is IMHO weird. I used to think about the number in front of the > name as the virtqueue index. But based on this patch I'm wondering if > that is compatible with the approach of this patch. > > What does for example mean if we have VIRTIO_BALLOON_F_STATS_VQ not > offered, VIRTIO_BALLOON_F_FREE_PAGE_HINT offered but not negotiated > and VIRTIO_BALLOON_F_PAGE_REPORTING negotiated. > > One reading of the things is that statq is does not exist for sure, > free_page_vq is a little tricky because "is set" is not precise enough, > and reporting_vq exists for sure. And in your reading of the spec, if > I understood you correctly and we assume that free_page_vq does not > exist, it has index 2. But I from the top of my head, I don't know why > interpreting the spec like it reporting_vq has index 4 and indexes 2 > and 3 are not mapped to existing-queues would be considered wrong. > > And even if we do want reportig_vq to have index 2, the virtio-balloon > code could still give us an array where reportig_vq is at index 2. Why > not? > > Sorry this ended up being a very long rant. the bottom line is that, I > lack conceptual clarity on where the problem exactly is and how it needs > to be addressed. I would like to understand this properly before moving > forward. > > [..] > > > > There was recently a discussion [1] whether the "holes" should be > > treated differently again, effectively assigning also non-existing > > queues a queue index: that should also fix the issue, but requires other > > workarounds to not break existing setups. Yep. I tried that, and it is basically a terrible mess :(. > > > > Sorry I have to have a look at that discussion. Maybe it will answer > some my questions. > > > Let's fix it without affecting existing setups for now by properly ignoring > > the non-existing queues, so the indicator bits will match the queue > > indexes. > > Just one question. My understanding is that the crux is that Linux > and QEMU (or the driver and the device) disagree at which index > reporting_vq is actually sitting. Is that right? > > Regards, > Halil
On Thu, 3 Apr 2025 10:35:33 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Apr 03, 2025 at 04:18:36PM +0200, Halil Pasic wrote: > > On Wed, 2 Apr 2025 22:36:21 +0200 [..] > > > > 5.5.2 Virtqueues > > > > 0 > > inflateq > > 1 > > deflateq > > 2 > > statsq > > 3 > > free_page_vq > > 4 > > reporting_vq > > Indeed. Unfortunately, no one at all implemented this properly as > per spec :(. > > Balloon is the worst offender here but other devices are broken > too in some configurations. > > > Given it has been like this for many years I'm inclined in this > instance to fix the spec not the implementations. > I understand! For me at least knowing if we are going to change the spec or the implementations is pretty important. It would probably make sense to spin a patch for the spec, just for the unlikely case that somebody did end up implementing this by the spec and wants to protest. I think, a consequence of this design is that all queues need to be created and allocated at initialization time. I mean in the spec we have something like """ 5.1.6.5.6.1 Driver Requirements: Automatic receive steering in multiqueue mode The driver MUST configure the virtqueues before enabling them with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. """ For example one could want to hotplug 2 more vCPUs and still have a queue-pair per cpu (and a controlq). Those 2 extra queue-pairs could in theory be allocated on-demand instead of having to allocate memory for the rings for all the queues corresponding to the maxed out setup. I've had a look at the Linux virtio-net and it does allocate all the queues up front. Also with this design, I believe we would effectively prohibit "holes". Now if holes are prohibited, IMHO it does not make a whole lot of sense for the virtio_find_vqs() to support holes. Because the holes and a fair amount of complexity. Of course that would make sense as a cleanup on top. Regards, Halil
On Thu, 3 Apr 2025 16:28:31 +0200 David Hildenbrand <david@redhat.com> wrote: > > Sorry I have to have a look at that discussion. Maybe it will answer > > some my questions. > > Yes, I think so. > > > > >> Let's fix it without affecting existing setups for now by properly > >> ignoring the non-existing queues, so the indicator bits will match > >> the queue indexes. > > > > Just one question. My understanding is that the crux is that Linux > > and QEMU (or the driver and the device) disagree at which index > > reporting_vq is actually sitting. Is that right? > > I thought I made it clear: this is only about the airq indicator bit. > That's where both disagree. > > Not the actual queue index (see above). I did some more research including having a look at that discussion. Let me try to sum up how did we end up here. Before commit a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") the kernel behavior used to be in spec, but QEMU and possibly other hypervisor were out of spec and things did not work. Possibly because of the complexity of fixing the hypervisor(s) commit a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted for changing the guest side so that it does not fit the spec but fits the hypervisor(s). It unfortunately also broke notifiers (for the with holes) scenario for virtio-ccw only. Now we had another look at this, and have concluded that fixing the hypervisor(s) and fixing the kernel, and making sure that the fixed kernel can tolerate the old broken hypervisor(s) is way to complicated if possible at all. So we decided to give the spec a reality check and fix the notifier bit assignment for virtio-ccw which is broken beyond doubt if we accept that the correct virtqueue index is the one that the hypervisor(s) use and not the one that the spec says they should use. With the spec fixed, the whole notion of "holes" will be something that does not make sense any more. With that the merit of the kernel interface virtio_find_vqs() supporting "holes" is quite questionable. Now we need it because the drivers within the Linux kernel still think of the queues in terms of the current spec, i.e. they try to have the "holes" as mandated by the spec, and the duty of making it work with the broken device implementations falls to the transports. Under the assumption that the spec is indeed going to be fixed: Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Regards, Halil
On Fri, Apr 04, 2025 at 06:02:04AM +0200, Halil Pasic wrote: > On Thu, 3 Apr 2025 10:35:33 -0400 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Apr 03, 2025 at 04:18:36PM +0200, Halil Pasic wrote: > > > On Wed, 2 Apr 2025 22:36:21 +0200 > [..] > > > > > > 5.5.2 Virtqueues > > > > > > 0 > > > inflateq > > > 1 > > > deflateq > > > 2 > > > statsq > > > 3 > > > free_page_vq > > > 4 > > > reporting_vq > > > > Indeed. Unfortunately, no one at all implemented this properly as > > per spec :(. > > > > Balloon is the worst offender here but other devices are broken > > too in some configurations. > > > > > > Given it has been like this for many years I'm inclined in this > > instance to fix the spec not the implementations. > > > > I understand! For me at least knowing if we are going to change the > spec or the implementations is pretty important. It would probably > make sense to spin a patch for the spec, just for the unlikely case that > somebody did end up implementing this by the spec and wants to protest. > > I think, a consequence of this design is that all queues need to be > created and allocated at initialization time. Why? after feature negotiation. > I mean in the spec we have something like > """ > 5.1.6.5.6.1 Driver Requirements: Automatic receive steering in multiqueue mode > The driver MUST configure the virtqueues before enabling them with the VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command. > """ > > For example one could want to hotplug 2 more vCPUs and still have a > queue-pair per cpu (and a controlq). Those 2 extra queue-pairs could > in theory be allocated on-demand instead of having to allocate memory > for the rings for all the queues corresponding to the maxed out setup. > I've had a look at the Linux virtio-net and it does allocate all the > queues up front. > > Also with this design, I believe we would effectively prohibit "holes". For existing devices at least. > Now if holes are prohibited, IMHO it does not make a whole lot of > sense for the virtio_find_vqs() to support holes. Because the holes > and a fair amount of complexity. Of course that would make sense as a > cleanup on top. > > Regards, > Halil
On 04.04.25 06:36, Halil Pasic wrote: > On Thu, 3 Apr 2025 16:28:31 +0200 > David Hildenbrand <david@redhat.com> wrote: > >>> Sorry I have to have a look at that discussion. Maybe it will answer >>> some my questions. >> >> Yes, I think so. >> >>> >>>> Let's fix it without affecting existing setups for now by properly >>>> ignoring the non-existing queues, so the indicator bits will match >>>> the queue indexes. >>> >>> Just one question. My understanding is that the crux is that Linux >>> and QEMU (or the driver and the device) disagree at which index >>> reporting_vq is actually sitting. Is that right? >> >> I thought I made it clear: this is only about the airq indicator bit. >> That's where both disagree. >> >> Not the actual queue index (see above). > > I did some more research including having a look at that discussion. Let > me try to sum up how did we end up here. Let me add some more details after digging as well: > > Before commit a229989d975e ("virtio: don't allocate vqs when names[i] = > NULL") the kernel behavior used to be in spec, but QEMU and possibly > other hypervisor were out of spec and things did not work. It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that, we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very end. So there was no possibility for holes "in-between". In the Linux driver, we created the stats queue only if the feature bit VIRTIO_BALLOON_F_STATS_VQ was actually around: nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would unconditionally create 4 queues. QEMU always supported the first 3 queues unconditionally, but old QEMU did obviously not support the (new) VIRTIO_BALLOON_F_FREE_PAGE_HINT queue. 390x didn't particularly like getting queried for non-existing queues. [1] So the fix was not for a hypervisor that was out of spec, but because quering non-existing queues didn't work. The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well. Again, as QEMU always implemented the 3 first queues unconditionally, this was not a problem. [1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/#t > > Possibly because of the complexity of fixing the hypervisor(s) commit > a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted > for changing the guest side so that it does not fit the spec but fits > the hypervisor(s). It unfortunately also broke notifiers (for the with > holes) scenario for virtio-ccw only. Yes, it broke the notifiers. But note that everything was in spec at that point, because we only documented "free_page_vq == 3" in the spec *2 years later*, in 2020: commit 38448268eba0c105200d131c3f7f660129a4d673 Author: Alexander Duyck <alexander.h.duyck@linux.intel.com> Date: Tue Aug 25 07:45:02 2020 -0700 content: Document balloon feature free page hints Free page hints allow the balloon driver to provide information on what pages are not currently in use so that we can avoid the cost of copying them in migration scenarios. Add a feature description for free page hints describing basic functioning and requirements. At that point, what we documented in the spec *did not match reality* in Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is unconditionally set. QEMU and Linux kept using that queue index assignment model, and the spec was wrong (out of sync?) at that point. The spec got more wrong with commit d917d4a8d552c003e046b0e3b1b529d98f7e695b Author: Alexander Duyck <alexander.h.duyck@linux.intel.com> Date: Tue Aug 25 07:45:17 2020 -0700 content: Document balloon feature free page reporting Free page reporting is a feature that allows the guest to proactively report unused pages to the host. By making use of this feature is is possible to reduce the overall memory footprint of the guest in cases where some significant portion of the memory is idle. Add documentation for the free page reporting feature describing the functionality and requirements. Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to QEMU+Linux implementation, so the spec did not reflect reality. I'll note also cloud-hypervisor [2] today follows that model. In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented in the spec to be *4*. So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending on the availability of the other two features/queues. [2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs > > Now we had another look at this, and have concluded that fixing the > hypervisor(s) and fixing the kernel, and making sure that the fixed > kernel can tolerate the old broken hypervisor(s) is way to complicated > if possible at all. So we decided to give the spec a reality check and > fix the notifier bit assignment for virtio-ccw which is broken beyond > doubt if we accept that the correct virtqueue index is the one that the > hypervisor(s) use and not the one that the spec says they should use. In case of virtio-balloon, it's unfortunate that it went that way, but the spec simply did not / does not reflect reality when it was added to the spec. > > With the spec fixed, the whole notion of "holes" will be something that > does not make sense any more. With that the merit of the kernel interface > virtio_find_vqs() supporting "holes" is quite questionable. Now we need > it because the drivers within the Linux kernel still think of the queues > in terms of the current spec, i.e. they try to have the "holes" as > mandated by the spec, and the duty of making it work with the broken > device implementations falls to the transports. > Right, the "holes" only exist in the input array. > Under the assumption that the spec is indeed going to be fixed: > > Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Thanks!
On 04.04.25 12:00, David Hildenbrand wrote: > On 04.04.25 06:36, Halil Pasic wrote: >> On Thu, 3 Apr 2025 16:28:31 +0200 >> David Hildenbrand <david@redhat.com> wrote: >> >>>> Sorry I have to have a look at that discussion. Maybe it will answer >>>> some my questions. >>> >>> Yes, I think so. >>> >>>> >>>>> Let's fix it without affecting existing setups for now by properly >>>>> ignoring the non-existing queues, so the indicator bits will match >>>>> the queue indexes. >>>> >>>> Just one question. My understanding is that the crux is that Linux >>>> and QEMU (or the driver and the device) disagree at which index >>>> reporting_vq is actually sitting. Is that right? >>> >>> I thought I made it clear: this is only about the airq indicator bit. >>> That's where both disagree. >>> >>> Not the actual queue index (see above). >> >> I did some more research including having a look at that discussion. Let >> me try to sum up how did we end up here. > > Let me add some more details after digging as well: > >> >> Before commit a229989d975e ("virtio: don't allocate vqs when names[i] = >> NULL") the kernel behavior used to be in spec, but QEMU and possibly >> other hypervisor were out of spec and things did not work. > > It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that, > we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very > end. So there was no possibility for holes "in-between". > > In the Linux driver, we created the stats queue only if the feature bit > VIRTIO_BALLOON_F_STATS_VQ was actually around: > > nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; > err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); > > That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would > unconditionally create 4 queues. QEMU always supported the first 3 queues > unconditionally, but old QEMU did obviously not support the (new) > VIRTIO_BALLOON_F_FREE_PAGE_HINT queue. > > 390x didn't particularly like getting queried for non-existing > queues. [1] So the fix was not for a hypervisor that was out of spec, but > because quering non-existing queues didn't work. > > The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue > index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well. > > Again, as QEMU always implemented the 3 first queues unconditionally, this was > not a problem. > > [1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/#t > >> >> Possibly because of the complexity of fixing the hypervisor(s) commit >> a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted >> for changing the guest side so that it does not fit the spec but fits >> the hypervisor(s). It unfortunately also broke notifiers (for the with >> holes) scenario for virtio-ccw only. > > Yes, it broke the notifiers. > > But note that everything was in spec at that point, because we only documented > "free_page_vq == 3" in the spec *2 years later*, in 2020: > > commit 38448268eba0c105200d131c3f7f660129a4d673 > Author: Alexander Duyck <alexander.h.duyck@linux.intel.com> > Date: Tue Aug 25 07:45:02 2020 -0700 > > content: Document balloon feature free page hints > > Free page hints allow the balloon driver to provide information on what > pages are not currently in use so that we can avoid the cost of copying > them in migration scenarios. Add a feature description for free page hints > describing basic functioning and requirements. > > At that point, what we documented in the spec *did not match reality* in > Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is > unconditionally set. > > > QEMU and Linux kept using that queue index assignment model, and the spec > was wrong (out of sync?) at that point. The spec got more wrong with > > commit d917d4a8d552c003e046b0e3b1b529d98f7e695b > Author: Alexander Duyck <alexander.h.duyck@linux.intel.com> > Date: Tue Aug 25 07:45:17 2020 -0700 > > content: Document balloon feature free page reporting > > Free page reporting is a feature that allows the guest to proactively > report unused pages to the host. By making use of this feature is is > possible to reduce the overall memory footprint of the guest in cases where > some significant portion of the memory is idle. Add documentation for the > free page reporting feature describing the functionality and requirements. > > Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to > QEMU+Linux implementation, so the spec did not reflect reality. > > I'll note also cloud-hypervisor [2] today follows that model. > > In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning > the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented > in the spec to be *4*. > > So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending > on the availability of the other two features/queues. > > [2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs > > >> >> Now we had another look at this, and have concluded that fixing the >> hypervisor(s) and fixing the kernel, and making sure that the fixed >> kernel can tolerate the old broken hypervisor(s) is way to complicated >> if possible at all. So we decided to give the spec a reality check and >> fix the notifier bit assignment for virtio-ccw which is broken beyond >> doubt if we accept that the correct virtqueue index is the one that the >> hypervisor(s) use and not the one that the spec says they should use. > > In case of virtio-balloon, it's unfortunate that it went that way, but the > spec simply did not / does not reflect reality when it was added to the spec. > >> >> With the spec fixed, the whole notion of "holes" will be something that >> does not make sense any more. With that the merit of the kernel interface >> virtio_find_vqs() supporting "holes" is quite questionable. Now we need >> it because the drivers within the Linux kernel still think of the queues >> in terms of the current spec, i.e. they try to have the "holes" as >> mandated by the spec, and the duty of making it work with the broken >> device implementations falls to the transports. >> > > Right, the "holes" only exist in the input array. > >> Under the assumption that the spec is indeed going to be fixed: For virito-balloon, we should probably do the following: From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Fri, 4 Apr 2025 12:53:16 +0200 Subject: [PATCH] virtio-balloon: Fix queue index assignment for non-existing queues Signed-off-by: David Hildenbrand <david@redhat.com> --- device-types/balloon/description.tex | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex index a1d9603..a7396ff 100644 --- a/device-types/balloon/description.tex +++ b/device-types/balloon/description.tex @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I 5 \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} + +\begin{description} +\item[inflateq] Exists unconditionally. +\item[deflateq] Exists unconditionally. +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. +\end{description} + +\begin{note} +Virtqueue indexes are assigned sequentially for existing queues, starting +with index 0; consequently, if a virtqueue does not exist, it does not get +an index assigned. Assuming all virtqueues exist for a device, the indexes +are: + \begin{description} \item[0] inflateq \item[1] deflateq @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque \item[3] free_page_vq \item[4] reporting_vq \end{description} - - statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. - - free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. - - reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. +\end{note} \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} \begin{description}
On Fri, 4 Apr 2025 01:33:28 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > I think, a consequence of this design is that all queues need to be > > created and allocated at initialization time. > > Why? after feature negotiation. What I mean is, with this change having queues that exist but are not set up before the device becomes operational is not viable any more. Let me use the virtio-net example again. I assume by the current spec it would be OK to have max_virtqueue_pairs quite big e.g. 64, just in case the guest ends up having many vCPUs hotplugged. But start out with 2 vcpus, 2 queue pairs and initially doing VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET with 2. And then grow the guest to 8 vcpus, 'discover' virtqueues 2(I-1) receiveqI, 2(I-1)+1 transmitqI for 2 < I < 9 (I is a natural number) and do another VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET with 8. Please notice that the controlq would sit at index 128 (64*2) all along. That is in the old world. In the new world we don't do holes, so we need to allocate all the virtqueues up to controlq up-front. To avoid having holes. Or any queue-pairs that are discoverd after the initial vq discovery would need to have an index larger than controlq has. Regards, Halil
On Fri, 4 Apr 2025 12:55:09 +0200 David Hildenbrand <david@redhat.com> wrote: > For virito-balloon, we should probably do the following: > > From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Fri, 4 Apr 2025 12:53:16 +0200 > Subject: [PATCH] virtio-balloon: Fix queue index assignment for > non-existing queues > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > device-types/balloon/description.tex | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex > index a1d9603..a7396ff 100644 > --- a/device-types/balloon/description.tex > +++ b/device-types/balloon/description.tex > @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I > 5 > > \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} > + > +\begin{description} > +\item[inflateq] Exists unconditionally. > +\item[deflateq] Exists unconditionally. > +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. s/is set/is negotiated/? I think we should stick to "feature is offered" and "feature is negotiated". > +\end{description} > + > +\begin{note} > +Virtqueue indexes are assigned sequentially for existing queues, starting > +with index 0; consequently, if a virtqueue does not exist, it does not get > +an index assigned. Assuming all virtqueues exist for a device, the indexes > +are: > + > \begin{description} > \item[0] inflateq > \item[1] deflateq > @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque > \item[3] free_page_vq > \item[4] reporting_vq > \end{description} > - > - statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > - > - free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > - > - reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > +\end{note} > > \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} > \begin{description} Sounds good to me! But I'm still a little confused by the "holes". What confuses me is that i can think of at least 2 distinct types of "holes": 1) Holes that can be filled later. The queue conceptually exists, but there is no need to back it with any resources for now because it is dormant (it can be seen a hole in comparison to queues that need to materialize -- vring, notifiers, ...) 2) Holes that can not be filled without resetting the device: i.e. if certain features are not negotiated, then a queue X does not exist, but subsequent queues retain their index. Can we have both kinds or was/will be 1) and/or 2) never a thing? This patch would make sure that neither 1) nor 2) applies to virtio-balloon, which is good. But we are talking about a transport fix here, and I would like to eventually make sure that the abstractions make sense. That being said, I think we should proceed with this patch, because I don't think Linux uses type 1) holes. Regards, Halil
On 04.04.25 15:36, Halil Pasic wrote: > On Fri, 4 Apr 2025 12:55:09 +0200 > David Hildenbrand <david@redhat.com> wrote: > >> For virito-balloon, we should probably do the following: >> >> From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 >> From: David Hildenbrand <david@redhat.com> >> Date: Fri, 4 Apr 2025 12:53:16 +0200 >> Subject: [PATCH] virtio-balloon: Fix queue index assignment for >> non-existing queues >> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> device-types/balloon/description.tex | 22 ++++++++++++++++------ >> 1 file changed, 16 insertions(+), 6 deletions(-) >> >> diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex >> index a1d9603..a7396ff 100644 >> --- a/device-types/balloon/description.tex >> +++ b/device-types/balloon/description.tex >> @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I >> 5 >> >> \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} >> + >> +\begin{description} >> +\item[inflateq] Exists unconditionally. >> +\item[deflateq] Exists unconditionally. >> +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. >> +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. >> +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > > s/is set/is negotiated/? > > I think we should stick to "feature is offered" and "feature is > negotiated". > >> +\end{description} >> + >> +\begin{note} >> +Virtqueue indexes are assigned sequentially for existing queues, starting >> +with index 0; consequently, if a virtqueue does not exist, it does not get >> +an index assigned. Assuming all virtqueues exist for a device, the indexes >> +are: >> + >> \begin{description} >> \item[0] inflateq >> \item[1] deflateq >> @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque >> \item[3] free_page_vq >> \item[4] reporting_vq >> \end{description} >> - >> - statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. >> - >> - free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. >> - >> - reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. >> +\end{note} >> >> \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} >> \begin{description} > > Sounds good to me! But I'm still a little confused by the "holes". What > confuses me is that i can think of at least 2 distinct types of "holes": > 1) Holes that can be filled later. The queue conceptually exists, but > there is no need to back it with any resources for now because it is > dormant (it can be seen a hole in comparison to queues that need to > materialize -- vring, notifiers, ...) > 2) Holes that can not be filled without resetting the device: i.e. if > certain features are not negotiated, then a queue X does not exist, > but subsequent queues retain their index. I think it is not about "negotiated", that might be the wrong terminology. E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues (virtio_add_queue()) if virtio_has_feature(s->host_features). That is, it's independent of a feature negotiation (IIUC), it's static for the device -- "host_features" Is that really "negotiated" or is it "the device offers the feature X" ?
On Fri, 4 Apr 2025 15:48:49 +0200 David Hildenbrand <david@redhat.com> wrote: > > Sounds good to me! But I'm still a little confused by the "holes". > > What confuses me is that i can think of at least 2 distinct types of > > "holes": 1) Holes that can be filled later. The queue conceptually > > exists, but there is no need to back it with any resources for now > > because it is dormant (it can be seen a hole in comparison to queues > > that need to materialize -- vring, notifiers, ...) > > 2) Holes that can not be filled without resetting the device: i.e. if > > certain features are not negotiated, then a queue X does not > > exist, but subsequent queues retain their index. > > I think it is not about "negotiated", that might be the wrong > terminology. > > E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues > (virtio_add_queue()) if virtio_has_feature(s->host_features). > > That is, it's independent of a feature negotiation (IIUC), it's static > for the device -- "host_features" > > > Is that really "negotiated" or is it "the device offers the feature X" > ? It is offered. And this is precisely why I'm so keen on having a precise wording here. Usually for compatibility one needs negotiated. Because the feature negotiation is mostly about compatibility. I.e. the driver should be able to say, hey I don't know about that feature, and get compatible behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make reporting_vq jump to +1 compared to the case where VIRTIO_BALLOON_F_FREE_PAGE_HINT is not offered. Which is IMHO no good, because for the features that the driver is going to reject in most of the cases it should not matter if it was offered or not. @MST: Please correct me if I'm wrong! Regards, Halil
On 04.04.25 16:00, Halil Pasic wrote: > On Fri, 4 Apr 2025 15:48:49 +0200 > David Hildenbrand <david@redhat.com> wrote: > >>> Sounds good to me! But I'm still a little confused by the "holes". >>> What confuses me is that i can think of at least 2 distinct types of >>> "holes": 1) Holes that can be filled later. The queue conceptually >>> exists, but there is no need to back it with any resources for now >>> because it is dormant (it can be seen a hole in comparison to queues >>> that need to materialize -- vring, notifiers, ...) >>> 2) Holes that can not be filled without resetting the device: i.e. if >>> certain features are not negotiated, then a queue X does not >>> exist, but subsequent queues retain their index. >> >> I think it is not about "negotiated", that might be the wrong >> terminology. >> >> E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues >> (virtio_add_queue()) if virtio_has_feature(s->host_features). >> >> That is, it's independent of a feature negotiation (IIUC), it's static >> for the device -- "host_features" >> >> >> Is that really "negotiated" or is it "the device offers the feature X" >> ? > > It is offered. And this is precisely why I'm so keen on having a precise > wording here. Yes, me too. The current phrasing in the spec is not clear. Linux similarly checks virtio_has_feature()->virtio_check_driver_offered_feature(). > > Usually for compatibility one needs negotiated. Because the feature > negotiation is mostly about compatibility. I.e. the driver should be > able to say, hey I don't know about that feature, and get compatible > behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and > VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only > VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make reporting_vq > jump to +1 compared to the case where VIRTIO_BALLOON_F_FREE_PAGE_HINT is > not offered. Which is IMHO no good, because for the features that the > driver is going to reject in most of the cases it should not matter if > it was offered or not. Yes. The key part is that we may only add new features to the tail of our feature list; maybe we should document that as well. I agree that a driver that implements VIRTIO_BALLOON_F_PAGE_REPORTING *must* be aware that VIRTIO_BALLOON_F_FREE_PAGE_HINT exists. So queue existence is not about feature negotiation but about features being offered from the device. ... which is a bit the same behavior as with fixed-assigned numbers a bit. VIRTIO_BALLOON_F_PAGE_REPORTING was documented as "4" because VIRTIO_BALLOON_F_FREE_PAGE_HINT was documented to be "3" -- IOW, it already existed in the spec. Not perfect, but AFAIKS, not horrible. (as Linux supports all these features, it's easy. A driver that only supports some features has to calculate the queue index manually based on the offered features) > > @MST: Please correct me if I'm wrong! > > Regards, > Halil >
On Fri, 4 Apr 2025 16:17:14 +0200 David Hildenbrand <david@redhat.com> wrote: > > It is offered. And this is precisely why I'm so keen on having a > > precise wording here. > > Yes, me too. The current phrasing in the spec is not clear. > > Linux similarly checks > virtio_has_feature()->virtio_check_driver_offered_feature(). Careful, that is a *driver* offered and not a *device* offered! We basically mandate that one can only check for a feature F with virtio_has_feature() such that it is either in drv->feature_table or in drv->feature_table_legacy. AFAICT *device_features* obtained via dev->config->get_features(dev) isn't even saved but is only used for binary and-ing it with the driver_features to obtain the negotiated features. That basically means that if I was, for the sake of fun do --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -1197,7 +1197,6 @@ static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, VIRTIO_BALLOON_F_DEFLATE_ON_OOM, - VIRTIO_BALLOON_F_FREE_PAGE_HINT, VIRTIO_BALLOON_F_PAGE_POISON, VIRTIO_BALLOON_F_REPORTING, }; I would end up with virtio_check_driver_offered_feature() calling BUG(). That basically means that Linux mandates implementing all previous features regardless whether does are supposed to be optional ones or not. Namely if you put the feature into drv->feature_table it will get negotiated. Which is not nice IMHO. > > > > > Usually for compatibility one needs negotiated. Because the feature > > negotiation is mostly about compatibility. I.e. the driver should be > > able to say, hey I don't know about that feature, and get compatible > > behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and > > VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only > > VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make > > reporting_vq jump to +1 compared to the case where > > VIRTIO_BALLOON_F_FREE_PAGE_HINT is not offered. Which is IMHO no > > good, because for the features that the driver is going to reject in > > most of the cases it should not matter if it was offered or not. > > Yes. The key part is that we may only add new features to the tail of > our feature list; maybe we should document that as well. > > I agree that a driver that implements VIRTIO_BALLOON_F_PAGE_REPORTING > *must* be aware that VIRTIO_BALLOON_F_FREE_PAGE_HINT exists. So queue > existence is not about feature negotiation but about features being > offered from the device. > > ... which is a bit the same behavior as with fixed-assigned numbers a > bit. VIRTIO_BALLOON_F_PAGE_REPORTING was documented as "4" because > VIRTIO_BALLOON_F_FREE_PAGE_HINT was documented to be "3" -- IOW, it > already existed in the spec. I don't agree with the comparison. One obviously needs to avoid fatal collisions when extending the spec, and has to consider prior art. But ideally not implemented or fenced optional features A should have no impact to implemented optional or not optional features B -- unless the features are actually interdependent, but then the spec would prohibit the combo of having B but not A. And IMHO exactly this would have been the advantage of fixed-assigned numbers: you may not care if the other queueues exist or not. Also like cloud-hypervisor has decided that they are going only to support VIRTIO_BALLOON_F_REPORTING some weird OS could in theory decide that they only care about VIRTIO_BALLOON_F_REPORTING. In that setting having to look at VIRTIO_BALLOON_F_STATS_VQ and VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered is weird. But that is all water under the bridge. We have to respect what is out there in the field. > > Not perfect, but AFAIKS, not horrible. It is like it is. QEMU does queue exist if the corresponding feature is offered by the device, and that is what we have to live with. Yes, I agree we should make the spec reflect reality! > > (as Linux supports all these features, it's easy. A driver that only > supports some features has to calculate the queue index manually based > on the offered features) As I've tried to explain above, not implementing/accepting optional features and then implementing/accepting a newer feature is problematic with the current code. Supporting some features would work only as supporting all features up to X. Regards, Halil
On 04.04.25 17:39, Halil Pasic wrote: > On Fri, 4 Apr 2025 16:17:14 +0200 > David Hildenbrand <david@redhat.com> wrote: > >>> It is offered. And this is precisely why I'm so keen on having a >>> precise wording here. >> >> Yes, me too. The current phrasing in the spec is not clear. >> >> Linux similarly checks >> virtio_has_feature()->virtio_check_driver_offered_feature(). > > Careful, that is a *driver* offered and not a *device* offered! Right, I was pointing at the usage of the term "offered". virtio_check_driver_offered_feature(). (but was also confused about that function) virtio_has_feature() is clearer: "helper to determine if this device has this feature." The way it's currently implemented is that it's essentially "device has this feature and we know about it (->feature_table)" > > We basically mandate that one can only check for a feature F with > virtio_has_feature() such that it is either in drv->feature_table or in > drv->feature_table_legacy. > > AFAICT *device_features* obtained via dev->config->get_features(dev) > isn't even saved but is only used for binary and-ing it with the > driver_features to obtain the negotiated features. > > That basically means that if I was, for the sake of fun do > > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -1197,7 +1197,6 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_MUST_TELL_HOST, > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > - VIRTIO_BALLOON_F_FREE_PAGE_HINT, > VIRTIO_BALLOON_F_PAGE_POISON, > VIRTIO_BALLOON_F_REPORTING, > }; > > I would end up with virtio_check_driver_offered_feature() calling > BUG(). > Right. > That basically means that Linux mandates implementing all previous > features regardless whether does are supposed to be optional ones or > not. Namely if you put the feature into drv->feature_table it will > get negotiated. > > Which is not nice IMHO. I think the validate() callbacks allows for fixing that up. Like us unconditionally clearing VIRTIO_F_ACCESS_PLATFORM (I know, that's a transport feature and a bit different for this reason). ... not that I think the current way of achieving that is nice :) > >> >>> >>> Usually for compatibility one needs negotiated. Because the feature >>> negotiation is mostly about compatibility. I.e. the driver should be >>> able to say, hey I don't know about that feature, and get compatible >>> behavior. If for example VIRTIO_BALLOON_F_FREE_PAGE_HINT and >>> VIRTIO_BALLOON_F_PAGE_REPORTING are both offered but only >>> VIRTIO_BALLOON_F_PAGE_REPORTING is negotiated. That would make >>> reporting_vq jump to +1 compared to the case where >>> VIRTIO_BALLOON_F_FREE_PAGE_HINT is not offered. Which is IMHO no >>> good, because for the features that the driver is going to reject in >>> most of the cases it should not matter if it was offered or not. >> >> Yes. The key part is that we may only add new features to the tail of >> our feature list; maybe we should document that as well. >> >> I agree that a driver that implements VIRTIO_BALLOON_F_PAGE_REPORTING >> *must* be aware that VIRTIO_BALLOON_F_FREE_PAGE_HINT exists. So queue >> existence is not about feature negotiation but about features being >> offered from the device. >> >> ... which is a bit the same behavior as with fixed-assigned numbers a >> bit. VIRTIO_BALLOON_F_PAGE_REPORTING was documented as "4" because >> VIRTIO_BALLOON_F_FREE_PAGE_HINT was documented to be "3" -- IOW, it >> already existed in the spec. > > I don't agree with the comparison. One obviously needs to avoid fatal > collisions when extending the spec, and has to consider prior art. Agreed. But IMHO it's similar to two out-of-spec driver starting to use "queue index 5" in a fix-assigned world. It cannot work. > > But ideally not implemented or fenced optional features A should have no > impact to implemented optional or not optional features B -- unless the > features are actually interdependent, but then the spec would prohibit > the combo of having B but not A. And IMHO exactly this would have been > the advantage of fixed-assigned numbers: you may not care if the other > queueues exist or not. > > Also like cloud-hypervisor has decided that they are going only to > support VIRTIO_BALLOON_F_REPORTING some weird OS could in theory > decide that they only care about VIRTIO_BALLOON_F_REPORTING. In that > setting having to look at VIRTIO_BALLOON_F_STATS_VQ and > VIRTIO_BALLOON_F_FREE_PAGE_HINT are offered is weird. But that is all water > under the bridge. We have to respect what is out there in the field. Yes, they would have to do the math based on offered features. Definitely not nice, but as you say, that ship has sailed. [...] >> >> (as Linux supports all these features, it's easy. A driver that only >> supports some features has to calculate the queue index manually based >> on the offered features) > > As I've tried to explain above, not implementing/accepting optional > features and then implementing/accepting a newer feature is problematic > with the current code. Supporting some features would work only as > supporting all features up to X. See above regarding validate(). Again, doesn't win a beauty contest ... I'll send an improved virtio-spec update next week, thanks!
On 04.04.25 18:49, David Hildenbrand wrote: > On 04.04.25 17:39, Halil Pasic wrote: >> On Fri, 4 Apr 2025 16:17:14 +0200 >> David Hildenbrand <david@redhat.com> wrote: >> >>>> It is offered. And this is precisely why I'm so keen on having a >>>> precise wording here. >>> >>> Yes, me too. The current phrasing in the spec is not clear. >>> >>> Linux similarly checks >>> virtio_has_feature()->virtio_check_driver_offered_feature(). >> >> Careful, that is a *driver* offered and not a *device* offered! > > Right, I was pointing at the usage of the term "offered". > virtio_check_driver_offered_feature(). (but was also confused about that > function) > > virtio_has_feature() is clearer: "helper to determine if this device has > this feature." > > The way it's currently implemented is that it's essentially "device has > this feature and we know about it (->feature_table)" > >> >> We basically mandate that one can only check for a feature F with >> virtio_has_feature() such that it is either in drv->feature_table or in >> drv->feature_table_legacy. >> >> AFAICT *device_features* obtained via dev->config->get_features(dev) >> isn't even saved but is only used for binary and-ing it with the >> driver_features to obtain the negotiated features. >> >> That basically means that if I was, for the sake of fun do >> >> --- a/drivers/virtio/virtio_balloon.c >> +++ b/drivers/virtio/virtio_balloon.c >> @@ -1197,7 +1197,6 @@ static unsigned int features[] = { >> VIRTIO_BALLOON_F_MUST_TELL_HOST, >> VIRTIO_BALLOON_F_STATS_VQ, >> VIRTIO_BALLOON_F_DEFLATE_ON_OOM, >> - VIRTIO_BALLOON_F_FREE_PAGE_HINT, >> VIRTIO_BALLOON_F_PAGE_POISON, >> VIRTIO_BALLOON_F_REPORTING, >> }; >> >> I would end up with virtio_check_driver_offered_feature() calling >> BUG(). >> > > Right. > >> That basically means that Linux mandates implementing all previous >> features regardless whether does are supposed to be optional ones or >> not. Namely if you put the feature into drv->feature_table it will >> get negotiated. >> >> Which is not nice IMHO. > > I think the validate() callbacks allows for fixing that up. > > Like us unconditionally clearing VIRTIO_F_ACCESS_PLATFORM (I know, > that's a transport feature and a bit different for this reason). > > ... not that I think the current way of achieving that is nice :) Thinking again, that won't work, because it would also make virtio_has_feature() say that the device does not have that feature. So yeah, virtio_has_feature() is confusing and the documentation does not quite match. Would need a change/cleanup to handle such features that we don't implement but still want to check if they are offered.
On Fri, Apr 04, 2025 at 12:55:09PM +0200, David Hildenbrand wrote: > On 04.04.25 12:00, David Hildenbrand wrote: > > On 04.04.25 06:36, Halil Pasic wrote: > > > On Thu, 3 Apr 2025 16:28:31 +0200 > > > David Hildenbrand <david@redhat.com> wrote: > > > > > > > > Sorry I have to have a look at that discussion. Maybe it will answer > > > > > some my questions. > > > > > > > > Yes, I think so. > > > > > > > > > > Let's fix it without affecting existing setups for now by properly > > > > > > ignoring the non-existing queues, so the indicator bits will match > > > > > > the queue indexes. > > > > > > > > > > Just one question. My understanding is that the crux is that Linux > > > > > and QEMU (or the driver and the device) disagree at which index > > > > > reporting_vq is actually sitting. Is that right? > > > > > > > > I thought I made it clear: this is only about the airq indicator bit. > > > > That's where both disagree. > > > > > > > > Not the actual queue index (see above). > > > > > > I did some more research including having a look at that discussion. Let > > > me try to sum up how did we end up here. > > > > Let me add some more details after digging as well: > > > > > > > > Before commit a229989d975e ("virtio: don't allocate vqs when names[i] = > > > NULL") the kernel behavior used to be in spec, but QEMU and possibly > > > other hypervisor were out of spec and things did not work. > > > > It all started with VIRTIO_BALLOON_F_FREE_PAGE_HINT. Before that, > > we only had the single optional VIRTIO_BALLOON_F_STATS_VQ queue at the very > > end. So there was no possibility for holes "in-between". > > > > In the Linux driver, we created the stats queue only if the feature bit > > VIRTIO_BALLOON_F_STATS_VQ was actually around: > > > > nvqs = virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_STATS_VQ) ? 3 : 2; > > err = virtio_find_vqs(vb->vdev, nvqs, vqs, callbacks, names, NULL); > > > > That changed with VIRTIO_BALLOON_F_FREE_PAGE_HINT, because we would > > unconditionally create 4 queues. QEMU always supported the first 3 queues > > unconditionally, but old QEMU did obviously not support the (new) > > VIRTIO_BALLOON_F_FREE_PAGE_HINT queue. > > > > 390x didn't particularly like getting queried for non-existing > > queues. [1] So the fix was not for a hypervisor that was out of spec, but > > because quering non-existing queues didn't work. > > > > The fix implied that if VIRTIO_BALLOON_F_STATS_VQ is missing, suddenly the queue > > index of VIRTIO_BALLOON_F_FREE_PAGE_HINT changed as well. > > > > Again, as QEMU always implemented the 3 first queues unconditionally, this was > > not a problem. > > > > [1] https://lore.kernel.org/all/c6746307-fae5-7652-af8d-19f560fc31d9@de.ibm.com/#t > > > > > > > > Possibly because of the complexity of fixing the hypervisor(s) commit > > > a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") opted > > > for changing the guest side so that it does not fit the spec but fits > > > the hypervisor(s). It unfortunately also broke notifiers (for the with > > > holes) scenario for virtio-ccw only. > > > > Yes, it broke the notifiers. > > > > But note that everything was in spec at that point, because we only documented > > "free_page_vq == 3" in the spec *2 years later*, in 2020: > > > > commit 38448268eba0c105200d131c3f7f660129a4d673 > > Author: Alexander Duyck <alexander.h.duyck@linux.intel.com> > > Date: Tue Aug 25 07:45:02 2020 -0700 > > > > content: Document balloon feature free page hints > > Free page hints allow the balloon driver to provide information on what > > pages are not currently in use so that we can avoid the cost of copying > > them in migration scenarios. Add a feature description for free page hints > > describing basic functioning and requirements. > > At that point, what we documented in the spec *did not match reality* in > > Linux. QEMU was fully compatible, because VIRTIO_BALLOON_F_STATS_VQ is > > unconditionally set. > > > > > > QEMU and Linux kept using that queue index assignment model, and the spec > > was wrong (out of sync?) at that point. The spec got more wrong with > > > > commit d917d4a8d552c003e046b0e3b1b529d98f7e695b > > Author: Alexander Duyck <alexander.h.duyck@linux.intel.com> > > Date: Tue Aug 25 07:45:17 2020 -0700 > > > > content: Document balloon feature free page reporting > > Free page reporting is a feature that allows the guest to proactively > > report unused pages to the host. By making use of this feature is is > > possible to reduce the overall memory footprint of the guest in cases where > > some significant portion of the memory is idle. Add documentation for the > > free page reporting feature describing the functionality and requirements. > > > > Where we documented VIRTIO_BALLOON_F_REPORTING after the changes were added to > > QEMU+Linux implementation, so the spec did not reflect reality. > > > > I'll note also cloud-hypervisor [2] today follows that model. > > > > In particular, it *only* supports VIRTIO_BALLOON_F_REPORTING, turning > > the queue index of VIRTIO_BALLOON_F_REPORTING into *2* instead of documented > > in the spec to be *4*. > > > > So in reality, we can see VIRTIO_BALLOON_F_REPORTING to be either 2/3/4, depending > > on the availability of the other two features/queues. > > > > [2] https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/virtio-devices/src/balloon.rs > > > > > > > > > > Now we had another look at this, and have concluded that fixing the > > > hypervisor(s) and fixing the kernel, and making sure that the fixed > > > kernel can tolerate the old broken hypervisor(s) is way to complicated > > > if possible at all. So we decided to give the spec a reality check and > > > fix the notifier bit assignment for virtio-ccw which is broken beyond > > > doubt if we accept that the correct virtqueue index is the one that the > > > hypervisor(s) use and not the one that the spec says they should use. > > > > In case of virtio-balloon, it's unfortunate that it went that way, but the > > spec simply did not / does not reflect reality when it was added to the spec. > > > > > > > > With the spec fixed, the whole notion of "holes" will be something that > > > does not make sense any more. With that the merit of the kernel interface > > > virtio_find_vqs() supporting "holes" is quite questionable. Now we need > > > it because the drivers within the Linux kernel still think of the queues > > > in terms of the current spec, i.e. they try to have the "holes" as > > > mandated by the spec, and the duty of making it work with the broken > > > device implementations falls to the transports. > > > > > > > Right, the "holes" only exist in the input array. > > > > > Under the assumption that the spec is indeed going to be fixed: > > For virito-balloon, we should probably do the following: > > From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@redhat.com> > Date: Fri, 4 Apr 2025 12:53:16 +0200 > Subject: [PATCH] virtio-balloon: Fix queue index assignment for > non-existing queues > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > device-types/balloon/description.tex | 22 ++++++++++++++++------ > 1 file changed, 16 insertions(+), 6 deletions(-) > > diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex > index a1d9603..a7396ff 100644 > --- a/device-types/balloon/description.tex > +++ b/device-types/balloon/description.tex > @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I > 5 > \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} > + > +\begin{description} > +\item[inflateq] Exists unconditionally. > +\item[deflateq] Exists unconditionally. > +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > +\end{description} > + > +\begin{note} > +Virtqueue indexes are assigned sequentially for existing queues, starting > +with index 0; consequently, if a virtqueue does not exist, it does not get > +an index assigned. Assuming all virtqueues exist for a device, the indexes > +are: > + > \begin{description} > \item[0] inflateq > \item[1] deflateq > @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque > \item[3] free_page_vq > \item[4] reporting_vq > \end{description} > - > - statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > - > - free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > - > - reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > +\end{note} > \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} > \begin{description} > -- > 2.48.1 > > > If something along these lines sounds reasonable, I can send a proper patch to the > proper audience. Indeed, but do we want to add a note about previous spec versions saying something different? Maybe, with a hint how devices following old spec can be detected? > -- > Cheers, > > David / dhildenb
On Fri, Apr 04, 2025 at 03:48:49PM +0200, David Hildenbrand wrote: > On 04.04.25 15:36, Halil Pasic wrote: > > On Fri, 4 Apr 2025 12:55:09 +0200 > > David Hildenbrand <david@redhat.com> wrote: > > > > > For virito-balloon, we should probably do the following: > > > > > > From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 > > > From: David Hildenbrand <david@redhat.com> > > > Date: Fri, 4 Apr 2025 12:53:16 +0200 > > > Subject: [PATCH] virtio-balloon: Fix queue index assignment for > > > non-existing queues > > > > > > Signed-off-by: David Hildenbrand <david@redhat.com> > > > --- > > > device-types/balloon/description.tex | 22 ++++++++++++++++------ > > > 1 file changed, 16 insertions(+), 6 deletions(-) > > > > > > diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex > > > index a1d9603..a7396ff 100644 > > > --- a/device-types/balloon/description.tex > > > +++ b/device-types/balloon/description.tex > > > @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I > > > 5 > > > \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} > > > + > > > +\begin{description} > > > +\item[inflateq] Exists unconditionally. > > > +\item[deflateq] Exists unconditionally. > > > +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > > > +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > > > +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > > > > s/is set/is negotiated/? > > > > I think we should stick to "feature is offered" and "feature is > > negotiated". > > > > > +\end{description} > > > + > > > +\begin{note} > > > +Virtqueue indexes are assigned sequentially for existing queues, starting > > > +with index 0; consequently, if a virtqueue does not exist, it does not get > > > +an index assigned. Assuming all virtqueues exist for a device, the indexes > > > +are: > > > + > > > \begin{description} > > > \item[0] inflateq > > > \item[1] deflateq > > > @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque > > > \item[3] free_page_vq > > > \item[4] reporting_vq > > > \end{description} > > > - > > > - statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. > > > - > > > - free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. > > > - > > > - reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. > > > +\end{note} > > > \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} > > > \begin{description} > > > > Sounds good to me! But I'm still a little confused by the "holes". What > > confuses me is that i can think of at least 2 distinct types of "holes": > > 1) Holes that can be filled later. The queue conceptually exists, but > > there is no need to back it with any resources for now because it is > > dormant (it can be seen a hole in comparison to queues that need to > > materialize -- vring, notifiers, ...) > > 2) Holes that can not be filled without resetting the device: i.e. if > > certain features are not negotiated, then a queue X does not exist, > > but subsequent queues retain their index. > > I think it is not about "negotiated", that might be the wrong terminology. > > E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues > (virtio_add_queue()) if virtio_has_feature(s->host_features). > > That is, it's independent of a feature negotiation (IIUC), it's static for > the device -- "host_features" No no that is a bad idea. Breaks forward compatibility. Oh my. I did not realize. It is really broken hopelessly. Because, note, the guest looks at the guest features :) Now I am beginning to think we should leave the spec alone and fix the drivers ... Ugh .... > > Is that really "negotiated" or is it "the device offers the feature X" ? > > -- > Cheers, > > David / dhildenb
On 06.04.25 20:42, Michael S. Tsirkin wrote: > On Fri, Apr 04, 2025 at 03:48:49PM +0200, David Hildenbrand wrote: >> On 04.04.25 15:36, Halil Pasic wrote: >>> On Fri, 4 Apr 2025 12:55:09 +0200 >>> David Hildenbrand <david@redhat.com> wrote: >>> >>>> For virito-balloon, we should probably do the following: >>>> >>>> From 38e340c2bb53c2a7cc7c675f5dfdd44ecf7701d9 Mon Sep 17 00:00:00 2001 >>>> From: David Hildenbrand <david@redhat.com> >>>> Date: Fri, 4 Apr 2025 12:53:16 +0200 >>>> Subject: [PATCH] virtio-balloon: Fix queue index assignment for >>>> non-existing queues >>>> >>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>> --- >>>> device-types/balloon/description.tex | 22 ++++++++++++++++------ >>>> 1 file changed, 16 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/device-types/balloon/description.tex b/device-types/balloon/description.tex >>>> index a1d9603..a7396ff 100644 >>>> --- a/device-types/balloon/description.tex >>>> +++ b/device-types/balloon/description.tex >>>> @@ -16,6 +16,21 @@ \subsection{Device ID}\label{sec:Device Types / Memory Balloon Device / Device I >>>> 5 >>>> \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtqueues} >>>> + >>>> +\begin{description} >>>> +\item[inflateq] Exists unconditionally. >>>> +\item[deflateq] Exists unconditionally. >>>> +\item[statsq] Only exists if VIRTIO_BALLOON_F_STATS_VQ is set. >>>> +\item[free_page_vq] Only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. >>>> +\item[reporting_vq] Only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. >>> >>> s/is set/is negotiated/? >>> >>> I think we should stick to "feature is offered" and "feature is >>> negotiated". >>> >>>> +\end{description} >>>> + >>>> +\begin{note} >>>> +Virtqueue indexes are assigned sequentially for existing queues, starting >>>> +with index 0; consequently, if a virtqueue does not exist, it does not get >>>> +an index assigned. Assuming all virtqueues exist for a device, the indexes >>>> +are: >>>> + >>>> \begin{description} >>>> \item[0] inflateq >>>> \item[1] deflateq >>>> @@ -23,12 +38,7 @@ \subsection{Virtqueues}\label{sec:Device Types / Memory Balloon Device / Virtque >>>> \item[3] free_page_vq >>>> \item[4] reporting_vq >>>> \end{description} >>>> - >>>> - statsq only exists if VIRTIO_BALLOON_F_STATS_VQ is set. >>>> - >>>> - free_page_vq only exists if VIRTIO_BALLOON_F_FREE_PAGE_HINT is set. >>>> - >>>> - reporting_vq only exists if VIRTIO_BALLOON_F_PAGE_REPORTING is set. >>>> +\end{note} >>>> \subsection{Feature bits}\label{sec:Device Types / Memory Balloon Device / Feature bits} >>>> \begin{description} >>> >>> Sounds good to me! But I'm still a little confused by the "holes". What >>> confuses me is that i can think of at least 2 distinct types of "holes": >>> 1) Holes that can be filled later. The queue conceptually exists, but >>> there is no need to back it with any resources for now because it is >>> dormant (it can be seen a hole in comparison to queues that need to >>> materialize -- vring, notifiers, ...) >>> 2) Holes that can not be filled without resetting the device: i.e. if >>> certain features are not negotiated, then a queue X does not exist, >>> but subsequent queues retain their index. >> >> I think it is not about "negotiated", that might be the wrong terminology. >> >> E.g., in QEMU virtio_balloon_device_realize() we define the virtqueues >> (virtio_add_queue()) if virtio_has_feature(s->host_features). >> >> That is, it's independent of a feature negotiation (IIUC), it's static for >> the device -- "host_features" > > > No no that is a bad idea. Breaks forward compatibility. > > Oh my. I did not realize. It is really broken hopelessly. > > Because, note, the guest looks at the guest features :) Can you elaborate why? statsq = 2 free_page_vq = statsq + host_offered_feat(VIRTIO_BALLOON_F_STATS_VQ) reporting_vq = free_page_vq + host_offered_feat(VIRTIO_BALLOON_F_FREE_PAGE_HINT) Independent of any upcoming features. And if a new feature defines a new virtqueue new_vq = reporting_vq + host_offered_feat(VIRTIO_BALLOON_F_PAGE_REPORTING) We only have to make sure in the spec that these calculations always hold. Querying of the host offered features already happens as part of determining the actual guest usable feature (driver_offered & host_offered). > Now I am beginning to think we should leave the spec alone > and fix the drivers ... Ugh .... We could always say that starting with feature X, queue indexes are fixed again. E.g., VIRTIO_BALLOON_F_X would have it's virtqueue fixed at index 5, independent of the other (older) features where the virtqueue indexes are determined like today. Won't make the implementation easier, though, I'm afraid. (I also thought about a way to query the virtqueue index for a feature, but that's probably overengineering)
On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: > > > > Not perfect, but AFAIKS, not horrible. > > It is like it is. QEMU does queue exist if the corresponding feature > is offered by the device, and that is what we have to live with. I don't think we can live with this properly though. It means a guest that does not know about some features does not know where to find things. So now, I am inclined to add linux code to work with current qemu and with spec compliant one, and add qemu code to work with current linux and spec compliant one. Document the bug in the spec, maybe, in a non conformance section.
On 07.04.25 09:52, Michael S. Tsirkin wrote: > On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: >>> >>> Not perfect, but AFAIKS, not horrible. >> >> It is like it is. QEMU does queue exist if the corresponding feature >> is offered by the device, and that is what we have to live with. > > I don't think we can live with this properly though. > It means a guest that does not know about some features > does not know where to find things. Please describe a real scenario, I'm missing the point. Whoever adds new feat_X *must be aware* about all previous features, otherwise we'd be reusing feature bits and everything falls to pieces. > > So now, I am inclined to add linux code to work with current qemu and > with spec compliant one, and add qemu code to work with current linux > and spec compliant one. > > Document the bug in the spec, maybe, in a non conformance section. I'm afraid this results in a lot of churn without really making things better. IMHO, documenting things how they actually behave, and maybe moving towards fixed queue indexes for new features is the low hanging fruit. As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor.
On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote: > On 07.04.25 09:52, Michael S. Tsirkin wrote: > > On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: > > > > > > > > Not perfect, but AFAIKS, not horrible. > > > > > > It is like it is. QEMU does queue exist if the corresponding feature > > > is offered by the device, and that is what we have to live with. > > > > I don't think we can live with this properly though. > > It means a guest that does not know about some features > > does not know where to find things. > > Please describe a real scenario, I'm missing the point. OK so. Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING Driver only knows about VIRTIO_BALLOON_F_REPORTING so it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do. How does it know which vq to use for reporting? It will try to use the free page hint one. > Whoever adds new feat_X *must be aware* about all previous features, > otherwise we'd be reusing feature bits and everything falls to pieces. The knowledge is supposed be limited to which feature bit to use. > > > > So now, I am inclined to add linux code to work with current qemu and > > with spec compliant one, and add qemu code to work with current linux > > and spec compliant one. > > > > Document the bug in the spec, maybe, in a non conformance section. > > I'm afraid this results in a lot of churn without really making things > better. > IMHO, documenting things how they actually behave, and maybe moving towards > fixed queue indexes for new features is the low hanging fruit. I worry about how to we ensure that? If old code is messed up people will just keep propagating that. I would like to fix old code so that new code is correct. > > As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor. > > -- > Cheers, > > David / dhildenb There's a slippery slope here in that people will come to us with buggy devices and ask to change the spec.
On 07.04.25 10:17, David Hildenbrand wrote: > On 07.04.25 09:52, Michael S. Tsirkin wrote: >> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: >>>> >>>> Not perfect, but AFAIKS, not horrible. >>> >>> It is like it is. QEMU does queue exist if the corresponding feature >>> is offered by the device, and that is what we have to live with. >> >> I don't think we can live with this properly though. >> It means a guest that does not know about some features >> does not know where to find things. > > Please describe a real scenario, I'm missing the point. > > Whoever adds new feat_X *must be aware* about all previous features, > otherwise we'd be reusing feature bits and everything falls to pieces. > >> >> So now, I am inclined to add linux code to work with current qemu and >> with spec compliant one, and add qemu code to work with current linux >> and spec compliant one. >> >> Document the bug in the spec, maybe, in a non conformance section. > > I'm afraid this results in a lot of churn without really making things > better. > > IMHO, documenting things how they actually behave, and maybe moving > towards fixed queue indexes for new features is the low hanging fruit. > > As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor. I'm digging for other virtio-balloon implementations. virtio-win: https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/Balloon/sys/balloon.c -> Does not support hinting/reporting -> no problem libkrun: https://github.com/containers/libkrun/blob/main/src/devices/src/virtio/balloon/device.rs -> Hard-codes queue indexes but always seems to offer all features -> Offers VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_STATS_VQ even though it doesn't seem to implement them (device-triggered, so nothing to do probably?) -> Actually seems to implements VIRTIO_BALLOON_F_REPORTING crossvm: https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs -> Hard-codes queue numbers; does *not* offer/implement VIRTIO_BALLOON_F_STATS_VQ but does offer VIRTIO_BALLOON_F_STATS_VQ and VIRTIO_BALLOON_F_DEFLATE_ON_OOM. -> Implements something that is not in the virtio-spec const VIRTIO_BALLOON_F_WS_REPORTING: u32 = 8; // Working Set Reporting virtqueues and const WS_DATA_VQ: usize = 5; const WS_OP_VQ: usize = 6; IIUC, Linux inside cross-vm might actually be problematic? They would disagree on the virtqueue for free-page-reporting Maybe I am missing something, it's a mess ...
On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: > That basically means that if I was, for the sake of fun do > > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -1197,7 +1197,6 @@ static unsigned int features[] = { > VIRTIO_BALLOON_F_MUST_TELL_HOST, > VIRTIO_BALLOON_F_STATS_VQ, > VIRTIO_BALLOON_F_DEFLATE_ON_OOM, > - VIRTIO_BALLOON_F_FREE_PAGE_HINT, > VIRTIO_BALLOON_F_PAGE_POISON, > VIRTIO_BALLOON_F_REPORTING, > }; > > I would end up with virtio_check_driver_offered_feature() calling > BUG(). I mean, yes, this is exactly to catch drivers that use features without negotiating them first.
On 07.04.25 10:34, Michael S. Tsirkin wrote: > On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote: >> On 07.04.25 09:52, Michael S. Tsirkin wrote: >>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: >>>>> >>>>> Not perfect, but AFAIKS, not horrible. >>>> >>>> It is like it is. QEMU does queue exist if the corresponding feature >>>> is offered by the device, and that is what we have to live with. >>> >>> I don't think we can live with this properly though. >>> It means a guest that does not know about some features >>> does not know where to find things. >> >> Please describe a real scenario, I'm missing the point. > > > OK so. > > Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING > Driver only knows about VIRTIO_BALLOON_F_REPORTING so > it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do. > How does it know which vq to use for reporting? > It will try to use the free page hint one. "only knows" -- VIRTIO_BALLOON_F_FREE_PAGE_HINT was proposed + specified in the spec. So I think this is not a very good example. > > > >> Whoever adds new feat_X *must be aware* about all previous features, >> otherwise we'd be reusing feature bits and everything falls to pieces. > > > The knowledge is supposed be limited to which feature bit to use. I think we also have to know which virtqueue bits can be used, right? I mean, I agree that it's all nasty ... >>> >>> So now, I am inclined to add linux code to work with current qemu and >>> with spec compliant one, and add qemu code to work with current linux >>> and spec compliant one. >>> >>> Document the bug in the spec, maybe, in a non conformance section. >> >> I'm afraid this results in a lot of churn without really making things >> better. > >> IMHO, documenting things how they actually behave, and maybe moving towards >> fixed queue indexes for new features is the low hanging fruit. > > I worry about how to we ensure that? > If old code is messed up people will just keep propagating that. > I would like to fix old code so that new code is correct. > >> >> As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor. >> >> -- >> Cheers, >> >> David / dhildenb > > There's a slippery slope here in that people will come to us > with buggy devices and ask to change the spec. I yet have to fully digest cross-vm: https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs and how it would free-page-reporting work with upstream Linux ... :(
Wow great job digging through all these hypervisors! On Mon, Apr 07, 2025 at 10:38:59AM +0200, David Hildenbrand wrote: > crossvm: > https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs > > -> Hard-codes queue numbers; does *not* offer/implement > VIRTIO_BALLOON_F_STATS_VQ but does offer VIRTIO_BALLOON_F_STATS_VQ > and VIRTIO_BALLOON_F_DEFLATE_ON_OOM. > > -> Implements something that is not in the virtio-spec > > const VIRTIO_BALLOON_F_WS_REPORTING: u32 = 8; // Working Set Reporting > virtqueues > > and > > const WS_DATA_VQ: usize = 5; > const WS_OP_VQ: usize = 6; > > > IIUC, Linux inside cross-vm might actually be problematic? They would > disagree on the virtqueue for free-page-reporting That's why things must be tied to negotiated features, not to offered ones. > > Maybe I am missing something, it's a mess ... Indeed. Let's fix the implementations while keeping work arounds for existing ones.
On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: > > > > > > > > > Whoever adds new feat_X *must be aware* about all previous features, > > > otherwise we'd be reusing feature bits and everything falls to pieces. > > > > > > The knowledge is supposed be limited to which feature bit to use. > > I think we also have to know which virtqueue bits can be used, right? > what are virtqueue bits? vq number?
On 07.04.25 10:44, Michael S. Tsirkin wrote: > Wow great job digging through all these hypervisors! > > On Mon, Apr 07, 2025 at 10:38:59AM +0200, David Hildenbrand wrote: >> crossvm: >> https://github.com/google/crosvm/blob/main/devices/src/virtio/balloon.rs >> >> -> Hard-codes queue numbers; does *not* offer/implement >> VIRTIO_BALLOON_F_STATS_VQ but does offer VIRTIO_BALLOON_F_STATS_VQ >> and VIRTIO_BALLOON_F_DEFLATE_ON_OOM. >> >> -> Implements something that is not in the virtio-spec >> >> const VIRTIO_BALLOON_F_WS_REPORTING: u32 = 8; // Working Set Reporting >> virtqueues >> >> and >> >> const WS_DATA_VQ: usize = 5; >> const WS_OP_VQ: usize = 6; >> >> >> IIUC, Linux inside cross-vm might actually be problematic? They would >> disagree on the virtqueue for free-page-reporting > > > That's why things must be tied to negotiated features, not to offered > ones. cross-vm also has this weird comment: " const VIRTIO_BALLOON_F_PAGE_REPORTING: u32 = 5; // Page reporting virtqueue // TODO(b/273973298): this should maybe be bit 6? to be changed later " Not sure why that should be bit 6, the spec says 5 ... So maybe whatever they run inside the VM is also out of spec ... Really hard to tell.
On 07.04.25 10:49, Michael S. Tsirkin wrote: > On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: >>> >>> >>> >>>> Whoever adds new feat_X *must be aware* about all previous features, >>>> otherwise we'd be reusing feature bits and everything falls to pieces. >>> >>> >>> The knowledge is supposed be limited to which feature bit to use. >> >> I think we also have to know which virtqueue bits can be used, right? >> > > what are virtqueue bits? vq number? Yes, sorry. Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 with their VIRTIO_BALLOON_F_WS_REPORTING. So whatever feature another device implements couldn't use this feature bit or these virtqueue indexes. (as long the other device never intends to implement VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But the spec will also be a mess, because virtqueue indexes could also have duplicate meanings ... ugh)
On Mon, Apr 07, 2025 at 09:18:21AM +0200, David Hildenbrand wrote: > > Now I am beginning to think we should leave the spec alone > > and fix the drivers ... Ugh .... > > We could always say that starting with feature X, queue indexes are fixed > again. E.g., VIRTIO_BALLOON_F_X would have it's virtqueue fixed at index 5, > independent of the other (older) features where the virtqueue indexes are > determined like today. > > Won't make the implementation easier, though, I'm afraid. > > (I also thought about a way to query the virtqueue index for a feature, but > that's probably overengineering) The best contract we have is the spec. Sometimes it is hopelessly broken and we have to fix it, but not in this case. Let's do a theoretical excercise, assuming we want to fix the drivers, but we also want to have workarounds in place in qemu and in drivers to support existing ones. How would we go about it? Maybe we want a feature bit BALLOON_FIXED and ask everyone to negotiate it? But if we go this way, we really need to fix the 48 bit limitation too.
On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote: > On 07.04.25 10:49, Michael S. Tsirkin wrote: > > On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > Whoever adds new feat_X *must be aware* about all previous features, > > > > > otherwise we'd be reusing feature bits and everything falls to pieces. > > > > > > > > > > > > The knowledge is supposed be limited to which feature bit to use. > > > > > > I think we also have to know which virtqueue bits can be used, right? > > > > > > > what are virtqueue bits? vq number? > > Yes, sorry. I got confused myself, it's vq index actually now, we made the spec consistent with that terminology. used to be number/index interchangeably. > Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 > with their VIRTIO_BALLOON_F_WS_REPORTING. crossvm guys really should have reserved the feature bit even if they did not bother specifying it. Let's reserve it now at least? > So whatever feature another device implements couldn't use this feature bit > or these virtqueue indexes. > > (as long the other device never intends to implement > VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But > the spec will also be a mess, because virtqueue indexes could also have > duplicate meanings ... ugh) what do they do with vq indices btw? > -- > Cheers, > > David / dhildenb
On 07.04.25 10:54, Michael S. Tsirkin wrote: > On Mon, Apr 07, 2025 at 09:18:21AM +0200, David Hildenbrand wrote: >>> Now I am beginning to think we should leave the spec alone >>> and fix the drivers ... Ugh .... >> >> We could always say that starting with feature X, queue indexes are fixed >> again. E.g., VIRTIO_BALLOON_F_X would have it's virtqueue fixed at index 5, >> independent of the other (older) features where the virtqueue indexes are >> determined like today. >> >> Won't make the implementation easier, though, I'm afraid. >> >> (I also thought about a way to query the virtqueue index for a feature, but >> that's probably overengineering) > > The best contract we have is the spec. Sometimes it is hopelessly broken > and we have to fix it, but not in this case. > > Let's do a theoretical excercise, assuming we want to fix the drivers, > but we also want to have workarounds in place in qemu and in > drivers to support existing ones. How would we go about it? QEMU could likely be changed to always offer VIRTIO_BALLOON_F_FREE_PAGE_HINT, but not actually use it unless enabled for QEMU. That should work, because all action is initiated by the device. That way, all virtqueue indexes would always be according to the spec. We'll likely need compat machine handling .... Regarding Linux, I'll have to think about it further ... > Maybe we want a feature bit BALLOON_FIXED and ask everyone > to negotiate it? But if we go this way, we really need to fix > the 48 bit limitation too. I was thinking about the same, but it's all a mess ...
On 07.04.25 10:58, Michael S. Tsirkin wrote: > On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote: >> On 07.04.25 10:49, Michael S. Tsirkin wrote: >>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: >>>>> >>>>> >>>>> >>>>>> Whoever adds new feat_X *must be aware* about all previous features, >>>>>> otherwise we'd be reusing feature bits and everything falls to pieces. >>>>> >>>>> >>>>> The knowledge is supposed be limited to which feature bit to use. >>>> >>>> I think we also have to know which virtqueue bits can be used, right? >>>> >>> >>> what are virtqueue bits? vq number? >> >> Yes, sorry. > > I got confused myself, it's vq index actually now, we made the spec > consistent with that terminology. used to be number/index > interchangeably. > >> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 >> with their VIRTIO_BALLOON_F_WS_REPORTING. > > > crossvm guys really should have reserved the feature bit even if they > did not bother specifying it. Let's reserve it now at least? Along with the virtqueue indices, right? Note that there was https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html and https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT But it only was RFC, and as the QEMU implementation didn't materialize, nobody seemed to care ... > > >> So whatever feature another device implements couldn't use this feature bit >> or these virtqueue indexes. >> >> (as long the other device never intends to implement >> VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But >> the spec will also be a mess, because virtqueue indexes could also have >> duplicate meanings ... ugh) > > what do they do with vq indices btw? See above links, they use the two for "s_vq and notification_vq".
On 07.04.25 11:11, David Hildenbrand wrote: > On 07.04.25 10:58, Michael S. Tsirkin wrote: >> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote: >>> On 07.04.25 10:49, Michael S. Tsirkin wrote: >>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Whoever adds new feat_X *must be aware* about all previous features, >>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces. >>>>>> >>>>>> >>>>>> The knowledge is supposed be limited to which feature bit to use. >>>>> >>>>> I think we also have to know which virtqueue bits can be used, right? >>>>> >>>> >>>> what are virtqueue bits? vq number? >>> >>> Yes, sorry. >> >> I got confused myself, it's vq index actually now, we made the spec >> consistent with that terminology. used to be number/index >> interchangeably. >> >>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 >>> with their VIRTIO_BALLOON_F_WS_REPORTING. >> >> >> crossvm guys really should have reserved the feature bit even if they >> did not bother specifying it. Let's reserve it now at least? > > Along with the virtqueue indices, right? > > Note that there was > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html > > and > > https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT > > But it only was RFC, and as the QEMU implementation didn't materialize, > nobody seemed to care ... Heh, but that one said: +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for Working Set Which does not seem to reflect reality ...
On 07.04.25 10:44, Michael S. Tsirkin wrote:
> Wow great job digging through all these hypervisors!
There is more ... :(
aloith:
https://github.com/google/alioth/blob/main/alioth/src/virtio/dev/balloon.rs
It uses the incremental vq index assignment like QEMU.
impl VirtioMio for Balloon {
fn activate<'a, 'm, Q: VirtQueue<'m>, S: IrqSender>(
&mut self,
feature: u64,
_active_mio: &mut ActiveMio<'a, 'm, Q, S>,
) -> Result<()> {
let feature = BalloonFeature::from_bits_retain(feature);
self.queues[0] = BalloonQueue::Inflate;
self.queues[1] = BalloonQueue::Deflate;
let mut index = 2;
if feature.contains(BalloonFeature::STATS_VQ) {
self.queues[index] = BalloonQueue::Stats;
index += 1;
}
if feature.contains(BalloonFeature::FREE_PAGE_HINT) {
self.queues[index] = BalloonQueue::FreePage;
index += 1;
}
if feature.contains(BalloonFeature::PAGE_REPORTING) {
self.queues[index] = BalloonQueue::Reporting;
}
Ok(())
}
I'll dig some more, but this is getting out of hand :D
On Mon, Apr 07, 2025 at 11:11:34AM +0200, David Hildenbrand wrote: > On 07.04.25 10:58, Michael S. Tsirkin wrote: > > On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote: > > > On 07.04.25 10:49, Michael S. Tsirkin wrote: > > > > On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Whoever adds new feat_X *must be aware* about all previous features, > > > > > > > otherwise we'd be reusing feature bits and everything falls to pieces. > > > > > > > > > > > > > > > > > > The knowledge is supposed be limited to which feature bit to use. > > > > > > > > > > I think we also have to know which virtqueue bits can be used, right? > > > > > > > > > > > > > what are virtqueue bits? vq number? > > > > > > Yes, sorry. > > > > I got confused myself, it's vq index actually now, we made the spec > > consistent with that terminology. used to be number/index > > interchangeably. > > > > > Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 > > > with their VIRTIO_BALLOON_F_WS_REPORTING. > > > > > > crossvm guys really should have reserved the feature bit even if they > > did not bother specifying it. Let's reserve it now at least? > > Along with the virtqueue indices, right? Well ... as long as the implementation is careful to check that feature is negotiated, reusing vq index at least causes no trouble for others. > Note that there was > > https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html > > and > > https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT > > But it only was RFC, and as the QEMU implementation didn't materialize, > nobody seemed to care ... Thanks! I will try poke the author again. > > > > > > > So whatever feature another device implements couldn't use this feature bit > > > or these virtqueue indexes. > > > > > > (as long the other device never intends to implement > > > VIRTIO_BALLOON_F_WS_REPORTING, the virtqueue indexes could be reused. But > > > the spec will also be a mess, because virtqueue indexes could also have > > > duplicate meanings ... ugh) > > > > what do they do with vq indices btw? > > See above links, they use the two for "s_vq and notification_vq". > > -- > Cheers, > > David / dhildenb
On Mon, 7 Apr 2025 04:34:29 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote: > > On 07.04.25 09:52, Michael S. Tsirkin wrote: > > > On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: > > > > > > > > > > Not perfect, but AFAIKS, not horrible. > > > > > > > > It is like it is. QEMU does queue exist if the corresponding feature > > > > is offered by the device, and that is what we have to live with. > > > > > > I don't think we can live with this properly though. > > > It means a guest that does not know about some features > > > does not know where to find things. > > > > Please describe a real scenario, I'm missing the point. > > > OK so. > > Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING > Driver only knows about VIRTIO_BALLOON_F_REPORTING so > it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do. > How does it know which vq to use for reporting? > It will try to use the free page hint one. First, sorry for not catching up again with the discussion earlier. I think David's point is based on the assumption that by the time feature with the feature bit N+1 is specified and allocates a queue Q, all queues with indexes smaller than Q are allocated and possibly associated with features that were previously specified (and probably have feature bits smaller than N+1). I.e. that we can mandate, even if you don't want to care about other optional features, you have to, because we say so, for the matter of virtqueue existence. And anything in the future, you don't have to care about because the queue index associated with future features is larger than Q, so it does not affect our position. I think that argument can fall a part if: * future features reference optional queues defined in the past * somebody managed to introduce a limbo where a feature is reserved, and they can not decide if they want a queue or not, or make the existence of the queue depend on something else than a feature bit. Frankly I don't think the risks are huge, but it would undoubtedly make the spec more ugly. > > > > > Whoever adds new feat_X *must be aware* about all previous features, > > otherwise we'd be reusing feature bits and everything falls to pieces. > > > The knowledge is supposed be limited to which feature bit to use. > I do agree! This is why I brought this question up. Creating exceptions from that rule would be very ugly IMHO. But I would not say it is impossible. > > > > > > > > So now, I am inclined to add linux code to work with current qemu and > > > with spec compliant one, and add qemu code to work with current linux > > > and spec compliant one. > > > > > > Document the bug in the spec, maybe, in a non conformance section. > > > > I'm afraid this results in a lot of churn without really making things > > better. > > > IMHO, documenting things how they actually behave, and maybe moving towards > > fixed queue indexes for new features is the low hanging fruit. > > I worry about how to we ensure that? > If old code is messed up people will just keep propagating that. > I would like to fix old code so that new code is correct. > > > > > As raised, it's not just qemu+linux, it's *at least* also cloud-hypervisor. > > > > -- > > Cheers, > > > > David / dhildenb > > There's a slippery slope here in that people will come to us > with buggy devices and ask to change the spec. > I agree! IMHO all we have are bad options. To decide for myself which is less ugly I would love to see both. I agree making the spec bend to the fact that implementations are buggy does not seem right from the spec perspective. But if making things work is not practical without sacrificing the sanctity of the spec, I am willing to swallow the bitter pill and bend the spec. If you don't mind try to keep me in the loop, even if I'm not able to be as responsive as I would like to be. I'm happy fixing the code is going to get another round of consideration. Regards, Halil
On 07.04.25 11:13, David Hildenbrand wrote: > On 07.04.25 11:11, David Hildenbrand wrote: >> On 07.04.25 10:58, Michael S. Tsirkin wrote: >>> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote: >>>> On 07.04.25 10:49, Michael S. Tsirkin wrote: >>>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Whoever adds new feat_X *must be aware* about all previous features, >>>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces. >>>>>>> >>>>>>> >>>>>>> The knowledge is supposed be limited to which feature bit to use. >>>>>> >>>>>> I think we also have to know which virtqueue bits can be used, right? >>>>>> >>>>> >>>>> what are virtqueue bits? vq number? >>>> >>>> Yes, sorry. >>> >>> I got confused myself, it's vq index actually now, we made the spec >>> consistent with that terminology. used to be number/index >>> interchangeably. >>> >>>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 >>>> with their VIRTIO_BALLOON_F_WS_REPORTING. >>> >>> >>> crossvm guys really should have reserved the feature bit even if they >>> did not bother specifying it. Let's reserve it now at least? >> >> Along with the virtqueue indices, right? >> >> Note that there was >> >> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html >> >> and >> >> https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT >> >> But it only was RFC, and as the QEMU implementation didn't materialize, >> nobody seemed to care ... > > Heh, but that one said: > > +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for > Working Set > > Which does not seem to reflect reality ... > I dug a bit more into cross-vm, because that one seems to be the only one out there that does not behave like everybody else I found (maybe good, maybe bad :) ). 1) There was temporarily even another feature (VIRTIO_BALLOON_F_EVENTS_VQ) and another queue. It got removed from cross-vm in: commit 9ba634b82b55ba762dc8724676b2cf9419460145 Author: Daniel Verkamp <dverkamp@chromium.org> Date: Thu Jul 11 11:29:52 2024 -0700 devices: virtio-balloon: remove event queue support VIRTIO_BALLOON_F_EVENTS_VQ was part of a proposed virtio spec change. It is not currently supported by upstream Linux, so removing this should have no effect except for guest kernels that had CHROMIUM patches applied. The virtqueue indexes for the ws-related queues are decremented to fill the hole left by the removal of the event VQ; these are non-standard as well, so they do not have virtqueue indexes assigned in the virtio spec, but the proposed spec extension did actually use vq indexes 5 and 6. BUG=b:214864326 2) cross-vm is aware of the upstream Linux driver They thought your fix would go upstream; it didn't. commit a2fa119e759d0238a42ff15a9aff0dfd122afebd Author: Daniel Verkamp <dverkamp@chromium.org> Date: Wed Jul 10 16:16:28 2024 -0700 devices: virtio-balloon: warn about queue index mismatches The Linux kernel virtio-balloon driver spec non-compliance related to queue numbering is being fixed; add some diagnostics to our device that help to check if everything is working as expected. <https://lore.kernel.org/virtualization/CACGkMEsg0+vpav1Fo8JF1isq4Ef8t4_CFN1scyztDO8bXzRLBQ@mail.gmail.com/T/> Additionally, replace the num_expected_queues() function with per-queue checking to avoid the need for the duplicate feature checks and queue count calculation; each pop_queue() call will be checked using the `?` operator and return a more useful error message if a particular queue is missing. BUG=None TEST=crosvm run --balloon-page-reporting ... IIRC, in that commit they switched to the "spec" behavior. That's when they started hard-coding the queue indexes. CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting. How is that handled?
On 07.04.25 15:12, Halil Pasic wrote: > On Mon, 7 Apr 2025 04:34:29 -0400 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote: >>> On 07.04.25 09:52, Michael S. Tsirkin wrote: >>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: >>>>>> >>>>>> Not perfect, but AFAIKS, not horrible. >>>>> >>>>> It is like it is. QEMU does queue exist if the corresponding feature >>>>> is offered by the device, and that is what we have to live with. >>>> >>>> I don't think we can live with this properly though. >>>> It means a guest that does not know about some features >>>> does not know where to find things. >>> >>> Please describe a real scenario, I'm missing the point. >> >> >> OK so. >> >> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING >> Driver only knows about VIRTIO_BALLOON_F_REPORTING so >> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do. >> How does it know which vq to use for reporting? >> It will try to use the free page hint one. > > First, sorry for not catching up again with the discussion earlier. > > I think David's point is based on the assumption that by the time feature > with the feature bit N+1 is specified and allocates a queue Q, all > queues with indexes smaller than Q are allocated and possibly associated > with features that were previously specified (and probably have feature > bits smaller than N+1). > > I.e. that we can mandate, even if you don't want to care about other > optional features, you have to, because we say so, for the matter of > virtqueue existence. And anything in the future, you don't have to care > about because the queue index associated with future features is larger > than Q, so it does not affect our position. > > I think that argument can fall a part if: > * future features reference optional queues defined in the past > * somebody managed to introduce a limbo where a feature is reserved, and > they can not decide if they want a queue or not, or make the existence > of the queue depend on something else than a feature bit. Staring at the cross-vmm, including the adding+removing of features and queues that are not in the spec, I am wondering if (in a world with fixed virtqueues) 1) Feature bits must be reserved before used. 2) Queue indices must be reserved before used. It all smells like a problem similar to device IDs ...
On Mon, Apr 07 2025, David Hildenbrand <david@redhat.com> wrote: > On 07.04.25 15:12, Halil Pasic wrote: >> On Mon, 7 Apr 2025 04:34:29 -0400 >> "Michael S. Tsirkin" <mst@redhat.com> wrote: >> >>> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote: >>>> On 07.04.25 09:52, Michael S. Tsirkin wrote: >>>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: >>>>>>> >>>>>>> Not perfect, but AFAIKS, not horrible. >>>>>> >>>>>> It is like it is. QEMU does queue exist if the corresponding feature >>>>>> is offered by the device, and that is what we have to live with. >>>>> >>>>> I don't think we can live with this properly though. >>>>> It means a guest that does not know about some features >>>>> does not know where to find things. >>>> >>>> Please describe a real scenario, I'm missing the point. >>> >>> >>> OK so. >>> >>> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING >>> Driver only knows about VIRTIO_BALLOON_F_REPORTING so >>> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do. >>> How does it know which vq to use for reporting? >>> It will try to use the free page hint one. >> >> First, sorry for not catching up again with the discussion earlier. >> >> I think David's point is based on the assumption that by the time feature >> with the feature bit N+1 is specified and allocates a queue Q, all >> queues with indexes smaller than Q are allocated and possibly associated >> with features that were previously specified (and probably have feature >> bits smaller than N+1). >> >> I.e. that we can mandate, even if you don't want to care about other >> optional features, you have to, because we say so, for the matter of >> virtqueue existence. And anything in the future, you don't have to care >> about because the queue index associated with future features is larger >> than Q, so it does not affect our position. >> >> I think that argument can fall a part if: >> * future features reference optional queues defined in the past >> * somebody managed to introduce a limbo where a feature is reserved, and >> they can not decide if they want a queue or not, or make the existence >> of the queue depend on something else than a feature bit. > > Staring at the cross-vmm, including the adding+removing of features and > queues that are not in the spec, I am wondering if (in a world with > fixed virtqueues) > > 1) Feature bits must be reserved before used. > > 2) Queue indices must be reserved before used. > > It all smells like a problem similar to device IDs ... Indeed, we need a rule "reserve a feature bit/queue index before using it, even if you do not plan to spec it properly".
On Mon, Apr 07, 2025 at 03:28:13PM +0200, Cornelia Huck wrote: > On Mon, Apr 07 2025, David Hildenbrand <david@redhat.com> wrote: > > > On 07.04.25 15:12, Halil Pasic wrote: > >> On Mon, 7 Apr 2025 04:34:29 -0400 > >> "Michael S. Tsirkin" <mst@redhat.com> wrote: > >> > >>> On Mon, Apr 07, 2025 at 10:17:10AM +0200, David Hildenbrand wrote: > >>>> On 07.04.25 09:52, Michael S. Tsirkin wrote: > >>>>> On Fri, Apr 04, 2025 at 05:39:10PM +0200, Halil Pasic wrote: > >>>>>>> > >>>>>>> Not perfect, but AFAIKS, not horrible. > >>>>>> > >>>>>> It is like it is. QEMU does queue exist if the corresponding feature > >>>>>> is offered by the device, and that is what we have to live with. > >>>>> > >>>>> I don't think we can live with this properly though. > >>>>> It means a guest that does not know about some features > >>>>> does not know where to find things. > >>>> > >>>> Please describe a real scenario, I'm missing the point. > >>> > >>> > >>> OK so. > >>> > >>> Device has VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_REPORTING > >>> Driver only knows about VIRTIO_BALLOON_F_REPORTING so > >>> it does not know what does VIRTIO_BALLOON_F_FREE_PAGE_HINT do. > >>> How does it know which vq to use for reporting? > >>> It will try to use the free page hint one. > >> > >> First, sorry for not catching up again with the discussion earlier. > >> > >> I think David's point is based on the assumption that by the time feature > >> with the feature bit N+1 is specified and allocates a queue Q, all > >> queues with indexes smaller than Q are allocated and possibly associated > >> with features that were previously specified (and probably have feature > >> bits smaller than N+1). > >> > >> I.e. that we can mandate, even if you don't want to care about other > >> optional features, you have to, because we say so, for the matter of > >> virtqueue existence. And anything in the future, you don't have to care > >> about because the queue index associated with future features is larger > >> than Q, so it does not affect our position. > >> > >> I think that argument can fall a part if: > >> * future features reference optional queues defined in the past > >> * somebody managed to introduce a limbo where a feature is reserved, and > >> they can not decide if they want a queue or not, or make the existence > >> of the queue depend on something else than a feature bit. > > > > Staring at the cross-vmm, including the adding+removing of features and > > queues that are not in the spec, I am wondering if (in a world with > > fixed virtqueues) > > > > 1) Feature bits must be reserved before used. > > > > 2) Queue indices must be reserved before used. > > > > It all smells like a problem similar to device IDs ... > > Indeed, we need a rule "reserve a feature bit/queue index before using > it, even if you do not plan to spec it properly". Reserving feature bits is something I do my best to advocate for in all presentations I do.
On Mon, 07 Apr 2025 15:28:13 +0200 Cornelia Huck <cohuck@redhat.com> wrote: > > Staring at the cross-vmm, including the adding+removing of features and > > queues that are not in the spec, I am wondering if (in a world with > > fixed virtqueues) > > > > 1) Feature bits must be reserved before used. > > > > 2) Queue indices must be reserved before used. > > > > It all smells like a problem similar to device IDs ... > > Indeed, we need a rule "reserve a feature bit/queue index before using > it, even if you do not plan to spec it properly". What definition of usage do you guys have in mind? Would an RFC patch constitute usage. I think reserving/allocating an identifier of this type before relying on it for anything remotely serious is very basic common sense. Frankly I would go even further and advocate for the following rule: we don't accept anything virtio into Linux unless it is reasonably/properly spec-ed. My train of thought is following: if a virtio thing gains traction with Linux it has a fair chance of becoming a de-facto standard. Consider our thinking on this one. Despite the fact that what is spec-ed is obviously nicer, we almost decided to change the spec to fit what is implemented and fielded out there. And IMHO for good reason. For any rule we come up with, I think one of the most crucial questions is, who is going to enforce it if anybody. The Linux (and probably also QEMU) virtio maintainers are in my opinion the most reasonable point of enforcement. Another thing to consider. After the code is in and things work, I speculate that the motivation for writing a proper spec may wane. I hope we do strive for consistency between the spec and the implementations we are talking about. Having and eye on the spec while looking at and trying to understand the code suits my workflow better at least than the other way around. And licensing-wise getting the spec merged first is probably the better option. Regards, Halil
On Mon, Apr 7, 2025 at 6:13 AM David Hildenbrand <david@redhat.com> wrote: > > On 07.04.25 11:13, David Hildenbrand wrote: > > On 07.04.25 11:11, David Hildenbrand wrote: > >> On 07.04.25 10:58, Michael S. Tsirkin wrote: > >>> On Mon, Apr 07, 2025 at 10:54:00AM +0200, David Hildenbrand wrote: > >>>> On 07.04.25 10:49, Michael S. Tsirkin wrote: > >>>>> On Mon, Apr 07, 2025 at 10:44:21AM +0200, David Hildenbrand wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> Whoever adds new feat_X *must be aware* about all previous features, > >>>>>>>> otherwise we'd be reusing feature bits and everything falls to pieces. > >>>>>>> > >>>>>>> > >>>>>>> The knowledge is supposed be limited to which feature bit to use. > >>>>>> > >>>>>> I think we also have to know which virtqueue bits can be used, right? > >>>>>> > >>>>> > >>>>> what are virtqueue bits? vq number? > >>>> > >>>> Yes, sorry. > >>> > >>> I got confused myself, it's vq index actually now, we made the spec > >>> consistent with that terminology. used to be number/index > >>> interchangeably. > >>> > >>>> Assume cross-vm as an example. It would make use of virtqueue indexes 5+6 > >>>> with their VIRTIO_BALLOON_F_WS_REPORTING. > >>> > >>> > >>> crossvm guys really should have reserved the feature bit even if they > >>> did not bother specifying it. Let's reserve it now at least? > >> > >> Along with the virtqueue indices, right? > >> > >> Note that there was > >> > >> https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg02503.html > >> > >> and > >> > >> https://groups.oasis-open.org/communities/community-home/digestviewer/viewthread?GroupId=3973&MessageKey=afb07613-f56c-4d40-8981-2fad1c723998&CommunityKey=2f26be99-3aa1-48f6-93a5-018dce262226&hlmlt=VT > >> > >> But it only was RFC, and as the QEMU implementation didn't materialize, > >> nobody seemed to care ... > > > > Heh, but that one said: > > > > +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for > > Working Set > > > > Which does not seem to reflect reality ... Please feel free to disregard these features and reuse their bits and queue indexes; as far as I know, they are not actually enabled anywhere currently and the corresponding guest patches were only applied to some (no-longer-used) ChromeOS kernel trees, so the compatibility impact should be minimal. I will also try to clean up the leftover bits on the crosvm side just to clear things up. > I dug a bit more into cross-vm, because that one seems to be the only > one out there that does not behave like everybody else I found (maybe good, > maybe bad :) ). > > > 1) There was temporarily even another feature (VIRTIO_BALLOON_F_EVENTS_VQ) > and another queue. > > It got removed from cross-vm in: > > commit 9ba634b82b55ba762dc8724676b2cf9419460145 > Author: Daniel Verkamp <dverkamp@chromium.org> > Date: Thu Jul 11 11:29:52 2024 -0700 > > devices: virtio-balloon: remove event queue support > > VIRTIO_BALLOON_F_EVENTS_VQ was part of a proposed virtio spec change. > > It is not currently supported by upstream Linux, so removing this should > have no effect except for guest kernels that had CHROMIUM patches > applied. > > The virtqueue indexes for the ws-related queues are decremented to fill > the hole left by the removal of the event VQ; these are non-standard as > well, so they do not have virtqueue indexes assigned in the virtio spec, > but the proposed spec extension did actually use vq indexes 5 and 6. > > BUG=b:214864326 > > > 2) cross-vm is aware of the upstream Linux driver > > They thought your fix would go upstream; it didn't. > > commit a2fa119e759d0238a42ff15a9aff0dfd122afebd > Author: Daniel Verkamp <dverkamp@chromium.org> > Date: Wed Jul 10 16:16:28 2024 -0700 > > devices: virtio-balloon: warn about queue index mismatches > > The Linux kernel virtio-balloon driver spec non-compliance related to > queue numbering is being fixed; add some diagnostics to our device that > help to check if everything is working as expected. > > <https://lore.kernel.org/virtualization/CACGkMEsg0+vpav1Fo8JF1isq4Ef8t4_CFN1scyztDO8bXzRLBQ@mail.gmail.com/T/> > > Additionally, replace the num_expected_queues() function with per-queue > checking to avoid the need for the duplicate feature checks and queue > count calculation; each pop_queue() call will be checked using the `?` > operator and return a more useful error message if a particular queue is > missing. > > BUG=None > TEST=crosvm run --balloon-page-reporting ... > > > IIRC, in that commit they switched to the "spec" behavior. > > That's when they started hard-coding the queue indexes. > > CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting. > How is that handled? In practice, it only works because nobody calls crosvm with --balloon-page-reporting (it's off by default), so the balloon device does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature. (I just went searching now, and it does seem like there is actually one user in Android that does try to enable page reporting[1], which I'll have to look into...) In my opinion, it makes the most sense to keep the spec as it is and change QEMU and the kernel to match, but obviously that's not trivial to do in a way that doesn't break existing devices and drivers. [1]: https://cs.android.com/android/platform/superproject/main/+/main:packages/modules/Virtualization/android/virtmgr/src/crosvm.rs;l=1079;drc=caff2827914c0918260ff76cd55f6e3dc6323d51
>>> Heh, but that one said: >>> >>> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for >>> Working Set >>> >>> Which does not seem to reflect reality ... > > Please feel free to disregard these features and reuse their bits and > queue indexes; as far as I know, they are not actually enabled > anywhere currently and the corresponding guest patches were only > applied to some (no-longer-used) ChromeOS kernel trees, so the > compatibility impact should be minimal. I will also try to clean up > the leftover bits on the crosvm side just to clear things up. Thanks for your reply, and thanks for clarifying+cleaning it up. > >> I dug a bit more into cross-vm, because that one seems to be the only >> one out there that does not behave like everybody else I found (maybe good, >> maybe bad :) ). >> >> >> 1) There was temporarily even another feature (VIRTIO_BALLOON_F_EVENTS_VQ) >> and another queue. >> >> It got removed from cross-vm in: >> >> commit 9ba634b82b55ba762dc8724676b2cf9419460145 >> Author: Daniel Verkamp <dverkamp@chromium.org> >> Date: Thu Jul 11 11:29:52 2024 -0700 >> >> devices: virtio-balloon: remove event queue support >> >> VIRTIO_BALLOON_F_EVENTS_VQ was part of a proposed virtio spec change. >> >> It is not currently supported by upstream Linux, so removing this should >> have no effect except for guest kernels that had CHROMIUM patches >> applied. >> >> The virtqueue indexes for the ws-related queues are decremented to fill >> the hole left by the removal of the event VQ; these are non-standard as >> well, so they do not have virtqueue indexes assigned in the virtio spec, >> but the proposed spec extension did actually use vq indexes 5 and 6. >> >> BUG=b:214864326 >> >> >> 2) cross-vm is aware of the upstream Linux driver >> >> They thought your fix would go upstream; it didn't. >> >> commit a2fa119e759d0238a42ff15a9aff0dfd122afebd >> Author: Daniel Verkamp <dverkamp@chromium.org> >> Date: Wed Jul 10 16:16:28 2024 -0700 >> >> devices: virtio-balloon: warn about queue index mismatches >> >> The Linux kernel virtio-balloon driver spec non-compliance related to >> queue numbering is being fixed; add some diagnostics to our device that >> help to check if everything is working as expected. >> >> <https://lore.kernel.org/virtualization/CACGkMEsg0+vpav1Fo8JF1isq4Ef8t4_CFN1scyztDO8bXzRLBQ@mail.gmail.com/T/> >> >> Additionally, replace the num_expected_queues() function with per-queue >> checking to avoid the need for the duplicate feature checks and queue >> count calculation; each pop_queue() call will be checked using the `?` >> operator and return a more useful error message if a particular queue is >> missing. >> >> BUG=None >> TEST=crosvm run --balloon-page-reporting ... >> >> >> IIRC, in that commit they switched to the "spec" behavior. >> >> That's when they started hard-coding the queue indexes. >> >> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting. >> How is that handled? > > In practice, it only works because nobody calls crosvm with > --balloon-page-reporting (it's off by default), so the balloon device > does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature. > > (I just went searching now, and it does seem like there is actually > one user in Android that does try to enable page reporting[1], which > I'll have to look into...) > > In my opinion, it makes the most sense to keep the spec as it is and > change QEMU and the kernel to match, but obviously that's not trivial > to do in a way that doesn't break existing devices and drivers. If only it would be limited to QEMU and Linux ... :) Out of curiosity, assuming we'd make the spec match the current QEMU/Linux implementation at least for the 3 involved features only, would there be a way to adjust crossvm without any disruption? I still have the feeling that it will be rather hard to get that all implementations match the spec ... For new features+queues it will be easy to force the usage of fixed virtqueue numbers, but for free-page-hinting and reporting, it's a mess :(
On Mon, Apr 7, 2025 at 11:47 AM David Hildenbrand <david@redhat.com> wrote: > > >>> Heh, but that one said: > >>> > >>> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for > >>> Working Set > >>> > >>> Which does not seem to reflect reality ... > > > > Please feel free to disregard these features and reuse their bits and > > queue indexes; as far as I know, they are not actually enabled > > anywhere currently and the corresponding guest patches were only > > applied to some (no-longer-used) ChromeOS kernel trees, so the > > compatibility impact should be minimal. I will also try to clean up > > the leftover bits on the crosvm side just to clear things up. > > Thanks for your reply, and thanks for clarifying+cleaning it up. > [...] > >> IIRC, in that commit they switched to the "spec" behavior. > >> > >> That's when they started hard-coding the queue indexes. > >> > >> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting. > >> How is that handled? > > > > In practice, it only works because nobody calls crosvm with > > --balloon-page-reporting (it's off by default), so the balloon device > > does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature. > > > > (I just went searching now, and it does seem like there is actually > > one user in Android that does try to enable page reporting[1], which > > I'll have to look into...) > > > > In my opinion, it makes the most sense to keep the spec as it is and > > change QEMU and the kernel to match, but obviously that's not trivial > > to do in a way that doesn't break existing devices and drivers. > > If only it would be limited to QEMU and Linux ... :) > > Out of curiosity, assuming we'd make the spec match the current > QEMU/Linux implementation at least for the 3 involved features only, > would there be a way to adjust crossvm without any disruption? > > I still have the feeling that it will be rather hard to get that all > implementations match the spec ... For new features+queues it will be > easy to force the usage of fixed virtqueue numbers, but for > free-page-hinting and reporting, it's a mess :( If the spec is changed, we can certainly update crosvm to match it; I think this only really affects a few devices (balloon and technically filesystem, but see below), only affects features that are generally not turned on, and in many cases, the guest kernel is updated simultaneously with the crosvm binary. I'm not opposed to changing the spec to match reality, although that feels like a bad move from a spec-integrity perspective. Regardless of the chosen path, I think the spec should be clarified - the meaning of "queue only exists if <feature> is set" leaves the reader with too many questions: - What does "if <feature> is set" mean? If it's advertised by the device? If it's acked by the driver? (To me, "set" definitely hints at the latter, but it should be explicit.) - What does it mean for a virtqueue to "exist"? Does that queue index disappear from the numbering if it does not exist, sliding all of the later queues down? If so, the spec should really not have the static queue numbers listed for the later queues, since they are only correct if all previous feature-dependent queues were also "set", whatever that means. The way crosvm interpreted this was: - "if <feature> is set" means "if the device advertised <feature> *and* driver acknowledged <feature>" - "queue only exists" means "if <feature> was not acked, the corresponding virtqueue cannot be enabled by the driver" (attempting to set queue_enable = 1 has no effect). - Any later virtqueues are unaffected and still have the same queue indexes. The way QEMU interpeted this (I think, just skimming the code and working from memory here): - "if <feature> is set" means "if the device advertised <feature>" (it is checking host_features, not guest_features) - "queue only exists" means "if <feature> was not offered by the device, all later virtqueues are shifted down one index" --- The spec for the filesystem device has a similar issue to the balloon device: - Queue 0 (hiprio) is always available regardless of features. - Queue 1 (notification queue) has a note that "The notification queue only exists if VIRTIO_FS_F_NOTIFICATION is set." - Queue 2..n are supposed to be the request queues per the numbering in the spec. This is how it has been specified since virtio 1.2 when the fs device was originally added. However, the Linux driver (and probably all existing device implementations - at least virtiofsd and crosvm's fs device) don't support VIRTIO_FS_F_NOTIFICATION and use queue 1 as a request queue, which matches the QEMU/Linux interpretation but means the spec doesn't match reality again. Thanks, -- Daniel
On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: > > In my opinion, it makes the most sense to keep the spec as it is and > > change QEMU and the kernel to match, but obviously that's not trivial > > to do in a way that doesn't break existing devices and drivers. > > If only it would be limited to QEMU and Linux ... :) > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux > implementation at least for the 3 involved features only, would there be a > way to adjust crossvm without any disruption? > > I still have the feeling that it will be rather hard to get that all > implementations match the spec ... For new features+queues it will be easy > to force the usage of fixed virtqueue numbers, but for free-page-hinting and > reporting, it's a mess :( Still thinking about a way to fix drivers... We can discuss this theoretically, maybe?
On 07.04.25 23:20, Michael S. Tsirkin wrote: > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: >>> In my opinion, it makes the most sense to keep the spec as it is and >>> change QEMU and the kernel to match, but obviously that's not trivial >>> to do in a way that doesn't break existing devices and drivers. >> >> If only it would be limited to QEMU and Linux ... :) >> >> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux >> implementation at least for the 3 involved features only, would there be a >> way to adjust crossvm without any disruption? >> >> I still have the feeling that it will be rather hard to get that all >> implementations match the spec ... For new features+queues it will be easy >> to force the usage of fixed virtqueue numbers, but for free-page-hinting and >> reporting, it's a mess :( > > > Still thinking about a way to fix drivers... We can discuss this > theoretically, maybe? Yes, absolutely. I took the time to do some more digging; regarding drivers only Linux seems to be problematic. virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support problematic features (free page hinting, free page reporting) in their virtio-balloon implementations. So from the known drivers, only Linux is applicable. reporting_vq is either at idx 4/3/2 free_page_vq is either at idx 3/2 statsq is at idx2 (only relevant if the feature is offered) So if we could test for the existence of a virtqueue at an idx easily, we could test from highest-to-smallest idx. But I recall that testing for the existance of a virtqueue on s390x resulted in the problem/deadlock in the first place ...
On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote: > On 07.04.25 23:20, Michael S. Tsirkin wrote: > > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: > > > > In my opinion, it makes the most sense to keep the spec as it is and > > > > change QEMU and the kernel to match, but obviously that's not trivial > > > > to do in a way that doesn't break existing devices and drivers. > > > > > > If only it would be limited to QEMU and Linux ... :) > > > > > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux > > > implementation at least for the 3 involved features only, would there be a > > > way to adjust crossvm without any disruption? > > > > > > I still have the feeling that it will be rather hard to get that all > > > implementations match the spec ... For new features+queues it will be easy > > > to force the usage of fixed virtqueue numbers, but for free-page-hinting and > > > reporting, it's a mess :( > > > > > > Still thinking about a way to fix drivers... We can discuss this > > theoretically, maybe? > > Yes, absolutely. I took the time to do some more digging; regarding drivers > only Linux seems to be problematic. > > virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support > problematic features (free page hinting, free page reporting) in their > virtio-balloon implementations. > > So from the known drivers, only Linux is applicable. > > reporting_vq is either at idx 4/3/2 > free_page_vq is either at idx 3/2 > statsq is at idx2 (only relevant if the feature is offered) > > So if we could test for the existence of a virtqueue at an idx easily, we > could test from highest-to-smallest idx. > > But I recall that testing for the existance of a virtqueue on s390x resulted > in the problem/deadlock in the first place ... > > -- > Cheers, > > David / dhildenb So let's talk about a new feature bit? Since vqs are probed after feature negotiation, it looks like we could have a feature bit trigger sane behaviour, right? I kind of dislike it that we have a feature bit for bugs though. What would be a minimal new feature to add so it does not feel wrong? Maybe it's in the field of psychology though ...
On 07.04.25 23:09, Daniel Verkamp wrote: > On Mon, Apr 7, 2025 at 11:47 AM David Hildenbrand <david@redhat.com> wrote: >> >>>>> Heh, but that one said: >>>>> >>>>> +\item[ VIRTIO_BALLOON_F_WS_REPORTING(6) ] The device has support for >>>>> Working Set >>>>> >>>>> Which does not seem to reflect reality ... >>> >>> Please feel free to disregard these features and reuse their bits and >>> queue indexes; as far as I know, they are not actually enabled >>> anywhere currently and the corresponding guest patches were only >>> applied to some (no-longer-used) ChromeOS kernel trees, so the >>> compatibility impact should be minimal. I will also try to clean up >>> the leftover bits on the crosvm side just to clear things up. >> >> Thanks for your reply, and thanks for clarifying+cleaning it up. >> > [...] >>>> IIRC, in that commit they switched to the "spec" behavior. >>>> >>>> That's when they started hard-coding the queue indexes. >>>> >>>> CCing Daniel. All Linux versions should be incompatible with cross-vmm regarding free page reporting. >>>> How is that handled? >>> >>> In practice, it only works because nobody calls crosvm with >>> --balloon-page-reporting (it's off by default), so the balloon device >>> does not advertise the VIRTIO_BALLOON_F_PAGE_REPORTING feature. >>> >>> (I just went searching now, and it does seem like there is actually >>> one user in Android that does try to enable page reporting[1], which >>> I'll have to look into...) >>> >>> In my opinion, it makes the most sense to keep the spec as it is and >>> change QEMU and the kernel to match, but obviously that's not trivial >>> to do in a way that doesn't break existing devices and drivers. >> >> If only it would be limited to QEMU and Linux ... :) >> >> Out of curiosity, assuming we'd make the spec match the current >> QEMU/Linux implementation at least for the 3 involved features only, >> would there be a way to adjust crossvm without any disruption? >> >> I still have the feeling that it will be rather hard to get that all >> implementations match the spec ... For new features+queues it will be >> easy to force the usage of fixed virtqueue numbers, but for >> free-page-hinting and reporting, it's a mess :( > > If the spec is changed, we can certainly update crosvm to match it; I > think this only really affects a few devices (balloon and technically > filesystem, but see below), only affects features that are generally > not turned on, and in many cases, the guest kernel is updated > simultaneously with the crosvm binary. I'm not opposed to changing the > spec to match reality, although that feels like a bad move from a > spec-integrity perspective. Right. We didn't pay attention that the spec would reflect reality, and the reality was a bad decision :) > > Regardless of the chosen path, I think the spec should be clarified - > the meaning of "queue only exists if <feature> is set" leaves the > reader with too many questions: Right, that's what we've been discussing. > - What does "if <feature> is set" mean? If it's advertised by the > device? If it's acked by the driver? (To me, "set" definitely hints at > the latter, but it should be explicit.) Currently it's "feature is offered by the device". > - What does it mean for a virtqueue to "exist"? Does that queue index > disappear from the numbering if it does not exist, sliding all of the > later queues down? Currently it's like that, yes. > If so, the spec should really not have the static > queue numbers listed for the later queues, since they are only correct > if all previous feature-dependent queues were also "set", whatever > that means. Yes, that's also what we've been discussing. And that started restarted the whole "can we fix device/drivers instead" :) > > The way crosvm interpreted this was: > - "if <feature> is set" means "if the device advertised <feature> > *and* driver acknowledged <feature>" > - "queue only exists" means "if <feature> was not acked, the > corresponding virtqueue cannot be enabled by the driver" (attempting > to set queue_enable = 1 has no effect). > - Any later virtqueues are unaffected and still have the same queue indexes. Yes, that matches my understanding. > > The way QEMU interpeted this (I think, just skimming the code and > working from memory here): > - "if <feature> is set" means "if the device advertised <feature>" (it > is checking host_features, not guest_features) Right, "offered features". > - "queue only exists" means "if <feature> was not offered by the > device, all later virtqueues are shifted down one index" Exactly. > > --- > > The spec for the filesystem device has a similar issue to the balloon device: > - Queue 0 (hiprio) is always available regardless of features. > - Queue 1 (notification queue) has a note that "The notification queue > only exists if VIRTIO_FS_F_NOTIFICATION is set." > - Queue 2..n are supposed to be the request queues per the numbering > in the spec. > > This is how it has been specified since virtio 1.2 when the fs device > was originally added. However, the Linux driver (and probably all > existing device implementations - at least virtiofsd and crosvm's fs > device) don't support VIRTIO_FS_F_NOTIFICATION and use queue 1 as a > request queue, which matches the QEMU/Linux interpretation but means > the spec doesn't match reality again. Yes, these are the two known cases we have to sort out. Thanks for the information on crossvm, and that whatever we decide to do, adjusting crossvm should not be a big problem (at least for now :) ).
On 09.04.25 12:56, Michael S. Tsirkin wrote: > On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote: >> On 07.04.25 23:20, Michael S. Tsirkin wrote: >>> On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: >>>>> In my opinion, it makes the most sense to keep the spec as it is and >>>>> change QEMU and the kernel to match, but obviously that's not trivial >>>>> to do in a way that doesn't break existing devices and drivers. >>>> >>>> If only it would be limited to QEMU and Linux ... :) >>>> >>>> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux >>>> implementation at least for the 3 involved features only, would there be a >>>> way to adjust crossvm without any disruption? >>>> >>>> I still have the feeling that it will be rather hard to get that all >>>> implementations match the spec ... For new features+queues it will be easy >>>> to force the usage of fixed virtqueue numbers, but for free-page-hinting and >>>> reporting, it's a mess :( >>> >>> >>> Still thinking about a way to fix drivers... We can discuss this >>> theoretically, maybe? >> >> Yes, absolutely. I took the time to do some more digging; regarding drivers >> only Linux seems to be problematic. >> >> virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support >> problematic features (free page hinting, free page reporting) in their >> virtio-balloon implementations. >> >> So from the known drivers, only Linux is applicable. >> >> reporting_vq is either at idx 4/3/2 >> free_page_vq is either at idx 3/2 >> statsq is at idx2 (only relevant if the feature is offered) >> >> So if we could test for the existence of a virtqueue at an idx easily, we >> could test from highest-to-smallest idx. >> >> But I recall that testing for the existance of a virtqueue on s390x resulted >> in the problem/deadlock in the first place ... >> >> -- >> Cheers, >> >> David / dhildenb > > So let's talk about a new feature bit? Are you thinking about a new feature that switches between "fixed queue indices" and "compressed queue indices", whereby the latter would be the legacy default and we would expect all devices to switch to the new fixed-queue-indices layout? We could make all new features require "fixed-queue-indices". > > Since vqs are probed after feature negotiation, it looks like > we could have a feature bit trigger sane behaviour, right? In the Linux driver, yes. In QEMU (devices), we add the queues when realizing, so we'd need some mechanism to adjust the queue indices based on feature negotiation I guess? For virtio-balloon it might be doable to simply always create+indicate free-page hinting to resolve the issue easily. For virtio-fs it might not be that easy. > > I kind of dislike it that we have a feature bit for bugs though. > What would be a minimal new feature to add so it does not > feel wrong? Probably as above: fixed vs. compressed virtqueue indices?
On Wed, Apr 09, 2025 at 01:12:19PM +0200, David Hildenbrand wrote: > On 09.04.25 12:56, Michael S. Tsirkin wrote: > > On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote: > > > On 07.04.25 23:20, Michael S. Tsirkin wrote: > > > > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: > > > > > > In my opinion, it makes the most sense to keep the spec as it is and > > > > > > change QEMU and the kernel to match, but obviously that's not trivial > > > > > > to do in a way that doesn't break existing devices and drivers. > > > > > > > > > > If only it would be limited to QEMU and Linux ... :) > > > > > > > > > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux > > > > > implementation at least for the 3 involved features only, would there be a > > > > > way to adjust crossvm without any disruption? > > > > > > > > > > I still have the feeling that it will be rather hard to get that all > > > > > implementations match the spec ... For new features+queues it will be easy > > > > > to force the usage of fixed virtqueue numbers, but for free-page-hinting and > > > > > reporting, it's a mess :( > > > > > > > > > > > > Still thinking about a way to fix drivers... We can discuss this > > > > theoretically, maybe? > > > > > > Yes, absolutely. I took the time to do some more digging; regarding drivers > > > only Linux seems to be problematic. > > > > > > virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support > > > problematic features (free page hinting, free page reporting) in their > > > virtio-balloon implementations. > > > > > > So from the known drivers, only Linux is applicable. > > > > > > reporting_vq is either at idx 4/3/2 > > > free_page_vq is either at idx 3/2 > > > statsq is at idx2 (only relevant if the feature is offered) > > > > > > So if we could test for the existence of a virtqueue at an idx easily, we > > > could test from highest-to-smallest idx. > > > > > > But I recall that testing for the existance of a virtqueue on s390x resulted > > > in the problem/deadlock in the first place ... > > > > > > -- > > > Cheers, > > > > > > David / dhildenb > > > > So let's talk about a new feature bit? > > Are you thinking about a new feature that switches between "fixed queue > indices" and "compressed queue indices", whereby the latter would be the > legacy default and we would expect all devices to switch to the new > fixed-queue-indices layout? > > We could make all new features require "fixed-queue-indices". I see two ways: 1. we make driver behave correctly with in spec and out of spec devices and we make qemu behave correctly with in spec and out of spec devices 2. a new feature bit I prefer 1, and when we add a new feature we can also document that it should be in spec if negotiated. My question is if 1 is practical. > > > > Since vqs are probed after feature negotiation, it looks like > > we could have a feature bit trigger sane behaviour, right? > > In the Linux driver, yes. In QEMU (devices), we add the queues when > realizing, so we'd need some mechanism to adjust the queue indices based on > feature negotiation I guess? Well we can add queues later, nothing prevents that. > For virtio-balloon it might be doable to simply always create+indicate > free-page hinting to resolve the issue easily. OK, so - for devices, we suggest that basically VIRTIO_BALLOON_F_REPORTING only created with VIRTIO_BALLOON_F_FREE_PAGE_HINT and VIRTIO_BALLOON_F_FREE_PAGE_HINT only created with VIRTIO_BALLOON_F_STATS_VQ I got that. Now, for drivers. If the dependency is satisfied as above, no difference. What should drivers do if not? I think the thing to do would be to first probe spec compliant vq numbers? If not there, try with the non compliant version? However, you wrote: > > > But I recall that testing for the existance of a virtqueue on s390x resulted > > > in the problem/deadlock in the first place ... I think the deadlock was if trying to *use* a non-existent virtqueue? This is qemu code: case CCW_CMD_READ_VQ_CONF: if (check_len) { if (ccw.count != sizeof(vq_config)) { ret = -EINVAL; break; } } else if (ccw.count < sizeof(vq_config)) { /* Can't execute command. */ ret = -EINVAL; break; } if (!ccw.cda) { ret = -EFAULT; } else { ret = ccw_dstream_read(&sch->cds, vq_config.index); if (ret) { break; } vq_config.index = be16_to_cpu(vq_config.index); if (vq_config.index >= VIRTIO_QUEUE_MAX) { ret = -EINVAL; break; } vq_config.num_max = virtio_queue_get_num(vdev, vq_config.index); vq_config.num_max = cpu_to_be16(vq_config.num_max); ret = ccw_dstream_write(&sch->cds, vq_config.num_max); if (!ret) { sch->curr_status.scsw.count = ccw.count - sizeof(vq_config); } } and int virtio_queue_get_num(VirtIODevice *vdev, int n) { return vdev->vq[n].vring.num; } it seems to happily return vq size with no issues? > For virtio-fs it might not be that easy. virtio fs? But it has no features? > > > > I kind of dislike it that we have a feature bit for bugs though. > > What would be a minimal new feature to add so it does not > > feel wrong? > > Probably as above: fixed vs. compressed virtqueue indices? > > -- > Cheers, > > David / dhildenb
On 09.04.25 14:07, Michael S. Tsirkin wrote: > On Wed, Apr 09, 2025 at 01:12:19PM +0200, David Hildenbrand wrote: >> On 09.04.25 12:56, Michael S. Tsirkin wrote: >>> On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote: >>>> On 07.04.25 23:20, Michael S. Tsirkin wrote: >>>>> On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: >>>>>>> In my opinion, it makes the most sense to keep the spec as it is and >>>>>>> change QEMU and the kernel to match, but obviously that's not trivial >>>>>>> to do in a way that doesn't break existing devices and drivers. >>>>>> >>>>>> If only it would be limited to QEMU and Linux ... :) >>>>>> >>>>>> Out of curiosity, assuming we'd make the spec match the current QEMU/Linux >>>>>> implementation at least for the 3 involved features only, would there be a >>>>>> way to adjust crossvm without any disruption? >>>>>> >>>>>> I still have the feeling that it will be rather hard to get that all >>>>>> implementations match the spec ... For new features+queues it will be easy >>>>>> to force the usage of fixed virtqueue numbers, but for free-page-hinting and >>>>>> reporting, it's a mess :( >>>>> >>>>> >>>>> Still thinking about a way to fix drivers... We can discuss this >>>>> theoretically, maybe? >>>> >>>> Yes, absolutely. I took the time to do some more digging; regarding drivers >>>> only Linux seems to be problematic. >>>> >>>> virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support >>>> problematic features (free page hinting, free page reporting) in their >>>> virtio-balloon implementations. >>>> >>>> So from the known drivers, only Linux is applicable. >>>> >>>> reporting_vq is either at idx 4/3/2 >>>> free_page_vq is either at idx 3/2 >>>> statsq is at idx2 (only relevant if the feature is offered) >>>> >>>> So if we could test for the existence of a virtqueue at an idx easily, we >>>> could test from highest-to-smallest idx. >>>> >>>> But I recall that testing for the existance of a virtqueue on s390x resulted >>>> in the problem/deadlock in the first place ... >>>> >>>> -- >>>> Cheers, >>>> >>>> David / dhildenb >>> >>> So let's talk about a new feature bit? >> >> Are you thinking about a new feature that switches between "fixed queue >> indices" and "compressed queue indices", whereby the latter would be the >> legacy default and we would expect all devices to switch to the new >> fixed-queue-indices layout? >> >> We could make all new features require "fixed-queue-indices". > > I see two ways: > 1. we make driver behave correctly with in spec and out of spec devices > and we make qemu behave correctly with in spec and out of spec devices > 2. a new feature bit > > I prefer 1, and when we add a new feature we can also > document that it should be in spec if negotiated. > > My question is if 1 is practical. AFAIKT, 1) implies: virtio-balloon: a) Driver As mentioned above, we'd need a reliable way to test for the existence of a virtqueue, so we can e.g., test for reporting_vq idx 4 -> 3 -> 2 With that we'd be able to support compressed+fixed at the same time. Q: Is it possible/feasible? b) Device: virtio-balloon: we can fake existence of STAT and FREE_PAGE_HINTING easily, such that the compressed layout corresponds to the fixed layout easily. Q: alternatives? We could try creating multiple queues for the same feature, but it's going to be a mess I'm afraid ... virtio-fs: a) Driver Linux does not even implement VIRTIO_FS_F_NOTIFICATION or respect VIRTIO_FS_F_NOTIFICATION when calculating queue indices, ... b) Device Same applies to virtiofsd ... Q: Did anybody actually implement VIRTIO_FS_F_NOTIFICATION ever? If not, can we just remove it from the spec completely and resolve the issue that way?
On Wed, Apr 09, 2025 at 02:24:32PM +0200, David Hildenbrand wrote: > On 09.04.25 14:07, Michael S. Tsirkin wrote: > > On Wed, Apr 09, 2025 at 01:12:19PM +0200, David Hildenbrand wrote: > > > On 09.04.25 12:56, Michael S. Tsirkin wrote: > > > > On Wed, Apr 09, 2025 at 12:46:41PM +0200, David Hildenbrand wrote: > > > > > On 07.04.25 23:20, Michael S. Tsirkin wrote: > > > > > > On Mon, Apr 07, 2025 at 08:47:05PM +0200, David Hildenbrand wrote: > > > > > > > > In my opinion, it makes the most sense to keep the spec as it is and > > > > > > > > change QEMU and the kernel to match, but obviously that's not trivial > > > > > > > > to do in a way that doesn't break existing devices and drivers. > > > > > > > > > > > > > > If only it would be limited to QEMU and Linux ... :) > > > > > > > > > > > > > > Out of curiosity, assuming we'd make the spec match the current QEMU/Linux > > > > > > > implementation at least for the 3 involved features only, would there be a > > > > > > > way to adjust crossvm without any disruption? > > > > > > > > > > > > > > I still have the feeling that it will be rather hard to get that all > > > > > > > implementations match the spec ... For new features+queues it will be easy > > > > > > > to force the usage of fixed virtqueue numbers, but for free-page-hinting and > > > > > > > reporting, it's a mess :( > > > > > > > > > > > > > > > > > > Still thinking about a way to fix drivers... We can discuss this > > > > > > theoretically, maybe? > > > > > > > > > > Yes, absolutely. I took the time to do some more digging; regarding drivers > > > > > only Linux seems to be problematic. > > > > > > > > > > virtio-win, FreeBSD, NetBSD and OpenBSD and don't seem to support > > > > > problematic features (free page hinting, free page reporting) in their > > > > > virtio-balloon implementations. > > > > > > > > > > So from the known drivers, only Linux is applicable. > > > > > > > > > > reporting_vq is either at idx 4/3/2 > > > > > free_page_vq is either at idx 3/2 > > > > > statsq is at idx2 (only relevant if the feature is offered) > > > > > > > > > > So if we could test for the existence of a virtqueue at an idx easily, we > > > > > could test from highest-to-smallest idx. > > > > > > > > > > But I recall that testing for the existance of a virtqueue on s390x resulted > > > > > in the problem/deadlock in the first place ... > > > > > > > > > > -- > > > > > Cheers, > > > > > > > > > > David / dhildenb > > > > > > > > So let's talk about a new feature bit? > > > > > > Are you thinking about a new feature that switches between "fixed queue > > > indices" and "compressed queue indices", whereby the latter would be the > > > legacy default and we would expect all devices to switch to the new > > > fixed-queue-indices layout? > > > > > > We could make all new features require "fixed-queue-indices". > > > > I see two ways: > > 1. we make driver behave correctly with in spec and out of spec devices > > and we make qemu behave correctly with in spec and out of spec devices > > 2. a new feature bit > > > > I prefer 1, and when we add a new feature we can also > > document that it should be in spec if negotiated. > > > > My question is if 1 is practical. > > AFAIKT, 1) implies: > > virtio-balloon: > > a) Driver > > As mentioned above, we'd need a reliable way to test for the existence of a > virtqueue, so we can e.g., test for reporting_vq idx 4 -> 3 -> 2 > > With that we'd be able to support compressed+fixed at the same time. > > Q: Is it possible/feasible? > > b) Device: virtio-balloon: we can fake existence of STAT and > FREE_PAGE_HINTING easily, such that the compressed layout corresponds to the > fixed layout easily. > > Q: alternatives? We could try creating multiple queues for the same feature, > but it's going to be a mess I'm afraid ... > > > virtio-fs: > > a) Driver > > Linux does not even implement VIRTIO_FS_F_NOTIFICATION or respect > VIRTIO_FS_F_NOTIFICATION when calculating queue indices, ... > > b) Device > > Same applies to virtiofsd ... > > Q: Did anybody actually implement VIRTIO_FS_F_NOTIFICATION ever? If not, can > we just remove it from the spec completely and resolve the issue that way? Donnu. Vivek? Or we can check for queue number 1+num_request_queues maybe? If that exists then it is spec compliant? > -- > Cheers, > > David / dhildenb
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c index 21fa7ac849e5c..4904b831c0a75 100644 --- a/drivers/s390/virtio/virtio_ccw.c +++ b/drivers/s390/virtio/virtio_ccw.c @@ -302,11 +302,17 @@ static struct airq_info *new_airq_info(int index) static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs, u64 *first, void **airq_info) { - int i, j; + int i, j, queue_idx, highest_queue_idx = -1; struct airq_info *info; unsigned long *indicator_addr = NULL; unsigned long bit, flags; + /* Array entries without an actual queue pointer must be ignored. */ + for (i = 0; i < nvqs; i++) { + if (vqs[i]) + highest_queue_idx++; + } + for (i = 0; i < MAX_AIRQ_AREAS && !indicator_addr; i++) { mutex_lock(&airq_areas_lock); if (!airq_areas[i]) @@ -316,7 +322,7 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs, if (!info) return NULL; write_lock_irqsave(&info->lock, flags); - bit = airq_iv_alloc(info->aiv, nvqs); + bit = airq_iv_alloc(info->aiv, highest_queue_idx + 1); if (bit == -1UL) { /* Not enough vacancies. */ write_unlock_irqrestore(&info->lock, flags); @@ -325,8 +331,10 @@ static unsigned long *get_airq_indicator(struct virtqueue *vqs[], int nvqs, *first = bit; *airq_info = info; indicator_addr = info->aiv->vector; - for (j = 0; j < nvqs; j++) { - airq_iv_set_ptr(info->aiv, bit + j, + for (j = 0, queue_idx = 0; j < nvqs; j++) { + if (!vqs[j]) + continue; + airq_iv_set_ptr(info->aiv, bit + queue_idx++, (unsigned long)vqs[j]); } write_unlock_irqrestore(&info->lock, flags);
If we finds a vq without a name in our input array in virtio_ccw_find_vqs(), we treat it as "non-existing" and set the vq pointer to NULL; we will not call virtio_ccw_setup_vq() to allocate/setup a vq. Consequently, we create only a queue if it actually exists (name != NULL) and assign an incremental queue index to each such existing queue. However, in virtio_ccw_register_adapter_ind()->get_airq_indicator() we will not ignore these "non-existing queues", but instead assign an airq indicator to them. Besides never releasing them in virtio_ccw_drop_indicators() (because there is no virtqueue), the bigger issue seems to be that there will be a disagreement between the device and the Linux guest about the airq indicator to be used for notifying a queue, because the indicator bit for adapter I/O interrupt is derived from the queue index. The virtio spec states under "Setting Up Two-Stage Queue Indicators": ... indicator contains the guest address of an area wherein the indicators for the devices are contained, starting at bit_nr, one bit per virtqueue of the device. And further in "Notification via Adapter I/O Interrupts": For notifying the driver of virtqueue buffers, the device sets the bit in the guest-provided indicator area at the corresponding offset. For example, QEMU uses in virtio_ccw_notify() the queue index (passed as "vector") to select the relevant indicator bit. If a queue does not exist, it does not have a corresponding indicator bit assigned, because it effectively doesn't have a queue index. Using a virtio-balloon-ccw device under QEMU with free-page-hinting disabled ("free-page-hint=off") but free-page-reporting enabled ("free-page-reporting=on") will result in free page reporting not working as expected: in the virtio_balloon driver, we'll be stuck forever in virtballoon_free_page_report()->wait_event(), because the waitqueue will not be woken up as the notification from the device is lost: it would use the wrong indicator bit. Free page reporting stops working and we get splats (when configured to detect hung wqs) like: INFO: task kworker/1:3:463 blocked for more than 61 seconds. Not tainted 6.14.0 #4 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/1:3 [...] Workqueue: events page_reporting_process Call Trace: [<000002f404e6dfb2>] __schedule+0x402/0x1640 [<000002f404e6f22e>] schedule+0x3e/0xe0 [<000002f3846a88fa>] virtballoon_free_page_report+0xaa/0x110 [virtio_balloon] [<000002f40435c8a4>] page_reporting_process+0x2e4/0x740 [<000002f403fd3ee2>] process_one_work+0x1c2/0x400 [<000002f403fd4b96>] worker_thread+0x296/0x420 [<000002f403fe10b4>] kthread+0x124/0x290 [<000002f403f4e0dc>] __ret_from_fork+0x3c/0x60 [<000002f404e77272>] ret_from_fork+0xa/0x38 There was recently a discussion [1] whether the "holes" should be treated differently again, effectively assigning also non-existing queues a queue index: that should also fix the issue, but requires other workarounds to not break existing setups. Let's fix it without affecting existing setups for now by properly ignoring the non-existing queues, so the indicator bits will match the queue indexes. [1] https://lore.kernel.org/all/cover.1720611677.git.mst@redhat.com/ Fixes: a229989d975e ("virtio: don't allocate vqs when names[i] = NULL") Reported-by: Chandra Merla <cmerla@redhat.com> Cc: <Stable@vger.kernel.org> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Eric Farman <farman@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Wei Wang <wei.w.wang@intel.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- drivers/s390/virtio/virtio_ccw.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)