[v2,1/3] x86/msi: passthrough all MSI-X vector ctrl writes to device model

Message ID	20230325024924.882883-1-marmarek@invisiblethingslab.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Feedback-ID: i1568416f:Fastmail From: =?utf-8?q?Marek_Marczykowski-G=C3=B3recki?= <marmarek@invisiblethingslab.com> To: xen-devel@lists.xenproject.org Cc: Jason Andryuk <jandryuk@gmail.com>, =?utf-8?q?Marek_Marczykowski-G=C3=B3?= =?utf-8?q?recki?= <marmarek@invisiblethingslab.com>, Jan Beulich <jbeulich@suse.com>, Andrew Cooper <andrew.cooper3@citrix.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Wei Liu <wl@xen.org> Subject: [PATCH v2 1/3] x86/msi: passthrough all MSI-X vector ctrl writes to device model Date: Sat, 25 Mar 2023 03:49:22 +0100 Message-Id: <20230325024924.882883-1-marmarek@invisiblethingslab.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	[v2,1/3] x86/msi: passthrough all MSI-X vector ctrl writes to device model \| expand [v2,1/3] x86/msi: passthrough all MSI-X vector ctrl writes to device model [v2,2/3] x86/hvm: Allow writes to registers on the same page as MSI-X table [v2,3/3] x86/msi: clear initial MSI-X state on boot

Marek Marczykowski-Górecki March 25, 2023, 2:49 a.m. UTC

QEMU needs to know whether clearing maskbit of a vector is really
clearing, or was already cleared before. Currently Xen sends only
clearing that bit to the device model, but not setting it, so QEMU
cannot detect it. Because of that, QEMU is working this around by
checking via /dev/mem, but that isn't the proper approach.

Give all necessary information to QEMU by passing all ctrl writes,
including masking a vector. This does include forwarding also writes
that did not change the value, but as tested on both Linux (6.1.12) and
Windows (10 pro), they don't do excessive writes of unchanged values
(Windows seems to clear maskbit in some cases twice, but not more).

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
---
v2:
 - passthrough quad writes to emulator too (Jan)
 - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
   #define for this magic value

This behavior change needs to be surfaced to the device model somehow,
so it knows whether it can rely on it. I'm open for suggestions.
---
 xen/arch/x86/hvm/vmsi.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

Roger Pau Monné March 27, 2023, 10:12 a.m. UTC | #1

On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> QEMU needs to know whether clearing maskbit of a vector is really
> clearing, or was already cleared before. Currently Xen sends only
> clearing that bit to the device model, but not setting it, so QEMU
> cannot detect it. Because of that, QEMU is working this around by
> checking via /dev/mem, but that isn't the proper approach.
> 
> Give all necessary information to QEMU by passing all ctrl writes,
> including masking a vector. This does include forwarding also writes
> that did not change the value, but as tested on both Linux (6.1.12) and
> Windows (10 pro), they don't do excessive writes of unchanged values
> (Windows seems to clear maskbit in some cases twice, but not more).

Since we passthrough all the accesses to the device model, is the
handling in Xen still required?  It might be worth to also expose any
interfaces needed to the device model so all the functionality done by
the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
passing the accesses anyway.

> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> ---
> v2:
>  - passthrough quad writes to emulator too (Jan)
>  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
>    #define for this magic value
> 
> This behavior change needs to be surfaced to the device model somehow,
> so it knows whether it can rely on it. I'm open for suggestions.

Maybe exposed in XEN_DMOP_get_ioreq_server_info?

But I wonder whether it shouldn't be the other way arround, the device
model tells Xen it doesn't need to handle any MSI-X accesses because
QEMU will take care of it, likely using a new flag in
XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
part of the gflags, but then we would need to assert that the flag is
passed for all MSI-X interrupts bound from that device to the same
domain.

Thanks, Roger.

Marek Marczykowski-Górecki March 27, 2023, 10:26 a.m. UTC | #2

On Mon, Mar 27, 2023 at 12:12:29PM +0200, Roger Pau Monné wrote:
> On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> > QEMU needs to know whether clearing maskbit of a vector is really
> > clearing, or was already cleared before. Currently Xen sends only
> > clearing that bit to the device model, but not setting it, so QEMU
> > cannot detect it. Because of that, QEMU is working this around by
> > checking via /dev/mem, but that isn't the proper approach.
> > 
> > Give all necessary information to QEMU by passing all ctrl writes,
> > including masking a vector. This does include forwarding also writes
> > that did not change the value, but as tested on both Linux (6.1.12) and
> > Windows (10 pro), they don't do excessive writes of unchanged values
> > (Windows seems to clear maskbit in some cases twice, but not more).
> 
> Since we passthrough all the accesses to the device model, is the
> handling in Xen still required?  It might be worth to also expose any
> interfaces needed to the device model so all the functionality done by
> the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> passing the accesses anyway.

This was discussed on v1 already. Such QEMU would need to be able to do
the actual write. If it's running in stubdomain, it would hit the exact
issue again (page mapped R/O to it). In fact, that might be an issue for
dom0 too (I haven't checked).
I guess that could use my subpage RO feature I just posted then, but it
would still mean intercepting the write twice (not a performance issue
really here, but rather convoluted handling in total).

> > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > ---
> > v2:
> >  - passthrough quad writes to emulator too (Jan)
> >  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
> >    #define for this magic value
> > 
> > This behavior change needs to be surfaced to the device model somehow,
> > so it knows whether it can rely on it. I'm open for suggestions.
> 
> Maybe exposed in XEN_DMOP_get_ioreq_server_info?
> 
> But I wonder whether it shouldn't be the other way arround, the device
> model tells Xen it doesn't need to handle any MSI-X accesses because
> QEMU will take care of it, likely using a new flag in
> XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> part of the gflags, but then we would need to assert that the flag is
> passed for all MSI-X interrupts bound from that device to the same
> domain.

Is is safe thing to do? I mean, doesn't Xen need to guard access to
MSI-X configuration to assure its safety, especially if no interrupt
remapping is there? It probably doesn't matter for qemu in dom0 case,
but both with deprivileged qemu and stubdom, it might matter.

Roger Pau Monné March 27, 2023, 10:51 a.m. UTC | #3

On Mon, Mar 27, 2023 at 12:26:05PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Mar 27, 2023 at 12:12:29PM +0200, Roger Pau Monné wrote:
> > On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > QEMU needs to know whether clearing maskbit of a vector is really
> > > clearing, or was already cleared before. Currently Xen sends only
> > > clearing that bit to the device model, but not setting it, so QEMU
> > > cannot detect it. Because of that, QEMU is working this around by
> > > checking via /dev/mem, but that isn't the proper approach.
> > > 
> > > Give all necessary information to QEMU by passing all ctrl writes,
> > > including masking a vector. This does include forwarding also writes
> > > that did not change the value, but as tested on both Linux (6.1.12) and
> > > Windows (10 pro), they don't do excessive writes of unchanged values
> > > (Windows seems to clear maskbit in some cases twice, but not more).
> > 
> > Since we passthrough all the accesses to the device model, is the
> > handling in Xen still required?  It might be worth to also expose any
> > interfaces needed to the device model so all the functionality done by
> > the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> > passing the accesses anyway.
> 
> This was discussed on v1 already. Such QEMU would need to be able to do
> the actual write. If it's running in stubdomain, it would hit the exact
> issue again (page mapped R/O to it). In fact, that might be an issue for
> dom0 too (I haven't checked).

Oh, sorry, likely missed that discussion, as I don't recall this.

Maybe we need an hypercall for QEMU to notify the masking/unmasking to
Xen?  As any change on the other fields is already handled by QEMU.

> I guess that could use my subpage RO feature I just posted then, but it
> would still mean intercepting the write twice (not a performance issue
> really here, but rather convoluted handling in total).

Yes, that does seem way too convoluted.

> > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > ---
> > > v2:
> > >  - passthrough quad writes to emulator too (Jan)
> > >  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
> > >    #define for this magic value
> > > 
> > > This behavior change needs to be surfaced to the device model somehow,
> > > so it knows whether it can rely on it. I'm open for suggestions.
> > 
> > Maybe exposed in XEN_DMOP_get_ioreq_server_info?
> > 
> > But I wonder whether it shouldn't be the other way arround, the device
> > model tells Xen it doesn't need to handle any MSI-X accesses because
> > QEMU will take care of it, likely using a new flag in
> > XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> > part of the gflags, but then we would need to assert that the flag is
> > passed for all MSI-X interrupts bound from that device to the same
> > domain.
> 
> Is is safe thing to do? I mean, doesn't Xen need to guard access to
> MSI-X configuration to assure its safety, especially if no interrupt
> remapping is there? It probably doesn't matter for qemu in dom0 case,
> but both with deprivileged qemu and stubdom, it might matter.

Right - QEMU shouldn't write directly to the MSI-X table using
/dev/mem, but instead use an hypercall to notify Xen of the
{un,}masking of the MSI-X table entry.  I think that would allow us to
safely get rid of the extra logic in Xen to deal with MSI-X table
accesses.

Thanks, Roger.

Marek Marczykowski-Górecki March 27, 2023, 11:34 a.m. UTC | #4

On Mon, Mar 27, 2023 at 12:51:09PM +0200, Roger Pau Monné wrote:
> On Mon, Mar 27, 2023 at 12:26:05PM +0200, Marek Marczykowski-Górecki wrote:
> > On Mon, Mar 27, 2023 at 12:12:29PM +0200, Roger Pau Monné wrote:
> > > On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > > QEMU needs to know whether clearing maskbit of a vector is really
> > > > clearing, or was already cleared before. Currently Xen sends only
> > > > clearing that bit to the device model, but not setting it, so QEMU
> > > > cannot detect it. Because of that, QEMU is working this around by
> > > > checking via /dev/mem, but that isn't the proper approach.
> > > > 
> > > > Give all necessary information to QEMU by passing all ctrl writes,
> > > > including masking a vector. This does include forwarding also writes
> > > > that did not change the value, but as tested on both Linux (6.1.12) and
> > > > Windows (10 pro), they don't do excessive writes of unchanged values
> > > > (Windows seems to clear maskbit in some cases twice, but not more).
> > > 
> > > Since we passthrough all the accesses to the device model, is the
> > > handling in Xen still required?  It might be worth to also expose any
> > > interfaces needed to the device model so all the functionality done by
> > > the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> > > passing the accesses anyway.
> > 
> > This was discussed on v1 already. Such QEMU would need to be able to do
> > the actual write. If it's running in stubdomain, it would hit the exact
> > issue again (page mapped R/O to it). In fact, that might be an issue for
> > dom0 too (I haven't checked).
> 
> Oh, sorry, likely missed that discussion, as I don't recall this.
> 
> Maybe we need an hypercall for QEMU to notify the masking/unmasking to
> Xen?  As any change on the other fields is already handled by QEMU.
> 
> > I guess that could use my subpage RO feature I just posted then, but it
> > would still mean intercepting the write twice (not a performance issue
> > really here, but rather convoluted handling in total).
> 
> Yes, that does seem way too convoluted.
> 
> > > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > ---
> > > > v2:
> > > >  - passthrough quad writes to emulator too (Jan)
> > > >  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
> > > >    #define for this magic value
> > > > 
> > > > This behavior change needs to be surfaced to the device model somehow,
> > > > so it knows whether it can rely on it. I'm open for suggestions.
> > > 
> > > Maybe exposed in XEN_DMOP_get_ioreq_server_info?

Make flags IN/OUT parameter (and not reuse the same bits)? Or introduce
new field?

> > > 
> > > But I wonder whether it shouldn't be the other way arround, the device
> > > model tells Xen it doesn't need to handle any MSI-X accesses because
> > > QEMU will take care of it, likely using a new flag in
> > > XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> > > part of the gflags, but then we would need to assert that the flag is
> > > passed for all MSI-X interrupts bound from that device to the same
> > > domain.
> > 
> > Is is safe thing to do? I mean, doesn't Xen need to guard access to
> > MSI-X configuration to assure its safety, especially if no interrupt
> > remapping is there? It probably doesn't matter for qemu in dom0 case,
> > but both with deprivileged qemu and stubdom, it might matter.
> 
> Right - QEMU shouldn't write directly to the MSI-X table using
> /dev/mem, but instead use an hypercall to notify Xen of the
> {un,}masking of the MSI-X table entry.  I think that would allow us to
> safely get rid of the extra logic in Xen to deal with MSI-X table
> accesses.

But the purpose of this series is to give guest (or QEMU) more write
access to the MSI-X page, not less. If it wouldn't be this Intel AX
wifi, indeed we could translate everything to hypercalls in QEMU and not
worry about special handlers in Xen at all. But unfortunately, we do
need to handle writes to the same page outside of the MSI-X structures
and QEMU can't be trusted with properly filtering them (and otherwise
given full write access to the page). 

So, while I agree translating {un,}masking individual vectors to
hypercalls could simplify MSI-X handling in general, I don't think it
helps in this particular case. That said, such simplification would
involve:
1. Exposing the capability in Xen to the qemu
(XEN_DMOP_get_ioreq_server_info sounds reasonable).
2. QEMU notifying Xen it will handle masking too, somehow.
3. QEMU using xc_domain_update_msi_irq and XEN_DOMCTL_VMSI_X86_UNMASKED
to update Xen about the mask state too.
4. Xen no longer interpreting writes to mask bit, but still intercepting
them to passthorugh those outside of MSI-X structures (the other patch
in the series). But the handler would still need to stay, to keep
working with older QEMU versions.

Roger Pau Monné March 27, 2023, 1:29 p.m. UTC | #5

On Mon, Mar 27, 2023 at 01:34:23PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Mar 27, 2023 at 12:51:09PM +0200, Roger Pau Monné wrote:
> > On Mon, Mar 27, 2023 at 12:26:05PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Mon, Mar 27, 2023 at 12:12:29PM +0200, Roger Pau Monné wrote:
> > > > On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > QEMU needs to know whether clearing maskbit of a vector is really
> > > > > clearing, or was already cleared before. Currently Xen sends only
> > > > > clearing that bit to the device model, but not setting it, so QEMU
> > > > > cannot detect it. Because of that, QEMU is working this around by
> > > > > checking via /dev/mem, but that isn't the proper approach.
> > > > > 
> > > > > Give all necessary information to QEMU by passing all ctrl writes,
> > > > > including masking a vector. This does include forwarding also writes
> > > > > that did not change the value, but as tested on both Linux (6.1.12) and
> > > > > Windows (10 pro), they don't do excessive writes of unchanged values
> > > > > (Windows seems to clear maskbit in some cases twice, but not more).
> > > > 
> > > > Since we passthrough all the accesses to the device model, is the
> > > > handling in Xen still required?  It might be worth to also expose any
> > > > interfaces needed to the device model so all the functionality done by
> > > > the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> > > > passing the accesses anyway.
> > > 
> > > This was discussed on v1 already. Such QEMU would need to be able to do
> > > the actual write. If it's running in stubdomain, it would hit the exact
> > > issue again (page mapped R/O to it). In fact, that might be an issue for
> > > dom0 too (I haven't checked).
> > 
> > Oh, sorry, likely missed that discussion, as I don't recall this.
> > 
> > Maybe we need an hypercall for QEMU to notify the masking/unmasking to
> > Xen?  As any change on the other fields is already handled by QEMU.
> > 
> > > I guess that could use my subpage RO feature I just posted then, but it
> > > would still mean intercepting the write twice (not a performance issue
> > > really here, but rather convoluted handling in total).
> > 
> > Yes, that does seem way too convoluted.
> > 
> > > > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > > ---
> > > > > v2:
> > > > >  - passthrough quad writes to emulator too (Jan)
> > > > >  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
> > > > >    #define for this magic value
> > > > > 
> > > > > This behavior change needs to be surfaced to the device model somehow,
> > > > > so it knows whether it can rely on it. I'm open for suggestions.
> > > > 
> > > > Maybe exposed in XEN_DMOP_get_ioreq_server_info?
> 
> Make flags IN/OUT parameter (and not reuse the same bits)? Or introduce
> new field?

I think it would be fine to make it IN/OUT, but see below.

> > > > 
> > > > But I wonder whether it shouldn't be the other way arround, the device
> > > > model tells Xen it doesn't need to handle any MSI-X accesses because
> > > > QEMU will take care of it, likely using a new flag in
> > > > XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> > > > part of the gflags, but then we would need to assert that the flag is
> > > > passed for all MSI-X interrupts bound from that device to the same
> > > > domain.
> > > 
> > > Is is safe thing to do? I mean, doesn't Xen need to guard access to
> > > MSI-X configuration to assure its safety, especially if no interrupt
> > > remapping is there? It probably doesn't matter for qemu in dom0 case,
> > > but both with deprivileged qemu and stubdom, it might matter.
> > 
> > Right - QEMU shouldn't write directly to the MSI-X table using
> > /dev/mem, but instead use an hypercall to notify Xen of the
> > {un,}masking of the MSI-X table entry.  I think that would allow us to
> > safely get rid of the extra logic in Xen to deal with MSI-X table
> > accesses.
> 
> But the purpose of this series is to give guest (or QEMU) more write
> access to the MSI-X page, not less.

Right, but there are two independent issues here: one is the
propagation of the MSIX entry mask state to the device model, the
other is allowing guest accesses to MMIO regions adjacent to the MSIX
table.

> If it wouldn't be this Intel AX
> wifi, indeed we could translate everything to hypercalls in QEMU and not
> worry about special handlers in Xen at all. But unfortunately, we do
> need to handle writes to the same page outside of the MSI-X structures
> and QEMU can't be trusted with properly filtering them (and otherwise
> given full write access to the page).

Indeed, but IMO it would be helpful if we could avoid this split
handling of MSIX entries, where Xen handles entry mask/unmask, and
QEMU handles entry setup.  It makes the handling logic very
complicated, and more likely to be buggy (as you have probably
discovered).

Having QEMU always handle accesses to the MSI-X table would make
things simpler, and we could get rid of a huge amount of logic and
entry tracking in msixtbl_mmio_ops.

Then, we would only need to detect where an access falls into the same
page as the MSI-X (or PBA() tables, but outside of those, and forward
it to the underlying hardware, but that's a fairly simple piece of
logic, and completely detached from all the MSI-X entry tracking that
Xen currently does.

> So, while I agree translating {un,}masking individual vectors to
> hypercalls could simplify MSI-X handling in general, I don't think it
> helps in this particular case. That said, such simplification would
> involve:
> 1. Exposing the capability in Xen to the qemu
> (XEN_DMOP_get_ioreq_server_info sounds reasonable).
> 2. QEMU notifying Xen it will handle masking too, somehow.

I think it's possible we could get away with adding a new flag bit to
xen_domctl_bind_pt_irq, like: XEN_DOMCTL_VMSI_X86_MASK_HANDLING that
would tell Xen that QEMU will handle the mask bit for this entry.

QEMU using this flag should be prepared to handle the mask bit, but if
Xen doesn't know the flag it will keep processing the mask bit.

> 3. QEMU using xc_domain_update_msi_irq and XEN_DOMCTL_VMSI_X86_UNMASKED
> to update Xen about the mask state too.
> 4. Xen no longer interpreting writes to mask bit, but still intercepting
> them to passthorugh those outside of MSI-X structures (the other patch
> in the series). But the handler would still need to stay, to keep
> working with older QEMU versions.

Xen would need to intercept writes to the page(s) containing the MSI-X
table in any case, but the logic is much simpler if it just needs to
decide whether the accesses fall inside of the table region, and thus
needs to be forwarded to the device model, or fails outside of it and
needs to be propagated to the real address.

While true that we won't be able to remove the code that partially
handles MSIX entries for guests in Xen, it would be unused for newer
versions of QEMU, hopefully making the handling far more consistent as
the logic will be entirely in QEMU rather than split between Xen and
QEMU.

Thanks, Roger.

Marek Marczykowski-Górecki March 27, 2023, 2:20 p.m. UTC | #6

On Mon, Mar 27, 2023 at 03:29:55PM +0200, Roger Pau Monné wrote:
> On Mon, Mar 27, 2023 at 01:34:23PM +0200, Marek Marczykowski-Górecki wrote:
> > On Mon, Mar 27, 2023 at 12:51:09PM +0200, Roger Pau Monné wrote:
> > > On Mon, Mar 27, 2023 at 12:26:05PM +0200, Marek Marczykowski-Górecki wrote:
> > > > On Mon, Mar 27, 2023 at 12:12:29PM +0200, Roger Pau Monné wrote:
> > > > > On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > > QEMU needs to know whether clearing maskbit of a vector is really
> > > > > > clearing, or was already cleared before. Currently Xen sends only
> > > > > > clearing that bit to the device model, but not setting it, so QEMU
> > > > > > cannot detect it. Because of that, QEMU is working this around by
> > > > > > checking via /dev/mem, but that isn't the proper approach.
> > > > > > 
> > > > > > Give all necessary information to QEMU by passing all ctrl writes,
> > > > > > including masking a vector. This does include forwarding also writes
> > > > > > that did not change the value, but as tested on both Linux (6.1.12) and
> > > > > > Windows (10 pro), they don't do excessive writes of unchanged values
> > > > > > (Windows seems to clear maskbit in some cases twice, but not more).
> > > > > 
> > > > > Since we passthrough all the accesses to the device model, is the
> > > > > handling in Xen still required?  It might be worth to also expose any
> > > > > interfaces needed to the device model so all the functionality done by
> > > > > the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> > > > > passing the accesses anyway.
> > > > 
> > > > This was discussed on v1 already. Such QEMU would need to be able to do
> > > > the actual write. If it's running in stubdomain, it would hit the exact
> > > > issue again (page mapped R/O to it). In fact, that might be an issue for
> > > > dom0 too (I haven't checked).
> > > 
> > > Oh, sorry, likely missed that discussion, as I don't recall this.
> > > 
> > > Maybe we need an hypercall for QEMU to notify the masking/unmasking to
> > > Xen?  As any change on the other fields is already handled by QEMU.
> > > 
> > > > I guess that could use my subpage RO feature I just posted then, but it
> > > > would still mean intercepting the write twice (not a performance issue
> > > > really here, but rather convoluted handling in total).
> > > 
> > > Yes, that does seem way too convoluted.
> > > 
> > > > > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > > > ---
> > > > > > v2:
> > > > > >  - passthrough quad writes to emulator too (Jan)
> > > > > >  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
> > > > > >    #define for this magic value
> > > > > > 
> > > > > > This behavior change needs to be surfaced to the device model somehow,
> > > > > > so it knows whether it can rely on it. I'm open for suggestions.
> > > > > 
> > > > > Maybe exposed in XEN_DMOP_get_ioreq_server_info?
> > 
> > Make flags IN/OUT parameter (and not reuse the same bits)? Or introduce
> > new field?
> 
> I think it would be fine to make it IN/OUT, but see below.
> 
> > > > > 
> > > > > But I wonder whether it shouldn't be the other way arround, the device
> > > > > model tells Xen it doesn't need to handle any MSI-X accesses because
> > > > > QEMU will take care of it, likely using a new flag in
> > > > > XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> > > > > part of the gflags, but then we would need to assert that the flag is
> > > > > passed for all MSI-X interrupts bound from that device to the same
> > > > > domain.
> > > > 
> > > > Is is safe thing to do? I mean, doesn't Xen need to guard access to
> > > > MSI-X configuration to assure its safety, especially if no interrupt
> > > > remapping is there? It probably doesn't matter for qemu in dom0 case,
> > > > but both with deprivileged qemu and stubdom, it might matter.
> > > 
> > > Right - QEMU shouldn't write directly to the MSI-X table using
> > > /dev/mem, but instead use an hypercall to notify Xen of the
> > > {un,}masking of the MSI-X table entry.  I think that would allow us to
> > > safely get rid of the extra logic in Xen to deal with MSI-X table
> > > accesses.
> > 
> > But the purpose of this series is to give guest (or QEMU) more write
> > access to the MSI-X page, not less.
> 
> Right, but there are two independent issues here: one is the
> propagation of the MSIX entry mask state to the device model, the
> other is allowing guest accesses to MMIO regions adjacent to the MSIX
> table.
> 
> > If it wouldn't be this Intel AX
> > wifi, indeed we could translate everything to hypercalls in QEMU and not
> > worry about special handlers in Xen at all. But unfortunately, we do
> > need to handle writes to the same page outside of the MSI-X structures
> > and QEMU can't be trusted with properly filtering them (and otherwise
> > given full write access to the page).
> 
> Indeed, but IMO it would be helpful if we could avoid this split
> handling of MSIX entries, where Xen handles entry mask/unmask, and
> QEMU handles entry setup.  It makes the handling logic very
> complicated, and more likely to be buggy (as you have probably
> discovered).
> 
> Having QEMU always handle accesses to the MSI-X table would make
> things simpler, and we could get rid of a huge amount of logic and
> entry tracking in msixtbl_mmio_ops.
> 
> Then, we would only need to detect where an access falls into the same
> page as the MSI-X (or PBA() tables, but outside of those, and forward
> it to the underlying hardware, but that's a fairly simple piece of
> logic, and completely detached from all the MSI-X entry tracking that
> Xen currently does.
> 
> > So, while I agree translating {un,}masking individual vectors to
> > hypercalls could simplify MSI-X handling in general, I don't think it
> > helps in this particular case. That said, such simplification would
> > involve:
> > 1. Exposing the capability in Xen to the qemu
> > (XEN_DMOP_get_ioreq_server_info sounds reasonable).
> > 2. QEMU notifying Xen it will handle masking too, somehow.
> 
> I think it's possible we could get away with adding a new flag bit to
> xen_domctl_bind_pt_irq, like: XEN_DOMCTL_VMSI_X86_MASK_HANDLING that
> would tell Xen that QEMU will handle the mask bit for this entry.

Technically, for Xen to not care about those writes, it would need to
observe this flag on all vectors, including those not mapped yet. In
practice though, I think it might be okay to say device model should set
XEN_DOMCTL_VMSI_X86_MASK_HANDLING flag consistently (either on none of
them, or all of them), and Xen can rely on it (if one vector has
XEN_DOMCTL_VMSI_X86_MASK_HANDLING, then assume all of them will have
it).

> QEMU using this flag should be prepared to handle the mask bit, but if
> Xen doesn't know the flag it will keep processing the mask bit.
> 
> > 3. QEMU using xc_domain_update_msi_irq and XEN_DOMCTL_VMSI_X86_UNMASKED
> > to update Xen about the mask state too.
> > 4. Xen no longer interpreting writes to mask bit, but still intercepting
> > them to passthorugh those outside of MSI-X structures (the other patch
> > in the series). But the handler would still need to stay, to keep
> > working with older QEMU versions.
> 
> Xen would need to intercept writes to the page(s) containing the MSI-X
> table in any case, but the logic is much simpler if it just needs to
> decide whether the accesses fall inside of the table region, and thus
> needs to be forwarded to the device model, or fails outside of it and
> needs to be propagated to the real address.
> 
> While true that we won't be able to remove the code that partially
> handles MSIX entries for guests in Xen, it would be unused for newer
> versions of QEMU, hopefully making the handling far more consistent as
> the logic will be entirely in QEMU rather than split between Xen and
> QEMU.

In fact, it was easier for me to register a separate ioreq server for
writes outside of the MSI-X table. But I'm afraid the current one would
need to stay registered (just not accepting writes).

Roger Pau Monné March 27, 2023, 2:37 p.m. UTC | #7

On Mon, Mar 27, 2023 at 04:20:45PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Mar 27, 2023 at 03:29:55PM +0200, Roger Pau Monné wrote:
> > On Mon, Mar 27, 2023 at 01:34:23PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Mon, Mar 27, 2023 at 12:51:09PM +0200, Roger Pau Monné wrote:
> > > > On Mon, Mar 27, 2023 at 12:26:05PM +0200, Marek Marczykowski-Górecki wrote:
> > > > > On Mon, Mar 27, 2023 at 12:12:29PM +0200, Roger Pau Monné wrote:
> > > > > > On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > > > > > QEMU needs to know whether clearing maskbit of a vector is really
> > > > > > > clearing, or was already cleared before. Currently Xen sends only
> > > > > > > clearing that bit to the device model, but not setting it, so QEMU
> > > > > > > cannot detect it. Because of that, QEMU is working this around by
> > > > > > > checking via /dev/mem, but that isn't the proper approach.
> > > > > > > 
> > > > > > > Give all necessary information to QEMU by passing all ctrl writes,
> > > > > > > including masking a vector. This does include forwarding also writes
> > > > > > > that did not change the value, but as tested on both Linux (6.1.12) and
> > > > > > > Windows (10 pro), they don't do excessive writes of unchanged values
> > > > > > > (Windows seems to clear maskbit in some cases twice, but not more).
> > > > > > 
> > > > > > Since we passthrough all the accesses to the device model, is the
> > > > > > handling in Xen still required?  It might be worth to also expose any
> > > > > > interfaces needed to the device model so all the functionality done by
> > > > > > the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> > > > > > passing the accesses anyway.
> > > > > 
> > > > > This was discussed on v1 already. Such QEMU would need to be able to do
> > > > > the actual write. If it's running in stubdomain, it would hit the exact
> > > > > issue again (page mapped R/O to it). In fact, that might be an issue for
> > > > > dom0 too (I haven't checked).
> > > > 
> > > > Oh, sorry, likely missed that discussion, as I don't recall this.
> > > > 
> > > > Maybe we need an hypercall for QEMU to notify the masking/unmasking to
> > > > Xen?  As any change on the other fields is already handled by QEMU.
> > > > 
> > > > > I guess that could use my subpage RO feature I just posted then, but it
> > > > > would still mean intercepting the write twice (not a performance issue
> > > > > really here, but rather convoluted handling in total).
> > > > 
> > > > Yes, that does seem way too convoluted.
> > > > 
> > > > > > > Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
> > > > > > > ---
> > > > > > > v2:
> > > > > > >  - passthrough quad writes to emulator too (Jan)
> > > > > > >  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
> > > > > > >    #define for this magic value
> > > > > > > 
> > > > > > > This behavior change needs to be surfaced to the device model somehow,
> > > > > > > so it knows whether it can rely on it. I'm open for suggestions.
> > > > > > 
> > > > > > Maybe exposed in XEN_DMOP_get_ioreq_server_info?
> > > 
> > > Make flags IN/OUT parameter (and not reuse the same bits)? Or introduce
> > > new field?
> > 
> > I think it would be fine to make it IN/OUT, but see below.
> > 
> > > > > > 
> > > > > > But I wonder whether it shouldn't be the other way arround, the device
> > > > > > model tells Xen it doesn't need to handle any MSI-X accesses because
> > > > > > QEMU will take care of it, likely using a new flag in
> > > > > > XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> > > > > > part of the gflags, but then we would need to assert that the flag is
> > > > > > passed for all MSI-X interrupts bound from that device to the same
> > > > > > domain.
> > > > > 
> > > > > Is is safe thing to do? I mean, doesn't Xen need to guard access to
> > > > > MSI-X configuration to assure its safety, especially if no interrupt
> > > > > remapping is there? It probably doesn't matter for qemu in dom0 case,
> > > > > but both with deprivileged qemu and stubdom, it might matter.
> > > > 
> > > > Right - QEMU shouldn't write directly to the MSI-X table using
> > > > /dev/mem, but instead use an hypercall to notify Xen of the
> > > > {un,}masking of the MSI-X table entry.  I think that would allow us to
> > > > safely get rid of the extra logic in Xen to deal with MSI-X table
> > > > accesses.
> > > 
> > > But the purpose of this series is to give guest (or QEMU) more write
> > > access to the MSI-X page, not less.
> > 
> > Right, but there are two independent issues here: one is the
> > propagation of the MSIX entry mask state to the device model, the
> > other is allowing guest accesses to MMIO regions adjacent to the MSIX
> > table.
> > 
> > > If it wouldn't be this Intel AX
> > > wifi, indeed we could translate everything to hypercalls in QEMU and not
> > > worry about special handlers in Xen at all. But unfortunately, we do
> > > need to handle writes to the same page outside of the MSI-X structures
> > > and QEMU can't be trusted with properly filtering them (and otherwise
> > > given full write access to the page).
> > 
> > Indeed, but IMO it would be helpful if we could avoid this split
> > handling of MSIX entries, where Xen handles entry mask/unmask, and
> > QEMU handles entry setup.  It makes the handling logic very
> > complicated, and more likely to be buggy (as you have probably
> > discovered).
> > 
> > Having QEMU always handle accesses to the MSI-X table would make
> > things simpler, and we could get rid of a huge amount of logic and
> > entry tracking in msixtbl_mmio_ops.
> > 
> > Then, we would only need to detect where an access falls into the same
> > page as the MSI-X (or PBA() tables, but outside of those, and forward
> > it to the underlying hardware, but that's a fairly simple piece of
> > logic, and completely detached from all the MSI-X entry tracking that
> > Xen currently does.
> > 
> > > So, while I agree translating {un,}masking individual vectors to
> > > hypercalls could simplify MSI-X handling in general, I don't think it
> > > helps in this particular case. That said, such simplification would
> > > involve:
> > > 1. Exposing the capability in Xen to the qemu
> > > (XEN_DMOP_get_ioreq_server_info sounds reasonable).
> > > 2. QEMU notifying Xen it will handle masking too, somehow.
> > 
> > I think it's possible we could get away with adding a new flag bit to
> > xen_domctl_bind_pt_irq, like: XEN_DOMCTL_VMSI_X86_MASK_HANDLING that
> > would tell Xen that QEMU will handle the mask bit for this entry.
> 
> Technically, for Xen to not care about those writes, it would need to
> observe this flag on all vectors, including those not mapped yet. In
> practice though, I think it might be okay to say device model should set
> XEN_DOMCTL_VMSI_X86_MASK_HANDLING flag consistently (either on none of
> them, or all of them), and Xen can rely on it (if one vector has
> XEN_DOMCTL_VMSI_X86_MASK_HANDLING, then assume all of them will have
> it).

I agree.  I would just return -EINVAL if the flag is not consistent
across vectors on the same device.

> > QEMU using this flag should be prepared to handle the mask bit, but if
> > Xen doesn't know the flag it will keep processing the mask bit.
> > 
> > > 3. QEMU using xc_domain_update_msi_irq and XEN_DOMCTL_VMSI_X86_UNMASKED
> > > to update Xen about the mask state too.
> > > 4. Xen no longer interpreting writes to mask bit, but still intercepting
> > > them to passthorugh those outside of MSI-X structures (the other patch
> > > in the series). But the handler would still need to stay, to keep
> > > working with older QEMU versions.
> > 
> > Xen would need to intercept writes to the page(s) containing the MSI-X
> > table in any case, but the logic is much simpler if it just needs to
> > decide whether the accesses fall inside of the table region, and thus
> > needs to be forwarded to the device model, or fails outside of it and
> > needs to be propagated to the real address.
> > 
> > While true that we won't be able to remove the code that partially
> > handles MSIX entries for guests in Xen, it would be unused for newer
> > versions of QEMU, hopefully making the handling far more consistent as
> > the logic will be entirely in QEMU rather than split between Xen and
> > QEMU.
> 
> In fact, it was easier for me to register a separate ioreq server for
> writes outside of the MSI-X table. But I'm afraid the current one would
> need to stay registered (just not accepting writes).

(I assume in the paragraph above you should use hvm_io_handler rather
than ioreq server, as ioreqs are only for emulation running outside of
Xen)

The handle is currently registered when a device with MSIX is assigned
to an HVM domain.  I think it's fine to leave as-is as long as the
handler doesn't accept any accesses if QEMU does handle the mask bit.
We can always see later about not registering it.

Thanks, Roger.

Jan Beulich March 27, 2023, 3:32 p.m. UTC | #8

On 27.03.2023 12:12, Roger Pau Monné wrote:
> On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
>> QEMU needs to know whether clearing maskbit of a vector is really
>> clearing, or was already cleared before. Currently Xen sends only
>> clearing that bit to the device model, but not setting it, so QEMU
>> cannot detect it. Because of that, QEMU is working this around by
>> checking via /dev/mem, but that isn't the proper approach.
>>
>> Give all necessary information to QEMU by passing all ctrl writes,
>> including masking a vector. This does include forwarding also writes
>> that did not change the value, but as tested on both Linux (6.1.12) and
>> Windows (10 pro), they don't do excessive writes of unchanged values
>> (Windows seems to clear maskbit in some cases twice, but not more).
>
> Since we passthrough all the accesses to the device model, is the
> handling in Xen still required?

"All accesses" isn't really correct; aiui even after this change it's only
"all writes". We still "accelerate" reading from the first 3 table entries
(whether or not that's [still] useful in this shape is a separate question).
Plus we need to invoke guest_mask_msi_irq() as necessary, and I don't think
we should make ourselves dependent upon qemu communicating the necessary
info back to us, when that's necessary for the correctness of Xen's internal
interrupt handling. (That's further leaving aside the performance aspect of
handing off to qemu just for it to pass data back to us.)

Jan

>  It might be worth to also expose any
> interfaces needed to the device model so all the functionality done by
> the msixtbl_mmio_ops hooks could be done by QEMU, since we end up
> passing the accesses anyway.
>
>> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
>> ---
>> v2:
>>  - passthrough quad writes to emulator too (Jan)
>>  - (ab)use len==0 for write len=4 completion (Jan), but add descriptive
>>    #define for this magic value
>>
>> This behavior change needs to be surfaced to the device model somehow,
>> so it knows whether it can rely on it. I'm open for suggestions.
>
> Maybe exposed in XEN_DMOP_get_ioreq_server_info?
>
> But I wonder whether it shouldn't be the other way arround, the device
> model tells Xen it doesn't need to handle any MSI-X accesses because
> QEMU will take care of it, likely using a new flag in
> XEN_DMOP_create_ioreq_server or maybe in XEN_DOMCTL_bind_pt_irq as
> part of the gflags, but then we would need to assert that the flag is
> passed for all MSI-X interrupts bound from that device to the same
> domain.
>
> Thanks, Roger.

Roger Pau Monné March 27, 2023, 3:42 p.m. UTC | #9

On Mon, Mar 27, 2023 at 05:32:01PM +0200, Jan Beulich wrote:
> On 27.03.2023 12:12, Roger Pau Monné wrote:
> > On Sat, Mar 25, 2023 at 03:49:22AM +0100, Marek Marczykowski-Górecki wrote:
> >> QEMU needs to know whether clearing maskbit of a vector is really
> >> clearing, or was already cleared before. Currently Xen sends only
> >> clearing that bit to the device model, but not setting it, so QEMU
> >> cannot detect it. Because of that, QEMU is working this around by
> >> checking via /dev/mem, but that isn't the proper approach.
> >>
> >> Give all necessary information to QEMU by passing all ctrl writes,
> >> including masking a vector. This does include forwarding also writes
> >> that did not change the value, but as tested on both Linux (6.1.12) and
> >> Windows (10 pro), they don't do excessive writes of unchanged values
> >> (Windows seems to clear maskbit in some cases twice, but not more).
> >
> > Since we passthrough all the accesses to the device model, is the
> > handling in Xen still required?
> 
> "All accesses" isn't really correct; aiui even after this change it's only
> "all writes". We still "accelerate" reading from the first 3 table entries
> (whether or not that's [still] useful in this shape is a separate question).
> Plus we need to invoke guest_mask_msi_irq() as necessary, and I don't think
> we should make ourselves dependent upon qemu communicating the necessary
> info back to us, when that's necessary for the correctness of Xen's internal
> interrupt handling. (That's further leaving aside the performance aspect of
> handing off to qemu just for it to pass data back to us.)

The call to guest_mask_msi_irq() is a result of a guest action, I'm
failing to see how that's necessary for the correctness of Xen's
internal interrupt handling.  In my proposed model QEMU won't be
allowed direct access to the physical MSIX entry mask bits, and would
instead use an hypercall to mask/unmask MSIX entries (and thus
guest_mask_msi_irq() would be called as appropriate).

Thanks, Roger.

[v2,1/3] x86/msi: passthrough all MSI-X vector ctrl writes to device model

Commit Message

Comments

Patch