diff mbox

[RFC] fixup! virtio: convert to use DMA api

Message ID 1460979793-6621-1-git-send-email-mst@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michael S. Tsirkin April 18, 2016, 11:47 a.m. UTC
This adds a flag to enable/disable bypassing the IOMMU by
virtio devices.

This is on top of patch
http://article.gmane.org/gmane.comp.emulators.qemu/403467
virtio: convert to use DMA api

Tested with patchset
http://article.gmane.org/gmane.linux.kernel.virtualization/27545
virtio-pci: iommu support

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---
 include/hw/virtio/virtio-access.h              | 3 ++-
 include/hw/virtio/virtio.h                     | 6 +++++-
 include/standard-headers/linux/virtio_config.h | 8 ++++++++
 3 files changed, 15 insertions(+), 2 deletions(-)

Comments

David Woodhouse April 18, 2016, 11:58 a.m. UTC | #1
On Mon, 2016-04-18 at 14:47 +0300, Michael S. Tsirkin wrote:
> This adds a flag to enable/disable bypassing the IOMMU by
> virtio devices.

I'm still deeply unhappy with having this kind of hack in the virtio
code at all, as you know. Drivers should just use the DMA API and if
the *platform* wants to make it a no-op for a specific device, then it
can.

Remember, this isn't just virtio either. Don't we have *precisely* the
same issue with assigned PCI devices on a system with an emulated Intel
IOMMU? The assigned PCI devices aren't covered by the emulated IOMMU,
and the platform needs to know to bypass *those* too.

Now, we've had this conversation, and we accepted the hack in virtio
for now until the platforms (especially SPARC and Power IIRC) can get
their act together and make their DMA API implementations not broken.

But now you're adding this hack to the public API where we have to
support it for ever. Please, can't we avoid that?
Michael S. Tsirkin April 18, 2016, 1:12 p.m. UTC | #2
On Mon, Apr 18, 2016 at 07:58:37AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 14:47 +0300, Michael S. Tsirkin wrote:
> > This adds a flag to enable/disable bypassing the IOMMU by
> > virtio devices.
> 
> I'm still deeply unhappy with having this kind of hack in the virtio
> code at all, as you know. Drivers should just use the DMA API and if
> the *platform* wants to make it a no-op for a specific device, then it
> can.
> 
> Remember, this isn't just virtio either. Don't we have *precisely* the
> same issue with assigned PCI devices on a system with an emulated Intel
> IOMMU? The assigned PCI devices aren't covered by the emulated IOMMU,
> and the platform needs to know to bypass *those* too.
> 
> Now, we've had this conversation, and we accepted the hack in virtio
> for now until the platforms (especially SPARC and Power IIRC) can get
> their act together and make their DMA API implementations not broken.
> 
> But now you're adding this hack to the public API where we have to
> support it for ever. Please, can't we avoid that?

I'm not sure I understand the issue.  The public API is not about how
the driver works.  It doesn't say "don't use DMA API" anywhere, does it?
It's about telling device whether to obey the IOMMU and
about discovering whether a device is in fact under the IOMMU.

Once DMA API allows bypassing IOMMU per device we'll be
able to drop the ugly hack from virtio drivers, simply keying it
off the given flag.


> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse April 18, 2016, 2:03 p.m. UTC | #3
On Mon, 2016-04-18 at 16:12 +0300, Michael S. Tsirkin wrote:
> I'm not sure I understand the issue.  The public API is not about how
> the driver works.  It doesn't say "don't use DMA API" anywhere, does it?
> It's about telling device whether to obey the IOMMU and
> about discovering whether a device is in fact under the IOMMU.

Apologies, I was wrongly reading this as a kernel patch.

After a brief struggle with "telling device whether to obey the IOMMU",
which is obviously completely impossible from the guest kernel, I
realise my mistake :)

So... on x86 how does this get reflected in the DMAR tables that the
guest BIOS presents to the guest kernel, so that the guest kernel
*knows* which devices are behind which IOMMU?

(And are you fixing the case of assigned PCI devices, which aren't
behind any IOMMU, at the same time as you answer that? :)
Michael S. Tsirkin April 18, 2016, 2:23 p.m. UTC | #4
On Mon, Apr 18, 2016 at 10:03:52AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 16:12 +0300, Michael S. Tsirkin wrote:
> > I'm not sure I understand the issue.  The public API is not about how
> > the driver works.  It doesn't say "don't use DMA API" anywhere, does it?
> > It's about telling device whether to obey the IOMMU and
> > about discovering whether a device is in fact under the IOMMU.
> 
> Apologies, I was wrongly reading this as a kernel patch.
> 
> After a brief struggle with "telling device whether to obey the IOMMU",
> which is obviously completely impossible from the guest kernel, I
> realise my mistake :)
> 
> So... on x86 how does this get reflected in the DMAR tables that the
> guest BIOS presents to the guest kernel, so that the guest kernel
> *knows* which devices are behind which IOMMU?

This patch doesn't change DMAR tables, it creates a way for virtio
device to tell guest "I obey what DMAR tables tell you, you can stop
doing hacks".

And as PPC guys seem adamant that platform tools there are no good for
that purpose, there's another bit that says "ignore what platform tells
you, I'm not a real device - I'm part of hypervisor and I bypass the
IOMMU".


> (And are you fixing the case of assigned PCI devices, which aren't
> behind any IOMMU, at the same time as you answer that? :)

No - Aviv B.D. has patches on list to fix that.
David Woodhouse April 18, 2016, 3:22 p.m. UTC | #5
On Mon, 2016-04-18 at 17:23 +0300, Michael S. Tsirkin wrote:
> 
> This patch doesn't change DMAR tables, it creates a way for virtio
> device to tell guest "I obey what DMAR tables tell you, you can stop
> doing hacks".
> 
> And as PPC guys seem adamant that platform tools there are no good for
> that purpose, there's another bit that says "ignore what platform tells
> you, I'm not a real device - I'm part of hypervisor and I bypass the
> IOMMU".

...

+/* Request IOMMU passthrough (if available)
+ * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
+ * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
+ */
+#define VIRTIO_F_IOMMU_PASSTHROUGH     33
+
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM                34

OK... let's see if I can reconcile those descriptions coherently.

Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
its own operating system's IOMMU code is expected to be broken, and
that the virtio driver should eschew the DMA API? And that the guest OS
cannot further assign the affected device to any of *its* nested
guests? Not that the broken IOMMU code in said guest OS will know the
latter, of course.

With VIRTIO_F_IOMMU_PLATFORM set, VIRTIO_F_IOMMU_PASSTHROUGH is just a
*hint*, suggesting that the guest OS should *request* a passthrough
mapping from the IOMMU? Via a driver??IOMMU API which doesn't yet exist
in Linux, since we only have 'iommu=pt' on the command line for that?

And having *neither* of those bits sets is the status quo, which means
that your OS code might well be broken and need you to eschew the DMA
API, but maybe not.


-- 
dwmw2
Michael S. Tsirkin April 18, 2016, 3:30 p.m. UTC | #6
On Mon, Apr 18, 2016 at 11:22:03AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 17:23 +0300, Michael S. Tsirkin wrote:
> > 
> > This patch doesn't change DMAR tables, it creates a way for virtio
> > device to tell guest "I obey what DMAR tables tell you, you can stop
> > doing hacks".
> > 
> > And as PPC guys seem adamant that platform tools there are no good for
> > that purpose, there's another bit that says "ignore what platform tells
> > you, I'm not a real device - I'm part of hypervisor and I bypass the
> > IOMMU".
> 
> ...
> 
> +/* Request IOMMU passthrough (if available)
> + * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
> + * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
> + */
> +#define VIRTIO_F_IOMMU_PASSTHROUGH     33
> +
> +/* Do not bypass the IOMMU (if configured) */
> +#define VIRTIO_F_IOMMU_PLATFORM                34
> 
> OK... let's see if I can reconcile those descriptions coherently.
> 
> Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> its own operating system's IOMMU code is expected to be broken, and
> that the virtio driver should eschew the DMA API?

No - it tells guest that e.g. the ACPI tables (or whatever the
equivalent is) do not match reality with respect to this device
since IOMMU is ignored by hypervisor.
Hypervisor has no idea what does guest IOMMU code do - hopefully
it is not actually broken.

> And that the guest OS
> cannot further assign the affected device to any of *its* nested
> guests? Not that the broken IOMMU code in said guest OS will know the
> latter, of course.
> 
> With VIRTIO_F_IOMMU_PLATFORM set, VIRTIO_F_IOMMU_PASSTHROUGH is just a
> *hint*, suggesting that the guest OS should *request* a passthrough
> mapping from the IOMMU?

Right. But it'll work correctly if you don't.

> Via a driver??IOMMU API which doesn't yet exist
> in Linux, since we only have 'iommu=pt' on the command line for that?
> 
> And having *neither* of those bits sets is the status quo, which means
> that your OS code might well be broken and need you to eschew the DMA
> API, but maybe not.


The status quo is that that the IOMMU might well be bypassed
and then you need to program physical addresses into the device,
but maybe not. If DMA API does not give you physical addresses, you
need to bypass it, but hypervisor does not know or care.


> 
> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse April 18, 2016, 3:51 p.m. UTC | #7
On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin wrote:
> 
> > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> > its own operating system's IOMMU code is expected to be broken, and
> > that the virtio driver should eschew the DMA API?
> 
> No - it tells guest that e.g. the ACPI tables (or whatever the
> equivalent is) do not match reality with respect to this device
> since IOMMU is ignored by hypervisor.
> Hypervisor has no idea what does guest IOMMU code do - hopefully
> it is not actually broken.

OK, that makes sense — thanks.

So where the platform *does* have a way to coherently tell the guest
that some devices are behind and IOMMU and some aren't, we should never
see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
the truth; qv.)

This should *only* be a crutch for platforms which cannot properly
convey that information from the hypervisor to the guest. It should be
clearly documented "thou shalt not use this unless you've first
attempted to fix the broken platform to get it right for itself".

And if we look at it as such... does it make more sense for this to be
a more *generic* qemu??guest interface? That way the software hacks can
live in the OS IOMMU code where they belong, and prevent assignment to
nested guests for example. And can cover cases like assigned PCI
devices in existing qemu/x86 which need the same treatment.

Put another way: if we're going to add code to the guest OS to look at
this information, why can't we add that code in the guest's IOMMU
support instead, to look at an out-of-band qemu-specific "ignore IOMMU
for these devices" list instead?

> The status quo is that that the IOMMU might well be bypassed
> and then you need to program physical addresses into the device,
> but maybe not. If DMA API does not give you physical addresses, you
> need to bypass it, but hypervisor does not know or care.

Right. The status quo is that qemu doesn't provide correct information
about IOMMU topology to guests, and they have to have heuristics to
work out whether to eschew the IOMMU for a given device or not. This is
true for virtio and assigned PCI devices alike.

Furthermore, some platforms don't *have* a standard way for qemu to
'tell the truth' to the guests, and that's where the real fun comes in.
But still, I'd like to see a generic solution for that lack instead of
a virtio-specific hack.
Michael S. Tsirkin April 18, 2016, 4:27 p.m. UTC | #8
On Mon, Apr 18, 2016 at 11:51:41AM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 18:30 +0300, Michael S. Tsirkin wrote:
> > 
> > > Setting (only) VIRTIO_F_IOMMU_PASSTHROUGH indicates to the guest that
> > > its own operating system's IOMMU code is expected to be broken, and
> > > that the virtio driver should eschew the DMA API?
> > 
> > No - it tells guest that e.g. the ACPI tables (or whatever the
> > equivalent is) do not match reality with respect to this device
> > since IOMMU is ignored by hypervisor.
> > Hypervisor has no idea what does guest IOMMU code do - hopefully
> > it is not actually broken.
> 
> OK, that makes sense — thanks.
> 
> So where the platform *does* have a way to coherently tell the guest
> that some devices are behind and IOMMU and some aren't, we should never
> see VIRTIO_F_IOMMU_PASSTHROUGH && !VIRTIO_F_IOMMU_PLATFORM. (Except
> perhaps temporarily on x86 until we *do* fix the DMAR tables to tell
> the truth; qv.)
> 
> This should *only* be a crutch for platforms which cannot properly
> convey that information from the hypervisor to the guest. It should be
> clearly documented "thou shalt not use this unless you've first
> attempted to fix the broken platform to get it right for itself".
> 
> And if we look at it as such... does it make more sense for this to be
> a more *generic* qemu??guest interface? That way the software hacks can
> live in the OS IOMMU code where they belong, and prevent assignment to
> nested guests for example. And can cover cases like assigned PCI
> devices in existing qemu/x86 which need the same treatment.
>
> Put another way: if we're going to add code to the guest OS to look at
> this information, why can't we add that code in the guest's IOMMU
> support instead, to look at an out-of-band qemu-specific "ignore IOMMU
> for these devices" list instead?

I balk at adding more hacks to a broken system. My goals are
merely to
- make things work correctly with an IOMMU and new guests,
  so people can use userspace drivers with virtio devices
- prevent security risks when guest kernel mistakenly thinks
  it's protected by an IOMMU, but in fact isn't
- avoid breaking any working configurations

Looking at guest code, it looks like virtio was always
bypassing the IOMMU even if configured, but no other
guest driver did.

This makes me think the problem where guest drivers
ignore the IOMMU is virtio specific
and so a virtio specific solution seems cleaner.

The problem for assigned devices is IMHO different: they bypass
the guest IOMMU too but no guest driver knows about this,
so guests do not work. Seems cleaner to fix QEMU to make
existing guests work.


> > The status quo is that that the IOMMU might well be bypassed
> > and then you need to program physical addresses into the device,
> > but maybe not. If DMA API does not give you physical addresses, you
> > need to bypass it, but hypervisor does not know or care.
> 
> Right. The status quo is that qemu doesn't provide correct information
> about IOMMU topology to guests, and they have to have heuristics to
> work out whether to eschew the IOMMU for a given device or not. This is
> true for virtio and assigned PCI devices alike.

True but I think we should fix QEMU to shadow IOMMU
page tables for assigned devices. This seems rather
possible with VT-D, and there are patches already on list.

It looks like this will fix all legacy guests which is
much nicer than what you suggest which will only help new guests.

> Furthermore, some platforms don't *have* a standard way for qemu to
> 'tell the truth' to the guests, and that's where the real fun comes in.
> But still, I'd like to see a generic solution for that lack instead of
> a virtio-specific hack.

But the issue is not just these holes.  E.g. with VT-D it is only easy
to emulate because there's a "caching mode" hook. It is fundamentally
paravirtualization.  So a completely generic solution would be a
paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
justified if many platforms have hard to emulate interfaces.



> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse April 18, 2016, 6:29 p.m. UTC | #9
On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:
> I balk at adding more hacks to a broken system. My goals are
> merely to
> - make things work correctly with an IOMMU and new guests,
>   so people can use userspace drivers with virtio devices
> - prevent security risks when guest kernel mistakenly thinks
>   it's protected by an IOMMU, but in fact isn't
> - avoid breaking any working configurations

AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
That's just an optimisation, for telling an OS "you don't really need
to bother with the IOMMU, even though you it works".

There are two main reasons why an operating system might want to use
the IOMMU via the DMA API for native drivers: 
 - To protect against driver bugs triggering rogue DMA.
 - To protect against hardware (or firmware) bugs.

With virtio, the first reason still exists. But the second is moot
because the device is part of the hypervisor and if the hypervisor is
untrustworthy then you're screwed anyway... but then again, in SoC
devices you could replace 'hypervisor' with 'chip' and the same is
true, isn't it? Is there *really* anything virtio-specific here?

Sure, I want my *external* network device on a PCIe card with software-
loadable firmware to be behind an IOMMU because I don't trust it as far
as I can throw it. But for on-SoC devices surely the situation is
*just* the same as devices provided by a hypervisor?

And some people want that external network device to use passthrough
anyway, for performance reasons.

On the whole, there are *plenty* of reasons why we might want to have a
passthrough mapping on a per-device basis, and I really struggle to
find justification for having this 'hint' in a virtio-specific way.

And it's complicating the discussion of the *actual* fix we're looking
at.

> Looking at guest code, it looks like virtio was always
> bypassing the IOMMU even if configured, but no other
> guest driver did.
> 
> This makes me think the problem where guest drivers
> ignore the IOMMU is virtio specific
> and so a virtio specific solution seems cleaner.
> 
> The problem for assigned devices is IMHO different: they bypass
> the guest IOMMU too but no guest driver knows about this,
> so guests do not work. Seems cleaner to fix QEMU to make
> existing guests work.

I certainly agree that it's better to fix QEMU. Whether devices are
behind an IOMMU or not, the DMAR tables we expose to a guest should
tell the truth.

Part of the issue here is virtio-specific; part isn't.

Basically, we have a conjunction of two separate bugs which happened to
work (for virtio) — the IOMMU support in QEMU wasn't working for virtio
(and assigned) devices even though it theoretically *should* have been,
and the virtio drivers weren't using the DMA API as they theoretically
should have been.

So there were corner cases like assigned PCI devices, and real hardware
implementations of virtio stuff (and perhaps virtio devices being
assigned to nested guests) which didn't work. But for the *common* use
case, one bug cancelled out the other.

Now we want to fix both bugs, and of course that involves carefully
coordinating both fixes.

I *like* your idea of a flag from the hypervisor which essentially says
"trust me, I'm telling the truth now".

But don't think that wants to be virtio-specific, because we actually
want it to cover *all* the corner cases, not just the common case which
*happened* to work before due to the alignment of the two previous
bugs.

An updated guest OS can look for this flag (in its generic IOMMU code)
and can apply a heuristic of its own to work out which devices *aren't*
behind the IOMMU, if the flag isn't present. And it can get that right
even for assigned devices, so that new kernels can run happily even on
today's QEMU instances. And the virtio driver in new kernels should
just use the DMA API and expect it to work. Just as the various drivers
for assigned PCI devices do.

The other interesting case for compatibility is old kernels running in
a new QEMU. And for that case, things are likely to break if you
suddenly start putting the virtio devices behind an IOMMU. There's
nothing you can do on ARM and Power to stop that breakage, since they
don't *have* a way to tell legacy guests that certain devices aren't
translated. So I suspect you probably can't enable virtio-behind-IOMMU
in QEMU *ever* for those platforms as the default behaviour.

For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
the truth, and even legacy kernels ought to cope with that.
FSVO 'ought to' where I suspect some of them will actually crash with a
NULL pointer dereference if there's no "catch-all" DMAR unit in the
tables, which puts it back into the same camp as ARM and Power.


> True but I think we should fix QEMU to shadow IOMMU
> page tables for assigned devices. This seems rather
> possible with VT-D, and there are patches already on list.
> 
> It looks like this will fix all legacy guests which is
> much nicer than what you suggest which will only help new guests.

Yes, we should do that. And in the short term we should at *least* fix
the DMAR tables to tell the truth.

> > 
> > Furthermore, some platforms don't *have* a standard way for qemu to
> > 'tell the truth' to the guests, and that's where the real fun comes in.
> > But still, I'd like to see a generic solution for that lack instead of
> > a virtio-specific hack.
> But the issue is not just these holes.  E.g. with VT-D it is only easy
> to emulate because there's a "caching mode" hook. It is fundamentally
> paravirtualization.  So a completely generic solution would be a
> paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> justified if many platforms have hard to emulate interfaces.

Hm, I'm not sure I understand the point here.

Either there is a way for the hypervisor to expose an IOMMU to a guest
(be it full hardware virt, or paravirt). Or there isn't.

If there is, it doesn't matter *how* it's done. And if there isn't, the
whole discussion is moot anyway.
Andy Lutomirski April 18, 2016, 7:24 p.m. UTC | #10
On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.
> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
implementation on x86 has always been "experimental", so it just might
be okay to change it in a way that causes some older kernels to OOPS.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 19, 2016, 9:13 a.m. UTC | #11
On Mon, Apr 18, 2016 at 02:29:33PM -0400, David Woodhouse wrote:
> On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:
> > I balk at adding more hacks to a broken system. My goals are
> > merely to
> > - make things work correctly with an IOMMU and new guests,
> >   so people can use userspace drivers with virtio devices
> > - prevent security risks when guest kernel mistakenly thinks
> >   it's protected by an IOMMU, but in fact isn't
> > - avoid breaking any working configurations
> 
> AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
> That's just an optimisation, for telling an OS "you don't really need
> to bother with the IOMMU, even though you it works".
> 
> There are two main reasons why an operating system might want to use
> the IOMMU via the DMA API for native drivers: 
>  - To protect against driver bugs triggering rogue DMA.
>  - To protect against hardware (or firmware) bugs.
> 
> With virtio, the first reason still exists. But the second is moot
> because the device is part of the hypervisor and if the hypervisor is
> untrustworthy then you're screwed anyway... but then again, in SoC
> devices you could replace 'hypervisor' with 'chip' and the same is
> true, isn't it? Is there *really* anything virtio-specific here?
>
> Sure, I want my *external* network device on a PCIe card with software-
> loadable firmware to be behind an IOMMU because I don't trust it as far
> as I can throw it. But for on-SoC devices surely the situation is
> *just* the same as devices provided by a hypervisor?

Depends on how SoC is designed I guess.  At the moment specifically QEMU
runs everything in a single memory space so an IOMMU table lookup does
not offer any extra protection. That's not a must, one could come
up with modular hypervisor designs - it's just what we have ATM.


> And some people want that external network device to use passthrough
> anyway, for performance reasons.

That's a policy decision though.

> On the whole, there are *plenty* of reasons why we might want to have a
> passthrough mapping on a per-device basis,

That's true. And driver security also might differ, for example maybe I
trust a distro-supplied driver more than an out of tree one.  Or maybe I
trust a distro-supplied userspace driver more than a closed-source one.
And maybe I trust devices from same vendor as my chip more than a 3rd
party one.  So one can generalize this even further, think about device
and driver security/trust level as an integer and platform protection as an
integer.

If platform IOMMU offers you extra protection over trusting the device
(trust < protection) it improves you security to use platform to limit
the device. If trust >= protection it just adds overhead without
increasing the security.

> and I really struggle to
> find justification for having this 'hint' in a virtio-specific way.

It's a way. No system seems to expose this information in a more generic
way at the moment, and it's portable. Would you like to push for some
kind of standartization of such a hint? I would be interested
to hear about that.


> And it's complicating the discussion of the *actual* fix we're looking
> at.

I guess you are right in that we should split this part out.
What I wanted is really the combination
PASSTHROUGH && !PLATFORM so that we can say "ok we don't
need to guess, this device actually bypasses the IOMMU".

And I thought it's a nice idea to use PASSTHROUGH && PLATFORM
as a hint since it seemed to be unused.
But maybe the best thing to do for now is to say
- hosts should not set PASSTHROUGH && PLATFORM
- guests should ignore PASSTHROUGH if PLATFORM is set

and then we can come back to this optimization idea later
if it's appropriate.

So yes I think we need the two bits but no we don't need to
mix the hint discussion in here.

> > Looking at guest code, it looks like virtio was always
> > bypassing the IOMMU even if configured, but no other
> > guest driver did.
> > 
> > This makes me think the problem where guest drivers
> > ignore the IOMMU is virtio specific
> > and so a virtio specific solution seems cleaner.
> > 
> > The problem for assigned devices is IMHO different: they bypass
> > the guest IOMMU too but no guest driver knows about this,
> > so guests do not work. Seems cleaner to fix QEMU to make
> > existing guests work.
> 
> I certainly agree that it's better to fix QEMU. Whether devices are
> behind an IOMMU or not, the DMAR tables we expose to a guest should
> tell the truth.
> 
> Part of the issue here is virtio-specific; part isn't.
> 
> Basically, we have a conjunction of two separate bugs which happened to
> work (for virtio) — the IOMMU support in QEMU wasn't working for virtio
> (and assigned) devices even though it theoretically *should* have been,
> and the virtio drivers weren't using the DMA API as they theoretically
> should have been.
> 
> So there were corner cases like assigned PCI devices, and real hardware
> implementations of virtio stuff (and perhaps virtio devices being
> assigned to nested guests) which didn't work. But for the *common* use
> case, one bug cancelled out the other.
> 
> Now we want to fix both bugs, and of course that involves carefully
> coordinating both fixes.
> 
> I *like* your idea of a flag from the hypervisor which essentially says
> "trust me, I'm telling the truth now".
> 
> But don't think that wants to be virtio-specific, because we actually
> want it to cover *all* the corner cases, not just the common case which
> *happened* to work before due to the alignment of the two previous
> bugs.

I guess we differ here. I care about fixing bugs and not breaking
working setups but I see little value in working around
existing bugs if they can be fixed at their source.

Building a generic mechanism to report which devices bypass the IOMMU
isn't trivial because there's no simple generic way to address
an arbitrary device from hypervisor. For example, DMAR tables
commonly use bus numbers for that but these are guest (bios) assigned.
So if we used bus numbers we'd have to ask bios to build a custom
ACPI table and stick bus numbers there.

> An updated guest OS can look for this flag (in its generic IOMMU code)
> and can apply a heuristic of its own to work out which devices *aren't*
> behind the IOMMU, if the flag isn't present. And it can get that right
> even for assigned devices, so that new kernels can run happily even on
> today's QEMU instances.

With iommu enabled? Point is, I don't really care about that.
At this point only a very small number of devices work with this
IOMMU at all. I expect that we'll fix assigned devices very soon.

> And the virtio driver in new kernels should
> just use the DMA API and expect it to work. Just as the various drivers
> for assigned PCI devices do.

Absolutely but that's a separate discussion.

> The other interesting case for compatibility is old kernels running in
> a new QEMU. And for that case, things are likely to break if you
> suddenly start putting the virtio devices behind an IOMMU. There's
> nothing you can do on ARM and Power to stop that breakage, since they
> don't *have* a way to tell legacy guests that certain devices aren't
> translated. So I suspect you probably can't enable virtio-behind-IOMMU
> in QEMU *ever* for those platforms as the default behaviour.
> 
> For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> the truth, and even legacy kernels ought to cope with that.

I don't see how in that legacy kernels bypassed the DMA API.
To me it looks like we either use physical addresses that they give us
or they don't work at all (at least without iommu=pt),
since the VT-D spec says:
	DMA requests processed through root-entries with present field
	Clear result in translation-fault.

So I suspect the IOMMU_PLATFORM flag would have to stay off
by default for a while.


> FSVO 'ought to' where I suspect some of them will actually crash with a
> NULL pointer dereference if there's no "catch-all" DMAR unit in the
> tables, which puts it back into the same camp as ARM and Power.

Right.  That would also be an issue.

> 
> > True but I think we should fix QEMU to shadow IOMMU
> > page tables for assigned devices. This seems rather
> > possible with VT-D, and there are patches already on list.
> > 
> > It looks like this will fix all legacy guests which is
> > much nicer than what you suggest which will only help new guests.
> 
> Yes, we should do that. And in the short term we should at *least* fix
> the DMAR tables to tell the truth.

Right. However, the way timing happens to work, we are out of time to
fix it in 2.6 and we are highly likely to have the proper VFIO fix in
2.7.  So I'm not sure there's space for a short term fix.

> > > 
> > > Furthermore, some platforms don't *have* a standard way for qemu to
> > > 'tell the truth' to the guests, and that's where the real fun comes in.
> > > But still, I'd like to see a generic solution for that lack instead of
> > > a virtio-specific hack.
> > But the issue is not just these holes.  E.g. with VT-D it is only easy
> > to emulate because there's a "caching mode" hook. It is fundamentally
> > paravirtualization.  So a completely generic solution would be a
> > paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> > justified if many platforms have hard to emulate interfaces.
> 
> Hm, I'm not sure I understand the point here.
> 
> Either there is a way for the hypervisor to expose an IOMMU to a guest
> (be it full hardware virt, or paravirt). Or there isn't.
> 
> If there is, it doesn't matter *how* it's done.

Well it does matter for people doing it :)

> And if there isn't, the
> whole discussion is moot anyway.

Point was that we can always build a paravirt interface
if it does not exist, but it's easier to maintain
if it's minimal, being as close to emulating hardware
as we can.

> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 19, 2016, 10:27 a.m. UTC | #12
On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> > the truth, and even legacy kernels ought to cope with that.
> > FSVO 'ought to' where I suspect some of them will actually crash with a
> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> > tables, which puts it back into the same camp as ARM and Power.
> 
> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
> implementation on x86 has always been "experimental", so it just might
> be okay to change it in a way that causes some older kernels to OOPS.
> 
> --Andy

Since it's experimental, it might be OK to change *guest kernels*
such that they oops on old QEMU.
But guest kernels were not experimental - so we need a QEMU mode that
makes them work fine. The more functionality is available in this QEMU
mode, the betterm because it's going to be the default for a while. For
the same reason, it is preferable to also have new kernels not crash in
this mode.
Alex Williamson April 19, 2016, 2:44 p.m. UTC | #13
On Tue, 19 Apr 2016 12:13:29 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Apr 18, 2016 at 02:29:33PM -0400, David Woodhouse wrote:
> > On Mon, 2016-04-18 at 19:27 +0300, Michael S. Tsirkin wrote:  
> > > I balk at adding more hacks to a broken system. My goals are
> > > merely to
> > > - make things work correctly with an IOMMU and new guests,
> > >   so people can use userspace drivers with virtio devices
> > > - prevent security risks when guest kernel mistakenly thinks
> > >   it's protected by an IOMMU, but in fact isn't
> > > - avoid breaking any working configurations  
> > 
> > AFAICT the VIRTIO_F_IOMMU_PASSTHROUGH thing seems orthogonal to this.
> > That's just an optimisation, for telling an OS "you don't really need
> > to bother with the IOMMU, even though you it works".
> > 
> > There are two main reasons why an operating system might want to use
> > the IOMMU via the DMA API for native drivers: 
> >  - To protect against driver bugs triggering rogue DMA.
> >  - To protect against hardware (or firmware) bugs.
> > 
> > With virtio, the first reason still exists. But the second is moot
> > because the device is part of the hypervisor and if the hypervisor is
> > untrustworthy then you're screwed anyway... but then again, in SoC
> > devices you could replace 'hypervisor' with 'chip' and the same is
> > true, isn't it? Is there *really* anything virtio-specific here?
> >
> > Sure, I want my *external* network device on a PCIe card with software-
> > loadable firmware to be behind an IOMMU because I don't trust it as far
> > as I can throw it. But for on-SoC devices surely the situation is
> > *just* the same as devices provided by a hypervisor?  
> 
> Depends on how SoC is designed I guess.  At the moment specifically QEMU
> runs everything in a single memory space so an IOMMU table lookup does
> not offer any extra protection. That's not a must, one could come
> up with modular hypervisor designs - it's just what we have ATM.
> 
> 
> > And some people want that external network device to use passthrough
> > anyway, for performance reasons.  
> 
> That's a policy decision though.
> 
> > On the whole, there are *plenty* of reasons why we might want to have a
> > passthrough mapping on a per-device basis,  
> 
> That's true. And driver security also might differ, for example maybe I
> trust a distro-supplied driver more than an out of tree one.  Or maybe I
> trust a distro-supplied userspace driver more than a closed-source one.
> And maybe I trust devices from same vendor as my chip more than a 3rd
> party one.  So one can generalize this even further, think about device
> and driver security/trust level as an integer and platform protection as an
> integer.
> 
> If platform IOMMU offers you extra protection over trusting the device
> (trust < protection) it improves you security to use platform to limit
> the device. If trust >= protection it just adds overhead without
> increasing the security.
> 
> > and I really struggle to
> > find justification for having this 'hint' in a virtio-specific way.  
> 
> It's a way. No system seems to expose this information in a more generic
> way at the moment, and it's portable. Would you like to push for some
> kind of standartization of such a hint? I would be interested
> to hear about that.
> 
> 
> > And it's complicating the discussion of the *actual* fix we're looking
> > at.  
> 
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".
> 
> And I thought it's a nice idea to use PASSTHROUGH && PLATFORM
> as a hint since it seemed to be unused.
> But maybe the best thing to do for now is to say
> - hosts should not set PASSTHROUGH && PLATFORM
> - guests should ignore PASSTHROUGH if PLATFORM is set
> 
> and then we can come back to this optimization idea later
> if it's appropriate.
> 
> So yes I think we need the two bits but no we don't need to
> mix the hint discussion in here.
> 
> > > Looking at guest code, it looks like virtio was always
> > > bypassing the IOMMU even if configured, but no other
> > > guest driver did.
> > > 
> > > This makes me think the problem where guest drivers
> > > ignore the IOMMU is virtio specific
> > > and so a virtio specific solution seems cleaner.
> > > 
> > > The problem for assigned devices is IMHO different: they bypass
> > > the guest IOMMU too but no guest driver knows about this,
> > > so guests do not work. Seems cleaner to fix QEMU to make
> > > existing guests work.  
> > 
> > I certainly agree that it's better to fix QEMU. Whether devices are
> > behind an IOMMU or not, the DMAR tables we expose to a guest should
> > tell the truth.
> > 
> > Part of the issue here is virtio-specific; part isn't.
> > 
> > Basically, we have a conjunction of two separate bugs which happened to
> > work (for virtio) — the IOMMU support in QEMU wasn't working for virtio
> > (and assigned) devices even though it theoretically *should* have been,
> > and the virtio drivers weren't using the DMA API as they theoretically
> > should have been.
> > 
> > So there were corner cases like assigned PCI devices, and real hardware
> > implementations of virtio stuff (and perhaps virtio devices being
> > assigned to nested guests) which didn't work. But for the *common* use
> > case, one bug cancelled out the other.
> > 
> > Now we want to fix both bugs, and of course that involves carefully
> > coordinating both fixes.
> > 
> > I *like* your idea of a flag from the hypervisor which essentially says
> > "trust me, I'm telling the truth now".
> > 
> > But don't think that wants to be virtio-specific, because we actually
> > want it to cover *all* the corner cases, not just the common case which
> > *happened* to work before due to the alignment of the two previous
> > bugs.  
> 
> I guess we differ here. I care about fixing bugs and not breaking
> working setups but I see little value in working around
> existing bugs if they can be fixed at their source.
> 
> Building a generic mechanism to report which devices bypass the IOMMU
> isn't trivial because there's no simple generic way to address
> an arbitrary device from hypervisor. For example, DMAR tables
> commonly use bus numbers for that but these are guest (bios) assigned.
> So if we used bus numbers we'd have to ask bios to build a custom
> ACPI table and stick bus numbers there.

This is incorrect, the DMAR table specifically uses devices paths in
order to avoid the issue with guest assigned bus numbers.  The only
bus number used is the starting bus number, which is generally
provided by the platform anyway.  Excluding devices isn't necessarily
easy with DMAR though, we don't get to be lazy and use the
INCLUDE_PCI_ALL flag.  Hotplug is also an issue, we either need to
hot-add devices into slots where there's already the correct DMAR
coverage (or lack of coverage) to represent the inclusion or exclusion
or enable dynamic table support.  And really it seems like dynamic
tables are the only possible way DMAR could support replacing a
device that obeys the IOMMU with one that does not at the same address,
or vica versa.

For any sort of sane implementation, it probably comes down to fully
enumerating root bus devices in the DMAR and creating PCI sub-hierarchy
entries for certain subordinate buses, leaving others undefined.
Devices making use of the IOMMU could only be attached behind those
sub-hierarchies and devices not making use of the IOMMU would be
downstream of bridges not covered.  The management stack would need to
know where to place devices.
 
> > An updated guest OS can look for this flag (in its generic IOMMU code)
> > and can apply a heuristic of its own to work out which devices *aren't*
> > behind the IOMMU, if the flag isn't present. And it can get that right
> > even for assigned devices, so that new kernels can run happily even on
> > today's QEMU instances.  
> 
> With iommu enabled? Point is, I don't really care about that.
> At this point only a very small number of devices work with this
> IOMMU at all. I expect that we'll fix assigned devices very soon.
> 
> > And the virtio driver in new kernels should
> > just use the DMA API and expect it to work. Just as the various drivers
> > for assigned PCI devices do.  
> 
> Absolutely but that's a separate discussion.
> 
> > The other interesting case for compatibility is old kernels running in
> > a new QEMU. And for that case, things are likely to break if you
> > suddenly start putting the virtio devices behind an IOMMU. There's
> > nothing you can do on ARM and Power to stop that breakage, since they
> > don't *have* a way to tell legacy guests that certain devices aren't
> > translated. So I suspect you probably can't enable virtio-behind-IOMMU
> > in QEMU *ever* for those platforms as the default behaviour.
> > 
> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> > the truth, and even legacy kernels ought to cope with that.  
> 
> I don't see how in that legacy kernels bypassed the DMA API.

It's a matter of excluding the device from being explicitly covered by
the DMAR AIUI.  This is theoretically possible, but I wonder if it
actually works for all kernels.

> To me it looks like we either use physical addresses that they give us
> or they don't work at all (at least without iommu=pt),
> since the VT-D spec says:
> 	DMA requests processed through root-entries with present field
> 	Clear result in translation-fault.
> 
> So I suspect the IOMMU_PLATFORM flag would have to stay off
> by default for a while.
> 
> 
> > FSVO 'ought to' where I suspect some of them will actually crash with a
> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> > tables, which puts it back into the same camp as ARM and Power.  
> 
> Right.  That would also be an issue.
> 
> >   
> > > True but I think we should fix QEMU to shadow IOMMU
> > > page tables for assigned devices. This seems rather
> > > possible with VT-D, and there are patches already on list.
> > > 
> > > It looks like this will fix all legacy guests which is
> > > much nicer than what you suggest which will only help new guests.  
> > 
> > Yes, we should do that. And in the short term we should at *least* fix
> > the DMAR tables to tell the truth.  
> 
> Right. However, the way timing happens to work, we are out of time to
> fix it in 2.6 and we are highly likely to have the proper VFIO fix in
> 2.7.  So I'm not sure there's space for a short term fix.

Note that vfio already works with IOMMUs on power, the issue I believe
we're talking about for assigned devices bypassing the guest IOMMU is
limited to the QEMU VT-d implementation failing to do the proper
notifies.  Legacy KVM device assignment of course has no idea about the
IOMMU because it piggy backs on KVM memory slot mapping instead of
operating within the QEMU Memory API like vfio does.

The issues I believe we're going to hit with vfio assigned devices and
QEMU VT-d are that 1) the vfio IOMMU interface is not designed for the
frequency of mapping that a DMA API managed guest device will generate,
2) we have accounting issues for locked pages since each device will
run in a separate IOMMU domain, accounted separately, and 3) we don't
have a way to expose host grouping to the VM so trying to assign
multiple devices from the same group is likely to fail.  We'd almost
need to put all of the devices within a group behind a conventional PCI
bridge in the VM to get them into the same address space, but I suspect
QEMU VT-d doesn't take that aliasing into account.

> > > > 
> > > > Furthermore, some platforms don't *have* a standard way for qemu to
> > > > 'tell the truth' to the guests, and that's where the real fun comes in.
> > > > But still, I'd like to see a generic solution for that lack instead of
> > > > a virtio-specific hack.  
> > > But the issue is not just these holes.  E.g. with VT-D it is only easy
> > > to emulate because there's a "caching mode" hook. It is fundamentally
> > > paravirtualization.  So a completely generic solution would be a
> > > paravirtualized IOMMU interface, replacing VT-D for VMs. It might be
> > > justified if many platforms have hard to emulate interfaces.  
> > 
> > Hm, I'm not sure I understand the point here.
> > 
> > Either there is a way for the hypervisor to expose an IOMMU to a guest
> > (be it full hardware virt, or paravirt). Or there isn't.
> > 
> > If there is, it doesn't matter *how* it's done.  
> 
> Well it does matter for people doing it :)
> 
> > And if there isn't, the
> > whole discussion is moot anyway.  
> 
> Point was that we can always build a paravirt interface
> if it does not exist, but it's easier to maintain
> if it's minimal, being as close to emulating hardware
> as we can.
> 
> > -- 
> > dwmw2
> > 
> >   
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski April 19, 2016, 4 p.m. UTC | #14
On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
>
> I guess you are right in that we should split this part out.
> What I wanted is really the combination
> PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> need to guess, this device actually bypasses the IOMMU".

What happens when you use a device like this on Xen or with a similar
software translation layer?

I think that a "please bypass IOMMU" feature would be better in the
PCI, IOMMU, or platform code.  For Xen, virtio would still want to use
the DMA API, just without translating at the DMAR or hardware level.
Doing it in virtio is awkward, because virtio is involved at the
device level and the driver level, but the translation might be
entirely in between.

I think a nicer long-term approach would be to have a way to ask the
guest to set up a full 1:1 mapping for performance, but to still
handle the case where the guest refuses to do so or where there's more
than one translation layer involved.

But I agree that this part shouldn't delay the other part of your series.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski April 19, 2016, 4:02 p.m. UTC | #15
On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> > the truth, and even legacy kernels ought to cope with that.
>> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> > tables, which puts it back into the same camp as ARM and Power.
>>
>> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> implementation on x86 has always been "experimental", so it just might
>> be okay to change it in a way that causes some older kernels to OOPS.
>>
>> --Andy
>
> Since it's experimental, it might be OK to change *guest kernels*
> such that they oops on old QEMU.
> But guest kernels were not experimental - so we need a QEMU mode that
> makes them work fine. The more functionality is available in this QEMU
> mode, the betterm because it's going to be the default for a while. For
> the same reason, it is preferable to also have new kernels not crash in
> this mode.
>

People add QEMU features that need new guest kernels all time time.
If you enable virtio-scsi and try to boot a guest that's too old, it
won't work.  So I don't see anything fundamentally wrong with saying
that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
kernel is too old.  It might be annoying, since old kernels do work on
actual Q35 hardware, but it at least seems to be that it might be
okay.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 19, 2016, 4:04 p.m. UTC | #16
On Tue, Apr 19, 2016 at 09:00:27AM -0700, Andy Lutomirski wrote:
> On Apr 19, 2016 2:13 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >
> >
> > I guess you are right in that we should split this part out.
> > What I wanted is really the combination
> > PASSTHROUGH && !PLATFORM so that we can say "ok we don't
> > need to guess, this device actually bypasses the IOMMU".
> 
> What happens when you use a device like this on Xen or with a similar
> software translation layer?

I think you don't use it on Xen since virtio doesn't bypass an IOMMU there.
If you do you have misconfigured your device.
Michael S. Tsirkin April 19, 2016, 4:09 p.m. UTC | #17
On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> >> > the truth, and even legacy kernels ought to cope with that.
> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> >> > tables, which puts it back into the same camp as ARM and Power.
> >>
> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
> >> implementation on x86 has always been "experimental", so it just might
> >> be okay to change it in a way that causes some older kernels to OOPS.
> >>
> >> --Andy
> >
> > Since it's experimental, it might be OK to change *guest kernels*
> > such that they oops on old QEMU.
> > But guest kernels were not experimental - so we need a QEMU mode that
> > makes them work fine. The more functionality is available in this QEMU
> > mode, the betterm because it's going to be the default for a while. For
> > the same reason, it is preferable to also have new kernels not crash in
> > this mode.
> >
> 
> People add QEMU features that need new guest kernels all time time.
> If you enable virtio-scsi and try to boot a guest that's too old, it
> won't work.  So I don't see anything fundamentally wrong with saying
> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
> kernel is too old.  It might be annoying, since old kernels do work on
> actual Q35 hardware, but it at least seems to be that it might be
> okay.
> 
> --Andy

Yes but we need a mode that makes both old and new kernels work, and
that should be the default for a while.  this is what the
IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
API, new kernels go "oh compatibility mode" and bypass the IOMMU
within DMA API.
Andy Lutomirski April 19, 2016, 4:12 p.m. UTC | #18
On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
>> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
>> >> > the truth, and even legacy kernels ought to cope with that.
>> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
>> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
>> >> > tables, which puts it back into the same camp as ARM and Power.
>> >>
>> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
>> >> implementation on x86 has always been "experimental", so it just might
>> >> be okay to change it in a way that causes some older kernels to OOPS.
>> >>
>> >> --Andy
>> >
>> > Since it's experimental, it might be OK to change *guest kernels*
>> > such that they oops on old QEMU.
>> > But guest kernels were not experimental - so we need a QEMU mode that
>> > makes them work fine. The more functionality is available in this QEMU
>> > mode, the betterm because it's going to be the default for a while. For
>> > the same reason, it is preferable to also have new kernels not crash in
>> > this mode.
>> >
>>
>> People add QEMU features that need new guest kernels all time time.
>> If you enable virtio-scsi and try to boot a guest that's too old, it
>> won't work.  So I don't see anything fundamentally wrong with saying
>> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
>> kernel is too old.  It might be annoying, since old kernels do work on
>> actual Q35 hardware, but it at least seems to be that it might be
>> okay.
>>
>> --Andy
>
> Yes but we need a mode that makes both old and new kernels work, and
> that should be the default for a while.  this is what the
> IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> API, new kernels go "oh compatibility mode" and bypass the IOMMU
> within DMA API.

I thought that PLATFORM served that purpose.  Woudn't the host
advertise PLATFORM support and, if the guest doesn't ack it, the host
device would skip translation?  Or is that problematic for vfio?

>
> --
> MST
Michael S. Tsirkin April 19, 2016, 4:20 p.m. UTC | #19
On Tue, Apr 19, 2016 at 09:12:03AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 9:09 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 09:02:14AM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 19, 2016 at 3:27 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Mon, Apr 18, 2016 at 12:24:15PM -0700, Andy Lutomirski wrote:
> >> >> On Mon, Apr 18, 2016 at 11:29 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> >> >> > For x86, you *can* enable virtio-behind-IOMMU if your DMAR tables tell
> >> >> > the truth, and even legacy kernels ought to cope with that.
> >> >> > FSVO 'ought to' where I suspect some of them will actually crash with a
> >> >> > NULL pointer dereference if there's no "catch-all" DMAR unit in the
> >> >> > tables, which puts it back into the same camp as ARM and Power.
> >> >>
> >> >> I think x86 may get a bit of a free pass here.  AFAIK the QEMU IOMMU
> >> >> implementation on x86 has always been "experimental", so it just might
> >> >> be okay to change it in a way that causes some older kernels to OOPS.
> >> >>
> >> >> --Andy
> >> >
> >> > Since it's experimental, it might be OK to change *guest kernels*
> >> > such that they oops on old QEMU.
> >> > But guest kernels were not experimental - so we need a QEMU mode that
> >> > makes them work fine. The more functionality is available in this QEMU
> >> > mode, the betterm because it's going to be the default for a while. For
> >> > the same reason, it is preferable to also have new kernels not crash in
> >> > this mode.
> >> >
> >>
> >> People add QEMU features that need new guest kernels all time time.
> >> If you enable virtio-scsi and try to boot a guest that's too old, it
> >> won't work.  So I don't see anything fundamentally wrong with saying
> >> that the non-experimental QEMU Q35 IOMMU mode won't boot if the guest
> >> kernel is too old.  It might be annoying, since old kernels do work on
> >> actual Q35 hardware, but it at least seems to be that it might be
> >> okay.
> >>
> >> --Andy
> >
> > Yes but we need a mode that makes both old and new kernels work, and
> > that should be the default for a while.  this is what the
> > IOMMU_PASSTHROUGH flag was about: old kernels ignore it and bypass DMA
> > API, new kernels go "oh compatibility mode" and bypass the IOMMU
> > within DMA API.
> 
> I thought that PLATFORM served that purpose.  Woudn't the host
> advertise PLATFORM support and, if the guest doesn't ack it, the host
> device would skip translation?  Or is that problematic for vfio?

Exactly that's problematic for security.
You can't allow guest driver to decide whether device skips security.

> >
> > --
> > MST
> 
> 
> 
> -- 
> Andy Lutomirski
> AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse April 19, 2016, 4:26 p.m. UTC | #20
On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> 
> > I thought that PLATFORM served that purpose.  Woudn't the host
> > advertise PLATFORM support and, if the guest doesn't ack it, the host
> > device would skip translation?  Or is that problematic for vfio?
> 
> Exactly that's problematic for security.
> You can't allow guest driver to decide whether device skips security.

Right. Because fundamentally, this *isn't* a property of the endpoint
device, and doesn't live in virtio itself.

It's a property of the platform IOMMU, and lives there.
Michael S. Tsirkin April 19, 2016, 5:49 p.m. UTC | #21
On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > 
> > > I thought that PLATFORM served that purpose.  Woudn't the host
> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> > > device would skip translation?  Or is that problematic for vfio?
> > 
> > Exactly that's problematic for security.
> > You can't allow guest driver to decide whether device skips security.
> 
> Right. Because fundamentally, this *isn't* a property of the endpoint
> device, and doesn't live in virtio itself.
> 
> It's a property of the platform IOMMU, and lives there.

It's a property of the hypervisor virtio implementation, and lives there.
Andy Lutomirski April 19, 2016, 6:01 p.m. UTC | #22
On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >
>> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> > > device would skip translation?  Or is that problematic for vfio?
>> >
>> > Exactly that's problematic for security.
>> > You can't allow guest driver to decide whether device skips security.
>>
>> Right. Because fundamentally, this *isn't* a property of the endpoint
>> device, and doesn't live in virtio itself.
>>
>> It's a property of the platform IOMMU, and lives there.
>
> It's a property of the hypervisor virtio implementation, and lives there.

It is now, but QEMU could, in principle, change the way it thinks
about it so that virtio devices would use the QEMU DMA API but ask
QEMU to pass everything through 1:1.  This would be entirely invisible
to guests but would make it be a property of the IOMMU implementation.
At that point, maybe QEMU could find a (platform dependent) way to
tell the guest what's going on.

FWIW, as far as I can tell, PPC and SPARC really could, in principle,
set up 1:1 mappings in the guest so that the virtio devices would work
regardless of whether QEMU is ignoring the IOMMU or not -- I think the
only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
up with an offset.  I don't know too much about those platforms, but
presumably the layout could be changed so that 1:1 really was 1:1.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 19, 2016, 8:16 p.m. UTC | #23
On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> >> >
> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> >> > > device would skip translation?  Or is that problematic for vfio?
> >> >
> >> > Exactly that's problematic for security.
> >> > You can't allow guest driver to decide whether device skips security.
> >>
> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> >> device, and doesn't live in virtio itself.
> >>
> >> It's a property of the platform IOMMU, and lives there.
> >
> > It's a property of the hypervisor virtio implementation, and lives there.
> 
> It is now, but QEMU could, in principle, change the way it thinks
> about it so that virtio devices would use the QEMU DMA API but ask
> QEMU to pass everything through 1:1.  This would be entirely invisible
> to guests but would make it be a property of the IOMMU implementation.
> At that point, maybe QEMU could find a (platform dependent) way to
> tell the guest what's going on.
> 
> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> set up 1:1 mappings in the guest so that the virtio devices would work
> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> up with an offset.  I don't know too much about those platforms, but
> presumably the layout could be changed so that 1:1 really was 1:1.
> 
> --Andy

Sure. Do you see any reason why the decision to do this can't be
keyed off the virtio feature bit?
Andy Lutomirski April 19, 2016, 8:27 p.m. UTC | #24
On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >
>> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >
>> >> > Exactly that's problematic for security.
>> >> > You can't allow guest driver to decide whether device skips security.
>> >>
>> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> device, and doesn't live in virtio itself.
>> >>
>> >> It's a property of the platform IOMMU, and lives there.
>> >
>> > It's a property of the hypervisor virtio implementation, and lives there.
>>
>> It is now, but QEMU could, in principle, change the way it thinks
>> about it so that virtio devices would use the QEMU DMA API but ask
>> QEMU to pass everything through 1:1.  This would be entirely invisible
>> to guests but would make it be a property of the IOMMU implementation.
>> At that point, maybe QEMU could find a (platform dependent) way to
>> tell the guest what's going on.
>>
>> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> set up 1:1 mappings in the guest so that the virtio devices would work
>> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> up with an offset.  I don't know too much about those platforms, but
>> presumably the layout could be changed so that 1:1 really was 1:1.
>>
>> --Andy
>
> Sure. Do you see any reason why the decision to do this can't be
> keyed off the virtio feature bit?

I can think of three types of virtio host:

a) virtio always bypasses the IOMMU.

b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
it does) -- i.e. virtio works like any other device.

c) virtio may bypass the IOMMU depending on what the guest asks it to do.

If this is keyed off a virtio feature bit and anyone tries to
implement (c), the vfio is going to have a problem.  And, if it's
keyed off a virtio feature bit, then (a) won't work on Xen or similar
setups unless the Xen hypervisor adds a giant and probably unreliable
kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
default x86 QEMU configuration, and I'd really like to keep it that
way.

What could plausibly work using a virtio feature bit is for a device
to say "hey, I'm a new device and I support the platform-defined IOMMU
mechanism".  This bit would be *set* on default IOMMU-less QEMU
configurations and on physical virtio PCI cards.  The guest could
operate accordingly.  I'm not sure I see a good way for feature
negotiation to work the other direction, though.

PPC and SPARC could only set this bit on emulated devices if they know
that new guest kernels are in use.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 19, 2016, 8:54 p.m. UTC | #25
On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> >> >> >
> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> >> >> > > device would skip translation?  Or is that problematic for vfio?
> >> >> >
> >> >> > Exactly that's problematic for security.
> >> >> > You can't allow guest driver to decide whether device skips security.
> >> >>
> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> >> >> device, and doesn't live in virtio itself.
> >> >>
> >> >> It's a property of the platform IOMMU, and lives there.
> >> >
> >> > It's a property of the hypervisor virtio implementation, and lives there.
> >>
> >> It is now, but QEMU could, in principle, change the way it thinks
> >> about it so that virtio devices would use the QEMU DMA API but ask
> >> QEMU to pass everything through 1:1.  This would be entirely invisible
> >> to guests but would make it be a property of the IOMMU implementation.
> >> At that point, maybe QEMU could find a (platform dependent) way to
> >> tell the guest what's going on.
> >>
> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> >> set up 1:1 mappings in the guest so that the virtio devices would work
> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> >> up with an offset.  I don't know too much about those platforms, but
> >> presumably the layout could be changed so that 1:1 really was 1:1.
> >>
> >> --Andy
> >
> > Sure. Do you see any reason why the decision to do this can't be
> > keyed off the virtio feature bit?
> 
> I can think of three types of virtio host:
> 
> a) virtio always bypasses the IOMMU.
> 
> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> it does) -- i.e. virtio works like any other device.
> 
> c) virtio may bypass the IOMMU depending on what the guest asks it to do.

d) some virtio devices bypass the IOMMU and some don't,
e.g. it's harder to support IOMMU with vhost.


> If this is keyed off a virtio feature bit and anyone tries to
> implement (c), the vfio is going to have a problem.  And, if it's
> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> setups unless the Xen hypervisor adds a giant and probably unreliable
> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
> default x86 QEMU configuration, and I'd really like to keep it that
> way.
> 
> What could plausibly work using a virtio feature bit is for a device
> to say "hey, I'm a new device and I support the platform-defined IOMMU
> mechanism".  This bit would be *set* on default IOMMU-less QEMU
> configurations and on physical virtio PCI cards.

And clear on xen.

>  The guest could
> operate accordingly.  I'm not sure I see a good way for feature
> negotiation to work the other direction, though.

I agree.

> PPC and SPARC could only set this bit on emulated devices if they know
> that new guest kernels are in use.
> 
> --Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski April 19, 2016, 9:07 p.m. UTC | #26
On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
>> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
>> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
>> >> >> >
>> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
>> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
>> >> >> > > device would skip translation?  Or is that problematic for vfio?
>> >> >> >
>> >> >> > Exactly that's problematic for security.
>> >> >> > You can't allow guest driver to decide whether device skips security.
>> >> >>
>> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
>> >> >> device, and doesn't live in virtio itself.
>> >> >>
>> >> >> It's a property of the platform IOMMU, and lives there.
>> >> >
>> >> > It's a property of the hypervisor virtio implementation, and lives there.
>> >>
>> >> It is now, but QEMU could, in principle, change the way it thinks
>> >> about it so that virtio devices would use the QEMU DMA API but ask
>> >> QEMU to pass everything through 1:1.  This would be entirely invisible
>> >> to guests but would make it be a property of the IOMMU implementation.
>> >> At that point, maybe QEMU could find a (platform dependent) way to
>> >> tell the guest what's going on.
>> >>
>> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
>> >> set up 1:1 mappings in the guest so that the virtio devices would work
>> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
>> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
>> >> up with an offset.  I don't know too much about those platforms, but
>> >> presumably the layout could be changed so that 1:1 really was 1:1.
>> >>
>> >> --Andy
>> >
>> > Sure. Do you see any reason why the decision to do this can't be
>> > keyed off the virtio feature bit?
>>
>> I can think of three types of virtio host:
>>
>> a) virtio always bypasses the IOMMU.
>>
>> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
>> it does) -- i.e. virtio works like any other device.
>>
>> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
>
> d) some virtio devices bypass the IOMMU and some don't,
> e.g. it's harder to support IOMMU with vhost.
>
>
>> If this is keyed off a virtio feature bit and anyone tries to
>> implement (c), the vfio is going to have a problem.  And, if it's
>> keyed off a virtio feature bit, then (a) won't work on Xen or similar
>> setups unless the Xen hypervisor adds a giant and probably unreliable
>> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
>> default x86 QEMU configuration, and I'd really like to keep it that
>> way.
>>
>> What could plausibly work using a virtio feature bit is for a device
>> to say "hey, I'm a new device and I support the platform-defined IOMMU
>> mechanism".  This bit would be *set* on default IOMMU-less QEMU
>> configurations and on physical virtio PCI cards.
>
> And clear on xen.

How?  QEMU has no idea that the guest is running Xen.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin April 20, 2016, 1:14 p.m. UTC | #27
On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> >> >> >> >
> >> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> >> >> >> > > device would skip translation?  Or is that problematic for vfio?
> >> >> >> >
> >> >> >> > Exactly that's problematic for security.
> >> >> >> > You can't allow guest driver to decide whether device skips security.
> >> >> >>
> >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> >> >> >> device, and doesn't live in virtio itself.
> >> >> >>
> >> >> >> It's a property of the platform IOMMU, and lives there.
> >> >> >
> >> >> > It's a property of the hypervisor virtio implementation, and lives there.
> >> >>
> >> >> It is now, but QEMU could, in principle, change the way it thinks
> >> >> about it so that virtio devices would use the QEMU DMA API but ask
> >> >> QEMU to pass everything through 1:1.  This would be entirely invisible
> >> >> to guests but would make it be a property of the IOMMU implementation.
> >> >> At that point, maybe QEMU could find a (platform dependent) way to
> >> >> tell the guest what's going on.
> >> >>
> >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> >> >> up with an offset.  I don't know too much about those platforms, but
> >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> >> >>
> >> >> --Andy
> >> >
> >> > Sure. Do you see any reason why the decision to do this can't be
> >> > keyed off the virtio feature bit?
> >>
> >> I can think of three types of virtio host:
> >>
> >> a) virtio always bypasses the IOMMU.
> >>
> >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> >> it does) -- i.e. virtio works like any other device.
> >>
> >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> >
> > d) some virtio devices bypass the IOMMU and some don't,
> > e.g. it's harder to support IOMMU with vhost.
> >
> >
> >> If this is keyed off a virtio feature bit and anyone tries to
> >> implement (c), the vfio is going to have a problem.  And, if it's
> >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> >> setups unless the Xen hypervisor adds a giant and probably unreliable
> >> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
> >> default x86 QEMU configuration, and I'd really like to keep it that
> >> way.
> >>
> >> What could plausibly work using a virtio feature bit is for a device
> >> to say "hey, I'm a new device and I support the platform-defined IOMMU
> >> mechanism".  This bit would be *set* on default IOMMU-less QEMU
> >> configurations and on physical virtio PCI cards.
> >
> > And clear on xen.
> 
> How?  QEMU has no idea that the guest is running Xen.

I was under impression xen_enabled() is true in QEMU.
Am I wrong?
Andy Lutomirski April 20, 2016, 3:43 p.m. UTC | #28
On Apr 20, 2016 6:14 AM, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>
> On Tue, Apr 19, 2016 at 02:07:01PM -0700, Andy Lutomirski wrote:
> > On Tue, Apr 19, 2016 at 1:54 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Tue, Apr 19, 2016 at 01:27:29PM -0700, Andy Lutomirski wrote:
> > >> On Tue, Apr 19, 2016 at 1:16 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >> > On Tue, Apr 19, 2016 at 11:01:38AM -0700, Andy Lutomirski wrote:
> > >> >> On Tue, Apr 19, 2016 at 10:49 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >> >> > On Tue, Apr 19, 2016 at 12:26:44PM -0400, David Woodhouse wrote:
> > >> >> >> On Tue, 2016-04-19 at 19:20 +0300, Michael S. Tsirkin wrote:
> > >> >> >> >
> > >> >> >> > > I thought that PLATFORM served that purpose.  Woudn't the host
> > >> >> >> > > advertise PLATFORM support and, if the guest doesn't ack it, the host
> > >> >> >> > > device would skip translation?  Or is that problematic for vfio?
> > >> >> >> >
> > >> >> >> > Exactly that's problematic for security.
> > >> >> >> > You can't allow guest driver to decide whether device skips security.
> > >> >> >>
> > >> >> >> Right. Because fundamentally, this *isn't* a property of the endpoint
> > >> >> >> device, and doesn't live in virtio itself.
> > >> >> >>
> > >> >> >> It's a property of the platform IOMMU, and lives there.
> > >> >> >
> > >> >> > It's a property of the hypervisor virtio implementation, and lives there.
> > >> >>
> > >> >> It is now, but QEMU could, in principle, change the way it thinks
> > >> >> about it so that virtio devices would use the QEMU DMA API but ask
> > >> >> QEMU to pass everything through 1:1.  This would be entirely invisible
> > >> >> to guests but would make it be a property of the IOMMU implementation.
> > >> >> At that point, maybe QEMU could find a (platform dependent) way to
> > >> >> tell the guest what's going on.
> > >> >>
> > >> >> FWIW, as far as I can tell, PPC and SPARC really could, in principle,
> > >> >> set up 1:1 mappings in the guest so that the virtio devices would work
> > >> >> regardless of whether QEMU is ignoring the IOMMU or not -- I think the
> > >> >> only obstacle is that the PPC and SPARC 1:1 mappings are currectly set
> > >> >> up with an offset.  I don't know too much about those platforms, but
> > >> >> presumably the layout could be changed so that 1:1 really was 1:1.
> > >> >>
> > >> >> --Andy
> > >> >
> > >> > Sure. Do you see any reason why the decision to do this can't be
> > >> > keyed off the virtio feature bit?
> > >>
> > >> I can think of three types of virtio host:
> > >>
> > >> a) virtio always bypasses the IOMMU.
> > >>
> > >> b) virtio never bypasses the IOMMU (unless DMAR tables or similar say
> > >> it does) -- i.e. virtio works like any other device.
> > >>
> > >> c) virtio may bypass the IOMMU depending on what the guest asks it to do.
> > >
> > > d) some virtio devices bypass the IOMMU and some don't,
> > > e.g. it's harder to support IOMMU with vhost.
> > >
> > >
> > >> If this is keyed off a virtio feature bit and anyone tries to
> > >> implement (c), the vfio is going to have a problem.  And, if it's
> > >> keyed off a virtio feature bit, then (a) won't work on Xen or similar
> > >> setups unless the Xen hypervisor adds a giant and probably unreliable
> > >> kludge to support it.  Meanwhile, 4.6-rc works fine under Xen on a
> > >> default x86 QEMU configuration, and I'd really like to keep it that
> > >> way.
> > >>
> > >> What could plausibly work using a virtio feature bit is for a device
> > >> to say "hey, I'm a new device and I support the platform-defined IOMMU
> > >> mechanism".  This bit would be *set* on default IOMMU-less QEMU
> > >> configurations and on physical virtio PCI cards.
> > >
> > > And clear on xen.
> >
> > How?  QEMU has no idea that the guest is running Xen.
>
> I was under impression xen_enabled() is true in QEMU.
> Am I wrong?

I'd be rather surprised, given that QEMU would have to inspect the
guest kernel to figure it out.  I'm talking about Xen under QEMU.  For
example, if you feed QEMU a guest disk image that contains Fedora with
the xen packages installed, you can boot it and get a grub menu.  If
you ask grub to boot Xen, you get Xen.  If you ask grub to boot Linux
directly, you don't get Xen.

I assume xen_enabled is for QEMU under Xen, i.e. QEMU, running under
Xen, supplying emulated devices to a Xen domU guest.  Since QEMU is
seeing the guest address space directly, this should be much the same
as QEMU !xen_enabled -- if you boot plain Linux, everything works, but
if you do Xen -> QEMU -> HVM guest running Xen PV -> Linux, then
virtio drivers in the Xen PV Linux guest need to translate addresses.
--Andy

>
> --
> MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h
index 967cc75..bb6f34e 100644
--- a/include/hw/virtio/virtio-access.h
+++ b/include/hw/virtio/virtio-access.h
@@ -23,7 +23,8 @@  static inline AddressSpace *virtio_get_dma_as(VirtIODevice *vdev)
     BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
     VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 
-    if (k->get_dma_as) {
+    if ((vdev->host_features & (0x1ULL << VIRTIO_F_IOMMU_PLATFORM)) &&
+        k->get_dma_as) {
         return k->get_dma_as(qbus->parent);
     }
     return &address_space_memory;
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index b12faa9..34d3041 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -228,7 +228,11 @@  typedef struct VirtIORNGConf VirtIORNGConf;
     DEFINE_PROP_BIT64("notify_on_empty", _state, _field,  \
                       VIRTIO_F_NOTIFY_ON_EMPTY, true), \
     DEFINE_PROP_BIT64("any_layout", _state, _field, \
-                      VIRTIO_F_ANY_LAYOUT, true)
+                      VIRTIO_F_ANY_LAYOUT, true), \
+    DEFINE_PROP_BIT64("iommu_passthrough", _state, _field, \
+                      VIRTIO_F_IOMMU_PASSTHROUGH, false), \
+    DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
+                      VIRTIO_F_IOMMU_PLATFORM, false)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 hwaddr virtio_queue_get_avail_addr(VirtIODevice *vdev, int n);
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index bcc445b..5564dab 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -61,4 +61,12 @@ 
 /* v1.0 compliant. */
 #define VIRTIO_F_VERSION_1		32
 
+/* Request IOMMU passthrough (if available)
+ * Without VIRTIO_F_IOMMU_PLATFORM: bypass the IOMMU even if enabled.
+ * With VIRTIO_F_IOMMU_PLATFORM: suggest disabling IOMMU.
+ */
+#define VIRTIO_F_IOMMU_PASSTHROUGH	33
+
+/* Do not bypass the IOMMU (if configured) */
+#define VIRTIO_F_IOMMU_PLATFORM		34
 #endif /* _LINUX_VIRTIO_CONFIG_H */