diff mbox series

[RFC] iommu: make no-quarantine mean no-quarantine

Message ID 0a292d952ff71dbfed5234d27b42a05f7b49a1fe.1619451434.git.scott.davis@starlab.io (mailing list archive)
State New
Headers show
Series [RFC] iommu: make no-quarantine mean no-quarantine | expand

Commit Message

Scott Davis April 26, 2021, 5:25 p.m. UTC
This patch modifies Xen's behavior when making devices assignable while the
iommu=no-quarantine command line option is in effect. Currently this option
only affects device deassignment, causing devices to get immediately assigned
back to Dom0 instead of to the quarantine dom_io domain. This patch extends
no-quarantine to device assignment as well, preventing devices from being
assigned to dom_io when they are made assignable while no-quarantine is in
effect.

Signed-off-by: Scott Davis <scott.davis@starlab.io>
---
First patch submission, apologies in advance for any formatting or other errors.

Background: I am setting up a QEMU-based development and testing environment for
the Crucible team at Star Lab that includes emulated PCIe devices for
passthrough and hotplug. I encountered an issue with `xl pci-assignable-add`
that causes the host QEMU to rapidly allocate memory until getting OOM-killed.
I then found that the issue could be worked around either by using manual sysfs
commands to rebind devices to pciback or by skipping over the quarantine logic
in `libxl__device_pci_assignable_add`, producing a working system. I hoped that
setting iommu=no-quarantine on the command line would have the same effect, only
to be surprised that it did not.

This patch fixes the issue I encountered and, in my opinion, makes the behavior
of no-quarantine more intuitive. While I intend to root-cause whatever
interaction between Xen and QEMU's VT-d vIOMMU is causing the problem, I plan to
carry this patch, or one like it, in the meantime. I'd welcome any feedback on
its design or intent. Note that this RFC version is based and tested on
Crucible's 4.14.1 branch of Xen, but I will rebase and test on staging if it's
desired upstream in some form.
---
 docs/misc/xen-command-line.pandoc | 8 ++++----
 xen/drivers/passthrough/pci.c     | 9 +++++++--
 2 files changed, 11 insertions(+), 6 deletions(-)

--
2.25.1

Comments

Jan Beulich April 27, 2021, 6:56 a.m. UTC | #1
On 26.04.2021 19:25, Scott Davis wrote:
> This patch modifies Xen's behavior when making devices assignable while the
> iommu=no-quarantine command line option is in effect. Currently this option
> only affects device deassignment, causing devices to get immediately assigned
> back to Dom0 instead of to the quarantine dom_io domain. This patch extends
> no-quarantine to device assignment as well, preventing devices from being
> assigned to dom_io when they are made assignable while no-quarantine is in
> effect.

Well, the term "quarantine" to me means a safety action taken _after_
possible exposure to something "bad". Therefore I see this being specific
to device de-assignment as the logical thing. Hence if a mode like what
you describe was wanted, I don't think it should be the result of
"iommu=no-quarantine".

> First patch submission, apologies in advance for any formatting or other errors.

I couldn't spot any, except maybe ...

> Background: I am setting up a QEMU-based development and testing environment for
> the Crucible team at Star Lab that includes emulated PCIe devices for
> passthrough and hotplug. I encountered an issue with `xl pci-assignable-add`
> that causes the host QEMU to rapidly allocate memory until getting OOM-killed.
> I then found that the issue could be worked around either by using manual sysfs
> commands to rebind devices to pciback or by skipping over the quarantine logic
> in `libxl__device_pci_assignable_add`, producing a working system. I hoped that
> setting iommu=no-quarantine on the command line would have the same effect, only
> to be surprised that it did not.

... some of this "why do we want this" would belong in the commit message
imo, not just here.

Jan
Scott Davis April 27, 2021, 10 p.m. UTC | #2
Thanks for the feedback, Jan!

On 4/27/21, 2:56 AM, Jan Beulich wrote:
> On 26.04.2021 19:25, Scott Davis wrote:
>> This patch modifies Xen's behavior when making devices assignable while the
>> iommu=no-quarantine command line option is in effect. Currently this option
>> only affects device deassignment, causing devices to get immediately assigned
>> back to Dom0 instead of to the quarantine dom_io domain. This patch extends
>> no-quarantine to device assignment as well, preventing devices from being
>> assigned to dom_io when they are made assignable while no-quarantine is in
>> effect.
>
> Well, the term "quarantine" to me means a safety action taken _after_
> possible exposure to something "bad". Therefore I see this being specific
> to device de-assignment as the logical thing. Hence if a mode like what
> you describe was wanted, I don't think it should be the result of
> "iommu=no-quarantine".

Sorry I'm a bit confused by this. Correct me if wrong, but my understanding is  
that the purpose of assigning a device to dom_io on de-assignment is to protect  
Dom0 from the effects of in-flight DMA operations initiated by a DomU. I assumed  
that the purpose of (temporarily) assigning to dom_io on assignment was the same  
but in reverse: protecting a DomU from Dom0-initiated ops. Is this not the case?

Also, documentation and code already refer to the operation in question as a 
"quarantine" (see xl command line docs and libxl__device_pci_assignable_add) 
and to devices that have undergone it as being "quarantined" (see assign_device 
in xen/drivers/passthrough/pci.c). So if that is not the correct term, there may 
be some additional changes needed for consistency. Is there another name that 
would be more appropriate?

I would also point out that, currently, there does not appear to be a way for an 
xl user to opt out of quarantine functionality in either direction other than by 
manually making devices assignable. There is no xl command line flag to disable 
it and iommu=no-quarantine will have no effect because any device that xl itself
makes assignable will have the struct pci_dev.quarantine flag set, which 
overrides iommu=no-quarantine. Is that intentional?

If I misunderstood and your objection is simply that "quarantine-on-assignment"  
and "quarantine-on-deassignment" should be controllable by separate iommu  
options, that's an easy enough change to make. Although I think that might also 
negate the need for/effect of struct pci_dev.quarantine as described above. If 
that's what is desired, any thoughts on what the new option(s) should be called?

>> Background: I am setting up a QEMU-based development and testing environment for
>> the Crucible team at Star Lab that includes emulated PCIe devices for
>> passthrough and hotplug. I encountered an issue with `xl pci-assignable-add`
>> that causes the host QEMU to rapidly allocate memory until getting OOM-killed.
>> I then found that the issue could be worked around either by using manual sysfs
>> commands to rebind devices to pciback or by skipping over the quarantine logic
>> in `libxl__device_pci_assignable_add`, producing a working system. I hoped that
>> setting iommu=no-quarantine on the command line would have the same effect, only
>> to be surprised that it did not.
>
> ... some of this "why do we want this" would belong in the commit message
> imo, not just here.

Thanks for the tip, I will include this info in commit message in any future 
revisions.

Good day,
Scott
Jan Beulich April 28, 2021, 6:15 a.m. UTC | #3
On 28.04.2021 00:00, Scott Davis wrote:
> On 4/27/21, 2:56 AM, Jan Beulich wrote:
>> On 26.04.2021 19:25, Scott Davis wrote:
>>> This patch modifies Xen's behavior when making devices assignable while the
>>> iommu=no-quarantine command line option is in effect. Currently this option
>>> only affects device deassignment, causing devices to get immediately assigned
>>> back to Dom0 instead of to the quarantine dom_io domain. This patch extends
>>> no-quarantine to device assignment as well, preventing devices from being
>>> assigned to dom_io when they are made assignable while no-quarantine is in
>>> effect.
>>
>> Well, the term "quarantine" to me means a safety action taken _after_
>> possible exposure to something "bad". Therefore I see this being specific
>> to device de-assignment as the logical thing. Hence if a mode like what
>> you describe was wanted, I don't think it should be the result of
>> "iommu=no-quarantine".
> 
> Sorry I'm a bit confused by this. Correct me if wrong, but my understanding is  
> that the purpose of assigning a device to dom_io on de-assignment is to protect  
> Dom0 from the effects of in-flight DMA operations initiated by a DomU. I assumed  
> that the purpose of (temporarily) assigning to dom_io on assignment was the same  
> but in reverse: protecting a DomU from Dom0-initiated ops. Is this not the case?

Well, no, not really. Dom0 is considered fully trusted for a variety of
reasons.

> Also, documentation and code already refer to the operation in question as a 
> "quarantine" (see xl command line docs and libxl__device_pci_assignable_add) 
> and to devices that have undergone it as being "quarantined" (see assign_device 
> in xen/drivers/passthrough/pci.c). So if that is not the correct term, there may 
> be some additional changes needed for consistency. Is there another name that 
> would be more appropriate?

I don't see what's wrong with the term for how things are currently. If
you talk about an adjustment to terminology to accompany your proposed
change - not sure.

> I would also point out that, currently, there does not appear to be a way for an 
> xl user to opt out of quarantine functionality in either direction other than by 
> manually making devices assignable. There is no xl command line flag to disable 
> it and iommu=no-quarantine will have no effect because any device that xl itself
> makes assignable will have the struct pci_dev.quarantine flag set, which 
> overrides iommu=no-quarantine. Is that intentional?

Not sure here either: It may also have been that it was assumed to not
be of interest. Paul?

> If I misunderstood and your objection is simply that "quarantine-on-assignment"  
> and "quarantine-on-deassignment" should be controllable by separate iommu  
> options, that's an easy enough change to make.

Yes, effectively it's that what I think things would want to be, if
"quarantine-on-assignment" is really something that we think is needed.
It would default to off imo.

> Although I think that might also 
> negate the need for/effect of struct pci_dev.quarantine as described above. If 
> that's what is desired, any thoughts on what the new option(s) should be called?

Following the extension to the command line option I'm putting in place
in "IOMMU: make DMA containment of quarantined devices optional" (which
I still need to get around to address review feedback for and resubmit),
I'd be inclined to suggest "iommu=quarantine=always" or
"iommu=quarantine=on-assign". Unless of course we'd prefer to have the
caller of the assignment operation have full control over the behavior
here anyway (in which case a command line option control simply is not
necessary).

Jan
Paul Durrant April 28, 2021, 7:19 a.m. UTC | #4
On 28/04/2021 07:15, Jan Beulich wrote:
> On 28.04.2021 00:00, Scott Davis wrote:
>> On 4/27/21, 2:56 AM, Jan Beulich wrote:
>>> On 26.04.2021 19:25, Scott Davis wrote:
>>>> This patch modifies Xen's behavior when making devices assignable while the
>>>> iommu=no-quarantine command line option is in effect. Currently this option
>>>> only affects device deassignment, causing devices to get immediately assigned
>>>> back to Dom0 instead of to the quarantine dom_io domain. This patch extends
>>>> no-quarantine to device assignment as well, preventing devices from being
>>>> assigned to dom_io when they are made assignable while no-quarantine is in
>>>> effect.
>>>
>>> Well, the term "quarantine" to me means a safety action taken _after_
>>> possible exposure to something "bad". Therefore I see this being specific
>>> to device de-assignment as the logical thing. Hence if a mode like what
>>> you describe was wanted, I don't think it should be the result of
>>> "iommu=no-quarantine".
>>
>> Sorry I'm a bit confused by this. Correct me if wrong, but my understanding is
>> that the purpose of assigning a device to dom_io on de-assignment is to protect
>> Dom0 from the effects of in-flight DMA operations initiated by a DomU. I assumed
>> that the purpose of (temporarily) assigning to dom_io on assignment was the same
>> but in reverse: protecting a DomU from Dom0-initiated ops. Is this not the case?
> 
> Well, no, not really. Dom0 is considered fully trusted for a variety of
> reasons.
> 
>> Also, documentation and code already refer to the operation in question as a
>> "quarantine" (see xl command line docs and libxl__device_pci_assignable_add)
>> and to devices that have undergone it as being "quarantined" (see assign_device
>> in xen/drivers/passthrough/pci.c). So if that is not the correct term, there may
>> be some additional changes needed for consistency. Is there another name that
>> would be more appropriate?
> 
> I don't see what's wrong with the term for how things are currently. If
> you talk about an adjustment to terminology to accompany your proposed
> change - not sure.
> 
>> I would also point out that, currently, there does not appear to be a way for an
>> xl user to opt out of quarantine functionality in either direction other than by
>> manually making devices assignable. There is no xl command line flag to disable
>> it and iommu=no-quarantine will have no effect because any device that xl itself
>> makes assignable will have the struct pci_dev.quarantine flag set, which
>> overrides iommu=no-quarantine. Is that intentional?
> 
> Not sure here either: It may also have been that it was assumed to not
> be of interest. Paul?
> 

TBH I'm not sure. When I implemented quarantining it was non-optional 
for good reason and no-quarantine came along later (and somewhat 
hurriedly IIRC).

>> If I misunderstood and your objection is simply that "quarantine-on-assignment"
>> and "quarantine-on-deassignment" should be controllable by separate iommu
>> options, that's an easy enough change to make.
> 
> Yes, effectively it's that what I think things would want to be, if
> "quarantine-on-assignment" is really something that we think is needed.
> It would default to off imo.
> 
>> Although I think that might also
>> negate the need for/effect of struct pci_dev.quarantine as described above. If
>> that's what is desired, any thoughts on what the new option(s) should be called?
> 
> Following the extension to the command line option I'm putting in place
> in "IOMMU: make DMA containment of quarantined devices optional" (which
> I still need to get around to address review feedback for and resubmit),
> I'd be inclined to suggest "iommu=quarantine=always" or
> "iommu=quarantine=on-assign". Unless of course we'd prefer to have the
> caller of the assignment operation have full control over the behavior
> here anyway (in which case a command line option control simply is not
> necessary).
> 

I'm still not entirely sure why not quarantining on is a problem, other 
than it triggering an as-yet undiagnosed issue in QEMU, but I agree that 
that the expectation of 'no-quarantine' meaning just that (i.e. the old 
dom0->domU and domU->dom0 transitions are re-instated) is reasonable. Do 
we really want yet more command line options?

   Paul
Jan Beulich April 28, 2021, 8:49 a.m. UTC | #5
On 28.04.2021 09:19, Paul Durrant wrote:
> On 28/04/2021 07:15, Jan Beulich wrote:
>> Following the extension to the command line option I'm putting in place
>> in "IOMMU: make DMA containment of quarantined devices optional" (which
>> I still need to get around to address review feedback for and resubmit),
>> I'd be inclined to suggest "iommu=quarantine=always" or
>> "iommu=quarantine=on-assign". Unless of course we'd prefer to have the
>> caller of the assignment operation have full control over the behavior
>> here anyway (in which case a command line option control simply is not
>> necessary).
>>
> 
> I'm still not entirely sure why not quarantining on is a problem,

Well, I continue to think that it is a mistake to hide problems (with
their hardware) from system administrators by default. I guess most
everyone else put usability in foreground, as my view to workarounds
(with non-benign [side-]effects) being enabled by default looks to be
generally different.

> other 
> than it triggering an as-yet undiagnosed issue in QEMU, but I agree that 
> that the expectation of 'no-quarantine' meaning just that (i.e. the old 
> dom0->domU and domU->dom0 transitions are re-instated) is reasonable.

I'm afraid I'm not clear what you're talking about here. What "old
transitions"? The ones prior to the introduction of quarantining? If
so, and if the tool stack is given (some level of) control, I guess
we'd first need to establish who "rules": The command line option,
or the tool stack (which imo ought to be acting whichever particular
way based on admin requests, not to blindly override Xen's defaults).

> Do we really want yet more command line options?

If we can avoid them without sacrificing functionality / flexibility ...

Jan
Paul Durrant April 28, 2021, 8:51 a.m. UTC | #6
On 28/04/2021 09:49, Jan Beulich wrote:
> On 28.04.2021 09:19, Paul Durrant wrote:
>> On 28/04/2021 07:15, Jan Beulich wrote:
>>> Following the extension to the command line option I'm putting in place
>>> in "IOMMU: make DMA containment of quarantined devices optional" (which
>>> I still need to get around to address review feedback for and resubmit),
>>> I'd be inclined to suggest "iommu=quarantine=always" or
>>> "iommu=quarantine=on-assign". Unless of course we'd prefer to have the
>>> caller of the assignment operation have full control over the behavior
>>> here anyway (in which case a command line option control simply is not
>>> necessary).
>>>
>>
>> I'm still not entirely sure why not quarantining on is a problem,
> 
> Well, I continue to think that it is a mistake to hide problems (with
> their hardware) from system administrators by default. I guess most
> everyone else put usability in foreground, as my view to workarounds
> (with non-benign [side-]effects) being enabled by default looks to be
> generally different.
> 
>> other
>> than it triggering an as-yet undiagnosed issue in QEMU, but I agree that
>> that the expectation of 'no-quarantine' meaning just that (i.e. the old
>> dom0->domU and domU->dom0 transitions are re-instated) is reasonable.
> 
> I'm afraid I'm not clear what you're talking about here. What "old
> transitions"? The ones prior to the introduction of quarantining?

FAOD, that's what I meant.

   Paul

> If
> so, and if the tool stack is given (some level of) control, I guess
> we'd first need to establish who "rules": The command line option,
> or the tool stack (which imo ought to be acting whichever particular
> way based on admin requests, not to blindly override Xen's defaults).
> 
>> Do we really want yet more command line options?
> 
> If we can avoid them without sacrificing functionality / flexibility ...
> 
> Jan
>
Scott Davis April 29, 2021, 9:04 p.m. UTC | #7
On 4/28/21, 3:20 AM, Paul Durrant wrote:
>> Following the extension to the command line option I'm putting in place
>> in "IOMMU: make DMA containment of quarantined devices optional" (which
>> I still need to get around to address review feedback for and resubmit),
>> I'd be inclined to suggest "iommu=quarantine=always" or
>> "iommu=quarantine=on-assign". Unless of course we'd prefer to have the
>> caller of the assignment operation have full control over the behavior
>> here anyway (in which case a command line option control simply is not
>> necessary).
>
> I'm still not entirely sure why not quarantining on is a problem, other
> than it triggering an as-yet undiagnosed issue in QEMU, but I agree that
> that the expectation of 'no-quarantine' meaning just that (i.e. the old
> dom0->domU and domU->dom0 transitions are re-instated) is reasonable. Do
> we really want yet more command line options?

Regarding the problem in QEMU, I traced the crash trigger down to a
write to the IQ tail register during the mapping operation into dom_io
(backtrace below). Along the way I noticed that, since a non-present
entry was being flushed, flush_context_qi only performs this
invalidation on an IOMMU with caching mode enabled (i.e. a software
IOMMU). Therefore this issue is probably only hittable when nesting.
Disabling caching mode on the QEMU vIOMMU was enough to prevent the
crash and give me a working system.

(gdb) si
0xffff82d04025b68b  72  in qinval.c
   0xffff82d04025b687 <qinval_update_qtail+43>: ... shl    $0x4,%r12
=> 0xffff82d04025b68b <qinval_update_qtail+47>: ... mov    %r12,0x88(%rax)
(gdb) bt
#0  0xffff82d04025b68b in qinval_update_qtail (...) at qinval.c:72
#1  0xffff82d04025baa7 in queue_invalidate_context_sync (...) at qinval.c:101
#2  flush_context_qi (...) at qinval.c:341
#3  0xffff82d040259125 in iommu_flush_context_device (...) at iommu.c:400
#4  domain_context_mapping_one (...) at iommu.c:1436
#5  0xffff82d040259351 in domain_context_mapping (...) at iommu.c:1510
#6  0xffff82d040259d20 in reassign_device_ownership (...) at iommu.c:2412
#7  0xffff82d040259f19 in intel_iommu_assign_device (...) at iommu.c:2476
#8  0xffff82d040267154 in assign_device (...) at pci.c:1545
#9  iommu_do_pci_domctl (...) at pci.c:1732
#10 0xffff82d040264de3 in iommu_do_domctl (...) at iommu.c:539
#11 0xffff82d040322ca5 in arch_do_domctl (...) at domctl.c:1496
#12 0xffff82d040205a19 in do_domctl (...) at domctl.c:956
#13 0xffff82d040319476 in pv_hypercall (...) at hypercall.c:155
#14 0xffff82d040390432 in lstar_enter () at entry.S:271
#15 0x0000000000000000 in ?? ()

As a result of the above, I no longer have a need to patch Xen to work
around the problem. Though I do want to test against newer versions of
QEMU (currently on 4.2.1) to see if it still exists.

So unless there's interest among Xen developers for this patch, I will
probably shelve it for now. Especially since it looks like Jan has some
ongoing work in this area that I had not previously discovered. If there
is interest, I just need a resolution on whether iommu=quarantine should
be left as a boolean or expanded to support always, never,
deassign-only, and (why not) assign-only.

Thanks,
Scott
Jan Beulich April 30, 2021, 7:15 a.m. UTC | #8
On 29.04.2021 23:04, Scott Davis wrote:
> On 4/28/21, 3:20 AM, Paul Durrant wrote:
>>> Following the extension to the command line option I'm putting in place
>>> in "IOMMU: make DMA containment of quarantined devices optional" (which
>>> I still need to get around to address review feedback for and resubmit),
>>> I'd be inclined to suggest "iommu=quarantine=always" or
>>> "iommu=quarantine=on-assign". Unless of course we'd prefer to have the
>>> caller of the assignment operation have full control over the behavior
>>> here anyway (in which case a command line option control simply is not
>>> necessary).
>>
>> I'm still not entirely sure why not quarantining on is a problem, other
>> than it triggering an as-yet undiagnosed issue in QEMU, but I agree that
>> that the expectation of 'no-quarantine' meaning just that (i.e. the old
>> dom0->domU and domU->dom0 transitions are re-instated) is reasonable. Do
>> we really want yet more command line options?
> 
> Regarding the problem in QEMU, I traced the crash trigger down to a
> write to the IQ tail register during the mapping operation into dom_io
> (backtrace below). Along the way I noticed that, since a non-present
> entry was being flushed, flush_context_qi only performs this
> invalidation on an IOMMU with caching mode enabled (i.e. a software
> IOMMU). Therefore this issue is probably only hittable when nesting.
> Disabling caching mode on the QEMU vIOMMU was enough to prevent the
> crash and give me a working system.
> 
> (gdb) si
> 0xffff82d04025b68b  72  in qinval.c
>    0xffff82d04025b687 <qinval_update_qtail+43>: ... shl    $0x4,%r12
> => 0xffff82d04025b68b <qinval_update_qtail+47>: ... mov    %r12,0x88(%rax)
> (gdb) bt
> #0  0xffff82d04025b68b in qinval_update_qtail (...) at qinval.c:72
> #1  0xffff82d04025baa7 in queue_invalidate_context_sync (...) at qinval.c:101
> #2  flush_context_qi (...) at qinval.c:341
> #3  0xffff82d040259125 in iommu_flush_context_device (...) at iommu.c:400
> #4  domain_context_mapping_one (...) at iommu.c:1436
> #5  0xffff82d040259351 in domain_context_mapping (...) at iommu.c:1510
> #6  0xffff82d040259d20 in reassign_device_ownership (...) at iommu.c:2412
> #7  0xffff82d040259f19 in intel_iommu_assign_device (...) at iommu.c:2476
> #8  0xffff82d040267154 in assign_device (...) at pci.c:1545
> #9  iommu_do_pci_domctl (...) at pci.c:1732
> #10 0xffff82d040264de3 in iommu_do_domctl (...) at iommu.c:539
> #11 0xffff82d040322ca5 in arch_do_domctl (...) at domctl.c:1496
> #12 0xffff82d040205a19 in do_domctl (...) at domctl.c:956
> #13 0xffff82d040319476 in pv_hypercall (...) at hypercall.c:155
> #14 0xffff82d040390432 in lstar_enter () at entry.S:271
> #15 0x0000000000000000 in ?? ()

Interesting. This then leaves the question whether we submit a bogus
command, or whether qemu can't deal (correctly) with a valid one here.
So far you didn't tell us what the actual crash was. I guess it's not
even clear to me whether it's Xen or qemu that did crash for you. But
I have to also admit that until now it wasn't really clear to me that
you ran Xen _under_ qemu - instead I was assuming there was an
interaction problem with a qemu serving a guest.

Jan
Scott Davis April 30, 2021, 7:27 p.m. UTC | #9
On 4/30/21, 3:15 AM, Jan Beulich wrote:
> So far you didn't tell us what the actual crash was. I guess it's not
> even clear to me whether it's Xen or qemu that did crash for you. But
> I have to also admit that until now it wasn't really clear to me that
> you ran Xen _under_ qemu - instead I was assuming there was an
> interaction problem with a qemu serving a guest.

I explained this in my OP, sorry if it was not clear:

> Background: I am setting up a QEMU-based development and testing environment
> for the Crucible team at Star Lab that includes emulated PCIe devices for
> passthrough and hotplug. I encountered an issue with `xl pci-assignable-add`
> that causes the host QEMU to rapidly allocate memory until getting 
> OOM-killed.

As soon as Xen writes the IQT register, the host QEMU process locks up,
starts allocating several hundred MB/sec, and is soon OOM-killed by the
host kernel.

On 4/30/21, 3:15 AM, Jan Beulich wrote:
> Interesting. This then leaves the question whether we submit a bogus
> command, or whether qemu can't deal (correctly) with a valid one here.

I did some extra debugging to inspect the index values being written to
IQT as well as the invalidation descriptors themselves and everything
appeared fine to me on Xen's end. In fact, the descriptor written by
queue_invalidate_context_sync upon map into dom_io is entirely identical
to the one it writes upon unmap from dom0, which works without issue.
This point towards a QEMU bug to me:

(gdb) c
Thread 1 hit Breakpoint 4, queue_invalidate_context_sync (...) at qinval.c:101
(gdb) bt
#0  queue_invalidate_context_sync (...) at qinval.c:85
#1  flush_context_qi (...) at qinval.c:341
#2  iommu_flush_context_device (...) at iommu.c:400
#3  domain_context_unmap_one (...) at iommu.c:1606
#4  domain_context_unmap (...) at iommu.c:1671
#5  reassign_device_ownership (...) at iommu.c:2396
#6  intel_iommu_assign_device (...) at iommu.c:2476
#7  assign_device (...) at pci.c:1545
#8  iommu_do_pci_domctl (...) at pci.c:1732
#9  iommu_do_domctl (...) at iommu.c:539
...
(gdb) print index
$2 = 552
(gdb) print qinval_entry->q.cc_inv_dsc
$3 = {
  lo = {
    type = 1,
    granu = 3,
    res_1 = 0,
    did = 0,
    sid = 256,
    fm = 0,
    res_2 = 0
  },
  hi = {
    res = 0
  }
}
(gdb) c
Thread 1 hit Breakpoint 5, qinval_next_index (...) at qinval.c:58
(gdb) bt
#0  qinval_next_index (...) at qinval.c:58
#1  queue_invalidate_wait (...) at qinval.c:159
#2  invalidate_sync (...) at qinval.c:207
#3  queue_invalidate_context_sync (...) at qinval.c:106
...
(gdb) print tail
$4 = 553
(gdb) c
Thread 1 hit Breakpoint 5, qinval_next_index (...) at qinval.c:58
(gdb) bt
#0  qinval_next_index (...) at qinval.c:58
#3  queue_invalidate_iotlb_sync (...) at qinval.c:120
#4  flush_iotlb_qi (...) at qinval.c:376
#5  iommu_flush_iotlb_dsi (...) at iommu.c:499
#6  domain_context_unmap_one (...) at iommu.c:1611
#7  domain_context_unmap (...) at iommu.c:1671
...
(gdb) print tail
$5 = 554
(gdb) c
Thread 1 hit Breakpoint 5, qinval_next_index (...) at qinval.c:58
(gdb) bt
#0  qinval_next_index (...) at qinval.c:58
#1  queue_invalidate_wait (...) at qinval.c:159
#2  invalidate_sync (...) at qinval.c:207
#3  queue_invalidate_iotlb_sync (...) at qinval.c:143
...
(gdb) print tail
$6 = 555
(gdb) c
Thread 1 hit Breakpoint 5, qinval_next_index (...) at qinval.c:58
(gdb) bt
#0  qinval_next_index (...) at qinval.c:58
#1  queue_invalidate_context_sync (...) at qinval.c:86
#2  flush_context_qi (...) at qinval.c:341
#3  iommu_flush_context_device (...) at iommu.c:400
#4  domain_context_mapping_one (...) at iommu.c:1436
#5  domain_context_mapping (...) at iommu.c:1510
#6  reassign_device_ownership (...) at iommu.c:2412
...
(gdb) print tail
$7 = 556
(gdb) c
Thread 1 hit Breakpoint 4, queue_invalidate_context_sync (...) at qinval.c:101
(gdb) print index
$8 = 556
(gdb) print qinval_entry->q.cc_inv_dsc
$9 = {
  lo = {
    type = 1,
    granu = 3,
    res_1 = 0,
    did = 0,
    sid = 256,
    fm = 0,
    res_2 = 0
  },
  hi = {
    res = 0
  }
}
(gdb) c
Continuing.
Remote connection closed

With output from dom0 and Xen like:

[   31.002214] e1000e 0000:01:00.0 eth1: removed PHC
[   31.694270] e1000e: eth1 NIC Link is Down
[   31.717849] pciback 0000:01:00.0: seizing device
[   31.719464] Already setup the GSI :20
(XEN) [   83.572804] [VT-D]d0:PCIe: unmap 0000:01:00.0
(XEN) [  808.092310] [VT-D]d32753:PCIe: map 0000:01:00.0

Good day,
Scott
diff mbox series

Patch

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 4ae9391fcd..1015f95dd5 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1317,10 +1317,10 @@  boolean (e.g. `iommu=no`) can override this and leave the IOMMUs disabled.
     successfully.

 *   The `quarantine` boolean can be used to control Xen's behavior when
-    de-assigning devices from guests.  If enabled (the default), Xen always
-    quarantines such devices; they must be explicitly assigned back to Dom0
-    before they can be used there again.  If disabled, Xen will only
-    quarantine devices the toolstack hass arranged for getting quarantined.
+    making devices assignable to guests and de-assigning devices from guests.
+    If enabled (the default), Xen will quarantine such devices; they must be
+    explicitly assigned back to Dom0 before they can be used there again.  If
+    disabled, Xen will not quarantine devices.

 *   The `sharept` boolean controls whether the IOMMU pagetables are shared
     with the CPU-side HAP pagetables, or allocated separately.  Sharing
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 2d6238a5bb..4c37f2d272 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -935,9 +935,14 @@  static int deassign_device(struct domain *d, uint16_t seg, uint8_t bus,
         return -EINVAL;

     ASSERT(pcidevs_locked());
-    pdev = pci_get_pdev_by_domain(d, seg, bus, devfn);
+    pdev = pci_get_pdev(seg, bus, devfn);
     if ( !pdev )
         return -ENODEV;
+    else if ( d == dom_io && pdev->domain != dom_io )
+        /* OK: Request to de-quarantine a device that is not quarantined */
+        return 0;
+    else if ( pdev->domain != d )
+        return -ENODEV;

     /* De-assignment from dom_io should de-quarantine the device */
     target = ((pdev->quarantine || iommu_quarantine) &&
@@ -1510,7 +1515,7 @@  static int assign_device(struct domain *d, u16 seg, u8 bus, u8 devfn, u32 flag)
     struct pci_dev *pdev;
     int rc = 0;

-    if ( !is_iommu_enabled(d) )
+    if ( !is_iommu_enabled(d) || (d == dom_io && !iommu_quarantine) )
         return 0;

     /* Prevent device assign if mem paging or mem sharing have been