diff mbox series

[v2] IOMMU: make DMA containment of quarantined devices optional

Message ID be16ddaa-ae99-f696-53c0-6a680ffa8504@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] IOMMU: make DMA containment of quarantined devices optional | expand

Commit Message

Jan Beulich Dec. 13, 2019, 12:53 p.m. UTC
Containing still in flight DMA was introduced to work around certain
devices / systems hanging hard upon hitting an IOMMU fault. Passing
through (such) devices (on such systems) is inherently insecure (as
guests could easily arrange for IOMMU faults to occur). Defaulting to
a mode where admins may not even become aware of issues with devices can
be considered undesirable. Therefore convert this mode of operation to
an optional one, not one enabled by default.

This involves resurrecting code commit ea38867831da ("x86 / iommu: set
up a scratch page in the quarantine domain") did remove, in a slightly
extended and abstracted fashion. Here, instead of reintroducing a pretty
pointless use of "goto" in domain_context_unmap(), and instead of making
the function (at least temporarily) inconsistent, take the opportunity
and replace the other similarly pointless "goto" as well.

In order to key the re-instated bypasses off of there (not) being a root
page table this further requires moving the allocate_domain_resources()
invocation from reassign_device() to amd_iommu_setup_domain_device() (or
else reassign_device() would allocate a root page table anyway); this is
benign to the second caller of the latter function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
As far as 4.13 is concerned, I guess if we can't come to an agreement
here, the only other option is to revert ea38867831da from the branch,
for having been committed prematurely (I'm not so much worried about the
master branch, where we have ample time until 4.14). What I surely want
to see us avoid is a back and forth in behavior of released versions.
(Note that 4.12.2 is similarly blocked on a decision either way here.)

I'm happy to take better suggestions to replace "full".
---
v2: Don't use true/false. Introduce QUARANTINE_SKIP() (albeit I'm not
    really convinced this is an improvement). Add comment.

Comments

Jürgen Groß Dec. 13, 2019, 1:11 p.m. UTC | #1
On 13.12.19 13:53, Jan Beulich wrote:
> Containing still in flight DMA was introduced to work around certain
> devices / systems hanging hard upon hitting an IOMMU fault. Passing
> through (such) devices (on such systems) is inherently insecure (as
> guests could easily arrange for IOMMU faults to occur). Defaulting to
> a mode where admins may not even become aware of issues with devices can
> be considered undesirable. Therefore convert this mode of operation to
> an optional one, not one enabled by default.
> 
> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> up a scratch page in the quarantine domain") did remove, in a slightly
> extended and abstracted fashion. Here, instead of reintroducing a pretty
> pointless use of "goto" in domain_context_unmap(), and instead of making
> the function (at least temporarily) inconsistent, take the opportunity
> and replace the other similarly pointless "goto" as well.
> 
> In order to key the re-instated bypasses off of there (not) being a root
> page table this further requires moving the allocate_domain_resources()
> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
> else reassign_device() would allocate a root page table anyway); this is
> benign to the second caller of the latter function.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> As far as 4.13 is concerned, I guess if we can't come to an agreement
> here, the only other option is to revert ea38867831da from the branch,
> for having been committed prematurely (I'm not so much worried about the
> master branch, where we have ample time until 4.14). What I surely want
> to see us avoid is a back and forth in behavior of released versions.
> (Note that 4.12.2 is similarly blocked on a decision either way here.)

I'm not really sure we really need to revert ea38867831da before the
4.13 release. It might not be optimal, but I'm quite sure the number of
cases where this could be an issue is rather small already, and I tend
to agree with Paul that admins who really care will more likely want to
select the option where the system will "just work". IMO the "noticeable
failure" is something which will be selected mostly by developers. But
I'm not an expert in that area, so I don't want to influence the
decision regarding the to be selected default too much.

In case we have a successful OSStest run soon I planned to do the
release without this patch.


Juergen
Paul Durrant Dec. 13, 2019, 1:12 p.m. UTC | #2
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
> Beulich
> Sent: 13 December 2019 12:53
> To: xen-devel@lists.xenproject.org
> Cc: Juergen Gross <jgross@suse.com>; Kevin Tian <kevin.tian@intel.com>;
> Stefano Stabellini <sstabellini@kernel.org>; Julien Grall
> <julien@xen.org>; Wei Liu <wl@xen.org>; Konrad Wilk
> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>;
> Andrew Cooper <andrew.cooper3@citrix.com>; Paul Durrant <paul@xen.org>;
> Ian Jackson <ian.jackson@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>
> Subject: [Xen-devel] [PATCH v2] IOMMU: make DMA containment of quarantined
> devices optional
> 
> Containing still in flight DMA was introduced to work around certain
> devices / systems hanging hard upon hitting an IOMMU fault. Passing
> through (such) devices (on such systems) is inherently insecure (as
> guests could easily arrange for IOMMU faults to occur). Defaulting to
> a mode where admins may not even become aware of issues with devices can
> be considered undesirable. Therefore convert this mode of operation to
> an optional one, not one enabled by default.
> 
> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> up a scratch page in the quarantine domain") did remove, in a slightly
> extended and abstracted fashion. Here, instead of reintroducing a pretty
> pointless use of "goto" in domain_context_unmap(), and instead of making
> the function (at least temporarily) inconsistent, take the opportunity
> and replace the other similarly pointless "goto" as well.
> 
> In order to key the re-instated bypasses off of there (not) being a root
> page table this further requires moving the allocate_domain_resources()
> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
> else reassign_device() would allocate a root page table anyway); this is
> benign to the second caller of the latter function.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> As far as 4.13 is concerned, I guess if we can't come to an agreement
> here, the only other option is to revert ea38867831da from the branch,
> for having been committed prematurely (I'm not so much worried about the
> master branch, where we have ample time until 4.14). What I surely want
> to see us avoid is a back and forth in behavior of released versions.
> (Note that 4.12.2 is similarly blocked on a decision either way here.)
> 
> I'm happy to take better suggestions to replace "full".

How about simply "sink", since that's what it does?

[snip]
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -30,13 +30,17 @@ bool_t __initdata iommu_enable = 1;
>  bool_t __read_mostly iommu_enabled;
>  bool_t __read_mostly force_iommu;
>  bool_t __read_mostly iommu_verbose;
> -bool __read_mostly iommu_quarantine = true;
>  bool_t __read_mostly iommu_igfx = 1;
>  bool_t __read_mostly iommu_snoop = 1;
>  bool_t __read_mostly iommu_qinval = 1;
>  bool_t __read_mostly iommu_intremap = 1;
>  bool_t __read_mostly iommu_crash_disable;
> 
> +#define IOMMU_quarantine_none  0
> +#define IOMMU_quarantine_basic 1
> +#define IOMMU_quarantine_full  2
> +uint8_t __read_mostly iommu_quarantine = IOMMU_quarantine_basic;

If we have 'IOMMU_quarantine_sink' instead of 'IOMMU_quarantine_full', then how about 'IOMMU_quarantine_write_fault' instead of 'IOMMU_quarantine_basic'?

  Paul
Jan Beulich Dec. 13, 2019, 1:21 p.m. UTC | #3
On 13.12.2019 14:11, Jürgen Groß wrote:
> On 13.12.19 13:53, Jan Beulich wrote:
>> Containing still in flight DMA was introduced to work around certain
>> devices / systems hanging hard upon hitting an IOMMU fault. Passing
>> through (such) devices (on such systems) is inherently insecure (as
>> guests could easily arrange for IOMMU faults to occur). Defaulting to
>> a mode where admins may not even become aware of issues with devices can
>> be considered undesirable. Therefore convert this mode of operation to
>> an optional one, not one enabled by default.
>>
>> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
>> up a scratch page in the quarantine domain") did remove, in a slightly
>> extended and abstracted fashion. Here, instead of reintroducing a pretty
>> pointless use of "goto" in domain_context_unmap(), and instead of making
>> the function (at least temporarily) inconsistent, take the opportunity
>> and replace the other similarly pointless "goto" as well.
>>
>> In order to key the re-instated bypasses off of there (not) being a root
>> page table this further requires moving the allocate_domain_resources()
>> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
>> else reassign_device() would allocate a root page table anyway); this is
>> benign to the second caller of the latter function.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> As far as 4.13 is concerned, I guess if we can't come to an agreement
>> here, the only other option is to revert ea38867831da from the branch,
>> for having been committed prematurely (I'm not so much worried about the
>> master branch, where we have ample time until 4.14). What I surely want
>> to see us avoid is a back and forth in behavior of released versions.
>> (Note that 4.12.2 is similarly blocked on a decision either way here.)
> 
> I'm not really sure we really need to revert ea38867831da before the
> 4.13 release. It might not be optimal, but I'm quite sure the number of
> cases where this could be an issue is rather small already, and I tend
> to agree with Paul that admins who really care will more likely want to
> select the option where the system will "just work". IMO the "noticeable
> failure" is something which will be selected mostly by developers. But
> I'm not an expert in that area, so I don't want to influence the
> decision regarding the to be selected default too much.

An admin not wanting to know is, to me, the same as them not wanting
to know about security issues, and hence not subscribing to our
announcements lists. I can accept this being a reasonable thing to
do when it is an _informed_ decision. But with the current
arrangements there's no way whatsoever for an admin to know.

Furthermore, once we ship a release with the defaults as they
currently are, changing the defaults again may be perceived as a
regression, which I think we should (want to) avoid.

> In case we have a successful OSStest run soon I planned to do the
> release without this patch.

I very much disagree - the patch shouldn't have been put in without
having heard back from Kevin. But yes, you're the release manager.

Jan
Jan Beulich Dec. 13, 2019, 1:25 p.m. UTC | #4
On 13.12.2019 14:12, Durrant, Paul wrote:
>> -----Original Message-----
>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
>> Beulich
>> Sent: 13 December 2019 12:53
>> To: xen-devel@lists.xenproject.org
>> Cc: Juergen Gross <jgross@suse.com>; Kevin Tian <kevin.tian@intel.com>;
>> Stefano Stabellini <sstabellini@kernel.org>; Julien Grall
>> <julien@xen.org>; Wei Liu <wl@xen.org>; Konrad Wilk
>> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>;
>> Andrew Cooper <andrew.cooper3@citrix.com>; Paul Durrant <paul@xen.org>;
>> Ian Jackson <ian.jackson@citrix.com>; Roger Pau Monné
>> <roger.pau@citrix.com>
>> Subject: [Xen-devel] [PATCH v2] IOMMU: make DMA containment of quarantined
>> devices optional
>>
>> Containing still in flight DMA was introduced to work around certain
>> devices / systems hanging hard upon hitting an IOMMU fault. Passing
>> through (such) devices (on such systems) is inherently insecure (as
>> guests could easily arrange for IOMMU faults to occur). Defaulting to
>> a mode where admins may not even become aware of issues with devices can
>> be considered undesirable. Therefore convert this mode of operation to
>> an optional one, not one enabled by default.
>>
>> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
>> up a scratch page in the quarantine domain") did remove, in a slightly
>> extended and abstracted fashion. Here, instead of reintroducing a pretty
>> pointless use of "goto" in domain_context_unmap(), and instead of making
>> the function (at least temporarily) inconsistent, take the opportunity
>> and replace the other similarly pointless "goto" as well.
>>
>> In order to key the re-instated bypasses off of there (not) being a root
>> page table this further requires moving the allocate_domain_resources()
>> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
>> else reassign_device() would allocate a root page table anyway); this is
>> benign to the second caller of the latter function.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> As far as 4.13 is concerned, I guess if we can't come to an agreement
>> here, the only other option is to revert ea38867831da from the branch,
>> for having been committed prematurely (I'm not so much worried about the
>> master branch, where we have ample time until 4.14). What I surely want
>> to see us avoid is a back and forth in behavior of released versions.
>> (Note that 4.12.2 is similarly blocked on a decision either way here.)
>>
>> I'm happy to take better suggestions to replace "full".
> 
> How about simply "sink", since that's what it does?

But it's not really a "sink", as we still fault writes (which is the
only thing I can see to be "sunk" if I'm getting the meaning of the
word right).

>> --- a/xen/drivers/passthrough/iommu.c
>> +++ b/xen/drivers/passthrough/iommu.c
>> @@ -30,13 +30,17 @@ bool_t __initdata iommu_enable = 1;
>>  bool_t __read_mostly iommu_enabled;
>>  bool_t __read_mostly force_iommu;
>>  bool_t __read_mostly iommu_verbose;
>> -bool __read_mostly iommu_quarantine = true;
>>  bool_t __read_mostly iommu_igfx = 1;
>>  bool_t __read_mostly iommu_snoop = 1;
>>  bool_t __read_mostly iommu_qinval = 1;
>>  bool_t __read_mostly iommu_intremap = 1;
>>  bool_t __read_mostly iommu_crash_disable;
>>
>> +#define IOMMU_quarantine_none  0
>> +#define IOMMU_quarantine_basic 1
>> +#define IOMMU_quarantine_full  2
>> +uint8_t __read_mostly iommu_quarantine = IOMMU_quarantine_basic;
> 
> If we have 'IOMMU_quarantine_sink' instead of 'IOMMU_quarantine_full',
> then how about 'IOMMU_quarantine_write_fault' instead of
> 'IOMMU_quarantine_basic'?

Why "write_fault"? Even in "full" mode you only avoid read faults
aiui (see also above). So if anything "write_fault" would be a
replacement for "full"; "basic" could be replaced by just "fault"
then.

Jan
Paul Durrant Dec. 13, 2019, 1:29 p.m. UTC | #5
> -----Original Message-----
> From: Jan Beulich <jbeulich@suse.com>
> Sent: 13 December 2019 13:26
> To: Durrant, Paul <pdurrant@amazon.com>
> Cc: xen-devel@lists.xenproject.org; Juergen Gross <jgross@suse.com>; Kevin
> Tian <kevin.tian@intel.com>; Stefano Stabellini <sstabellini@kernel.org>;
> Julien Grall <julien@xen.org>; Wei Liu <wl@xen.org>; Konrad Wilk
> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>;
> Andrew Cooper <andrew.cooper3@citrix.com>; Paul Durrant <paul@xen.org>;
> Ian Jackson <ian.jackson@citrix.com>; Roger Pau Monné
> <roger.pau@citrix.com>
> Subject: Re: [PATCH v2] IOMMU: make DMA containment of quarantined devices
> optional
> 
> On 13.12.2019 14:12, Durrant, Paul wrote:
> >> -----Original Message-----
> >> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Jan
> >> Beulich
> >> Sent: 13 December 2019 12:53
> >> To: xen-devel@lists.xenproject.org
> >> Cc: Juergen Gross <jgross@suse.com>; Kevin Tian <kevin.tian@intel.com>;
> >> Stefano Stabellini <sstabellini@kernel.org>; Julien Grall
> >> <julien@xen.org>; Wei Liu <wl@xen.org>; Konrad Wilk
> >> <konrad.wilk@oracle.com>; George Dunlap <George.Dunlap@eu.citrix.com>;
> >> Andrew Cooper <andrew.cooper3@citrix.com>; Paul Durrant <paul@xen.org>;
> >> Ian Jackson <ian.jackson@citrix.com>; Roger Pau Monné
> >> <roger.pau@citrix.com>
> >> Subject: [Xen-devel] [PATCH v2] IOMMU: make DMA containment of
> quarantined
> >> devices optional
> >>
> >> Containing still in flight DMA was introduced to work around certain
> >> devices / systems hanging hard upon hitting an IOMMU fault. Passing
> >> through (such) devices (on such systems) is inherently insecure (as
> >> guests could easily arrange for IOMMU faults to occur). Defaulting to
> >> a mode where admins may not even become aware of issues with devices
> can
> >> be considered undesirable. Therefore convert this mode of operation to
> >> an optional one, not one enabled by default.
> >>
> >> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> >> up a scratch page in the quarantine domain") did remove, in a slightly
> >> extended and abstracted fashion. Here, instead of reintroducing a
> pretty
> >> pointless use of "goto" in domain_context_unmap(), and instead of
> making
> >> the function (at least temporarily) inconsistent, take the opportunity
> >> and replace the other similarly pointless "goto" as well.
> >>
> >> In order to key the re-instated bypasses off of there (not) being a
> root
> >> page table this further requires moving the allocate_domain_resources()
> >> invocation from reassign_device() to amd_iommu_setup_domain_device()
> (or
> >> else reassign_device() would allocate a root page table anyway); this
> is
> >> benign to the second caller of the latter function.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> As far as 4.13 is concerned, I guess if we can't come to an agreement
> >> here, the only other option is to revert ea38867831da from the branch,
> >> for having been committed prematurely (I'm not so much worried about
> the
> >> master branch, where we have ample time until 4.14). What I surely want
> >> to see us avoid is a back and forth in behavior of released versions.
> >> (Note that 4.12.2 is similarly blocked on a decision either way here.)
> >>
> >> I'm happy to take better suggestions to replace "full".
> >
> > How about simply "sink", since that's what it does?
> 
> But it's not really a "sink", as we still fault writes (which is the
> only thing I can see to be "sunk" if I'm getting the meaning of the
> word right).
> 
> >> --- a/xen/drivers/passthrough/iommu.c
> >> +++ b/xen/drivers/passthrough/iommu.c
> >> @@ -30,13 +30,17 @@ bool_t __initdata iommu_enable = 1;
> >>  bool_t __read_mostly iommu_enabled;
> >>  bool_t __read_mostly force_iommu;
> >>  bool_t __read_mostly iommu_verbose;
> >> -bool __read_mostly iommu_quarantine = true;
> >>  bool_t __read_mostly iommu_igfx = 1;
> >>  bool_t __read_mostly iommu_snoop = 1;
> >>  bool_t __read_mostly iommu_qinval = 1;
> >>  bool_t __read_mostly iommu_intremap = 1;
> >>  bool_t __read_mostly iommu_crash_disable;
> >>
> >> +#define IOMMU_quarantine_none  0
> >> +#define IOMMU_quarantine_basic 1
> >> +#define IOMMU_quarantine_full  2
> >> +uint8_t __read_mostly iommu_quarantine = IOMMU_quarantine_basic;
> >
> > If we have 'IOMMU_quarantine_sink' instead of 'IOMMU_quarantine_full',
> > then how about 'IOMMU_quarantine_write_fault' instead of
> > 'IOMMU_quarantine_basic'?
> 
> Why "write_fault"? Even in "full" mode you only avoid read faults
> aiui (see also above). So if anything "write_fault" would be a
> replacement for "full"; "basic" could be replaced by just "fault"
> then.

Sorry, yes, I had things the wrong way round. "fault" and "write_fault" sound good.

  Paul

> 
> Jan
Jürgen Groß Dec. 13, 2019, 1:31 p.m. UTC | #6
On 13.12.19 14:21, Jan Beulich wrote:
> On 13.12.2019 14:11, Jürgen Groß wrote:
>> On 13.12.19 13:53, Jan Beulich wrote:
>>> Containing still in flight DMA was introduced to work around certain
>>> devices / systems hanging hard upon hitting an IOMMU fault. Passing
>>> through (such) devices (on such systems) is inherently insecure (as
>>> guests could easily arrange for IOMMU faults to occur). Defaulting to
>>> a mode where admins may not even become aware of issues with devices can
>>> be considered undesirable. Therefore convert this mode of operation to
>>> an optional one, not one enabled by default.
>>>
>>> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
>>> up a scratch page in the quarantine domain") did remove, in a slightly
>>> extended and abstracted fashion. Here, instead of reintroducing a pretty
>>> pointless use of "goto" in domain_context_unmap(), and instead of making
>>> the function (at least temporarily) inconsistent, take the opportunity
>>> and replace the other similarly pointless "goto" as well.
>>>
>>> In order to key the re-instated bypasses off of there (not) being a root
>>> page table this further requires moving the allocate_domain_resources()
>>> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
>>> else reassign_device() would allocate a root page table anyway); this is
>>> benign to the second caller of the latter function.
>>>
>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>> ---
>>> As far as 4.13 is concerned, I guess if we can't come to an agreement
>>> here, the only other option is to revert ea38867831da from the branch,
>>> for having been committed prematurely (I'm not so much worried about the
>>> master branch, where we have ample time until 4.14). What I surely want
>>> to see us avoid is a back and forth in behavior of released versions.
>>> (Note that 4.12.2 is similarly blocked on a decision either way here.)
>>
>> I'm not really sure we really need to revert ea38867831da before the
>> 4.13 release. It might not be optimal, but I'm quite sure the number of
>> cases where this could be an issue is rather small already, and I tend
>> to agree with Paul that admins who really care will more likely want to
>> select the option where the system will "just work". IMO the "noticeable
>> failure" is something which will be selected mostly by developers. But
>> I'm not an expert in that area, so I don't want to influence the
>> decision regarding the to be selected default too much.
> 
> An admin not wanting to know is, to me, the same as them not wanting
> to know about security issues, and hence not subscribing to our
> announcements lists. I can accept this being a reasonable thing to
> do when it is an _informed_ decision. But with the current
> arrangements there's no way whatsoever for an admin to know.

Maybe I have misunderstood the current state, but I thought that it
would just silently hide quirky devices without imposing a security
risk. We would not learn which devices are quirky, but OTOH I doubt
we'd get many reports about those in case your patch goes in.


Juergen
Jan Beulich Dec. 13, 2019, 1:35 p.m. UTC | #7
On 13.12.2019 14:29, Durrant, Paul wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: 13 December 2019 13:26
>>
>> On 13.12.2019 14:12, Durrant, Paul wrote:
>>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of Jan
>>>> Beulich
>>>> Sent: 13 December 2019 12:53
>>>>
>>>> +#define IOMMU_quarantine_none  0
>>>> +#define IOMMU_quarantine_basic 1
>>>> +#define IOMMU_quarantine_full  2
>>>> +uint8_t __read_mostly iommu_quarantine = IOMMU_quarantine_basic;
>>>
>>> If we have 'IOMMU_quarantine_sink' instead of 'IOMMU_quarantine_full',
>>> then how about 'IOMMU_quarantine_write_fault' instead of
>>> 'IOMMU_quarantine_basic'?
>>
>> Why "write_fault"? Even in "full" mode you only avoid read faults
>> aiui (see also above). So if anything "write_fault" would be a
>> replacement for "full"; "basic" could be replaced by just "fault"
>> then.
> 
> Sorry, yes, I had things the wrong way round. "fault" and "write_fault" sound good.

But the resulting command line option (iommu=quarantine=write-fault)
would then be quite a bit less nice imo, compare to the brief "full".
(I'm tempted to suggest "nrf" for "no read fault", but I guess that's
too ugly an acronym.)

Jan
Jan Beulich Dec. 13, 2019, 1:38 p.m. UTC | #8
On 13.12.2019 14:31, Jürgen Groß wrote:
> On 13.12.19 14:21, Jan Beulich wrote:
>> On 13.12.2019 14:11, Jürgen Groß wrote:
>>> On 13.12.19 13:53, Jan Beulich wrote:
>>>> Containing still in flight DMA was introduced to work around certain
>>>> devices / systems hanging hard upon hitting an IOMMU fault. Passing
>>>> through (such) devices (on such systems) is inherently insecure (as
>>>> guests could easily arrange for IOMMU faults to occur). Defaulting to
>>>> a mode where admins may not even become aware of issues with devices can
>>>> be considered undesirable. Therefore convert this mode of operation to
>>>> an optional one, not one enabled by default.
>>>>
>>>> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
>>>> up a scratch page in the quarantine domain") did remove, in a slightly
>>>> extended and abstracted fashion. Here, instead of reintroducing a pretty
>>>> pointless use of "goto" in domain_context_unmap(), and instead of making
>>>> the function (at least temporarily) inconsistent, take the opportunity
>>>> and replace the other similarly pointless "goto" as well.
>>>>
>>>> In order to key the re-instated bypasses off of there (not) being a root
>>>> page table this further requires moving the allocate_domain_resources()
>>>> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
>>>> else reassign_device() would allocate a root page table anyway); this is
>>>> benign to the second caller of the latter function.
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> As far as 4.13 is concerned, I guess if we can't come to an agreement
>>>> here, the only other option is to revert ea38867831da from the branch,
>>>> for having been committed prematurely (I'm not so much worried about the
>>>> master branch, where we have ample time until 4.14). What I surely want
>>>> to see us avoid is a back and forth in behavior of released versions.
>>>> (Note that 4.12.2 is similarly blocked on a decision either way here.)
>>>
>>> I'm not really sure we really need to revert ea38867831da before the
>>> 4.13 release. It might not be optimal, but I'm quite sure the number of
>>> cases where this could be an issue is rather small already, and I tend
>>> to agree with Paul that admins who really care will more likely want to
>>> select the option where the system will "just work". IMO the "noticeable
>>> failure" is something which will be selected mostly by developers. But
>>> I'm not an expert in that area, so I don't want to influence the
>>> decision regarding the to be selected default too much.
>>
>> An admin not wanting to know is, to me, the same as them not wanting
>> to know about security issues, and hence not subscribing to our
>> announcements lists. I can accept this being a reasonable thing to
>> do when it is an _informed_ decision. But with the current
>> arrangements there's no way whatsoever for an admin to know.
> 
> Maybe I have misunderstood the current state, but I thought that it
> would just silently hide quirky devices without imposing a security
> risk. We would not learn which devices are quirky, but OTOH I doubt
> we'd get many reports about those in case your patch goes in.

We don't want or need such reports, that's not the point. The
security risk comes from the quirkiness of the devices - admins
may wrongly think all is well and expose quirky devices to not
sufficiently trusted guests. (I say this fully realizing that
exposing devices to untrusted guests is almost always a certain
level of risk.)

Jan
Jürgen Groß Dec. 13, 2019, 1:46 p.m. UTC | #9
On 13.12.19 14:38, Jan Beulich wrote:
> On 13.12.2019 14:31, Jürgen Groß wrote:
>> On 13.12.19 14:21, Jan Beulich wrote:
>>> On 13.12.2019 14:11, Jürgen Groß wrote:
>>>> On 13.12.19 13:53, Jan Beulich wrote:
>>>>> Containing still in flight DMA was introduced to work around certain
>>>>> devices / systems hanging hard upon hitting an IOMMU fault. Passing
>>>>> through (such) devices (on such systems) is inherently insecure (as
>>>>> guests could easily arrange for IOMMU faults to occur). Defaulting to
>>>>> a mode where admins may not even become aware of issues with devices can
>>>>> be considered undesirable. Therefore convert this mode of operation to
>>>>> an optional one, not one enabled by default.
>>>>>
>>>>> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
>>>>> up a scratch page in the quarantine domain") did remove, in a slightly
>>>>> extended and abstracted fashion. Here, instead of reintroducing a pretty
>>>>> pointless use of "goto" in domain_context_unmap(), and instead of making
>>>>> the function (at least temporarily) inconsistent, take the opportunity
>>>>> and replace the other similarly pointless "goto" as well.
>>>>>
>>>>> In order to key the re-instated bypasses off of there (not) being a root
>>>>> page table this further requires moving the allocate_domain_resources()
>>>>> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
>>>>> else reassign_device() would allocate a root page table anyway); this is
>>>>> benign to the second caller of the latter function.
>>>>>
>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>> ---
>>>>> As far as 4.13 is concerned, I guess if we can't come to an agreement
>>>>> here, the only other option is to revert ea38867831da from the branch,
>>>>> for having been committed prematurely (I'm not so much worried about the
>>>>> master branch, where we have ample time until 4.14). What I surely want
>>>>> to see us avoid is a back and forth in behavior of released versions.
>>>>> (Note that 4.12.2 is similarly blocked on a decision either way here.)
>>>>
>>>> I'm not really sure we really need to revert ea38867831da before the
>>>> 4.13 release. It might not be optimal, but I'm quite sure the number of
>>>> cases where this could be an issue is rather small already, and I tend
>>>> to agree with Paul that admins who really care will more likely want to
>>>> select the option where the system will "just work". IMO the "noticeable
>>>> failure" is something which will be selected mostly by developers. But
>>>> I'm not an expert in that area, so I don't want to influence the
>>>> decision regarding the to be selected default too much.
>>>
>>> An admin not wanting to know is, to me, the same as them not wanting
>>> to know about security issues, and hence not subscribing to our
>>> announcements lists. I can accept this being a reasonable thing to
>>> do when it is an _informed_ decision. But with the current
>>> arrangements there's no way whatsoever for an admin to know.
>>
>> Maybe I have misunderstood the current state, but I thought that it
>> would just silently hide quirky devices without imposing a security
>> risk. We would not learn which devices are quirky, but OTOH I doubt
>> we'd get many reports about those in case your patch goes in.
> 
> We don't want or need such reports, that's not the point. The
> security risk comes from the quirkiness of the devices - admins
> may wrongly think all is well and expose quirky devices to not
> sufficiently trusted guests. (I say this fully realizing that
> exposing devices to untrusted guests is almost always a certain
> level of risk.)

Do we _know_ those devices are problematic from security standpoint?
Normally the IOMMU should do the isolation just fine. If it doesn't
then its not the quirky device which is problematic, but the IOMMU.

I thought the problem was that the quirky devices would not stop all
(read) DMA even when being unassigned from the guest resulting in
fatal IOMMU faults. The dummy page should stop those faults to happen
resulting in a more stable system.

So what are the security problems which are added by this behavior?


Juergen
Paul Durrant Dec. 13, 2019, 1:53 p.m. UTC | #10
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of
> Jürgen Groß
> Sent: 13 December 2019 13:47
> To: Jan Beulich <jbeulich@suse.com>
> Cc: Kevin Tian <kevin.tian@intel.com>; Stefano Stabellini
> <sstabellini@kernel.org>; Julien Grall <julien@xen.org>; Wei Liu
> <wl@xen.org>; Konrad Wilk <konrad.wilk@oracle.com>; George Dunlap
> <George.Dunlap@eu.citrix.com>; Andrew Cooper <andrew.cooper3@citrix.com>;
> Paul Durrant <paul@xen.org>; Ian Jackson <ian.jackson@citrix.com>; xen-
> devel@lists.xenproject.org; Roger Pau Monné <roger.pau@citrix.com>
> Subject: Re: [Xen-devel] [PATCH v2] IOMMU: make DMA containment of
> quarantined devices optional
> 
> On 13.12.19 14:38, Jan Beulich wrote:
> > On 13.12.2019 14:31, Jürgen Groß wrote:
> >> On 13.12.19 14:21, Jan Beulich wrote:
> >>> On 13.12.2019 14:11, Jürgen Groß wrote:
> >>>> On 13.12.19 13:53, Jan Beulich wrote:
> >>>>> Containing still in flight DMA was introduced to work around certain
> >>>>> devices / systems hanging hard upon hitting an IOMMU fault. Passing
> >>>>> through (such) devices (on such systems) is inherently insecure (as
> >>>>> guests could easily arrange for IOMMU faults to occur). Defaulting
> to
> >>>>> a mode where admins may not even become aware of issues with devices
> can
> >>>>> be considered undesirable. Therefore convert this mode of operation
> to
> >>>>> an optional one, not one enabled by default.
> >>>>>
> >>>>> This involves resurrecting code commit ea38867831da ("x86 / iommu:
> set
> >>>>> up a scratch page in the quarantine domain") did remove, in a
> slightly
> >>>>> extended and abstracted fashion. Here, instead of reintroducing a
> pretty
> >>>>> pointless use of "goto" in domain_context_unmap(), and instead of
> making
> >>>>> the function (at least temporarily) inconsistent, take the
> opportunity
> >>>>> and replace the other similarly pointless "goto" as well.
> >>>>>
> >>>>> In order to key the re-instated bypasses off of there (not) being a
> root
> >>>>> page table this further requires moving the
> allocate_domain_resources()
> >>>>> invocation from reassign_device() to amd_iommu_setup_domain_device()
> (or
> >>>>> else reassign_device() would allocate a root page table anyway);
> this is
> >>>>> benign to the second caller of the latter function.
> >>>>>
> >>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>>> ---
> >>>>> As far as 4.13 is concerned, I guess if we can't come to an
> agreement
> >>>>> here, the only other option is to revert ea38867831da from the
> branch,
> >>>>> for having been committed prematurely (I'm not so much worried about
> the
> >>>>> master branch, where we have ample time until 4.14). What I surely
> want
> >>>>> to see us avoid is a back and forth in behavior of released
> versions.
> >>>>> (Note that 4.12.2 is similarly blocked on a decision either way
> here.)
> >>>>
> >>>> I'm not really sure we really need to revert ea38867831da before the
> >>>> 4.13 release. It might not be optimal, but I'm quite sure the number
> of
> >>>> cases where this could be an issue is rather small already, and I
> tend
> >>>> to agree with Paul that admins who really care will more likely want
> to
> >>>> select the option where the system will "just work". IMO the
> "noticeable
> >>>> failure" is something which will be selected mostly by developers.
> But
> >>>> I'm not an expert in that area, so I don't want to influence the
> >>>> decision regarding the to be selected default too much.
> >>>
> >>> An admin not wanting to know is, to me, the same as them not wanting
> >>> to know about security issues, and hence not subscribing to our
> >>> announcements lists. I can accept this being a reasonable thing to
> >>> do when it is an _informed_ decision. But with the current
> >>> arrangements there's no way whatsoever for an admin to know.
> >>
> >> Maybe I have misunderstood the current state, but I thought that it
> >> would just silently hide quirky devices without imposing a security
> >> risk. We would not learn which devices are quirky, but OTOH I doubt
> >> we'd get many reports about those in case your patch goes in.
> >
> > We don't want or need such reports, that's not the point. The
> > security risk comes from the quirkiness of the devices - admins
> > may wrongly think all is well and expose quirky devices to not
> > sufficiently trusted guests. (I say this fully realizing that
> > exposing devices to untrusted guests is almost always a certain
> > level of risk.)
> 
> Do we _know_ those devices are problematic from security standpoint?
> Normally the IOMMU should do the isolation just fine. If it doesn't
> then its not the quirky device which is problematic, but the IOMMU.
> 
> I thought the problem was that the quirky devices would not stop all
> (read) DMA even when being unassigned from the guest resulting in
> fatal IOMMU faults. The dummy page should stop those faults to happen
> resulting in a more stable system.

That's right.

> 
> So what are the security problems which are added by this behavior?
> 

Since *not* having the 'sink' page allows a guest pull off a host DoS in the presence of such h/w, security is surely increased by having it?

  Paul

> 
> Juergen
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Jan Beulich Dec. 13, 2019, 2:11 p.m. UTC | #11
On 13.12.2019 14:46, Jürgen Groß wrote:
> On 13.12.19 14:38, Jan Beulich wrote:
>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>> Maybe I have misunderstood the current state, but I thought that it
>>> would just silently hide quirky devices without imposing a security
>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>> we'd get many reports about those in case your patch goes in.
>>
>> We don't want or need such reports, that's not the point. The
>> security risk comes from the quirkiness of the devices - admins
>> may wrongly think all is well and expose quirky devices to not
>> sufficiently trusted guests. (I say this fully realizing that
>> exposing devices to untrusted guests is almost always a certain
>> level of risk.)
> 
> Do we _know_ those devices are problematic from security standpoint?
> Normally the IOMMU should do the isolation just fine. If it doesn't
> then its not the quirky device which is problematic, but the IOMMU.
> 
> I thought the problem was that the quirky devices would not stop all
> (read) DMA even when being unassigned from the guest resulting in
> fatal IOMMU faults. The dummy page should stop those faults to happen
> resulting in a more stable system.

IOMMU faults by themselves are not impacting stability (they will
add processing overhead, yes). The problem, according to Paul's
description, is that the occurrence of at least some forms of IOMMU
faults (not present ones as it seems, as opposed to permission
violation ones) is fatal to certain systems. Irrespective of the
sink page used after de-assignment a guest can arrange for IOMMU
faults to occur even while it still has the device assigned. Hence
it is important for the admin to know that their system (not the
the particular device) behaves in this undesirable way.

> So what are the security problems which are added by this behavior?

There are no problems added; there are problems hidden from admins.

Jan
Roger Pau Monné Dec. 13, 2019, 2:12 p.m. UTC | #12
On Fri, Dec 13, 2019 at 01:53:29PM +0100, Jan Beulich wrote:
> Containing still in flight DMA was introduced to work around certain
> devices / systems hanging hard upon hitting an IOMMU fault. Passing
> through (such) devices (on such systems) is inherently insecure (as
> guests could easily arrange for IOMMU faults to occur). Defaulting to
> a mode where admins may not even become aware of issues with devices can
> be considered undesirable. Therefore convert this mode of operation to
> an optional one, not one enabled by default.
> 
> This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> up a scratch page in the quarantine domain") did remove, in a slightly
> extended and abstracted fashion. Here, instead of reintroducing a pretty
> pointless use of "goto" in domain_context_unmap(), and instead of making
> the function (at least temporarily) inconsistent, take the opportunity
> and replace the other similarly pointless "goto" as well.
> 
> In order to key the re-instated bypasses off of there (not) being a root
> page table this further requires moving the allocate_domain_resources()
> invocation from reassign_device() to amd_iommu_setup_domain_device() (or
> else reassign_device() would allocate a root page table anyway); this is
> benign to the second caller of the latter function.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

I'm however not sure if the default quarantine mode should be the
basic or the full one.

At the end of day the full one does prevent hard hangs on specific
systems, but a guest with a device behind such bogus IOMMU can
trivially trigger those anyway.

> ---
> As far as 4.13 is concerned, I guess if we can't come to an agreement
> here, the only other option is to revert ea38867831da from the branch,
> for having been committed prematurely (I'm not so much worried about the
> master branch, where we have ample time until 4.14). What I surely want
> to see us avoid is a back and forth in behavior of released versions.
> (Note that 4.12.2 is similarly blocked on a decision either way here.)
> 
> I'm happy to take better suggestions to replace "full".

I was going to comment on v1, but I really have no better alternative.

> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -30,13 +30,17 @@ bool_t __initdata iommu_enable = 1;
>  bool_t __read_mostly iommu_enabled;
>  bool_t __read_mostly force_iommu;
>  bool_t __read_mostly iommu_verbose;
> -bool __read_mostly iommu_quarantine = true;
>  bool_t __read_mostly iommu_igfx = 1;
>  bool_t __read_mostly iommu_snoop = 1;
>  bool_t __read_mostly iommu_qinval = 1;
>  bool_t __read_mostly iommu_intremap = 1;
>  bool_t __read_mostly iommu_crash_disable;
>  
> +#define IOMMU_quarantine_none  0
> +#define IOMMU_quarantine_basic 1
> +#define IOMMU_quarantine_full  2
> +uint8_t __read_mostly iommu_quarantine = IOMMU_quarantine_basic;

I wonder whether the default should be to use the sink page. Not using
it can lead to a hard hang on certain hardware according to the
description. OTOH if such devices are actually passed through, the
guest itself can trigger such page faults and hence freeze the system.

Thanks, Roger.
Jan Beulich Dec. 13, 2019, 2:23 p.m. UTC | #13
On 13.12.2019 14:53, Durrant, Paul wrote:
> Since *not* having the 'sink' page allows a guest pull off a host DoS
> in the presence of such h/w, security is surely increased by having it?

host		device		result w/o sink		result w/ sink
good		good		good			good
good		babbling	good 			good
wedge on fault	good		DoS (runtime)		DoS (runtime)
wedge on fault	babbling	DoS (runtime/late)	DoS (runtime only, silent)

I wouldn't call it an increase of security to fully hide post-
deassignment issues without doing anything about issues that can
arise while the device is still assigned.

Jan
Jürgen Groß Dec. 13, 2019, 2:24 p.m. UTC | #14
On 13.12.19 15:11, Jan Beulich wrote:
> On 13.12.2019 14:46, Jürgen Groß wrote:
>> On 13.12.19 14:38, Jan Beulich wrote:
>>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>>> Maybe I have misunderstood the current state, but I thought that it
>>>> would just silently hide quirky devices without imposing a security
>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>>> we'd get many reports about those in case your patch goes in.
>>>
>>> We don't want or need such reports, that's not the point. The
>>> security risk comes from the quirkiness of the devices - admins
>>> may wrongly think all is well and expose quirky devices to not
>>> sufficiently trusted guests. (I say this fully realizing that
>>> exposing devices to untrusted guests is almost always a certain
>>> level of risk.)
>>
>> Do we _know_ those devices are problematic from security standpoint?
>> Normally the IOMMU should do the isolation just fine. If it doesn't
>> then its not the quirky device which is problematic, but the IOMMU.
>>
>> I thought the problem was that the quirky devices would not stop all
>> (read) DMA even when being unassigned from the guest resulting in
>> fatal IOMMU faults. The dummy page should stop those faults to happen
>> resulting in a more stable system.
> 
> IOMMU faults by themselves are not impacting stability (they will
> add processing overhead, yes). The problem, according to Paul's
> description, is that the occurrence of at least some forms of IOMMU
> faults (not present ones as it seems, as opposed to permission
> violation ones) is fatal to certain systems. Irrespective of the
> sink page used after de-assignment a guest can arrange for IOMMU
> faults to occur even while it still has the device assigned. Hence
> it is important for the admin to know that their system (not the
> the particular device) behaves in this undesirable way.

So how does the admin learn this? Its not as if your patch would result
in a system crash or hang all the time, right? This would be the case
only if there either is a malicious (on purpose or due to a bug) guest
which gets the device assigned, or if there happens to be a pending DMA
operation when the device gets unassigned.

The malicious guest case would be caught without your patch as well.
And most cases of device unassignment would be fine, too, as a normal
guest shutdown/reboot includes stopping I/O activity.

> 
>> So what are the security problems which are added by this behavior?
> 
> There are no problems added; there are problems hidden from admins.

In my understanding in some corner cases, yes.


Juergen
Jan Beulich Dec. 13, 2019, 2:42 p.m. UTC | #15
On 13.12.2019 15:29, Jürgen Groß wrote:
> On 13.12.19 15:23, Jan Beulich wrote:
>> On 13.12.2019 14:53, Durrant, Paul wrote:
>>> Since *not* having the 'sink' page allows a guest pull off a host DoS
>>> in the presence of such h/w, security is surely increased by having it?
>>
>> host		device		result w/o sink		result w/ sink
>> good		good		good			good
>> good		babbling	good 			good
>> wedge on fault	good		DoS (runtime)		DoS (runtime)
> 
> I guess the DoS cases here are due to malicious guest actions?

Yes.

>> wedge on fault	babbling	DoS (runtime/late)	DoS (runtime only, silent)
> 
> And why is the sink page resulting in a silent DoS here?

Sorry, space restrictions may have lead to this being ambiguous:
There's still the runtime DoS; the would-be-DoS after deassignment
will go entirely silent (i.e. without making the admin aware of
the situation, not allowing them to take precautions against the
runtime aspects of this).

Jan
Jan Beulich Dec. 13, 2019, 2:45 p.m. UTC | #16
On 13.12.2019 15:24, Jürgen Groß wrote:
> On 13.12.19 15:11, Jan Beulich wrote:
>> On 13.12.2019 14:46, Jürgen Groß wrote:
>>> On 13.12.19 14:38, Jan Beulich wrote:
>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>>>> Maybe I have misunderstood the current state, but I thought that it
>>>>> would just silently hide quirky devices without imposing a security
>>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>>>> we'd get many reports about those in case your patch goes in.
>>>>
>>>> We don't want or need such reports, that's not the point. The
>>>> security risk comes from the quirkiness of the devices - admins
>>>> may wrongly think all is well and expose quirky devices to not
>>>> sufficiently trusted guests. (I say this fully realizing that
>>>> exposing devices to untrusted guests is almost always a certain
>>>> level of risk.)
>>>
>>> Do we _know_ those devices are problematic from security standpoint?
>>> Normally the IOMMU should do the isolation just fine. If it doesn't
>>> then its not the quirky device which is problematic, but the IOMMU.
>>>
>>> I thought the problem was that the quirky devices would not stop all
>>> (read) DMA even when being unassigned from the guest resulting in
>>> fatal IOMMU faults. The dummy page should stop those faults to happen
>>> resulting in a more stable system.
>>
>> IOMMU faults by themselves are not impacting stability (they will
>> add processing overhead, yes). The problem, according to Paul's
>> description, is that the occurrence of at least some forms of IOMMU
>> faults (not present ones as it seems, as opposed to permission
>> violation ones) is fatal to certain systems. Irrespective of the
>> sink page used after de-assignment a guest can arrange for IOMMU
>> faults to occur even while it still has the device assigned. Hence
>> it is important for the admin to know that their system (not the
>> the particular device) behaves in this undesirable way.
> 
> So how does the admin learn this? Its not as if your patch would result
> in a system crash or hang all the time, right? This would be the case
> only if there either is a malicious (on purpose or due to a bug) guest
> which gets the device assigned, or if there happens to be a pending DMA
> operation when the device gets unassigned.

I didn't claim the change would cover all cases. All I am claiming
is that it increases the chances of admins becoming aware of reasons
not to pass through devices to certain guests.

Jan
Jürgen Groß Dec. 13, 2019, 3:35 p.m. UTC | #17
On 13.12.19 15:45, Jan Beulich wrote:
> On 13.12.2019 15:24, Jürgen Groß wrote:
>> On 13.12.19 15:11, Jan Beulich wrote:
>>> On 13.12.2019 14:46, Jürgen Groß wrote:
>>>> On 13.12.19 14:38, Jan Beulich wrote:
>>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>>>>> Maybe I have misunderstood the current state, but I thought that it
>>>>>> would just silently hide quirky devices without imposing a security
>>>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>>>>> we'd get many reports about those in case your patch goes in.
>>>>>
>>>>> We don't want or need such reports, that's not the point. The
>>>>> security risk comes from the quirkiness of the devices - admins
>>>>> may wrongly think all is well and expose quirky devices to not
>>>>> sufficiently trusted guests. (I say this fully realizing that
>>>>> exposing devices to untrusted guests is almost always a certain
>>>>> level of risk.)
>>>>
>>>> Do we _know_ those devices are problematic from security standpoint?
>>>> Normally the IOMMU should do the isolation just fine. If it doesn't
>>>> then its not the quirky device which is problematic, but the IOMMU.
>>>>
>>>> I thought the problem was that the quirky devices would not stop all
>>>> (read) DMA even when being unassigned from the guest resulting in
>>>> fatal IOMMU faults. The dummy page should stop those faults to happen
>>>> resulting in a more stable system.
>>>
>>> IOMMU faults by themselves are not impacting stability (they will
>>> add processing overhead, yes). The problem, according to Paul's
>>> description, is that the occurrence of at least some forms of IOMMU
>>> faults (not present ones as it seems, as opposed to permission
>>> violation ones) is fatal to certain systems. Irrespective of the
>>> sink page used after de-assignment a guest can arrange for IOMMU
>>> faults to occur even while it still has the device assigned. Hence
>>> it is important for the admin to know that their system (not the
>>> the particular device) behaves in this undesirable way.
>>
>> So how does the admin learn this? Its not as if your patch would result
>> in a system crash or hang all the time, right? This would be the case
>> only if there either is a malicious (on purpose or due to a bug) guest
>> which gets the device assigned, or if there happens to be a pending DMA
>> operation when the device gets unassigned.
> 
> I didn't claim the change would cover all cases. All I am claiming
> is that it increases the chances of admins becoming aware of reasons
> not to pass through devices to certain guests.

So combined with your answer this means to me:

With your patch (or the original one reverted) a DoS will occur either
due to a malicious guest or in case a DMA is still pending. As a result
the admin will no longer pass this device to any untrusted guest.

With the current 4.13-staging a DoS will occur only due to a malicious
guest. The admin will then no longer pass this device to any untrusted
guest.

So right now without any untrusted guest no DoS, while possibly DoS with
your patch. How is that better?


Juergen
Jan Beulich Dec. 13, 2019, 5 p.m. UTC | #18
On 13.12.2019 16:35, Jürgen Groß wrote:
> On 13.12.19 15:45, Jan Beulich wrote:
>> On 13.12.2019 15:24, Jürgen Groß wrote:
>>> On 13.12.19 15:11, Jan Beulich wrote:
>>>> On 13.12.2019 14:46, Jürgen Groß wrote:
>>>>> On 13.12.19 14:38, Jan Beulich wrote:
>>>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>>>>>> Maybe I have misunderstood the current state, but I thought that it
>>>>>>> would just silently hide quirky devices without imposing a security
>>>>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>>>>>> we'd get many reports about those in case your patch goes in.
>>>>>>
>>>>>> We don't want or need such reports, that's not the point. The
>>>>>> security risk comes from the quirkiness of the devices - admins
>>>>>> may wrongly think all is well and expose quirky devices to not
>>>>>> sufficiently trusted guests. (I say this fully realizing that
>>>>>> exposing devices to untrusted guests is almost always a certain
>>>>>> level of risk.)
>>>>>
>>>>> Do we _know_ those devices are problematic from security standpoint?
>>>>> Normally the IOMMU should do the isolation just fine. If it doesn't
>>>>> then its not the quirky device which is problematic, but the IOMMU.
>>>>>
>>>>> I thought the problem was that the quirky devices would not stop all
>>>>> (read) DMA even when being unassigned from the guest resulting in
>>>>> fatal IOMMU faults. The dummy page should stop those faults to happen
>>>>> resulting in a more stable system.
>>>>
>>>> IOMMU faults by themselves are not impacting stability (they will
>>>> add processing overhead, yes). The problem, according to Paul's
>>>> description, is that the occurrence of at least some forms of IOMMU
>>>> faults (not present ones as it seems, as opposed to permission
>>>> violation ones) is fatal to certain systems. Irrespective of the
>>>> sink page used after de-assignment a guest can arrange for IOMMU
>>>> faults to occur even while it still has the device assigned. Hence
>>>> it is important for the admin to know that their system (not the
>>>> the particular device) behaves in this undesirable way.
>>>
>>> So how does the admin learn this? Its not as if your patch would result
>>> in a system crash or hang all the time, right? This would be the case
>>> only if there either is a malicious (on purpose or due to a bug) guest
>>> which gets the device assigned, or if there happens to be a pending DMA
>>> operation when the device gets unassigned.
>>
>> I didn't claim the change would cover all cases. All I am claiming
>> is that it increases the chances of admins becoming aware of reasons
>> not to pass through devices to certain guests.
> 
> So combined with your answer this means to me:
> 
> With your patch (or the original one reverted) a DoS will occur either
> due to a malicious guest or in case a DMA is still pending. As a result
> the admin will no longer pass this device to any untrusted guest.
> 
> With the current 4.13-staging a DoS will occur only due to a malicious
> guest. The admin will then no longer pass this device to any untrusted
> guest.
> 
> So right now without any untrusted guest no DoS, while possibly DoS with
> your patch. How is that better?

I'm afraid this way we can debate endlessly, because it's not like
there's a clear winner here.

Jan
Tian, Kevin Dec. 16, 2019, 5:58 a.m. UTC | #19
> From: Jürgen Groß <jgross@suse.com>
> Sent: Friday, December 13, 2019 11:36 PM
> 
> On 13.12.19 15:45, Jan Beulich wrote:
> > On 13.12.2019 15:24, Jürgen Groß wrote:
> >> On 13.12.19 15:11, Jan Beulich wrote:
> >>> On 13.12.2019 14:46, Jürgen Groß wrote:
> >>>> On 13.12.19 14:38, Jan Beulich wrote:
> >>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
> >>>>>> Maybe I have misunderstood the current state, but I thought that it
> >>>>>> would just silently hide quirky devices without imposing a security
> >>>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
> >>>>>> we'd get many reports about those in case your patch goes in.
> >>>>>
> >>>>> We don't want or need such reports, that's not the point. The
> >>>>> security risk comes from the quirkiness of the devices - admins
> >>>>> may wrongly think all is well and expose quirky devices to not
> >>>>> sufficiently trusted guests. (I say this fully realizing that
> >>>>> exposing devices to untrusted guests is almost always a certain
> >>>>> level of risk.)
> >>>>
> >>>> Do we _know_ those devices are problematic from security standpoint?
> >>>> Normally the IOMMU should do the isolation just fine. If it doesn't
> >>>> then its not the quirky device which is problematic, but the IOMMU.
> >>>>
> >>>> I thought the problem was that the quirky devices would not stop all
> >>>> (read) DMA even when being unassigned from the guest resulting in
> >>>> fatal IOMMU faults. The dummy page should stop those faults to
> happen
> >>>> resulting in a more stable system.
> >>>
> >>> IOMMU faults by themselves are not impacting stability (they will
> >>> add processing overhead, yes). The problem, according to Paul's
> >>> description, is that the occurrence of at least some forms of IOMMU
> >>> faults (not present ones as it seems, as opposed to permission
> >>> violation ones) is fatal to certain systems. Irrespective of the
> >>> sink page used after de-assignment a guest can arrange for IOMMU
> >>> faults to occur even while it still has the device assigned. Hence
> >>> it is important for the admin to know that their system (not the
> >>> the particular device) behaves in this undesirable way.
> >>
> >> So how does the admin learn this? Its not as if your patch would result
> >> in a system crash or hang all the time, right? This would be the case
> >> only if there either is a malicious (on purpose or due to a bug) guest
> >> which gets the device assigned, or if there happens to be a pending DMA
> >> operation when the device gets unassigned.
> >
> > I didn't claim the change would cover all cases. All I am claiming
> > is that it increases the chances of admins becoming aware of reasons
> > not to pass through devices to certain guests.
> 
> So combined with your answer this means to me:
> 
> With your patch (or the original one reverted) a DoS will occur either
> due to a malicious guest or in case a DMA is still pending. As a result
> the admin will no longer pass this device to any untrusted guest.
> 
> With the current 4.13-staging a DoS will occur only due to a malicious
> guest. The admin will then no longer pass this device to any untrusted
> guest.
> 
> So right now without any untrusted guest no DoS, while possibly DoS with
> your patch. How is that better?
> 

I'd suggest separating run-time DoS from original quarantine purpose
of this patch.

For quarantine, I'm with Jan that giving admin the chance of knowing
whether quarantine is required is important. Say an admin just gets
a sample device from a new vendor and needs to decide whether his
employer should put such device in their production system. It's
essential to have a default configuration which can warn on any 
possible violation of the expectations on a good assignable device. 
Then the admin can look at Xen user guide to find out what the warning
information means and then does whatever required (usually means
more scrutinization than the warning itself) to figure out whether 
identified problems are safe (e.g. by enabling quarantine) or are
real indicators of bogus implementation (then should not use it).
Having quarantine default on means that every admin should remember
that Xen already enables some band-aids on benign warnings so he
should explicitly turn off those options to do evaluation which, to me
is not realistic.

On the other hand, although a sink page can help mitigate run-time DoS,
how to handle it should be an admin policy. If DoS is caused by 
malicious guest, it makes more sense to warn the admin and then switch
to sink page after meeting a DoS detection condition (e.g. number of
faults within a timing window exceeds a threshold). Lots of such warnings
from multiple VMs may imply a massive attack then the admin may adopt
more comprehensive analysis or defensive means. Then it's also possible
that DoS is caused by a vulnerability within the guest. In such case, the
guest may want to contain the DoS attack itself through a virtual IOMMU.
Then Xen needs to know the fault and then forward to the guest through
the vIOMMU. In either case I don't think current patch is sufficient. So it's
better to leave DoS improved separately (of course the sink logic can be 
leveraged), given the timeline of 4.13. 

Thanks
Kevin
Tian, Kevin Dec. 16, 2019, 6:01 a.m. UTC | #20
> From: Roger Pau Monné <roger.pau@citrix.com>
> Sent: Friday, December 13, 2019 10:13 PM
> 
> On Fri, Dec 13, 2019 at 01:53:29PM +0100, Jan Beulich wrote:
> > Containing still in flight DMA was introduced to work around certain
> > devices / systems hanging hard upon hitting an IOMMU fault. Passing
> > through (such) devices (on such systems) is inherently insecure (as
> > guests could easily arrange for IOMMU faults to occur). Defaulting to
> > a mode where admins may not even become aware of issues with devices
> can
> > be considered undesirable. Therefore convert this mode of operation to
> > an optional one, not one enabled by default.
> >
> > This involves resurrecting code commit ea38867831da ("x86 / iommu: set
> > up a scratch page in the quarantine domain") did remove, in a slightly
> > extended and abstracted fashion. Here, instead of reintroducing a pretty
> > pointless use of "goto" in domain_context_unmap(), and instead of making
> > the function (at least temporarily) inconsistent, take the opportunity
> > and replace the other similarly pointless "goto" as well.
> >
> > In order to key the re-instated bypasses off of there (not) being a root
> > page table this further requires moving the allocate_domain_resources()
> > invocation from reassign_device() to amd_iommu_setup_domain_device()
> (or
> > else reassign_device() would allocate a root page table anyway); this is
> > benign to the second caller of the latter function.
> >
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> I'm however not sure if the default quarantine mode should be the
> basic or the full one.
> 
> At the end of day the full one does prevent hard hangs on specific
> systems, but a guest with a device behind such bogus IOMMU can
> trivially trigger those anyway.

It's a bogus device behind a good IOMMU. If IOMMU is bogus, we
should not pass through devices behind it. 
Tian, Kevin Dec. 16, 2019, 6:39 a.m. UTC | #21
> From: Tian, Kevin
> Sent: Monday, December 16, 2019 1:58 PM
> 
> > From: Jürgen Groß <jgross@suse.com>
> > Sent: Friday, December 13, 2019 11:36 PM
> >
> > On 13.12.19 15:45, Jan Beulich wrote:
> > > On 13.12.2019 15:24, Jürgen Groß wrote:
> > >> On 13.12.19 15:11, Jan Beulich wrote:
> > >>> On 13.12.2019 14:46, Jürgen Groß wrote:
> > >>>> On 13.12.19 14:38, Jan Beulich wrote:
> > >>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
> > >>>>>> Maybe I have misunderstood the current state, but I thought that it
> > >>>>>> would just silently hide quirky devices without imposing a security
> > >>>>>> risk. We would not learn which devices are quirky, but OTOH I
> doubt
> > >>>>>> we'd get many reports about those in case your patch goes in.
> > >>>>>
> > >>>>> We don't want or need such reports, that's not the point. The
> > >>>>> security risk comes from the quirkiness of the devices - admins
> > >>>>> may wrongly think all is well and expose quirky devices to not
> > >>>>> sufficiently trusted guests. (I say this fully realizing that
> > >>>>> exposing devices to untrusted guests is almost always a certain
> > >>>>> level of risk.)
> > >>>>
> > >>>> Do we _know_ those devices are problematic from security
> standpoint?
> > >>>> Normally the IOMMU should do the isolation just fine. If it doesn't
> > >>>> then its not the quirky device which is problematic, but the IOMMU.
> > >>>>
> > >>>> I thought the problem was that the quirky devices would not stop all
> > >>>> (read) DMA even when being unassigned from the guest resulting in
> > >>>> fatal IOMMU faults. The dummy page should stop those faults to
> > happen
> > >>>> resulting in a more stable system.
> > >>>
> > >>> IOMMU faults by themselves are not impacting stability (they will
> > >>> add processing overhead, yes). The problem, according to Paul's
> > >>> description, is that the occurrence of at least some forms of IOMMU
> > >>> faults (not present ones as it seems, as opposed to permission
> > >>> violation ones) is fatal to certain systems. Irrespective of the
> > >>> sink page used after de-assignment a guest can arrange for IOMMU
> > >>> faults to occur even while it still has the device assigned. Hence
> > >>> it is important for the admin to know that their system (not the
> > >>> the particular device) behaves in this undesirable way.
> > >>
> > >> So how does the admin learn this? Its not as if your patch would result
> > >> in a system crash or hang all the time, right? This would be the case
> > >> only if there either is a malicious (on purpose or due to a bug) guest
> > >> which gets the device assigned, or if there happens to be a pending DMA
> > >> operation when the device gets unassigned.
> > >
> > > I didn't claim the change would cover all cases. All I am claiming
> > > is that it increases the chances of admins becoming aware of reasons
> > > not to pass through devices to certain guests.
> >
> > So combined with your answer this means to me:
> >
> > With your patch (or the original one reverted) a DoS will occur either
> > due to a malicious guest or in case a DMA is still pending. As a result
> > the admin will no longer pass this device to any untrusted guest.
> >
> > With the current 4.13-staging a DoS will occur only due to a malicious
> > guest. The admin will then no longer pass this device to any untrusted
> > guest.
> >
> > So right now without any untrusted guest no DoS, while possibly DoS with
> > your patch. How is that better?
> >
> 
> I'd suggest separating run-time DoS from original quarantine purpose
> of this patch.
> 
> For quarantine, I'm with Jan that giving admin the chance of knowing
> whether quarantine is required is important. Say an admin just gets
> a sample device from a new vendor and needs to decide whether his
> employer should put such device in their production system. It's
> essential to have a default configuration which can warn on any
> possible violation of the expectations on a good assignable device.
> Then the admin can look at Xen user guide to find out what the warning
> information means and then does whatever required (usually means
> more scrutinization than the warning itself) to figure out whether
> identified problems are safe (e.g. by enabling quarantine) or are
> real indicators of bogus implementation (then should not use it).
> Having quarantine default on means that every admin should remember
> that Xen already enables some band-aids on benign warnings so he
> should explicitly turn off those options to do evaluation which, to me
> is not realistic.
> 
> On the other hand, although a sink page can help mitigate run-time DoS,
> how to handle it should be an admin policy. If DoS is caused by
> malicious guest, it makes more sense to warn the admin and then switch
> to sink page after meeting a DoS detection condition (e.g. number of
> faults within a timing window exceeds a threshold). Lots of such warnings
> from multiple VMs may imply a massive attack then the admin may adopt
> more comprehensive analysis or defensive means. Then it's also possible
> that DoS is caused by a vulnerability within the guest. In such case, the
> guest may want to contain the DoS attack itself through a virtual IOMMU.
> Then Xen needs to know the fault and then forward to the guest through
> the vIOMMU. In either case I don't think current patch is sufficient. So it's
> better to leave DoS improved separately (of course the sink logic can be
> leveraged), given the timeline of 4.13.
> 

btw I forgot that today we already have some-level of DoS detection. 
You can check pci_check_disable_device, which exactly does above
threshold check and then disable the PCI_COMMAND register. The 
missing part might be, the device still issues DMAs even after the
disabling. In such case, at least VT-d allows the software to disable fault
reporting on a given device. Such method betters suits warn-and-disable
model upon DoS attacking than blindly setting up a sink page.

Thanks
Kevin
Jürgen Groß Dec. 16, 2019, 7:24 a.m. UTC | #22
On 16.12.19 06:58, Tian, Kevin wrote:
>> From: Jürgen Groß <jgross@suse.com>
>> Sent: Friday, December 13, 2019 11:36 PM
>>
>> On 13.12.19 15:45, Jan Beulich wrote:
>>> On 13.12.2019 15:24, Jürgen Groß wrote:
>>>> On 13.12.19 15:11, Jan Beulich wrote:
>>>>> On 13.12.2019 14:46, Jürgen Groß wrote:
>>>>>> On 13.12.19 14:38, Jan Beulich wrote:
>>>>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>>>>>>> Maybe I have misunderstood the current state, but I thought that it
>>>>>>>> would just silently hide quirky devices without imposing a security
>>>>>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>>>>>>> we'd get many reports about those in case your patch goes in.
>>>>>>>
>>>>>>> We don't want or need such reports, that's not the point. The
>>>>>>> security risk comes from the quirkiness of the devices - admins
>>>>>>> may wrongly think all is well and expose quirky devices to not
>>>>>>> sufficiently trusted guests. (I say this fully realizing that
>>>>>>> exposing devices to untrusted guests is almost always a certain
>>>>>>> level of risk.)
>>>>>>
>>>>>> Do we _know_ those devices are problematic from security standpoint?
>>>>>> Normally the IOMMU should do the isolation just fine. If it doesn't
>>>>>> then its not the quirky device which is problematic, but the IOMMU.
>>>>>>
>>>>>> I thought the problem was that the quirky devices would not stop all
>>>>>> (read) DMA even when being unassigned from the guest resulting in
>>>>>> fatal IOMMU faults. The dummy page should stop those faults to
>> happen
>>>>>> resulting in a more stable system.
>>>>>
>>>>> IOMMU faults by themselves are not impacting stability (they will
>>>>> add processing overhead, yes). The problem, according to Paul's
>>>>> description, is that the occurrence of at least some forms of IOMMU
>>>>> faults (not present ones as it seems, as opposed to permission
>>>>> violation ones) is fatal to certain systems. Irrespective of the
>>>>> sink page used after de-assignment a guest can arrange for IOMMU
>>>>> faults to occur even while it still has the device assigned. Hence
>>>>> it is important for the admin to know that their system (not the
>>>>> the particular device) behaves in this undesirable way.
>>>>
>>>> So how does the admin learn this? Its not as if your patch would result
>>>> in a system crash or hang all the time, right? This would be the case
>>>> only if there either is a malicious (on purpose or due to a bug) guest
>>>> which gets the device assigned, or if there happens to be a pending DMA
>>>> operation when the device gets unassigned.
>>>
>>> I didn't claim the change would cover all cases. All I am claiming
>>> is that it increases the chances of admins becoming aware of reasons
>>> not to pass through devices to certain guests.
>>
>> So combined with your answer this means to me:
>>
>> With your patch (or the original one reverted) a DoS will occur either
>> due to a malicious guest or in case a DMA is still pending. As a result
>> the admin will no longer pass this device to any untrusted guest.
>>
>> With the current 4.13-staging a DoS will occur only due to a malicious
>> guest. The admin will then no longer pass this device to any untrusted
>> guest.
>>
>> So right now without any untrusted guest no DoS, while possibly DoS with
>> your patch. How is that better?
>>
> 
> I'd suggest separating run-time DoS from original quarantine purpose
> of this patch.
> 
> For quarantine, I'm with Jan that giving admin the chance of knowing
> whether quarantine is required is important. Say an admin just gets
> a sample device from a new vendor and needs to decide whether his
> employer should put such device in their production system. It's
> essential to have a default configuration which can warn on any
> possible violation of the expectations on a good assignable device.
> Then the admin can look at Xen user guide to find out what the warning
> information means and then does whatever required (usually means
> more scrutinization than the warning itself) to figure out whether
> identified problems are safe (e.g. by enabling quarantine) or are
> real indicators of bogus implementation (then should not use it).
> Having quarantine default on means that every admin should remember
> that Xen already enables some band-aids on benign warnings so he
> should explicitly turn off those options to do evaluation which, to me
> is not realistic.

This implies the admin is aware of the necessity to do that testing.
And for the tests to be conclusive he probably needs to do more than
just a basic "does it work" test, as the pending DMA might occur in
some cases only. And that is basically my problem with Jan's default:
an admin not doing enough testing (or non at all) will end up with a
DoS situation in production, while an admin knowing that he needs to
test properly is probably more aware of the needed command line option
for evaluation of the device's security relevance.

Its a complex problem and I think the decision for the default should
not be rushed. So I think it is best to discuss it on xen-devel and
leave the patch out of the initial 4.13 release (this patch is the last
pending one for 4.13 now). After a decision is made the patch can easily
by backported to 4.13 in case it is regarded to be important.


Juergen
Sander Eikelenboom Dec. 16, 2019, 8:55 a.m. UTC | #23
On 16/12/2019 08:24, Jürgen Groß wrote:
> On 16.12.19 06:58, Tian, Kevin wrote:
>>> From: Jürgen Groß <jgross@suse.com>
>>> Sent: Friday, December 13, 2019 11:36 PM
>>>
>>> On 13.12.19 15:45, Jan Beulich wrote:
>>>> On 13.12.2019 15:24, Jürgen Groß wrote:
>>>>> On 13.12.19 15:11, Jan Beulich wrote:
>>>>>> On 13.12.2019 14:46, Jürgen Groß wrote:
>>>>>>> On 13.12.19 14:38, Jan Beulich wrote:
>>>>>>>> On 13.12.2019 14:31, Jürgen Groß wrote:
>>>>>>>>> Maybe I have misunderstood the current state, but I thought that it
>>>>>>>>> would just silently hide quirky devices without imposing a security
>>>>>>>>> risk. We would not learn which devices are quirky, but OTOH I doubt
>>>>>>>>> we'd get many reports about those in case your patch goes in.
>>>>>>>>
>>>>>>>> We don't want or need such reports, that's not the point. The
>>>>>>>> security risk comes from the quirkiness of the devices - admins
>>>>>>>> may wrongly think all is well and expose quirky devices to not
>>>>>>>> sufficiently trusted guests. (I say this fully realizing that
>>>>>>>> exposing devices to untrusted guests is almost always a certain
>>>>>>>> level of risk.)
>>>>>>>
>>>>>>> Do we _know_ those devices are problematic from security standpoint?
>>>>>>> Normally the IOMMU should do the isolation just fine. If it doesn't
>>>>>>> then its not the quirky device which is problematic, but the IOMMU.
>>>>>>>
>>>>>>> I thought the problem was that the quirky devices would not stop all
>>>>>>> (read) DMA even when being unassigned from the guest resulting in
>>>>>>> fatal IOMMU faults. The dummy page should stop those faults to
>>> happen
>>>>>>> resulting in a more stable system.
>>>>>>
>>>>>> IOMMU faults by themselves are not impacting stability (they will
>>>>>> add processing overhead, yes). The problem, according to Paul's
>>>>>> description, is that the occurrence of at least some forms of IOMMU
>>>>>> faults (not present ones as it seems, as opposed to permission
>>>>>> violation ones) is fatal to certain systems. Irrespective of the
>>>>>> sink page used after de-assignment a guest can arrange for IOMMU
>>>>>> faults to occur even while it still has the device assigned. Hence
>>>>>> it is important for the admin to know that their system (not the
>>>>>> the particular device) behaves in this undesirable way.
>>>>>
>>>>> So how does the admin learn this? Its not as if your patch would result
>>>>> in a system crash or hang all the time, right? This would be the case
>>>>> only if there either is a malicious (on purpose or due to a bug) guest
>>>>> which gets the device assigned, or if there happens to be a pending DMA
>>>>> operation when the device gets unassigned.
>>>>
>>>> I didn't claim the change would cover all cases. All I am claiming
>>>> is that it increases the chances of admins becoming aware of reasons
>>>> not to pass through devices to certain guests.
>>>
>>> So combined with your answer this means to me:
>>>
>>> With your patch (or the original one reverted) a DoS will occur either
>>> due to a malicious guest or in case a DMA is still pending. As a result
>>> the admin will no longer pass this device to any untrusted guest.
>>>
>>> With the current 4.13-staging a DoS will occur only due to a malicious
>>> guest. The admin will then no longer pass this device to any untrusted
>>> guest.
>>>
>>> So right now without any untrusted guest no DoS, while possibly DoS with
>>> your patch. How is that better?
>>>
>>
>> I'd suggest separating run-time DoS from original quarantine purpose
>> of this patch.
>>
>> For quarantine, I'm with Jan that giving admin the chance of knowing
>> whether quarantine is required is important. Say an admin just gets
>> a sample device from a new vendor and needs to decide whether his
>> employer should put such device in their production system. It's
>> essential to have a default configuration which can warn on any
>> possible violation of the expectations on a good assignable device.
>> Then the admin can look at Xen user guide to find out what the warning
>> information means and then does whatever required (usually means
>> more scrutinization than the warning itself) to figure out whether
>> identified problems are safe (e.g. by enabling quarantine) or are
>> real indicators of bogus implementation (then should not use it).
>> Having quarantine default on means that every admin should remember
>> that Xen already enables some band-aids on benign warnings so he
>> should explicitly turn off those options to do evaluation which, to me
>> is not realistic.
> 
> This implies the admin is aware of the necessity to do that testing.
> And for the tests to be conclusive he probably needs to do more than
> just a basic "does it work" test, as the pending DMA might occur in
> some cases only. And that is basically my problem with Jan's default:
> an admin not doing enough testing (or non at all) will end up with a
> DoS situation in production, while an admin knowing that he needs to
> test properly is probably more aware of the needed command line option
> for evaluation of the device's security relevance.
> 
> Its a complex problem and I think the decision for the default should
> not be rushed. So I think it is best to discuss it on xen-devel and
> leave the patch out of the initial 4.13 release (this patch is the last
> pending one for 4.13 now). After a decision is made the patch can easily
> by backported to 4.13 in case it is regarded to be important.
> 
> 
> Juergen
> 
> 
Isn't it possible to quarantine by default, but still detect if there is
any (DMA) activity on that device and (perhaps after a certain threshold
or short time) print a big fat (once or rate limited) warning in Xen
logs that the device could be rogue and should be checked upon by an admin ?

--
Sander
diff mbox series

Patch

--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -1232,7 +1232,7 @@  detection of systems known to misbehave
 > Default: `new` unless directed-EOI is supported
 
 ### iommu
-    = List of [ <bool>, verbose, debug, force, required, quarantine,
+    = List of [ <bool>, verbose, debug, force, required, quarantine[=full],
                 sharept, intremap, intpost, crash-disable,
                 snoop, qinval, igfx, amd-iommu-perdev-intremap,
                 dom0-{passthrough,strict} ]
@@ -1270,11 +1270,13 @@  boolean (e.g. `iommu=no`) can override t
     will prevent Xen from booting if IOMMUs aren't discovered and enabled
     successfully.
 
-*   The `quarantine` boolean can be used to control Xen's behavior when
-    de-assigning devices from guests.  If enabled (the default), Xen always
-    quarantines such devices; they must be explicitly assigned back to Dom0
-    before they can be used there again.  If disabled, Xen will only
-    quarantine devices the toolstack hass arranged for getting quarantined.
+*   The `quarantine` option can be used to control Xen's behavior when
+    de-assigning devices from guests.  If set to true (the default), Xen
+    always quarantines such devices; they must be explicitly assigned back
+    to Dom0 before they can be used there again.  If set to "full", still
+    active DMA will additionally be directed to a "sink" page.  If set to
+    false, Xen will only quarantine devices the toolstack has arranged for
+    getting quarantined.
 
 *   The `sharept` boolean controls whether the IOMMU pagetables are shared
     with the CPU-side HAP pagetables, or allocated separately.  Sharing
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -28,6 +28,9 @@ 
 #include <asm/hvm/svm/amd-iommu-proto.h>
 #include "../ats.h"
 
+/* dom_io is used as a sentinel for quarantined devices */
+#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.root_table)
+
 static bool_t __read_mostly init_done;
 
 static const struct iommu_init_ops _iommu_init_ops;
@@ -85,18 +88,35 @@  int get_dma_requestor_id(uint16_t seg, u
     return req_id;
 }
 
-static void amd_iommu_setup_domain_device(
+static int __must_check allocate_domain_resources(struct domain_iommu *hd)
+{
+    int rc;
+
+    spin_lock(&hd->arch.mapping_lock);
+    rc = amd_iommu_alloc_root(hd);
+    spin_unlock(&hd->arch.mapping_lock);
+
+    return rc;
+}
+
+static int __must_check amd_iommu_setup_domain_device(
     struct domain *domain, struct amd_iommu *iommu,
     uint8_t devfn, struct pci_dev *pdev)
 {
     struct amd_iommu_dte *table, *dte;
     unsigned long flags;
-    int req_id, valid = 1;
+    int req_id, valid = 1, rc;
     u8 bus = pdev->bus;
-    const struct domain_iommu *hd = dom_iommu(domain);
+    struct domain_iommu *hd = dom_iommu(domain);
+
+    if ( QUARANTINE_SKIP(domain) )
+        return 0;
+
+    BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer);
 
-    BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode ||
-            !iommu->dev_table.buffer );
+    rc = allocate_domain_resources(hd);
+    if ( rc )
+        return rc;
 
     if ( iommu_hwdom_passthrough && is_hardware_domain(domain) )
         valid = 0;
@@ -151,6 +171,8 @@  static void amd_iommu_setup_domain_devic
 
         amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0);
     }
+
+    return 0;
 }
 
 int __init acpi_ivrs_init(void)
@@ -220,17 +242,6 @@  int amd_iommu_alloc_root(struct domain_i
     return 0;
 }
 
-static int __must_check allocate_domain_resources(struct domain_iommu *hd)
-{
-    int rc;
-
-    spin_lock(&hd->arch.mapping_lock);
-    rc = amd_iommu_alloc_root(hd);
-    spin_unlock(&hd->arch.mapping_lock);
-
-    return rc;
-}
-
 int amd_iommu_get_paging_mode(unsigned long entries)
 {
     int level = 1;
@@ -294,6 +305,9 @@  static void amd_iommu_disable_domain_dev
     int req_id;
     u8 bus = pdev->bus;
 
+    if ( QUARANTINE_SKIP(domain) )
+        return;
+
     BUG_ON ( iommu->dev_table.buffer == NULL );
     req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn));
     table = iommu->dev_table.buffer;
@@ -340,7 +354,6 @@  static int reassign_device(struct domain
 {
     struct amd_iommu *iommu;
     int bdf, rc;
-    struct domain_iommu *t = dom_iommu(target);
 
     bdf = PCI_BDF2(pdev->bus, pdev->devfn);
     iommu = find_iommu_for_device(pdev->seg, bdf);
@@ -361,11 +374,10 @@  static int reassign_device(struct domain
         pdev->domain = target;
     }
 
-    rc = allocate_domain_resources(t);
+    rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     if ( rc )
         return rc;
 
-    amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
                     pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
                     source->domain_id, target->domain_id);
@@ -522,8 +534,7 @@  static int amd_iommu_add_device(u8 devfn
         spin_unlock_irqrestore(&iommu->lock, flags);
     }
 
-    amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
-    return 0;
+    return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev);
 }
 
 static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev)
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -30,13 +30,17 @@  bool_t __initdata iommu_enable = 1;
 bool_t __read_mostly iommu_enabled;
 bool_t __read_mostly force_iommu;
 bool_t __read_mostly iommu_verbose;
-bool __read_mostly iommu_quarantine = true;
 bool_t __read_mostly iommu_igfx = 1;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
 bool_t __read_mostly iommu_crash_disable;
 
+#define IOMMU_quarantine_none  0
+#define IOMMU_quarantine_basic 1
+#define IOMMU_quarantine_full  2
+uint8_t __read_mostly iommu_quarantine = IOMMU_quarantine_basic;
+
 static bool __hwdom_initdata iommu_hwdom_none;
 bool __hwdom_initdata iommu_hwdom_strict;
 bool __read_mostly iommu_hwdom_passthrough;
@@ -81,6 +85,8 @@  static int __init parse_iommu_param(cons
             force_iommu = val;
         else if ( (val = parse_boolean("quarantine", s, ss)) >= 0 )
             iommu_quarantine = val;
+        else if ( ss == s + 15 && !strncmp(s, "quarantine=full", 15) )
+            iommu_quarantine = IOMMU_quarantine_full;
         else if ( (val = parse_boolean("igfx", s, ss)) >= 0 )
             iommu_igfx = val;
         else if ( (val = parse_boolean("verbose", s, ss)) >= 0 )
@@ -451,7 +457,7 @@  static int __init iommu_quarantine_init(
     dom_io->options |= XEN_DOMCTL_CDF_iommu;
 
     rc = iommu_domain_init(dom_io, 0);
-    if ( rc )
+    if ( rc || iommu_quarantine < IOMMU_quarantine_full )
         return rc;
 
     if ( !hd->platform_ops->quarantine_init )
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -41,6 +41,9 @@ 
 #include "vtd.h"
 #include "../ats.h"
 
+/* dom_io is used as a sentinel for quarantined devices */
+#define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.pgd_maddr)
+
 struct mapped_rmrr {
     struct list_head list;
     u64 base, end;
@@ -1291,6 +1294,9 @@  int domain_context_mapping_one(
     int agaw, rc, ret;
     bool_t flush_dev_iotlb;
 
+    if ( QUARANTINE_SKIP(domain) )
+        return 0;
+
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
     maddr = bus_to_context_maddr(iommu, bus);
@@ -1537,6 +1543,9 @@  int domain_context_unmap_one(
     int iommu_domid, rc, ret;
     bool_t flush_dev_iotlb;
 
+    if ( QUARANTINE_SKIP(domain) )
+        return 0;
+
     ASSERT(pcidevs_locked());
     spin_lock(&iommu->lock);
 
@@ -1598,7 +1607,7 @@  static int domain_context_unmap(struct d
 {
     struct acpi_drhd_unit *drhd;
     struct vtd_iommu *iommu;
-    int ret = 0;
+    int ret;
     u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus;
     int found = 0;
 
@@ -1614,14 +1623,12 @@  static int domain_context_unmap(struct d
             printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u unmap\n",
                    domain->domain_id, seg, bus,
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
-        if ( !is_hardware_domain(domain) )
-            return -EPERM;
-        goto out;
+        return is_hardware_domain(domain) ? 0 : -EPERM;
 
     case DEV_TYPE_PCIe_BRIDGE:
     case DEV_TYPE_PCIe2PCI_BRIDGE:
     case DEV_TYPE_LEGACY_PCI_BRIDGE:
-        goto out;
+        return 0;
 
     case DEV_TYPE_PCIe_ENDPOINT:
         if ( iommu_debug )
@@ -1665,10 +1672,12 @@  static int domain_context_unmap(struct d
         dprintk(XENLOG_ERR VTDPREFIX, "d%d:unknown(%u): %04x:%02x:%02x.%u\n",
                 domain->domain_id, pdev->type,
                 seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
-        ret = -EINVAL;
-        goto out;
+        return -EINVAL;
     }
 
+    if ( QUARANTINE_SKIP(domain) )
+        return ret;
+
     /*
      * if no other devices under the same iommu owned by this domain,
      * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp
@@ -1694,16 +1703,12 @@  static int domain_context_unmap(struct d
 
         iommu_domid = domain_iommu_domid(domain, iommu);
         if ( iommu_domid == -1 )
-        {
-            ret = -EINVAL;
-            goto out;
-        }
+            return -EINVAL;
 
         clear_bit(iommu_domid, iommu->domid_bitmap);
         iommu->domid_map[iommu_domid] = 0;
     }
 
-out:
     return ret;
 }
 
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -53,8 +53,10 @@  static inline bool_t dfn_eq(dfn_t x, dfn
 }
 
 extern bool_t iommu_enable, iommu_enabled;
-extern bool force_iommu, iommu_quarantine, iommu_verbose, iommu_igfx;
+extern bool force_iommu, iommu_verbose, iommu_igfx;
 extern bool_t iommu_snoop, iommu_qinval, iommu_intremap, iommu_intpost;
+/* Boolean except for the specific purposes of drivers/passthrough/iommu.c. */
+extern uint8_t iommu_quarantine;
 
 #if defined(CONFIG_IOMMU_FORCE_PT_SHARE)
 #define iommu_hap_pt_share true