diff mbox

[v3,1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest

Message ID 1463847590-22782-2-git-send-email-bd.aviv@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Aviv B.D. May 21, 2016, 4:19 p.m. UTC
From: "Aviv Ben-David" <bd.aviv@gmail.com>

This flag tells the guest to invalidate tlb cache also after unmap operations.

Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
---
 hw/i386/intel_iommu.c          | 3 ++-
 hw/i386/intel_iommu_internal.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Jan Kiszka May 21, 2016, 4:42 p.m. UTC | #1
On 2016-05-21 18:19, Aviv B.D wrote:
> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> 
> This flag tells the guest to invalidate tlb cache also after unmap operations.
> 
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> ---
>  hw/i386/intel_iommu.c          | 3 ++-
>  hw/i386/intel_iommu_internal.h | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 347718f..1af8da8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
>      s->next_frcd_reg = 0;
>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> +             VTD_CAP_CM;

Again, needs to be optional because not all guests will support it or
behave differently when it's set (I've one that refuses to work).

Jan

>      s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
>  
>      vtd_reset_context_cache(s);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index e5f514c..ae40f73 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,6 +190,7 @@
>  #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>  #define VTD_CAP_PSI                 (1ULL << 39)
>  #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_CM                  (1ULL << 7)
>  
>  /* Supported Adjusted Guest Address Widths */
>  #define VTD_CAP_SAGAW_SHIFT         8
>
Jason Wang May 24, 2016, 8:14 a.m. UTC | #2
On 2016年05月22日 00:19, Aviv B.D wrote:
> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>
> This flag tells the guest to invalidate tlb cache also after unmap operations.
>
> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> ---

Is this a guest visible behavior?  If yes, shouldn't we cache 
translation fault conditions in IOTLB?

>   hw/i386/intel_iommu.c          | 3 ++-
>   hw/i386/intel_iommu_internal.h | 1 +
>   2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 347718f..1af8da8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>       s->iq_last_desc_type = VTD_INV_DESC_NONE;
>       s->next_frcd_reg = 0;
>       s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> +             VTD_CAP_CM;
>       s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
>   
>       vtd_reset_context_cache(s);
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index e5f514c..ae40f73 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -190,6 +190,7 @@
>   #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>   #define VTD_CAP_PSI                 (1ULL << 39)
>   #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_CM                  (1ULL << 7)
>   
>   /* Supported Adjusted Guest Address Widths */
>   #define VTD_CAP_SAGAW_SHIFT         8
Jan Kiszka May 24, 2016, 9:25 a.m. UTC | #3
On 2016-05-24 10:14, Jason Wang wrote:
> On 2016年05月22日 00:19, Aviv B.D wrote:
>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>>
>> This flag tells the guest to invalidate tlb cache also after unmap
>> operations.
>>
>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>> ---
> 
> Is this a guest visible behavior?  If yes, shouldn't we cache
> translation fault conditions in IOTLB?

It is guest visible, and this first of all means, besides requiring to
be optional, that it definitely needs to be off for compat systems. Or
it stays off by default.

Jan
Aviv B.D. May 28, 2016, 4:12 p.m. UTC | #4
What is the best way to add this configuration option?

Aviv.

On Tue, May 24, 2016 at 12:25 PM Jan Kiszka <jan.kiszka@siemens.com> wrote:

> On 2016-05-24 10:14, Jason Wang wrote:
> > On 2016年05月22日 00:19, Aviv B.D wrote:
> >> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> >>
> >> This flag tells the guest to invalidate tlb cache also after unmap
> >> operations.
> >>
> >> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> >> ---
> >
> > Is this a guest visible behavior?  If yes, shouldn't we cache
> > translation fault conditions in IOTLB?
>
> It is guest visible, and this first of all means, besides requiring to
> be optional, that it definitely needs to be off for compat systems. Or
> it stays off by default.
>
> Jan
>
> --
> Siemens AG, Corporate Technology, CT RDA ITP SES-DE
> Corporate Competence Center Embedded Linux
>
Jan Kiszka May 28, 2016, 4:34 p.m. UTC | #5
Build on top of Marcel's -device for IOMMUs patches, add a device property. Sorry, no reference at hand.

Jan

Mit TouchDown von meinem Android-Telefon gesendet (www.symantec.com)

-----Original Message-----
From: Aviv B.D. [bd.aviv@gmail.com]
Received: Samstag, 28 Mai 2016, 18:12
To: Kiszka, Jan (CT RDA ITP SES-DE) [jan.kiszka@siemens.com]; Jason Wang [jasowang@redhat.com]; qemu-devel@nongnu.org [qemu-devel@nongnu.org]
CC: Alex Williamson [alex.williamson@redhat.com]; Peter Xu [peterx@redhat.com]; Michael S. Tsirkin [mst@redhat.com]
Subject: Re: [Qemu-devel] [PATCH v3 1/3] IOMMU: add VTD_CAP_CM to vIOMMU capability exposed to guest

What is the best way to add this configuration option?

Aviv.

On Tue, May 24, 2016 at 12:25 PM Jan Kiszka <jan.kiszka@siemens.com<mailto:jan.kiszka@siemens.com>> wrote:
On 2016-05-24 10:14, Jason Wang wrote:
> On 2016年05月22日 00:19, Aviv B.D wrote:
>> From: "Aviv Ben-David" <bd.aviv@gmail.com<mailto:bd.aviv@gmail.com>>
>>
>> This flag tells the guest to invalidate tlb cache also after unmap
>> operations.
>>
>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com<mailto:bd.aviv@gmail.com>>
>> ---
>
> Is this a guest visible behavior?  If yes, shouldn't we cache
> translation fault conditions in IOTLB?

It is guest visible, and this first of all means, besides requiring to
be optional, that it definitely needs to be off for compat systems. Or
it stays off by default.

Jan

--
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux
Peter Xu June 2, 2016, 8:44 a.m. UTC | #6
On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
> On 2016-05-21 18:19, Aviv B.D wrote:
> > From: "Aviv Ben-David" <bd.aviv@gmail.com>
> > 
> > This flag tells the guest to invalidate tlb cache also after unmap operations.
> > 
> > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > ---
> >  hw/i386/intel_iommu.c          | 3 ++-
> >  hw/i386/intel_iommu_internal.h | 1 +
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 347718f..1af8da8 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
> >      s->iq_last_desc_type = VTD_INV_DESC_NONE;
> >      s->next_frcd_reg = 0;
> >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> > -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> > +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > +             VTD_CAP_CM;
> 
> Again, needs to be optional because not all guests will support it or
> behave differently when it's set (I've one that refuses to work).

There should be more than one way to make it optional. Which is
better? What I can think of:

(Assume we have Marcel's "-device intel_iommu" working already)

1. Let the CM bit optional, or say, we need to specify something like
   "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
   have CM disabled but with VFIO device, let QEMU raise error.

2. We automatically detect whether we need CM bit. E.g., if we have
   VFIO and vIOMMU both enabled, we automatically set the bit. Another
   case is maybe we would in the future support nested vIOMMU? If so,
   we can do the same thing for the nested feature.

-- peterx
Alex Williamson June 2, 2016, 1 p.m. UTC | #7
On Thu, 2 Jun 2016 16:44:39 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
> > On 2016-05-21 18:19, Aviv B.D wrote:  
> > > From: "Aviv Ben-David" <bd.aviv@gmail.com>
> > > 
> > > This flag tells the guest to invalidate tlb cache also after unmap operations.
> > > 
> > > Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> > > ---
> > >  hw/i386/intel_iommu.c          | 3 ++-
> > >  hw/i386/intel_iommu_internal.h | 1 +
> > >  2 files changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 347718f..1af8da8 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
> > >      s->iq_last_desc_type = VTD_INV_DESC_NONE;
> > >      s->next_frcd_reg = 0;
> > >      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> > > -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> > > +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> > > +             VTD_CAP_CM;  
> > 
> > Again, needs to be optional because not all guests will support it or
> > behave differently when it's set (I've one that refuses to work).  
> 
> There should be more than one way to make it optional. Which is
> better? What I can think of:
> 
> (Assume we have Marcel's "-device intel_iommu" working already)
> 
> 1. Let the CM bit optional, or say, we need to specify something like
>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
>    have CM disabled but with VFIO device, let QEMU raise error.
> 
> 2. We automatically detect whether we need CM bit. E.g., if we have
>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
>    case is maybe we would in the future support nested vIOMMU? If so,
>    we can do the same thing for the nested feature.


Why do we need to support VT-d for guests that do not support CM=1?
The VT-d spec indicates that software should be written to handle both
caching modes (6.1).  Granted this is a *should* and not a *must*,
but can't we consider guests that do not support CM=1 incompatible with
emulated VT-d?  If CM=0 needs to be supported then we need to shadow
all of the remapping structures since vfio effectively becomes a cache
of the that would otherwise depend on the invalidation of both present
and non-present entries.  What guests do not support CM=1?  Thanks,

Alex
Jan Kiszka June 2, 2016, 1:14 p.m. UTC | #8
On 2016-06-02 15:00, Alex Williamson wrote:
> On Thu, 2 Jun 2016 16:44:39 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
>> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
>>> On 2016-05-21 18:19, Aviv B.D wrote:  
>>>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>>>>
>>>> This flag tells the guest to invalidate tlb cache also after unmap operations.
>>>>
>>>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>>>> ---
>>>>  hw/i386/intel_iommu.c          | 3 ++-
>>>>  hw/i386/intel_iommu_internal.h | 1 +
>>>>  2 files changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 347718f..1af8da8 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>>>>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
>>>>      s->next_frcd_reg = 0;
>>>>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
>>>> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
>>>> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
>>>> +             VTD_CAP_CM;  
>>>
>>> Again, needs to be optional because not all guests will support it or
>>> behave differently when it's set (I've one that refuses to work).  
>>
>> There should be more than one way to make it optional. Which is
>> better? What I can think of:
>>
>> (Assume we have Marcel's "-device intel_iommu" working already)
>>
>> 1. Let the CM bit optional, or say, we need to specify something like
>>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
>>    have CM disabled but with VFIO device, let QEMU raise error.
>>
>> 2. We automatically detect whether we need CM bit. E.g., if we have
>>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
>>    case is maybe we would in the future support nested vIOMMU? If so,
>>    we can do the same thing for the nested feature.
> 
> 
> Why do we need to support VT-d for guests that do not support CM=1?
> The VT-d spec indicates that software should be written to handle both
> caching modes (6.1).  Granted this is a *should* and not a *must*,
> but can't we consider guests that do not support CM=1 incompatible with
> emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> all of the remapping structures since vfio effectively becomes a cache
> of the that would otherwise depend on the invalidation of both present
> and non-present entries.  What guests do not support CM=1?  Thanks,

- there is at least one guest that does not support CM=1 yet (Jailhouse)
- there might be more or there might be broken ones as hardware
  generally doesn't have CM=1, thus this case is typically untested
- an AMD IOMMU (to my current understanding) will require shadowing
  anyway has it has no comparable concept, thus we will eventually be
  able to use that strategy also for VT-d

Jan
Jan Kiszka June 2, 2016, 1:17 p.m. UTC | #9
On 2016-06-02 15:14, Jan Kiszka wrote:
> On 2016-06-02 15:00, Alex Williamson wrote:
>> On Thu, 2 Jun 2016 16:44:39 +0800
>> Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
>>>> On 2016-05-21 18:19, Aviv B.D wrote:  
>>>>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
>>>>>
>>>>> This flag tells the guest to invalidate tlb cache also after unmap operations.
>>>>>
>>>>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
>>>>> ---
>>>>>  hw/i386/intel_iommu.c          | 3 ++-
>>>>>  hw/i386/intel_iommu_internal.h | 1 +
>>>>>  2 files changed, 3 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 347718f..1af8da8 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
>>>>>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
>>>>>      s->next_frcd_reg = 0;
>>>>>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
>>>>> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
>>>>> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
>>>>> +             VTD_CAP_CM;  
>>>>
>>>> Again, needs to be optional because not all guests will support it or
>>>> behave differently when it's set (I've one that refuses to work).  
>>>
>>> There should be more than one way to make it optional. Which is
>>> better? What I can think of:
>>>
>>> (Assume we have Marcel's "-device intel_iommu" working already)
>>>
>>> 1. Let the CM bit optional, or say, we need to specify something like
>>>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
>>>    have CM disabled but with VFIO device, let QEMU raise error.
>>>
>>> 2. We automatically detect whether we need CM bit. E.g., if we have
>>>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
>>>    case is maybe we would in the future support nested vIOMMU? If so,
>>>    we can do the same thing for the nested feature.
>>
>>
>> Why do we need to support VT-d for guests that do not support CM=1?
>> The VT-d spec indicates that software should be written to handle both
>> caching modes (6.1).  Granted this is a *should* and not a *must*,
>> but can't we consider guests that do not support CM=1 incompatible with
>> emulated VT-d?  If CM=0 needs to be supported then we need to shadow
>> all of the remapping structures since vfio effectively becomes a cache
>> of the that would otherwise depend on the invalidation of both present
>> and non-present entries.  What guests do not support CM=1?  Thanks,
> 
> - there is at least one guest that does not support CM=1 yet (Jailhouse)
> - there might be more or there might be broken ones as hardware
>   generally doesn't have CM=1, thus this case is typically untested
> - an AMD IOMMU (to my current understanding) will require shadowing
>   anyway has it has no comparable concept, thus we will eventually be
>   able to use that strategy also for VT-d

- more accurate hardware emulation, ie. less differences between
  modelled and real behaviour
  (one reason IR will be optional with VT-d because the Q35 chipset
  didn't include it)

Jan
Michael S. Tsirkin June 2, 2016, 4:15 p.m. UTC | #10
On Thu, Jun 02, 2016 at 03:14:36PM +0200, Jan Kiszka wrote:
> On 2016-06-02 15:00, Alex Williamson wrote:
> > On Thu, 2 Jun 2016 16:44:39 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> > 
> >> On Sat, May 21, 2016 at 06:42:03PM +0200, Jan Kiszka wrote:
> >>> On 2016-05-21 18:19, Aviv B.D wrote:  
> >>>> From: "Aviv Ben-David" <bd.aviv@gmail.com>
> >>>>
> >>>> This flag tells the guest to invalidate tlb cache also after unmap operations.
> >>>>
> >>>> Signed-off-by: Aviv Ben-David <bd.aviv@gmail.com>
> >>>> ---
> >>>>  hw/i386/intel_iommu.c          | 3 ++-
> >>>>  hw/i386/intel_iommu_internal.h | 1 +
> >>>>  2 files changed, 3 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> >>>> index 347718f..1af8da8 100644
> >>>> --- a/hw/i386/intel_iommu.c
> >>>> +++ b/hw/i386/intel_iommu.c
> >>>> @@ -1949,7 +1949,8 @@ static void vtd_init(IntelIOMMUState *s)
> >>>>      s->iq_last_desc_type = VTD_INV_DESC_NONE;
> >>>>      s->next_frcd_reg = 0;
> >>>>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
> >>>> -             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
> >>>> +             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> >>>> +             VTD_CAP_CM;  
> >>>
> >>> Again, needs to be optional because not all guests will support it or
> >>> behave differently when it's set (I've one that refuses to work).  
> >>
> >> There should be more than one way to make it optional. Which is
> >> better? What I can think of:
> >>
> >> (Assume we have Marcel's "-device intel_iommu" working already)
> >>
> >> 1. Let the CM bit optional, or say, we need to specify something like
> >>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
> >>    have CM disabled but with VFIO device, let QEMU raise error.
> >>
> >> 2. We automatically detect whether we need CM bit. E.g., if we have
> >>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
> >>    case is maybe we would in the future support nested vIOMMU? If so,
> >>    we can do the same thing for the nested feature.
> > 
> > 
> > Why do we need to support VT-d for guests that do not support CM=1?

I don't think we need to do this. Spec is rather clear on this point.
Just fix the guests.
Support for CM=0 might still be useful because CM=1 is not required if
device is restartable (most software devices would be).

I prefer cmbit=on by default, with cmbit=off, don't allow VFIO.


> > The VT-d spec indicates that software should be written to handle both
> > caching modes (6.1).  Granted this is a *should* and not a *must*,
> > but can't we consider guests that do not support CM=1 incompatible with
> > emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> > all of the remapping structures since vfio effectively becomes a cache
> > of the that would otherwise depend on the invalidation of both present
> > and non-present entries.  What guests do not support CM=1?  Thanks,
> 
> - there is at least one guest that does not support CM=1 yet (Jailhouse)
> - there might be more or there might be broken ones as hardware
>   generally doesn't have CM=1, thus this case is typically untested
> - an AMD IOMMU (to my current understanding) will require shadowing
>   anyway has it has no comparable concept,


I was rather sure this is it:
	26 NpCache: not present table entries cached. RO. Reset Xb. 1=Indicates
	that the IOMMU caches page table entries that are marked as not present.
	When this bit is set, software must issue an invalidate after any change
	to a PDE or PTE. 0=Indicates that the IOMMU caches only page table
	entries that are marked as present. When NpCache is clear, software must
	issue an invalidate after any change to a PDE or PTE marked present
	before the change.  Implementation note: For hardware implementations of
	the IOMMU, this bit must be 0b.

this implementation seems to set this bit.


> thus we will eventually be
>   able to use that strategy also for VT-d
> 
> Jan
> 
>
Peter Xu June 6, 2016, 5:04 a.m. UTC | #11
On Thu, Jun 02, 2016 at 03:14:36PM +0200, Jan Kiszka wrote:
> On 2016-06-02 15:00, Alex Williamson wrote:
> > On Thu, 2 Jun 2016 16:44:39 +0800
> > Peter Xu <peterx@redhat.com> wrote:
[...]
> >> There should be more than one way to make it optional. Which is
> >> better? What I can think of:
> >>
> >> (Assume we have Marcel's "-device intel_iommu" working already)
> >>
> >> 1. Let the CM bit optional, or say, we need to specify something like
> >>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
> >>    have CM disabled but with VFIO device, let QEMU raise error.
> >>
> >> 2. We automatically detect whether we need CM bit. E.g., if we have
> >>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
> >>    case is maybe we would in the future support nested vIOMMU? If so,
> >>    we can do the same thing for the nested feature.
> > 
> > 
> > Why do we need to support VT-d for guests that do not support CM=1?
> > The VT-d spec indicates that software should be written to handle both
> > caching modes (6.1).  Granted this is a *should* and not a *must*,
> > but can't we consider guests that do not support CM=1 incompatible with
> > emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> > all of the remapping structures since vfio effectively becomes a cache
> > of the that would otherwise depend on the invalidation of both present
> > and non-present entries.  What guests do not support CM=1?  Thanks,
> 
> - there is at least one guest that does not support CM=1 yet (Jailhouse)

Besides the reason that there might have guests that do not support
CM=1, will there be performance considerations? When user's
configuration does not require CM capability (e.g., generic VM
configuration, without VFIO), shall we allow user to disable the CM
bit so that we can have better IOMMU performance (avoid extra and
useless invalidations)?

-- peterx
Alex Williamson June 6, 2016, 1:11 p.m. UTC | #12
On Mon, 6 Jun 2016 13:04:07 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jun 02, 2016 at 03:14:36PM +0200, Jan Kiszka wrote:
> > On 2016-06-02 15:00, Alex Williamson wrote:  
> > > On Thu, 2 Jun 2016 16:44:39 +0800
> > > Peter Xu <peterx@redhat.com> wrote:  
> [...]
> > >> There should be more than one way to make it optional. Which is
> > >> better? What I can think of:
> > >>
> > >> (Assume we have Marcel's "-device intel_iommu" working already)
> > >>
> > >> 1. Let the CM bit optional, or say, we need to specify something like
> > >>    "-device intel_iommu,cmbit=on" or we will disable CM bit. If we
> > >>    have CM disabled but with VFIO device, let QEMU raise error.
> > >>
> > >> 2. We automatically detect whether we need CM bit. E.g., if we have
> > >>    VFIO and vIOMMU both enabled, we automatically set the bit. Another
> > >>    case is maybe we would in the future support nested vIOMMU? If so,
> > >>    we can do the same thing for the nested feature.  
> > > 
> > > 
> > > Why do we need to support VT-d for guests that do not support CM=1?
> > > The VT-d spec indicates that software should be written to handle both
> > > caching modes (6.1).  Granted this is a *should* and not a *must*,
> > > but can't we consider guests that do not support CM=1 incompatible with
> > > emulated VT-d?  If CM=0 needs to be supported then we need to shadow
> > > all of the remapping structures since vfio effectively becomes a cache
> > > of the that would otherwise depend on the invalidation of both present
> > > and non-present entries.  What guests do not support CM=1?  Thanks,  
> > 
> > - there is at least one guest that does not support CM=1 yet (Jailhouse)  
> 
> Besides the reason that there might have guests that do not support
> CM=1, will there be performance considerations? When user's
> configuration does not require CM capability (e.g., generic VM
> configuration, without VFIO), shall we allow user to disable the CM
> bit so that we can have better IOMMU performance (avoid extra and
> useless invalidations)?

With Alexey's proposed patch to have callback ops when the iommu
notifier list adds its first entry and removes its last, any of the
additional overhead to generate notifies when nobody is listening can
be avoided.  These same callbacks would be the ones that need to
generate a hw_error if a notifier is added while running in CM=0.
Thanks,

Alex
Peter Xu June 6, 2016, 1:43 p.m. UTC | #13
On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
> On Mon, 6 Jun 2016 13:04:07 +0800
> Peter Xu <peterx@redhat.com> wrote:
[...]
> > Besides the reason that there might have guests that do not support
> > CM=1, will there be performance considerations? When user's
> > configuration does not require CM capability (e.g., generic VM
> > configuration, without VFIO), shall we allow user to disable the CM
> > bit so that we can have better IOMMU performance (avoid extra and
> > useless invalidations)?
> 
> With Alexey's proposed patch to have callback ops when the iommu
> notifier list adds its first entry and removes its last, any of the
> additional overhead to generate notifies when nobody is listening can
> be avoided.  These same callbacks would be the ones that need to
> generate a hw_error if a notifier is added while running in CM=0.

Not familar with Alexey's patch, but is that for VFIO only? I mean, if
we configured CMbit=1, guest kernel will send invalidation request
every time it creates new entries (context entries, or iotlb
entries). Even without VFIO notifiers, guest need to trap into QEMU
and process the invalidation requests. This is avoidable if we are not
using VFIO devices at all (so no need to maintain any mappings),
right?

If we allow user to specify cmbit={0|1}, user can decide whether
he/she would like to take this benefit.

Thanks,

-- peterx
Alex Williamson June 6, 2016, 5:02 p.m. UTC | #14
On Mon, 6 Jun 2016 21:43:17 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
> > On Mon, 6 Jun 2016 13:04:07 +0800
> > Peter Xu <peterx@redhat.com> wrote:  
> [...]
> > > Besides the reason that there might have guests that do not support
> > > CM=1, will there be performance considerations? When user's
> > > configuration does not require CM capability (e.g., generic VM
> > > configuration, without VFIO), shall we allow user to disable the CM
> > > bit so that we can have better IOMMU performance (avoid extra and
> > > useless invalidations)?  
> > 
> > With Alexey's proposed patch to have callback ops when the iommu
> > notifier list adds its first entry and removes its last, any of the
> > additional overhead to generate notifies when nobody is listening can
> > be avoided.  These same callbacks would be the ones that need to
> > generate a hw_error if a notifier is added while running in CM=0.  
> 
> Not familar with Alexey's patch

https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html

>, but is that for VFIO only?

vfio is currently the only user of the iommu notifier, but the
interface is generic, which is how it should (must) be.

> I mean, if
> we configured CMbit=1, guest kernel will send invalidation request
> every time it creates new entries (context entries, or iotlb
> entries). Even without VFIO notifiers, guest need to trap into QEMU
> and process the invalidation requests. This is avoidable if we are not
> using VFIO devices at all (so no need to maintain any mappings),
> right?

CM=1 only defines that not-present and invalid entries can be cached,
any changes to existing entries requires an invalidation regardless of
CM.  What you're looking for sounds more like ECAP.C:

C: Page-walk Coherency
  This field indicates if hardware access to the root, context,
  extended-context and interrupt-remap tables, and second-level paging
  structures for requests-without PASID, are coherent (snooped) or not.
    • 0: Indicates hardware accesses to remapping structures are non-coherent.
    • 1: Indicates hardware accesses to remapping structures are coherent.

Without both CM=0 and C=0, our only virtualization mechanism for
maintaining a hardware cache coherent with the guest view of the iommu
would be to shadow all of the VT-d structures.  For purely emulated
devices, maybe we can get away with that, but I doubt the current
ghashes used for the iotlb are prepared for it.

> If we allow user to specify cmbit={0|1}, user can decide whether
> he/she would like to take this benefit.

So long as the *default* gives us the ability to support an external
hardware cache, like vfio, and we generate a hw_error or equivalent to
avoid unsafe combinations, you're free to enable whatever other
shortcuts you want.  Thanks,

Alex
Peter Xu June 7, 2016, 3:20 a.m. UTC | #15
On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
> On Mon, 6 Jun 2016 21:43:17 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
> > > On Mon, 6 Jun 2016 13:04:07 +0800
> > > Peter Xu <peterx@redhat.com> wrote:  
> > [...]
> > > > Besides the reason that there might have guests that do not support
> > > > CM=1, will there be performance considerations? When user's
> > > > configuration does not require CM capability (e.g., generic VM
> > > > configuration, without VFIO), shall we allow user to disable the CM
> > > > bit so that we can have better IOMMU performance (avoid extra and
> > > > useless invalidations)?  
> > > 
> > > With Alexey's proposed patch to have callback ops when the iommu
> > > notifier list adds its first entry and removes its last, any of the
> > > additional overhead to generate notifies when nobody is listening can
> > > be avoided.  These same callbacks would be the ones that need to
> > > generate a hw_error if a notifier is added while running in CM=0.  
> > 
> > Not familar with Alexey's patch
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html

Thanks for the pointer. :)

> 
> >, but is that for VFIO only?
> 
> vfio is currently the only user of the iommu notifier, but the
> interface is generic, which is how it should (must) be.

Yes.

> 
> > I mean, if
> > we configured CMbit=1, guest kernel will send invalidation request
> > every time it creates new entries (context entries, or iotlb
> > entries). Even without VFIO notifiers, guest need to trap into QEMU
> > and process the invalidation requests. This is avoidable if we are not
> > using VFIO devices at all (so no need to maintain any mappings),
> > right?
> 
> CM=1 only defines that not-present and invalid entries can be cached,
> any changes to existing entries requires an invalidation regardless of
> CM.  What you're looking for sounds more like ECAP.C:

Yes, but I guess what I was talking about is CM bit but not ECAP.C.
When we clear/replace one context entry, guest kernel will definitely
send one context entry invalidation to QEMU:

static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
{
	if (!iommu)
		return;

	clear_context_table(iommu, bus, devfn);
	iommu->flush.flush_context(iommu, 0, 0, 0,
					   DMA_CCMD_GLOBAL_INVL);
	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
}

... While if we are creating a new one (like attaching a new VFIO
device?), it's an optional behavior depending on whether CM bit is
set:

static int domain_context_mapping_one(struct dmar_domain *domain,
				      struct intel_iommu *iommu,
				      u8 bus, u8 devfn)
{
    ...
	/*
	 * It's a non-present to present mapping. If hardware doesn't cache
	 * non-present entry we only need to flush the write-buffer. If the
	 * _does_ cache non-present entries, then it does so in the special
	 * domain #0, which we have to flush:
	 */
	if (cap_caching_mode(iommu->cap)) {
		iommu->flush.flush_context(iommu, 0,
					   (((u16)bus) << 8) | devfn,
					   DMA_CCMD_MASK_NOBIT,
					   DMA_CCMD_DEVICE_INVL);
		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
	} else {
		iommu_flush_write_buffer(iommu);
	}
    ...
}

Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
will send these invalidations. What I meant is that, we should allow
user to specify the CM bit, so that when we are not using VFIO
devices, we can skip the above flush_content() and flush_iotlb()
etc... So, besides the truth that we have some guests do not support
CM bit (like Jailhouse), performance might be another consideration
point that we should allow user to specify the CM bit themselfs.

> 
> C: Page-walk Coherency
>   This field indicates if hardware access to the root, context,
>   extended-context and interrupt-remap tables, and second-level paging
>   structures for requests-without PASID, are coherent (snooped) or not.
>     • 0: Indicates hardware accesses to remapping structures are non-coherent.
>     • 1: Indicates hardware accesses to remapping structures are coherent.
> 
> Without both CM=0 and C=0, our only virtualization mechanism for
> maintaining a hardware cache coherent with the guest view of the iommu
> would be to shadow all of the VT-d structures.  For purely emulated
> devices, maybe we can get away with that, but I doubt the current
> ghashes used for the iotlb are prepared for it.

Actually I haven't noticed this bit yet. I see that this will decide
whether guest kernel need to send specific clflush() when modifying
IOMMU PTEs, but shouldn't we flush the memory cache always so that we
can sure IOMMU can see the same memory data as CPU does?

Thanks!

-- peterx
Alex Williamson June 7, 2016, 3:58 a.m. UTC | #16
On Tue, 7 Jun 2016 11:20:32 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
> > On Mon, 6 Jun 2016 21:43:17 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:  
> > > > On Mon, 6 Jun 2016 13:04:07 +0800
> > > > Peter Xu <peterx@redhat.com> wrote:    
> > > [...]  
> > > > > Besides the reason that there might have guests that do not support
> > > > > CM=1, will there be performance considerations? When user's
> > > > > configuration does not require CM capability (e.g., generic VM
> > > > > configuration, without VFIO), shall we allow user to disable the CM
> > > > > bit so that we can have better IOMMU performance (avoid extra and
> > > > > useless invalidations)?    
> > > > 
> > > > With Alexey's proposed patch to have callback ops when the iommu
> > > > notifier list adds its first entry and removes its last, any of the
> > > > additional overhead to generate notifies when nobody is listening can
> > > > be avoided.  These same callbacks would be the ones that need to
> > > > generate a hw_error if a notifier is added while running in CM=0.    
> > > 
> > > Not familar with Alexey's patch  
> > 
> > https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html  
> 
> Thanks for the pointer. :)
> 
> >   
> > >, but is that for VFIO only?  
> > 
> > vfio is currently the only user of the iommu notifier, but the
> > interface is generic, which is how it should (must) be.  
> 
> Yes.
> 
> >   
> > > I mean, if
> > > we configured CMbit=1, guest kernel will send invalidation request
> > > every time it creates new entries (context entries, or iotlb
> > > entries). Even without VFIO notifiers, guest need to trap into QEMU
> > > and process the invalidation requests. This is avoidable if we are not
> > > using VFIO devices at all (so no need to maintain any mappings),
> > > right?  
> > 
> > CM=1 only defines that not-present and invalid entries can be cached,
> > any changes to existing entries requires an invalidation regardless of
> > CM.  What you're looking for sounds more like ECAP.C:  
> 
> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
> When we clear/replace one context entry, guest kernel will definitely
> send one context entry invalidation to QEMU:
> 
> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
> {
> 	if (!iommu)
> 		return;
> 
> 	clear_context_table(iommu, bus, devfn);
> 	iommu->flush.flush_context(iommu, 0, 0, 0,
> 					   DMA_CCMD_GLOBAL_INVL);
> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
> }
> 
> ... While if we are creating a new one (like attaching a new VFIO
> device?), it's an optional behavior depending on whether CM bit is
> set:
> 
> static int domain_context_mapping_one(struct dmar_domain *domain,
> 				      struct intel_iommu *iommu,
> 				      u8 bus, u8 devfn)
> {
>     ...
> 	/*
> 	 * It's a non-present to present mapping. If hardware doesn't cache
> 	 * non-present entry we only need to flush the write-buffer. If the
> 	 * _does_ cache non-present entries, then it does so in the special
> 	 * domain #0, which we have to flush:
> 	 */
> 	if (cap_caching_mode(iommu->cap)) {
> 		iommu->flush.flush_context(iommu, 0,
> 					   (((u16)bus) << 8) | devfn,
> 					   DMA_CCMD_MASK_NOBIT,
> 					   DMA_CCMD_DEVICE_INVL);
> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> 	} else {
> 		iommu_flush_write_buffer(iommu);
> 	}
>     ...
> }
> 
> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
> will send these invalidations. What I meant is that, we should allow
> user to specify the CM bit, so that when we are not using VFIO
> devices, we can skip the above flush_content() and flush_iotlb()
> etc... So, besides the truth that we have some guests do not support
> CM bit (like Jailhouse), performance might be another consideration
> point that we should allow user to specify the CM bit themselfs.

I'm dubious of this, IOMMU drivers are already aware that hardware
flushes are expensive and do batching to optimize it.  The queued
invalidation mechanism itself is meant to allow asynchronous
invalidations.  QEMU invalidating a virtual IOMMU might very well be
faster than hardware.

> > 
> > C: Page-walk Coherency
> >   This field indicates if hardware access to the root, context,
> >   extended-context and interrupt-remap tables, and second-level paging
> >   structures for requests-without PASID, are coherent (snooped) or not.
> >     • 0: Indicates hardware accesses to remapping structures are non-coherent.
> >     • 1: Indicates hardware accesses to remapping structures are coherent.
> > 
> > Without both CM=0 and C=0, our only virtualization mechanism for
> > maintaining a hardware cache coherent with the guest view of the iommu
> > would be to shadow all of the VT-d structures.  For purely emulated
> > devices, maybe we can get away with that, but I doubt the current
> > ghashes used for the iotlb are prepared for it.  
> 
> Actually I haven't noticed this bit yet. I see that this will decide
> whether guest kernel need to send specific clflush() when modifying
> IOMMU PTEs, but shouldn't we flush the memory cache always so that we
> can sure IOMMU can see the same memory data as CPU does?

I think it would be a question of how much the g_hash code really buys
us in the VT-d code, it might be faster to do a lookup each time if it
means fewer flushes.  Those hashes are useless overhead for assigned
devices, so maybe we can avoid them when we only have assigned
devices ;)  Thanks,

Alex
Peter Xu June 7, 2016, 5 a.m. UTC | #17
On Mon, Jun 06, 2016 at 09:58:09PM -0600, Alex Williamson wrote:
> On Tue, 7 Jun 2016 11:20:32 +0800
> Peter Xu <peterx@redhat.com> wrote:
[...]
> > Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
> > will send these invalidations. What I meant is that, we should allow
> > user to specify the CM bit, so that when we are not using VFIO
> > devices, we can skip the above flush_content() and flush_iotlb()
> > etc... So, besides the truth that we have some guests do not support
> > CM bit (like Jailhouse), performance might be another consideration
> > point that we should allow user to specify the CM bit themselfs.
> 
> I'm dubious of this, IOMMU drivers are already aware that hardware
> flushes are expensive and do batching to optimize it.  The queued
> invalidation mechanism itself is meant to allow asynchronous
> invalidations.  QEMU invalidating a virtual IOMMU might very well be
> faster than hardware.

Agree. However it seems that current Linux is still not taking this
advantage... check qi_flush_context() and qi_flush_iotlb().
qi_submit_sync() is used for both, which sends one invalidation with a
explicit wait to make sure it's sync.

> 
> > > 
> > > C: Page-walk Coherency
> > >   This field indicates if hardware access to the root, context,
> > >   extended-context and interrupt-remap tables, and second-level paging
> > >   structures for requests-without PASID, are coherent (snooped) or not.
> > >     • 0: Indicates hardware accesses to remapping structures are non-coherent.
> > >     • 1: Indicates hardware accesses to remapping structures are coherent.
> > > 
> > > Without both CM=0 and C=0, our only virtualization mechanism for
> > > maintaining a hardware cache coherent with the guest view of the iommu
> > > would be to shadow all of the VT-d structures.  For purely emulated
> > > devices, maybe we can get away with that, but I doubt the current
> > > ghashes used for the iotlb are prepared for it.  
> > 
> > Actually I haven't noticed this bit yet. I see that this will decide
> > whether guest kernel need to send specific clflush() when modifying
> > IOMMU PTEs, but shouldn't we flush the memory cache always so that we
> > can sure IOMMU can see the same memory data as CPU does?
> 
> I think it would be a question of how much the g_hash code really buys
> us in the VT-d code, it might be faster to do a lookup each time if it
> means fewer flushes.  Those hashes are useless overhead for assigned
> devices, so maybe we can avoid them when we only have assigned
> devices ;)  Thanks,

Errr, I just noticed that VFIO devices do not need emulated
cache. There are indeed lots of pending works TBD on vIOMMU side...

Thanks!

-- peterx
Kai Huang June 7, 2016, 5:21 a.m. UTC | #18
On 6/7/2016 3:58 PM, Alex Williamson wrote:
> On Tue, 7 Jun 2016 11:20:32 +0800
> Peter Xu <peterx@redhat.com> wrote:
>
>> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
>>> On Mon, 6 Jun 2016 21:43:17 +0800
>>> Peter Xu <peterx@redhat.com> wrote:
>>>
>>>> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
>>>>> On Mon, 6 Jun 2016 13:04:07 +0800
>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>> [...]
>>>>>> Besides the reason that there might have guests that do not support
>>>>>> CM=1, will there be performance considerations? When user's
>>>>>> configuration does not require CM capability (e.g., generic VM
>>>>>> configuration, without VFIO), shall we allow user to disable the CM
>>>>>> bit so that we can have better IOMMU performance (avoid extra and
>>>>>> useless invalidations)?
>>>>>
>>>>> With Alexey's proposed patch to have callback ops when the iommu
>>>>> notifier list adds its first entry and removes its last, any of the
>>>>> additional overhead to generate notifies when nobody is listening can
>>>>> be avoided.  These same callbacks would be the ones that need to
>>>>> generate a hw_error if a notifier is added while running in CM=0.
>>>>
>>>> Not familar with Alexey's patch
>>>
>>> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html
>>
>> Thanks for the pointer. :)
>>
>>>
>>>> , but is that for VFIO only?
>>>
>>> vfio is currently the only user of the iommu notifier, but the
>>> interface is generic, which is how it should (must) be.
>>
>> Yes.
>>
>>>
>>>> I mean, if
>>>> we configured CMbit=1, guest kernel will send invalidation request
>>>> every time it creates new entries (context entries, or iotlb
>>>> entries). Even without VFIO notifiers, guest need to trap into QEMU
>>>> and process the invalidation requests. This is avoidable if we are not
>>>> using VFIO devices at all (so no need to maintain any mappings),
>>>> right?
>>>
>>> CM=1 only defines that not-present and invalid entries can be cached,
>>> any changes to existing entries requires an invalidation regardless of
>>> CM.  What you're looking for sounds more like ECAP.C:
>>
>> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
>> When we clear/replace one context entry, guest kernel will definitely
>> send one context entry invalidation to QEMU:
>>
>> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
>> {
>> 	if (!iommu)
>> 		return;
>>
>> 	clear_context_table(iommu, bus, devfn);
>> 	iommu->flush.flush_context(iommu, 0, 0, 0,
>> 					   DMA_CCMD_GLOBAL_INVL);
>> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
>> }
>>
>> ... While if we are creating a new one (like attaching a new VFIO
>> device?), it's an optional behavior depending on whether CM bit is
>> set:
>>
>> static int domain_context_mapping_one(struct dmar_domain *domain,
>> 				      struct intel_iommu *iommu,
>> 				      u8 bus, u8 devfn)
>> {
>>     ...
>> 	/*
>> 	 * It's a non-present to present mapping. If hardware doesn't cache
>> 	 * non-present entry we only need to flush the write-buffer. If the
>> 	 * _does_ cache non-present entries, then it does so in the special
>> 	 * domain #0, which we have to flush:
>> 	 */
>> 	if (cap_caching_mode(iommu->cap)) {
>> 		iommu->flush.flush_context(iommu, 0,
>> 					   (((u16)bus) << 8) | devfn,
>> 					   DMA_CCMD_MASK_NOBIT,
>> 					   DMA_CCMD_DEVICE_INVL);
>> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>> 	} else {
>> 		iommu_flush_write_buffer(iommu);
>> 	}
>>     ...
>> }
>>
>> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
>> will send these invalidations. What I meant is that, we should allow
>> user to specify the CM bit, so that when we are not using VFIO
>> devices, we can skip the above flush_content() and flush_iotlb()
>> etc... So, besides the truth that we have some guests do not support
>> CM bit (like Jailhouse), performance might be another consideration
>> point that we should allow user to specify the CM bit themselfs.
>
> I'm dubious of this, IOMMU drivers are already aware that hardware
> flushes are expensive and do batching to optimize it.  The queued
> invalidation mechanism itself is meant to allow asynchronous
> invalidations.  QEMU invalidating a virtual IOMMU might very well be
> faster than hardware.

Do batching doesn't mean we can eliminate the IOTLB flush for mappings 
from non-present to present, in case of CM=1, while in case CM=0 those 
IOTLB flush are not necessary, just like the code above shows. Therefore 
generally speaking CM=0 should have better performance than CM=1, even 
for Qemu's vIOMMU.

In my understanding the purpose of exposing CM=1 is to force guest do 
IOTLB flush for each mapping change (including from non-present to 
present) so Qemu is able to emulate each mapping change from guest 
(correct me if I was wrong). If previous statement stands, CM=1 is 
really a workaround for making vfio assigned devices and vIOMMU work 
together, and unfortunately this cannot work on other vendor's IOMMU 
without CM bit, such as AMD's IOMMU.

So what's the requirements of making vfio assigned devices and vIOMMU 
work together? I think it should be more helpful to implement a more 
generic solution to monitor and emulate guest vIOMMU's page table, 
rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.

And what do you mean asynchronous invalidations? I think the iova of the 
changed mappings cannot be used until the mappings are invalidated. It 
doesn't matter whether the invalidation is done via QI or register.

Thanks,
-Kai

>
>>>
>>> C: Page-walk Coherency
>>>   This field indicates if hardware access to the root, context,
>>>   extended-context and interrupt-remap tables, and second-level paging
>>>   structures for requests-without PASID, are coherent (snooped) or not.
>>>     • 0: Indicates hardware accesses to remapping structures are non-coherent.
>>>     • 1: Indicates hardware accesses to remapping structures are coherent.
>>>
>>> Without both CM=0 and C=0, our only virtualization mechanism for
>>> maintaining a hardware cache coherent with the guest view of the iommu
>>> would be to shadow all of the VT-d structures.  For purely emulated
>>> devices, maybe we can get away with that, but I doubt the current
>>> ghashes used for the iotlb are prepared for it.
>>
>> Actually I haven't noticed this bit yet. I see that this will decide
>> whether guest kernel need to send specific clflush() when modifying
>> IOMMU PTEs, but shouldn't we flush the memory cache always so that we
>> can sure IOMMU can see the same memory data as CPU does?
>
> I think it would be a question of how much the g_hash code really buys
> us in the VT-d code, it might be faster to do a lookup each time if it
> means fewer flushes.  Those hashes are useless overhead for assigned
> devices, so maybe we can avoid them when we only have assigned
> devices ;)  Thanks,
>
> Alex
>
>
Alex Williamson June 7, 2016, 6:46 p.m. UTC | #19
On Tue, 7 Jun 2016 17:21:06 +1200
"Huang, Kai" <kai.huang@linux.intel.com> wrote:

> On 6/7/2016 3:58 PM, Alex Williamson wrote:
> > On Tue, 7 Jun 2016 11:20:32 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >  
> >> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:  
> >>> On Mon, 6 Jun 2016 21:43:17 +0800
> >>> Peter Xu <peterx@redhat.com> wrote:
> >>>  
> >>>> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:  
> >>>>> On Mon, 6 Jun 2016 13:04:07 +0800
> >>>>> Peter Xu <peterx@redhat.com> wrote:  
> >>>> [...]  
> >>>>>> Besides the reason that there might have guests that do not support
> >>>>>> CM=1, will there be performance considerations? When user's
> >>>>>> configuration does not require CM capability (e.g., generic VM
> >>>>>> configuration, without VFIO), shall we allow user to disable the CM
> >>>>>> bit so that we can have better IOMMU performance (avoid extra and
> >>>>>> useless invalidations)?  
> >>>>>
> >>>>> With Alexey's proposed patch to have callback ops when the iommu
> >>>>> notifier list adds its first entry and removes its last, any of the
> >>>>> additional overhead to generate notifies when nobody is listening can
> >>>>> be avoided.  These same callbacks would be the ones that need to
> >>>>> generate a hw_error if a notifier is added while running in CM=0.  
> >>>>
> >>>> Not familar with Alexey's patch  
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html  
> >>
> >> Thanks for the pointer. :)
> >>  
> >>>  
> >>>> , but is that for VFIO only?  
> >>>
> >>> vfio is currently the only user of the iommu notifier, but the
> >>> interface is generic, which is how it should (must) be.  
> >>
> >> Yes.
> >>  
> >>>  
> >>>> I mean, if
> >>>> we configured CMbit=1, guest kernel will send invalidation request
> >>>> every time it creates new entries (context entries, or iotlb
> >>>> entries). Even without VFIO notifiers, guest need to trap into QEMU
> >>>> and process the invalidation requests. This is avoidable if we are not
> >>>> using VFIO devices at all (so no need to maintain any mappings),
> >>>> right?  
> >>>
> >>> CM=1 only defines that not-present and invalid entries can be cached,
> >>> any changes to existing entries requires an invalidation regardless of
> >>> CM.  What you're looking for sounds more like ECAP.C:  
> >>
> >> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
> >> When we clear/replace one context entry, guest kernel will definitely
> >> send one context entry invalidation to QEMU:
> >>
> >> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
> >> {
> >> 	if (!iommu)
> >> 		return;
> >>
> >> 	clear_context_table(iommu, bus, devfn);
> >> 	iommu->flush.flush_context(iommu, 0, 0, 0,
> >> 					   DMA_CCMD_GLOBAL_INVL);
> >> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
> >> }
> >>
> >> ... While if we are creating a new one (like attaching a new VFIO
> >> device?), it's an optional behavior depending on whether CM bit is
> >> set:
> >>
> >> static int domain_context_mapping_one(struct dmar_domain *domain,
> >> 				      struct intel_iommu *iommu,
> >> 				      u8 bus, u8 devfn)
> >> {
> >>     ...
> >> 	/*
> >> 	 * It's a non-present to present mapping. If hardware doesn't cache
> >> 	 * non-present entry we only need to flush the write-buffer. If the
> >> 	 * _does_ cache non-present entries, then it does so in the special
> >> 	 * domain #0, which we have to flush:
> >> 	 */
> >> 	if (cap_caching_mode(iommu->cap)) {
> >> 		iommu->flush.flush_context(iommu, 0,
> >> 					   (((u16)bus) << 8) | devfn,
> >> 					   DMA_CCMD_MASK_NOBIT,
> >> 					   DMA_CCMD_DEVICE_INVL);
> >> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
> >> 	} else {
> >> 		iommu_flush_write_buffer(iommu);
> >> 	}
> >>     ...
> >> }
> >>
> >> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
> >> will send these invalidations. What I meant is that, we should allow
> >> user to specify the CM bit, so that when we are not using VFIO
> >> devices, we can skip the above flush_content() and flush_iotlb()
> >> etc... So, besides the truth that we have some guests do not support
> >> CM bit (like Jailhouse), performance might be another consideration
> >> point that we should allow user to specify the CM bit themselfs.  
> >
> > I'm dubious of this, IOMMU drivers are already aware that hardware
> > flushes are expensive and do batching to optimize it.  The queued
> > invalidation mechanism itself is meant to allow asynchronous
> > invalidations.  QEMU invalidating a virtual IOMMU might very well be
> > faster than hardware.  
> 
> Do batching doesn't mean we can eliminate the IOTLB flush for mappings 
> from non-present to present, in case of CM=1, while in case CM=0 those 
> IOTLB flush are not necessary, just like the code above shows. Therefore 
> generally speaking CM=0 should have better performance than CM=1, even 
> for Qemu's vIOMMU.
> 
> In my understanding the purpose of exposing CM=1 is to force guest do 
> IOTLB flush for each mapping change (including from non-present to 
> present) so Qemu is able to emulate each mapping change from guest 
> (correct me if I was wrong). If previous statement stands, CM=1 is 
> really a workaround for making vfio assigned devices and vIOMMU work 
> together, and unfortunately this cannot work on other vendor's IOMMU 
> without CM bit, such as AMD's IOMMU.
> 
> So what's the requirements of making vfio assigned devices and vIOMMU 
> work together? I think it should be more helpful to implement a more 
> generic solution to monitor and emulate guest vIOMMU's page table, 
> rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.
> 
> And what do you mean asynchronous invalidations? I think the iova of the 
> changed mappings cannot be used until the mappings are invalidated. It 
> doesn't matter whether the invalidation is done via QI or register.

Ok, so you're arguing that CM=0 is more efficient than CM=1 because it
eliminates some portion of the invalidations necessary by the guest,
while at the same time arguing for a more general solution of shadow
page tables which would trap into the vIOMMU at every update,
eliminating all the batching done by the guest IOMMU driver code
attempting to reduce and consolidate the number of flushes done.  All
of this with only speculation on what might be more efficient.  Can we
get vIOMMU working, especially with assigned devices, before we
speculate further?

How do we expect a vIOMMU to be used in a guest?  In the case of
emulated devices, what value does it provide?  Are emulated devices
isolated from one another by the vIOMMU?  No.  Do we have 32-bit
emulated devices for which DMA translation at the vIOMMU is
significantly more efficient than bounce buffers within the guest?
Probably not, and if we did we could just emulate 64bit devices.  So I
assume that beyond being a development tool, our primary useful feature
of a vIOMMU is to expose devices to guest userspace (and thus nested
guests) via tools like vfio.  Am I wrong here?  In this use case, it's
most efficient to run with iommu=pt in the L1 guest, which would make
any sort of invalidations a very rare event.  Making use of vfio inside
the L1 guest would then move a device from the static-identity domain
in the L1 guest into a new domain where the mappings for that domain
are also relatively static.  So I really don't see why we're trying to
optimize the invalidation of the vIOMMU at this point.  I also have to
believe that the hardware folks that designed VT-d believed there to be
a performance advantage to using emulated VT-d with CM=1 versus
shadowing all of the VT-d data structures in the hypervisor or they
wouldn't have added this bit to the specification.  Thanks,

Alex
Kai Huang June 7, 2016, 10:39 p.m. UTC | #20
On 6/8/2016 6:46 AM, Alex Williamson wrote:
> On Tue, 7 Jun 2016 17:21:06 +1200
> "Huang, Kai" <kai.huang@linux.intel.com> wrote:
>
>> On 6/7/2016 3:58 PM, Alex Williamson wrote:
>>> On Tue, 7 Jun 2016 11:20:32 +0800
>>> Peter Xu <peterx@redhat.com> wrote:
>>>
>>>> On Mon, Jun 06, 2016 at 11:02:11AM -0600, Alex Williamson wrote:
>>>>> On Mon, 6 Jun 2016 21:43:17 +0800
>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>>> On Mon, Jun 06, 2016 at 07:11:41AM -0600, Alex Williamson wrote:
>>>>>>> On Mon, 6 Jun 2016 13:04:07 +0800
>>>>>>> Peter Xu <peterx@redhat.com> wrote:
>>>>>> [...]
>>>>>>>> Besides the reason that there might have guests that do not support
>>>>>>>> CM=1, will there be performance considerations? When user's
>>>>>>>> configuration does not require CM capability (e.g., generic VM
>>>>>>>> configuration, without VFIO), shall we allow user to disable the CM
>>>>>>>> bit so that we can have better IOMMU performance (avoid extra and
>>>>>>>> useless invalidations)?
>>>>>>>
>>>>>>> With Alexey's proposed patch to have callback ops when the iommu
>>>>>>> notifier list adds its first entry and removes its last, any of the
>>>>>>> additional overhead to generate notifies when nobody is listening can
>>>>>>> be avoided.  These same callbacks would be the ones that need to
>>>>>>> generate a hw_error if a notifier is added while running in CM=0.
>>>>>>
>>>>>> Not familar with Alexey's patch
>>>>>
>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2016-06/msg00079.html
>>>>
>>>> Thanks for the pointer. :)
>>>>
>>>>>
>>>>>> , but is that for VFIO only?
>>>>>
>>>>> vfio is currently the only user of the iommu notifier, but the
>>>>> interface is generic, which is how it should (must) be.
>>>>
>>>> Yes.
>>>>
>>>>>
>>>>>> I mean, if
>>>>>> we configured CMbit=1, guest kernel will send invalidation request
>>>>>> every time it creates new entries (context entries, or iotlb
>>>>>> entries). Even without VFIO notifiers, guest need to trap into QEMU
>>>>>> and process the invalidation requests. This is avoidable if we are not
>>>>>> using VFIO devices at all (so no need to maintain any mappings),
>>>>>> right?
>>>>>
>>>>> CM=1 only defines that not-present and invalid entries can be cached,
>>>>> any changes to existing entries requires an invalidation regardless of
>>>>> CM.  What you're looking for sounds more like ECAP.C:
>>>>
>>>> Yes, but I guess what I was talking about is CM bit but not ECAP.C.
>>>> When we clear/replace one context entry, guest kernel will definitely
>>>> send one context entry invalidation to QEMU:
>>>>
>>>> static void domain_context_clear_one(struct intel_iommu *iommu, u8 bus, u8 devfn)
>>>> {
>>>> 	if (!iommu)
>>>> 		return;
>>>>
>>>> 	clear_context_table(iommu, bus, devfn);
>>>> 	iommu->flush.flush_context(iommu, 0, 0, 0,
>>>> 					   DMA_CCMD_GLOBAL_INVL);
>>>> 	iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
>>>> }
>>>>
>>>> ... While if we are creating a new one (like attaching a new VFIO
>>>> device?), it's an optional behavior depending on whether CM bit is
>>>> set:
>>>>
>>>> static int domain_context_mapping_one(struct dmar_domain *domain,
>>>> 				      struct intel_iommu *iommu,
>>>> 				      u8 bus, u8 devfn)
>>>> {
>>>>     ...
>>>> 	/*
>>>> 	 * It's a non-present to present mapping. If hardware doesn't cache
>>>> 	 * non-present entry we only need to flush the write-buffer. If the
>>>> 	 * _does_ cache non-present entries, then it does so in the special
>>>> 	 * domain #0, which we have to flush:
>>>> 	 */
>>>> 	if (cap_caching_mode(iommu->cap)) {
>>>> 		iommu->flush.flush_context(iommu, 0,
>>>> 					   (((u16)bus) << 8) | devfn,
>>>> 					   DMA_CCMD_MASK_NOBIT,
>>>> 					   DMA_CCMD_DEVICE_INVL);
>>>> 		iommu->flush.flush_iotlb(iommu, did, 0, 0, DMA_TLB_DSI_FLUSH);
>>>> 	} else {
>>>> 		iommu_flush_write_buffer(iommu);
>>>> 	}
>>>>     ...
>>>> }
>>>>
>>>> Only if cap_caching_mode() is set (which is bit 7, the CM bit), we
>>>> will send these invalidations. What I meant is that, we should allow
>>>> user to specify the CM bit, so that when we are not using VFIO
>>>> devices, we can skip the above flush_content() and flush_iotlb()
>>>> etc... So, besides the truth that we have some guests do not support
>>>> CM bit (like Jailhouse), performance might be another consideration
>>>> point that we should allow user to specify the CM bit themselfs.
>>>
>>> I'm dubious of this, IOMMU drivers are already aware that hardware
>>> flushes are expensive and do batching to optimize it.  The queued
>>> invalidation mechanism itself is meant to allow asynchronous
>>> invalidations.  QEMU invalidating a virtual IOMMU might very well be
>>> faster than hardware.
>>
>> Do batching doesn't mean we can eliminate the IOTLB flush for mappings
>> from non-present to present, in case of CM=1, while in case CM=0 those
>> IOTLB flush are not necessary, just like the code above shows. Therefore
>> generally speaking CM=0 should have better performance than CM=1, even
>> for Qemu's vIOMMU.
>>
>> In my understanding the purpose of exposing CM=1 is to force guest do
>> IOTLB flush for each mapping change (including from non-present to
>> present) so Qemu is able to emulate each mapping change from guest
>> (correct me if I was wrong). If previous statement stands, CM=1 is
>> really a workaround for making vfio assigned devices and vIOMMU work
>> together, and unfortunately this cannot work on other vendor's IOMMU
>> without CM bit, such as AMD's IOMMU.
>>
>> So what's the requirements of making vfio assigned devices and vIOMMU
>> work together? I think it should be more helpful to implement a more
>> generic solution to monitor and emulate guest vIOMMU's page table,
>> rather than simply exposing CM=1 to guest, as it only works on intel IOMMU.
>>
>> And what do you mean asynchronous invalidations? I think the iova of the
>> changed mappings cannot be used until the mappings are invalidated. It
>> doesn't matter whether the invalidation is done via QI or register.
>
> Ok, so you're arguing that CM=0 is more efficient than CM=1 because it
> eliminates some portion of the invalidations necessary by the guest,
> while at the same time arguing for a more general solution of shadow
> page tables which would trap into the vIOMMU at every update,
> eliminating all the batching done by the guest IOMMU driver code
> attempting to reduce and consolidate the number of flushes done.  All
> of this with only speculation on what might be more efficient.  Can we
> get vIOMMU working, especially with assigned devices, before we
> speculate further?
>
> How do we expect a vIOMMU to be used in a guest?  In the case of
> emulated devices, what value does it provide?  Are emulated devices
> isolated from one another by the vIOMMU?  No.  Do we have 32-bit
> emulated devices for which DMA translation at the vIOMMU is
> significantly more efficient than bounce buffers within the guest?
> Probably not, and if we did we could just emulate 64bit devices.  So I
> assume that beyond being a development tool, our primary useful feature
> of a vIOMMU is to expose devices to guest userspace (and thus nested
> guests) via tools like vfio.  Am I wrong here?  In this use case, it's
> most efficient to run with iommu=pt in the L1 guest, which would make
> any sort of invalidations a very rare event.  Making use of vfio inside
> the L1 guest would then move a device from the static-identity domain
> in the L1 guest into a new domain where the mappings for that domain
> are also relatively static.  So I really don't see why we're trying to
> optimize the invalidation of the vIOMMU at this point.  I also have to
> believe that the hardware folks that designed VT-d believed there to be
> a performance advantage to using emulated VT-d with CM=1 versus
> shadowing all of the VT-d data structures in the hypervisor or they
> wouldn't have added this bit to the specification.  Thanks,
Hi Alex,

Sorry for jumping to this discussion suddenly. Yes I absolutely agree 
that getting vIOMMU working with assigned devices is more important 
thing than arguing on vIOMMU performance on CM bit. Actually I am very 
eager to make vIOMMU working with vfio for assigned device in guest as I 
want to try DPDK via VFIO in guest (this is the reason I searched vIOMMU 
support in Qemu and found this discussion). My concern for CM bit is not 
performance, but it is not generic way, but still it is better than 
nothing :)

Thanks,
-Kai

>
> Alex
>
>
diff mbox

Patch

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 347718f..1af8da8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1949,7 +1949,8 @@  static void vtd_init(IntelIOMMUState *s)
     s->iq_last_desc_type = VTD_INV_DESC_NONE;
     s->next_frcd_reg = 0;
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | VTD_CAP_MGAW |
-             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS;
+             VTD_CAP_SAGAW | VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
+             VTD_CAP_CM;
     s->ecap = VTD_ECAP_QI | VTD_ECAP_IRO;
 
     vtd_reset_context_cache(s);
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e5f514c..ae40f73 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -190,6 +190,7 @@ 
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
 #define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_CM                  (1ULL << 7)
 
 /* Supported Adjusted Guest Address Widths */
 #define VTD_CAP_SAGAW_SHIFT         8