mbox series

[0/3] AMD/IOMMU: re-work mode updating

Message ID db66edf2-ca66-4127-8920-ba55f4aee14e@suse.com (mailing list archive)
Headers show
Series AMD/IOMMU: re-work mode updating | expand

Message

Jan Beulich Nov. 6, 2019, 3:16 p.m. UTC
update_paging_mode() in the AMD IOMMU code expects to be invoked with
the PCI devices lock held. The check occurring only when the mode
actually needs updating, the violation of this rule by the majority
of callers did go unnoticed until per-domain IOMMU setup was changed
to do away with on-demand creation of IOMMU page tables.

Unfortunately the only half way reasonable fix to this that I could
come up with requires more re-work than would seem desirable at this
time of the release process, but addressing the issue seems
unavoidable to me as its manifestation is a regression from the
IOMMU page table setup re-work. The change also isn't without risk
of further regressions - if in patch 2 I've missed a code path that
would also need to invoke the new hook, then this might mean non-
working guests (with passed-through devices on AMD hardware).

1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
2: introduce GFN notification for translated domains
3: AMD/IOMMU: use notify_dfn() hook to update paging mode

Jan

Comments

Andrew Cooper Nov. 6, 2019, 5:31 p.m. UTC | #1
On 06/11/2019 15:16, Jan Beulich wrote:
> update_paging_mode() in the AMD IOMMU code expects to be invoked with
> the PCI devices lock held. The check occurring only when the mode
> actually needs updating, the violation of this rule by the majority
> of callers did go unnoticed until per-domain IOMMU setup was changed
> to do away with on-demand creation of IOMMU page tables.
>
> Unfortunately the only half way reasonable fix to this that I could
> come up with requires more re-work than would seem desirable at this
> time of the release process, but addressing the issue seems
> unavoidable to me as its manifestation is a regression from the
> IOMMU page table setup re-work. The change also isn't without risk
> of further regressions - if in patch 2 I've missed a code path that
> would also need to invoke the new hook, then this might mean non-
> working guests (with passed-through devices on AMD hardware).
>
> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
> 2: introduce GFN notification for translated domains
> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode

Having now looked at all three, why don't we just delete the dynamic
height of AMD IOMMU pagetables?

This series looks suspiciously like it is adding new common
infrastructure to work around the fact we're doing something fairly dumb
to being with.

Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain,
and a substantial reduction in the complexity of the IOMMU code.

~Andrew
Sander Eikelenboom Nov. 6, 2019, 6:29 p.m. UTC | #2
On 06/11/2019 16:16, Jan Beulich wrote:
> update_paging_mode() in the AMD IOMMU code expects to be invoked with
> the PCI devices lock held. The check occurring only when the mode
> actually needs updating, the violation of this rule by the majority
> of callers did go unnoticed until per-domain IOMMU setup was changed
> to do away with on-demand creation of IOMMU page tables.
> 
> Unfortunately the only half way reasonable fix to this that I could
> come up with requires more re-work than would seem desirable at this
> time of the release process, but addressing the issue seems
> unavoidable to me as its manifestation is a regression from the
> IOMMU page table setup re-work. The change also isn't without risk
> of further regressions - if in patch 2 I've missed a code path that
> would also need to invoke the new hook, then this might mean non-
> working guests (with passed-through devices on AMD hardware).
> 
> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
> 2: introduce GFN notification for translated domains
> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode
> 
> Jan
> 

Hi Jan,

I just tested and I don't get the  "pcidevs" message any more.

I assume this only was a fix for that issue, so it's probably expected
that the other issue:
   AMD-Vi: INVALID_DEV_REQUEST 00000800 8a000000 f8000840 000000fd
   and malfunctioning device in one of the guests.
is still around.

--
Sander
Jan Beulich Nov. 7, 2019, 7:32 a.m. UTC | #3
On 06.11.2019 19:29, Sander Eikelenboom wrote:
> On 06/11/2019 16:16, Jan Beulich wrote:
>> update_paging_mode() in the AMD IOMMU code expects to be invoked with
>> the PCI devices lock held. The check occurring only when the mode
>> actually needs updating, the violation of this rule by the majority
>> of callers did go unnoticed until per-domain IOMMU setup was changed
>> to do away with on-demand creation of IOMMU page tables.
>>
>> Unfortunately the only half way reasonable fix to this that I could
>> come up with requires more re-work than would seem desirable at this
>> time of the release process, but addressing the issue seems
>> unavoidable to me as its manifestation is a regression from the
>> IOMMU page table setup re-work. The change also isn't without risk
>> of further regressions - if in patch 2 I've missed a code path that
>> would also need to invoke the new hook, then this might mean non-
>> working guests (with passed-through devices on AMD hardware).
>>
>> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
>> 2: introduce GFN notification for translated domains
>> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode
> 
> I just tested and I don't get the  "pcidevs" message any more.

Thanks for testing the series.

> I assume this only was a fix for that issue, so it's probably expected
> that the other issue:
>    AMD-Vi: INVALID_DEV_REQUEST 00000800 8a000000 f8000840 000000fd
>    and malfunctioning device in one of the guests.
> is still around.

Indeed. Someone (possibly me) still needs to look into the other one.

Jan
Jan Beulich Nov. 7, 2019, 7:36 a.m. UTC | #4
On 06.11.2019 18:31, Andrew Cooper wrote:
> On 06/11/2019 15:16, Jan Beulich wrote:
>> update_paging_mode() in the AMD IOMMU code expects to be invoked with
>> the PCI devices lock held. The check occurring only when the mode
>> actually needs updating, the violation of this rule by the majority
>> of callers did go unnoticed until per-domain IOMMU setup was changed
>> to do away with on-demand creation of IOMMU page tables.
>>
>> Unfortunately the only half way reasonable fix to this that I could
>> come up with requires more re-work than would seem desirable at this
>> time of the release process, but addressing the issue seems
>> unavoidable to me as its manifestation is a regression from the
>> IOMMU page table setup re-work. The change also isn't without risk
>> of further regressions - if in patch 2 I've missed a code path that
>> would also need to invoke the new hook, then this might mean non-
>> working guests (with passed-through devices on AMD hardware).
>>
>> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
>> 2: introduce GFN notification for translated domains
>> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode
> 
> Having now looked at all three, why don't we just delete the dynamic
> height of AMD IOMMU pagetables?
> 
> This series looks suspiciously like it is adding new common
> infrastructure to work around the fact we're doing something fairly dumb
> to being with.
> 
> Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain,
> and a substantial reduction in the complexity of the IOMMU code.

Yet an additional level of page walks hardware has to perform. Also
4 levels won't cover all possible 52 address bits. And finally, the
more applicable your "substantial reduction", the less suitable such
a change may be at this point of the release (but I didn't look at
this side of things in any detail, so it may well not be an issue).

Jan
Andrew Cooper Nov. 7, 2019, 12:49 p.m. UTC | #5
On 07/11/2019 07:36, Jan Beulich wrote:
> On 06.11.2019 18:31, Andrew Cooper wrote:
>> On 06/11/2019 15:16, Jan Beulich wrote:
>>> update_paging_mode() in the AMD IOMMU code expects to be invoked with
>>> the PCI devices lock held. The check occurring only when the mode
>>> actually needs updating, the violation of this rule by the majority
>>> of callers did go unnoticed until per-domain IOMMU setup was changed
>>> to do away with on-demand creation of IOMMU page tables.
>>>
>>> Unfortunately the only half way reasonable fix to this that I could
>>> come up with requires more re-work than would seem desirable at this
>>> time of the release process, but addressing the issue seems
>>> unavoidable to me as its manifestation is a regression from the
>>> IOMMU page table setup re-work. The change also isn't without risk
>>> of further regressions - if in patch 2 I've missed a code path that
>>> would also need to invoke the new hook, then this might mean non-
>>> working guests (with passed-through devices on AMD hardware).
>>>
>>> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
>>> 2: introduce GFN notification for translated domains
>>> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode
>> Having now looked at all three, why don't we just delete the dynamic
>> height of AMD IOMMU pagetables?
>>
>> This series looks suspiciously like it is adding new common
>> infrastructure to work around the fact we're doing something fairly dumb
>> to being with.
>>
>> Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain,
>> and a substantial reduction in the complexity of the IOMMU code.
> Yet an additional level of page walks hardware has to perform. Also
> 4 levels won't cover all possible 52 address bits. And finally, the
> more applicable your "substantial reduction", the less suitable such
> a change may be at this point of the release (but I didn't look at
> this side of things in any detail, so it may well not be an issue).

There is, in practice, no such thing as an HVM guest using 2 levels. 
The VRAM just below the 4G boundary will force a resize to 3 levels
during domain construction, and as a 1-line fix for 4.13, this probably
isn't the worst idea going.

There are no AMD systems which support >48 bit PA space, so 4 levels is
sufficient for now, but fundamentally details such as the size of GPA
space should be specified in domain_create() and remain static for the
lifetime of the domain.

As far as I can tell, it is only AMD systems with IOMMUs which permit
the PA space to be variable, and I still can't help but feeling that
this series is attempting to work around a problem we shouldn't have in
the first place.

~Andrew
Jan Beulich Nov. 7, 2019, 1:17 p.m. UTC | #6
On 07.11.2019 13:49, Andrew Cooper wrote:
> On 07/11/2019 07:36, Jan Beulich wrote:
>> On 06.11.2019 18:31, Andrew Cooper wrote:
>>> On 06/11/2019 15:16, Jan Beulich wrote:
>>>> update_paging_mode() in the AMD IOMMU code expects to be invoked with
>>>> the PCI devices lock held. The check occurring only when the mode
>>>> actually needs updating, the violation of this rule by the majority
>>>> of callers did go unnoticed until per-domain IOMMU setup was changed
>>>> to do away with on-demand creation of IOMMU page tables.
>>>>
>>>> Unfortunately the only half way reasonable fix to this that I could
>>>> come up with requires more re-work than would seem desirable at this
>>>> time of the release process, but addressing the issue seems
>>>> unavoidable to me as its manifestation is a regression from the
>>>> IOMMU page table setup re-work. The change also isn't without risk
>>>> of further regressions - if in patch 2 I've missed a code path that
>>>> would also need to invoke the new hook, then this might mean non-
>>>> working guests (with passed-through devices on AMD hardware).
>>>>
>>>> 1: AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
>>>> 2: introduce GFN notification for translated domains
>>>> 3: AMD/IOMMU: use notify_dfn() hook to update paging mode
>>> Having now looked at all three, why don't we just delete the dynamic
>>> height of AMD IOMMU pagetables?
>>>
>>> This series looks suspiciously like it is adding new common
>>> infrastructure to work around the fact we're doing something fairly dumb
>>> to being with.
>>>
>>> Hardcoding at 4 levels is, at the very worst, 2 extra pages per domain,
>>> and a substantial reduction in the complexity of the IOMMU code.
>> Yet an additional level of page walks hardware has to perform. Also
>> 4 levels won't cover all possible 52 address bits. And finally, the
>> more applicable your "substantial reduction", the less suitable such
>> a change may be at this point of the release (but I didn't look at
>> this side of things in any detail, so it may well not be an issue).
> 
> There is, in practice, no such thing as an HVM guest using 2 levels. 
> The VRAM just below the 4G boundary will force a resize to 3 levels
> during domain construction, and as a 1-line fix for 4.13, this probably
> isn't the worst idea going.

So here (with the 1-line fix remark) you talk about 3 levels. Yet
switching the 2 that we start from to 3 won't fix anything, as we
still may need to go to 4 for huge guests. Such a change would
merely eliminate the indeed pretty pointless move from 2 to 3 which
now happens for all domains as their memory gets populated.

> There are no AMD systems which support >48 bit PA space, so 4 levels is
> sufficient for now, but fundamentally details such as the size of GPA
> space should be specified in domain_create() and remain static for the
> lifetime of the domain.

I agree GPA dimensions ought to be static. But the number-of-levels
adjustment the code does has nothing to do with variable GPA
boundaries. Even for a domain with a, say, 36-bit GFN space it
may be beneficial to run with only 3 levels, as long as it has
"little" enough memory assigned. In fact the number of levels the
hardware has to walk is the one aspect you don't even touch in your
reply.

Jan