diff mbox series

[v3,20/23] VT-d: free all-empty page tables

Message ID 807a48fe-3829-d976-75dc-1767d32fb0f4@suse.com (mailing list archive)
State New, archived
Headers show
Series IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich Jan. 10, 2022, 4:36 p.m. UTC
When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Note further that while pt_update_contig_markers() updates perhaps
several PTEs within the table, since these are changes to "avail" bits
only I do not think that cache flushing would be needed afterwards. Such
cache flushing (of entire pages, unless adding yet more logic to me more
selective) would be quite noticable performance-wise (very prominent
during Dom0 boot).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Properly bound loop. Re-base over changes earlier in the series.
v2: New.
---
The hang during boot on my Latitude E6410 (see the respective code
comment) was pretty close after iommu_enable_translation(). No errors,
no watchdog would kick in, just sometimes the first few pixel lines of
the next log message's (XEN) prefix would have made it out to the screen
(and there's no serial there). It's been a lot of experimenting until I
figured the workaround (which I consider ugly, but halfway acceptable).
I've been trying hard to make sure the workaround wouldn't be masking a
real issue, yet I'm still wary of it possibly doing so ... My best guess
at this point is that on these old IOMMUs the ignored bits 52...61
aren't really ignored for present entries, but also aren't "reserved"
enough to trigger faults. This guess is from having tried to set other
bits in this range (unconditionally, and with the workaround here in
place), which yielded the same behavior.

Comments

Tian, Kevin Feb. 18, 2022, 5:20 a.m. UTC | #1
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Tuesday, January 11, 2022 12:36 AM
> 
> When a page table ends up with no present entries left, it can be
> replaced by a non-present entry at the next higher level. The page table
> itself can then be scheduled for freeing.
> 
> Note that while its output isn't used there yet,
> pt_update_contig_markers() right away needs to be called in all places
> where entries get updated, not just the one where entries get cleared.
> 
> Note further that while pt_update_contig_markers() updates perhaps
> several PTEs within the table, since these are changes to "avail" bits
> only I do not think that cache flushing would be needed afterwards. Such
> cache flushing (of entire pages, unless adding yet more logic to me more
> selective) would be quite noticable performance-wise (very prominent
> during Dom0 boot).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: Properly bound loop. Re-base over changes earlier in the series.
> v2: New.
> ---
> The hang during boot on my Latitude E6410 (see the respective code
> comment) was pretty close after iommu_enable_translation(). No errors,
> no watchdog would kick in, just sometimes the first few pixel lines of
> the next log message's (XEN) prefix would have made it out to the screen
> (and there's no serial there). It's been a lot of experimenting until I
> figured the workaround (which I consider ugly, but halfway acceptable).
> I've been trying hard to make sure the workaround wouldn't be masking a
> real issue, yet I'm still wary of it possibly doing so ... My best guess
> at this point is that on these old IOMMUs the ignored bits 52...61
> aren't really ignored for present entries, but also aren't "reserved"
> enough to trigger faults. This guess is from having tried to set other

Is this machine able to capture any VT-d faults before? If yes maybe
you will observe more information if trying to tweak those bits at a later
time (instead of when iommu is enabled)?

Thanks
Kevin
Jan Beulich Feb. 18, 2022, 8:31 a.m. UTC | #2
On 18.02.2022 06:20, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Tuesday, January 11, 2022 12:36 AM
>>
>> When a page table ends up with no present entries left, it can be
>> replaced by a non-present entry at the next higher level. The page table
>> itself can then be scheduled for freeing.
>>
>> Note that while its output isn't used there yet,
>> pt_update_contig_markers() right away needs to be called in all places
>> where entries get updated, not just the one where entries get cleared.
>>
>> Note further that while pt_update_contig_markers() updates perhaps
>> several PTEs within the table, since these are changes to "avail" bits
>> only I do not think that cache flushing would be needed afterwards. Such
>> cache flushing (of entire pages, unless adding yet more logic to me more
>> selective) would be quite noticable performance-wise (very prominent
>> during Dom0 boot).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v3: Properly bound loop. Re-base over changes earlier in the series.
>> v2: New.
>> ---
>> The hang during boot on my Latitude E6410 (see the respective code
>> comment) was pretty close after iommu_enable_translation(). No errors,
>> no watchdog would kick in, just sometimes the first few pixel lines of
>> the next log message's (XEN) prefix would have made it out to the screen
>> (and there's no serial there). It's been a lot of experimenting until I
>> figured the workaround (which I consider ugly, but halfway acceptable).
>> I've been trying hard to make sure the workaround wouldn't be masking a
>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>> at this point is that on these old IOMMUs the ignored bits 52...61
>> aren't really ignored for present entries, but also aren't "reserved"
>> enough to trigger faults. This guess is from having tried to set other
> 
> Is this machine able to capture any VT-d faults before?

Not sure what you mean here. I don't think I can trigger any I/O at this
point in time, and hence I also couldn't try to trigger a fault. But if
the question is whether fault reporting at this time actually works,
then yes, I think so: This is during Dom0 construction, i.e. late enough
for fault reporting to be fully set up and enabled.

Jan

> If yes maybe
> you will observe more information if trying to tweak those bits at a later
> time (instead of when iommu is enabled)?
> 
> Thanks
> Kevin
Tian, Kevin March 14, 2022, 4:01 a.m. UTC | #3
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Friday, February 18, 2022 4:31 PM
> 
> On 18.02.2022 06:20, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Tuesday, January 11, 2022 12:36 AM
> >>
> >> When a page table ends up with no present entries left, it can be
> >> replaced by a non-present entry at the next higher level. The page table
> >> itself can then be scheduled for freeing.
> >>
> >> Note that while its output isn't used there yet,
> >> pt_update_contig_markers() right away needs to be called in all places
> >> where entries get updated, not just the one where entries get cleared.
> >>
> >> Note further that while pt_update_contig_markers() updates perhaps
> >> several PTEs within the table, since these are changes to "avail" bits
> >> only I do not think that cache flushing would be needed afterwards. Such
> >> cache flushing (of entire pages, unless adding yet more logic to me more
> >> selective) would be quite noticable performance-wise (very prominent
> >> during Dom0 boot).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> v3: Properly bound loop. Re-base over changes earlier in the series.
> >> v2: New.
> >> ---
> >> The hang during boot on my Latitude E6410 (see the respective code
> >> comment) was pretty close after iommu_enable_translation(). No errors,
> >> no watchdog would kick in, just sometimes the first few pixel lines of
> >> the next log message's (XEN) prefix would have made it out to the screen
> >> (and there's no serial there). It's been a lot of experimenting until I
> >> figured the workaround (which I consider ugly, but halfway acceptable).
> >> I've been trying hard to make sure the workaround wouldn't be masking a
> >> real issue, yet I'm still wary of it possibly doing so ... My best guess
> >> at this point is that on these old IOMMUs the ignored bits 52...61
> >> aren't really ignored for present entries, but also aren't "reserved"
> >> enough to trigger faults. This guess is from having tried to set other
> >
> > Is this machine able to capture any VT-d faults before?
> 
> Not sure what you mean here. I don't think I can trigger any I/O at this
> point in time, and hence I also couldn't try to trigger a fault. But if
> the question is whether fault reporting at this time actually works,
> then yes, I think so: This is during Dom0 construction, i.e. late enough
> for fault reporting to be fully set up and enabled.
> 

My point was that with your guess that the ignored bits are not
ignored some VT-d faults should be triggered. If the reason why
you cannot observe such faults is because they happened too
early so no much can be shown on the screen then trying to
setting those bits at much later point might get more shown to
verify your guess. 

btw any progress since last post? How urgent do you want this
feature in (compared to the issue that it may paper over)? 

Thanks
Kevin
Jan Beulich March 14, 2022, 7:33 a.m. UTC | #4
On 14.03.2022 05:01, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Friday, February 18, 2022 4:31 PM
>>
>> On 18.02.2022 06:20, Tian, Kevin wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Tuesday, January 11, 2022 12:36 AM
>>>>
>>>> When a page table ends up with no present entries left, it can be
>>>> replaced by a non-present entry at the next higher level. The page table
>>>> itself can then be scheduled for freeing.
>>>>
>>>> Note that while its output isn't used there yet,
>>>> pt_update_contig_markers() right away needs to be called in all places
>>>> where entries get updated, not just the one where entries get cleared.
>>>>
>>>> Note further that while pt_update_contig_markers() updates perhaps
>>>> several PTEs within the table, since these are changes to "avail" bits
>>>> only I do not think that cache flushing would be needed afterwards. Such
>>>> cache flushing (of entire pages, unless adding yet more logic to me more
>>>> selective) would be quite noticable performance-wise (very prominent
>>>> during Dom0 boot).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> v3: Properly bound loop. Re-base over changes earlier in the series.
>>>> v2: New.
>>>> ---
>>>> The hang during boot on my Latitude E6410 (see the respective code
>>>> comment) was pretty close after iommu_enable_translation(). No errors,
>>>> no watchdog would kick in, just sometimes the first few pixel lines of
>>>> the next log message's (XEN) prefix would have made it out to the screen
>>>> (and there's no serial there). It's been a lot of experimenting until I
>>>> figured the workaround (which I consider ugly, but halfway acceptable).
>>>> I've been trying hard to make sure the workaround wouldn't be masking a
>>>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>>>> at this point is that on these old IOMMUs the ignored bits 52...61
>>>> aren't really ignored for present entries, but also aren't "reserved"
>>>> enough to trigger faults. This guess is from having tried to set other
>>>
>>> Is this machine able to capture any VT-d faults before?
>>
>> Not sure what you mean here. I don't think I can trigger any I/O at this
>> point in time, and hence I also couldn't try to trigger a fault. But if
>> the question is whether fault reporting at this time actually works,
>> then yes, I think so: This is during Dom0 construction, i.e. late enough
>> for fault reporting to be fully set up and enabled.
>>
> 
> My point was that with your guess that the ignored bits are not
> ignored some VT-d faults should be triggered. If the reason why
> you cannot observe such faults is because they happened too
> early so no much can be shown on the screen then trying to
> setting those bits at much later point might get more shown to
> verify your guess. 

Pretty clearly there aren't any faults. And in fact my suspicion is
that the bits are used for addressing memory, and then the memory
access hangs (doesn't complete).

> btw any progress since last post? How urgent do you want this
> feature in (compared to the issue that it may paper over)? 

Well, one way or another the issue needs to be dealt with for this
series to eventually go in. To be honest I hadn't expected that it
would still be pending ...

Jan
Tian, Kevin March 17, 2022, 5:55 a.m. UTC | #5
> From: Jan Beulich <jbeulich@suse.com>
> Sent: Monday, March 14, 2022 3:33 PM
> 
> On 14.03.2022 05:01, Tian, Kevin wrote:
> >> From: Jan Beulich <jbeulich@suse.com>
> >> Sent: Friday, February 18, 2022 4:31 PM
> >>
> >> On 18.02.2022 06:20, Tian, Kevin wrote:
> >>>> From: Jan Beulich <jbeulich@suse.com>
> >>>> Sent: Tuesday, January 11, 2022 12:36 AM
> >>>>
> >>>> When a page table ends up with no present entries left, it can be
> >>>> replaced by a non-present entry at the next higher level. The page table
> >>>> itself can then be scheduled for freeing.
> >>>>
> >>>> Note that while its output isn't used there yet,
> >>>> pt_update_contig_markers() right away needs to be called in all places
> >>>> where entries get updated, not just the one where entries get cleared.
> >>>>
> >>>> Note further that while pt_update_contig_markers() updates perhaps
> >>>> several PTEs within the table, since these are changes to "avail" bits
> >>>> only I do not think that cache flushing would be needed afterwards.
> Such
> >>>> cache flushing (of entire pages, unless adding yet more logic to me
> more
> >>>> selective) would be quite noticable performance-wise (very prominent
> >>>> during Dom0 boot).
> >>>>
> >>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>>> ---
> >>>> v3: Properly bound loop. Re-base over changes earlier in the series.
> >>>> v2: New.
> >>>> ---
> >>>> The hang during boot on my Latitude E6410 (see the respective code
> >>>> comment) was pretty close after iommu_enable_translation(). No
> errors,
> >>>> no watchdog would kick in, just sometimes the first few pixel lines of
> >>>> the next log message's (XEN) prefix would have made it out to the
> screen
> >>>> (and there's no serial there). It's been a lot of experimenting until I
> >>>> figured the workaround (which I consider ugly, but halfway acceptable).
> >>>> I've been trying hard to make sure the workaround wouldn't be
> masking a
> >>>> real issue, yet I'm still wary of it possibly doing so ... My best guess
> >>>> at this point is that on these old IOMMUs the ignored bits 52...61
> >>>> aren't really ignored for present entries, but also aren't "reserved"
> >>>> enough to trigger faults. This guess is from having tried to set other
> >>>
> >>> Is this machine able to capture any VT-d faults before?
> >>
> >> Not sure what you mean here. I don't think I can trigger any I/O at this
> >> point in time, and hence I also couldn't try to trigger a fault. But if
> >> the question is whether fault reporting at this time actually works,
> >> then yes, I think so: This is during Dom0 construction, i.e. late enough
> >> for fault reporting to be fully set up and enabled.
> >>
> >
> > My point was that with your guess that the ignored bits are not
> > ignored some VT-d faults should be triggered. If the reason why
> > you cannot observe such faults is because they happened too
> > early so no much can be shown on the screen then trying to
> > setting those bits at much later point might get more shown to
> > verify your guess.
> 
> Pretty clearly there aren't any faults. And in fact my suspicion is
> that the bits are used for addressing memory, and then the memory
> access hangs (doesn't complete).
> 
> > btw any progress since last post? How urgent do you want this
> > feature in (compared to the issue that it may paper over)?
> 
> Well, one way or another the issue needs to be dealt with for this
> series to eventually go in. To be honest I hadn't expected that it
> would still be pending ...
> 

Sorry I didn't get your meaning. Do you mean that you didn't
expect that I haven't given r-b or that you haven't found time
to root-cause this issue?

Thanks
Kevin
Jan Beulich March 17, 2022, 8:55 a.m. UTC | #6
On 17.03.2022 06:55, Tian, Kevin wrote:
>> From: Jan Beulich <jbeulich@suse.com>
>> Sent: Monday, March 14, 2022 3:33 PM
>>
>> On 14.03.2022 05:01, Tian, Kevin wrote:
>>>> From: Jan Beulich <jbeulich@suse.com>
>>>> Sent: Friday, February 18, 2022 4:31 PM
>>>>
>>>> On 18.02.2022 06:20, Tian, Kevin wrote:
>>>>>> From: Jan Beulich <jbeulich@suse.com>
>>>>>> Sent: Tuesday, January 11, 2022 12:36 AM
>>>>>>
>>>>>> When a page table ends up with no present entries left, it can be
>>>>>> replaced by a non-present entry at the next higher level. The page table
>>>>>> itself can then be scheduled for freeing.
>>>>>>
>>>>>> Note that while its output isn't used there yet,
>>>>>> pt_update_contig_markers() right away needs to be called in all places
>>>>>> where entries get updated, not just the one where entries get cleared.
>>>>>>
>>>>>> Note further that while pt_update_contig_markers() updates perhaps
>>>>>> several PTEs within the table, since these are changes to "avail" bits
>>>>>> only I do not think that cache flushing would be needed afterwards.
>> Such
>>>>>> cache flushing (of entire pages, unless adding yet more logic to me
>> more
>>>>>> selective) would be quite noticable performance-wise (very prominent
>>>>>> during Dom0 boot).
>>>>>>
>>>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>>>> ---
>>>>>> v3: Properly bound loop. Re-base over changes earlier in the series.
>>>>>> v2: New.
>>>>>> ---
>>>>>> The hang during boot on my Latitude E6410 (see the respective code
>>>>>> comment) was pretty close after iommu_enable_translation(). No
>> errors,
>>>>>> no watchdog would kick in, just sometimes the first few pixel lines of
>>>>>> the next log message's (XEN) prefix would have made it out to the
>> screen
>>>>>> (and there's no serial there). It's been a lot of experimenting until I
>>>>>> figured the workaround (which I consider ugly, but halfway acceptable).
>>>>>> I've been trying hard to make sure the workaround wouldn't be
>> masking a
>>>>>> real issue, yet I'm still wary of it possibly doing so ... My best guess
>>>>>> at this point is that on these old IOMMUs the ignored bits 52...61
>>>>>> aren't really ignored for present entries, but also aren't "reserved"
>>>>>> enough to trigger faults. This guess is from having tried to set other
>>>>>
>>>>> Is this machine able to capture any VT-d faults before?
>>>>
>>>> Not sure what you mean here. I don't think I can trigger any I/O at this
>>>> point in time, and hence I also couldn't try to trigger a fault. But if
>>>> the question is whether fault reporting at this time actually works,
>>>> then yes, I think so: This is during Dom0 construction, i.e. late enough
>>>> for fault reporting to be fully set up and enabled.
>>>>
>>>
>>> My point was that with your guess that the ignored bits are not
>>> ignored some VT-d faults should be triggered. If the reason why
>>> you cannot observe such faults is because they happened too
>>> early so no much can be shown on the screen then trying to
>>> setting those bits at much later point might get more shown to
>>> verify your guess.
>>
>> Pretty clearly there aren't any faults. And in fact my suspicion is
>> that the bits are used for addressing memory, and then the memory
>> access hangs (doesn't complete).
>>
>>> btw any progress since last post? How urgent do you want this
>>> feature in (compared to the issue that it may paper over)?
>>
>> Well, one way or another the issue needs to be dealt with for this
>> series to eventually go in. To be honest I hadn't expected that it
>> would still be pending ...
>>
> 
> Sorry I didn't get your meaning. Do you mean that you didn't
> expect that I haven't given r-b or that you haven't found time
> to root-cause this issue?

Neither - the comment about the series as a whole still being pending
was a general one. After all it's been over half a year since the
original posting of the first parts of it.

As to root-causing this issue: I don't see any reasonable way for me
to do so. Hence it's not a matter of finding time anymore (that was
only the case until I could actually associate the behavior with the
one specific piece of code that causes it), but a matter of simply
not being in the position to sensibly try to dig deeper. I continue
to think that the only reasonable way to gain further insight is for
someone with access to the sources of the (I assume) involved
microcode in the chipset to spell out what the expected behavior
given that microcode would actually be. Without such knowledge I do
not see any alternative to what I'm currently doing to document and
work around the issue. Yet I also understand that given this is
rather old hardware, there's little interest at Intel to actually
put time into such research.

Jan
diff mbox series

Patch

--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -42,6 +42,9 @@ 
 #include "vtd.h"
 #include "../ats.h"
 
+#define CONTIG_MASK DMA_PTE_CONTIG_MASK
+#include <asm/pt-contig-markers.h>
+
 /* dom_io is used as a sentinel for quarantined devices */
 #define QUARANTINE_SKIP(d) ((d) == dom_io && !dom_iommu(d)->arch.vtd.pgd_maddr)
 
@@ -452,6 +455,9 @@  static uint64_t addr_to_dma_page_maddr(s
 
             write_atomic(&pte->val, new_pte.val);
             iommu_sync_cache(pte, sizeof(struct dma_pte));
+            pt_update_contig_markers(&parent->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_table);
         }
 
         if ( --level == target )
@@ -879,9 +885,31 @@  static int dma_pte_clear_one(struct doma
 
     old = *pte;
     dma_clear_pte(*pte);
+    iommu_sync_cache(pte, sizeof(*pte));
+
+    while ( pt_update_contig_markers(&page->val,
+                                     address_level_offset(addr, level),
+                                     level, PTE_kind_null) &&
+            ++level < min_pt_levels )
+    {
+        struct page_info *pg = maddr_to_page(pg_maddr);
+
+        unmap_vtd_domain_page(page);
+
+        pg_maddr = addr_to_dma_page_maddr(domain, addr, level, flush_flags,
+                                          false);
+        BUG_ON(pg_maddr < PAGE_SIZE);
+
+        page = map_vtd_domain_page(pg_maddr);
+        pte = &page[address_level_offset(addr, level)];
+        dma_clear_pte(*pte);
+        iommu_sync_cache(pte, sizeof(*pte));
+
+        *flush_flags |= IOMMU_FLUSHF_all;
+        iommu_queue_free_pgtable(domain, pg);
+    }
 
     spin_unlock(&hd->arch.mapping_lock);
-    iommu_sync_cache(pte, sizeof(struct dma_pte));
 
     unmap_vtd_domain_page(page);
 
@@ -2037,8 +2065,21 @@  static int __must_check intel_iommu_map_
     }
 
     *pte = new;
-
     iommu_sync_cache(pte, sizeof(struct dma_pte));
+
+    /*
+     * While the (ab)use of PTE_kind_table here allows to save some work in
+     * the function, the main motivation for it is that it avoids a so far
+     * unexplained hang during boot (while preparing Dom0) on a Westmere
+     * based laptop.
+     */
+    pt_update_contig_markers(&page->val,
+                             address_level_offset(dfn_to_daddr(dfn), level),
+                             level,
+                             (hd->platform_ops->page_sizes &
+                              (1UL << level_to_offset_bits(level + 1))
+                              ? PTE_kind_leaf : PTE_kind_table));
+
     spin_unlock(&hd->arch.mapping_lock);
     unmap_vtd_domain_page(page);