Message ID | 20220922160516.5929-1-roger.pau@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86/ept: limit calls to memory_type_changed() | expand |
On 22.09.2022 18:05, Roger Pau Monne wrote: > memory_type_changed() is currently only implemented for Intel EPT, and > results in the invalidation of EMT attributes on all the entries in > the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > when the guest tries to access any gfns for the first time, which > results in the recalculation of the EMT for the accessed page. The > vmexit and the recalculations are expensive, and as such should be > avoided when possible. > > Remove the call to memory_type_changed() from > XEN_DOMCTL_memory_mapping: there are no modifications of the > iomem_caps ranges anymore that could alter the return of > cache_flush_permitted() from that domctl. I certainly agree - this was an oversight when the two aspects were split. One might argue this is a (performance) fix to the earlier commit, and hence might want to go on its own with a Fixes: tag. > Calls to memory_type_changed() resulting from changes to the domain > iomem_caps or ioport_caps ranges are only relevant for EMT > calculations if the IOMMU is not enabled, and the call has resulted in > a change to the return value of cache_flush_permitted(). I'm less certain here: These shouldn't be frequent operations, so their impact on the guest should be limited? And if we were to restrict the calls, I think we need to clearly tie together the various places which need updating together in case e.g. the condition in epte_get_entry_emt() is changed. Minimally by way of comments, but maybe by way of a small helper function (for which I can't seem to be able to think of a good name) sitting next to epte_get_entry_emt(). > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > --- > I feel it's a bit weird to have calls to memory_type_changed() in > common domctl code - for once the domctl that trigger the call doesn't > change memory types, just adds or removes ranges from iomem_caps > (which in turn affects the behaviour of epte_get_entry_emt()). Do you have a better suggestion? Jan
On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote: > On 22.09.2022 18:05, Roger Pau Monne wrote: > > memory_type_changed() is currently only implemented for Intel EPT, and > > results in the invalidation of EMT attributes on all the entries in > > the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > > when the guest tries to access any gfns for the first time, which > > results in the recalculation of the EMT for the accessed page. The > > vmexit and the recalculations are expensive, and as such should be > > avoided when possible. > > > > Remove the call to memory_type_changed() from > > XEN_DOMCTL_memory_mapping: there are no modifications of the > > iomem_caps ranges anymore that could alter the return of > > cache_flush_permitted() from that domctl. > > I certainly agree - this was an oversight when the two aspects were > split. One might argue this is a (performance) fix to the earlier > commit, and hence might want to go on its own with a Fixes: tag. Was wondering myself, didn't add the 'Fixes:' tag because of the extra content. > > Calls to memory_type_changed() resulting from changes to the domain > > iomem_caps or ioport_caps ranges are only relevant for EMT > > calculations if the IOMMU is not enabled, and the call has resulted in > > a change to the return value of cache_flush_permitted(). > > I'm less certain here: These shouldn't be frequent operations, so > their impact on the guest should be limited? Citrix has an use case for vGPU where IOMMU regions are added and removed during guest runtime. Such functionality makes uses of both XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping. While the memory_type_changed() call in XEN_DOMCTL_memory_mapping seems to be the most problematic performance wise, I though it was nice to try to avoid memory_type_changed() as much as possible, as those tax the guest quite heavily with EPT_MISCONFIG faults and the recalculation logic. > And if we were to restrict the calls, I think we need to clearly > tie together the various places which need updating together in > case e.g. the condition in epte_get_entry_emt() is changed. > Minimally by way of comments, but maybe by way of a small helper > function (for which I can't seem to be able to think of a good > name) sitting next to epte_get_entry_emt(). Such helper function is also kind of problematic, as it would have to live in p2m-ept.c but be used in domctl.c and x86/domctl.c? It would have to go through the p2m_domain indirection structure. Do you have any suggestions about how the function should look like? I'm afraid the fact it needs the previous cache_flush_permitted() value makes it kind of weird to encapsulate. I've attempted to add comments to make it clear why the new checks are added, but I would also need to add a comment to epte_get_entry_emt() to notice changes in the condition need to be propagated to call sites of memory_type_changed(). > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> > > --- > > I feel it's a bit weird to have calls to memory_type_changed() in > > common domctl code - for once the domctl that trigger the call doesn't > > change memory types, just adds or removes ranges from iomem_caps > > (which in turn affects the behaviour of epte_get_entry_emt()). > > Do you have a better suggestion? No, not really, because we need the return value of cache_flush_permitted() before and after the changes, so it's not as easy as introducing a single helper sadly. Thanks, Roger.
On 23.09.2022 10:35, Roger Pau Monné wrote: > On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote: >> On 22.09.2022 18:05, Roger Pau Monne wrote: >>> memory_type_changed() is currently only implemented for Intel EPT, and >>> results in the invalidation of EMT attributes on all the entries in >>> the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits >>> when the guest tries to access any gfns for the first time, which >>> results in the recalculation of the EMT for the accessed page. The >>> vmexit and the recalculations are expensive, and as such should be >>> avoided when possible. >>> >>> Remove the call to memory_type_changed() from >>> XEN_DOMCTL_memory_mapping: there are no modifications of the >>> iomem_caps ranges anymore that could alter the return of >>> cache_flush_permitted() from that domctl. >> >> I certainly agree - this was an oversight when the two aspects were >> split. One might argue this is a (performance) fix to the earlier >> commit, and hence might want to go on its own with a Fixes: tag. > > Was wondering myself, didn't add the 'Fixes:' tag because of the extra > content. > >>> Calls to memory_type_changed() resulting from changes to the domain >>> iomem_caps or ioport_caps ranges are only relevant for EMT >>> calculations if the IOMMU is not enabled, and the call has resulted in >>> a change to the return value of cache_flush_permitted(). >> >> I'm less certain here: These shouldn't be frequent operations, so >> their impact on the guest should be limited? > > Citrix has an use case for vGPU where IOMMU regions are added and > removed during guest runtime. Such functionality makes uses of both > XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping. I see. Maybe this would want saying in the description, to express that there's little expected benefit for upstream. > While the memory_type_changed() call in XEN_DOMCTL_memory_mapping > seems to be the most problematic performance wise, I though it was > nice to try to avoid memory_type_changed() as much as possible, as > those tax the guest quite heavily with EPT_MISCONFIG faults and the > recalculation logic. Trying to avoid this is certainly desirable, I agree. But we need to make sure that it's not "easy" to break things by touching one place but leaving others alone which really would need keeping in sync. Therefore I'd see such added logic as acceptable only if the risk towards future changes is sufficiently low. >> And if we were to restrict the calls, I think we need to clearly >> tie together the various places which need updating together in >> case e.g. the condition in epte_get_entry_emt() is changed. >> Minimally by way of comments, but maybe by way of a small helper >> function (for which I can't seem to be able to think of a good >> name) sitting next to epte_get_entry_emt(). > > Such helper function is also kind of problematic, as it would have to > live in p2m-ept.c but be used in domctl.c and x86/domctl.c? It would > have to go through the p2m_domain indirection structure. It would need abstraction at the arch level as well as for !HVM configs on x86. I'm not sure the indirection layer would actually be needed, as the contents of the function - despite wanting placing in p2m-ept.c - isn't really vendor dependent. (If AMD/SVM gained a need for a similar helper, things would nee re-evaluating.) > Do you have any suggestions about how the function should look like? > I'm afraid the fact it needs the previous cache_flush_permitted() > value makes it kind of weird to encapsulate. Indeed. > I've attempted to add comments to make it clear why the new checks are > added, but I would also need to add a comment to epte_get_entry_emt() > to notice changes in the condition need to be propagated to call sites > of memory_type_changed(). Right - it may suffice to have one more extensive comment, but _all_ involved parties will need to have at least a cross reference such that one can easily find all pieces of code needing to be kept in sync. Jan
On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote: > On 23.09.2022 10:35, Roger Pau Monné wrote: > > On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote: > >> On 22.09.2022 18:05, Roger Pau Monne wrote: > >>> memory_type_changed() is currently only implemented for Intel EPT, and > >>> results in the invalidation of EMT attributes on all the entries in > >>> the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > >>> when the guest tries to access any gfns for the first time, which > >>> results in the recalculation of the EMT for the accessed page. The > >>> vmexit and the recalculations are expensive, and as such should be > >>> avoided when possible. > >>> > >>> Remove the call to memory_type_changed() from > >>> XEN_DOMCTL_memory_mapping: there are no modifications of the > >>> iomem_caps ranges anymore that could alter the return of > >>> cache_flush_permitted() from that domctl. > >> > >> I certainly agree - this was an oversight when the two aspects were > >> split. One might argue this is a (performance) fix to the earlier > >> commit, and hence might want to go on its own with a Fixes: tag. > > > > Was wondering myself, didn't add the 'Fixes:' tag because of the extra > > content. > > > >>> Calls to memory_type_changed() resulting from changes to the domain > >>> iomem_caps or ioport_caps ranges are only relevant for EMT > >>> calculations if the IOMMU is not enabled, and the call has resulted in > >>> a change to the return value of cache_flush_permitted(). > >> > >> I'm less certain here: These shouldn't be frequent operations, so > >> their impact on the guest should be limited? > > > > Citrix has an use case for vGPU where IOMMU regions are added and > > removed during guest runtime. Such functionality makes uses of both > > XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping. > > I see. Maybe this would want saying in the description, to express > that there's little expected benefit for upstream. I guess any OS that moves BARs around will also trigger such code paths, but that might not be very common. I can add something to the description. > > While the memory_type_changed() call in XEN_DOMCTL_memory_mapping > > seems to be the most problematic performance wise, I though it was > > nice to try to avoid memory_type_changed() as much as possible, as > > those tax the guest quite heavily with EPT_MISCONFIG faults and the > > recalculation logic. > > Trying to avoid this is certainly desirable, I agree. But we need > to make sure that it's not "easy" to break things by touching one > place but leaving others alone which really would need keeping in > sync. Therefore I'd see such added logic as acceptable only if the > risk towards future changes is sufficiently low. > > >> And if we were to restrict the calls, I think we need to clearly > >> tie together the various places which need updating together in > >> case e.g. the condition in epte_get_entry_emt() is changed. > >> Minimally by way of comments, but maybe by way of a small helper > >> function (for which I can't seem to be able to think of a good > >> name) sitting next to epte_get_entry_emt(). > > > > Such helper function is also kind of problematic, as it would have to > > live in p2m-ept.c but be used in domctl.c and x86/domctl.c? It would > > have to go through the p2m_domain indirection structure. > > It would need abstraction at the arch level as well as for !HVM configs > on x86. I'm not sure the indirection layer would actually be needed, as > the contents of the function - despite wanting placing in p2m-ept.c - > isn't really vendor dependent. (If AMD/SVM gained a need for a similar > helper, things would nee re-evaluating.) Maybe it would be better to add the calls to memory_type_changed() directly in iomem_{permit,deny}_access() and ioports_{permit,deny}_access itself? That would also allow to remove the noop Arm memory_type_changed() halper. Thanks, Roger.
On Mon, Sep 26, 2022 at 04:50:22PM +0200, Roger Pau Monné wrote: > On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote: > > On 23.09.2022 10:35, Roger Pau Monné wrote: > > > On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote: > > >> On 22.09.2022 18:05, Roger Pau Monne wrote: > > >>> memory_type_changed() is currently only implemented for Intel EPT, and > > >>> results in the invalidation of EMT attributes on all the entries in > > >>> the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > > >>> when the guest tries to access any gfns for the first time, which > > >>> results in the recalculation of the EMT for the accessed page. The > > >>> vmexit and the recalculations are expensive, and as such should be > > >>> avoided when possible. > > >>> > > >>> Remove the call to memory_type_changed() from > > >>> XEN_DOMCTL_memory_mapping: there are no modifications of the > > >>> iomem_caps ranges anymore that could alter the return of > > >>> cache_flush_permitted() from that domctl. > > >> > > >> I certainly agree - this was an oversight when the two aspects were > > >> split. One might argue this is a (performance) fix to the earlier > > >> commit, and hence might want to go on its own with a Fixes: tag. > > > > > > Was wondering myself, didn't add the 'Fixes:' tag because of the extra > > > content. > > > > > >>> Calls to memory_type_changed() resulting from changes to the domain > > >>> iomem_caps or ioport_caps ranges are only relevant for EMT > > >>> calculations if the IOMMU is not enabled, and the call has resulted in > > >>> a change to the return value of cache_flush_permitted(). > > >> > > >> I'm less certain here: These shouldn't be frequent operations, so > > >> their impact on the guest should be limited? > > > > > > Citrix has an use case for vGPU where IOMMU regions are added and > > > removed during guest runtime. Such functionality makes uses of both > > > XEN_DOMCTL_iomem_permission and XEN_DOMCTL_memory_mapping. > > > > I see. Maybe this would want saying in the description, to express > > that there's little expected benefit for upstream. > > I guess any OS that moves BARs around will also trigger such code > paths, but that might not be very common. I can add something to the > description. > > > > While the memory_type_changed() call in XEN_DOMCTL_memory_mapping > > > seems to be the most problematic performance wise, I though it was > > > nice to try to avoid memory_type_changed() as much as possible, as > > > those tax the guest quite heavily with EPT_MISCONFIG faults and the > > > recalculation logic. > > > > Trying to avoid this is certainly desirable, I agree. But we need > > to make sure that it's not "easy" to break things by touching one > > place but leaving others alone which really would need keeping in > > sync. Therefore I'd see such added logic as acceptable only if the > > risk towards future changes is sufficiently low. > > > > >> And if we were to restrict the calls, I think we need to clearly > > >> tie together the various places which need updating together in > > >> case e.g. the condition in epte_get_entry_emt() is changed. > > >> Minimally by way of comments, but maybe by way of a small helper > > >> function (for which I can't seem to be able to think of a good > > >> name) sitting next to epte_get_entry_emt(). > > > > > > Such helper function is also kind of problematic, as it would have to > > > live in p2m-ept.c but be used in domctl.c and x86/domctl.c? It would > > > have to go through the p2m_domain indirection structure. > > > > It would need abstraction at the arch level as well as for !HVM configs > > on x86. I'm not sure the indirection layer would actually be needed, as > > the contents of the function - despite wanting placing in p2m-ept.c - > > isn't really vendor dependent. (If AMD/SVM gained a need for a similar > > helper, things would nee re-evaluating.) > > Maybe it would be better to add the calls to memory_type_changed() > directly in iomem_{permit,deny}_access() and > ioports_{permit,deny}_access itself? > > That would also allow to remove the noop Arm memory_type_changed() > halper. Correction: the Arm memory_type_changed() needs to stay, as iomem_{permit,deny}_access() is common code. Regards, Roger.
On 26.09.2022 17:25, Roger Pau Monné wrote: > On Mon, Sep 26, 2022 at 04:50:22PM +0200, Roger Pau Monné wrote: >> On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote: >>> On 23.09.2022 10:35, Roger Pau Monné wrote: >>>> On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote: >>>>> On 22.09.2022 18:05, Roger Pau Monne wrote: >>>>> And if we were to restrict the calls, I think we need to clearly >>>>> tie together the various places which need updating together in >>>>> case e.g. the condition in epte_get_entry_emt() is changed. >>>>> Minimally by way of comments, but maybe by way of a small helper >>>>> function (for which I can't seem to be able to think of a good >>>>> name) sitting next to epte_get_entry_emt(). >>>> >>>> Such helper function is also kind of problematic, as it would have to >>>> live in p2m-ept.c but be used in domctl.c and x86/domctl.c? It would >>>> have to go through the p2m_domain indirection structure. >>> >>> It would need abstraction at the arch level as well as for !HVM configs >>> on x86. I'm not sure the indirection layer would actually be needed, as >>> the contents of the function - despite wanting placing in p2m-ept.c - >>> isn't really vendor dependent. (If AMD/SVM gained a need for a similar >>> helper, things would nee re-evaluating.) >> >> Maybe it would be better to add the calls to memory_type_changed() >> directly in iomem_{permit,deny}_access() and >> ioports_{permit,deny}_access itself? I'm of two minds - on one hand that would nicely take the call "out of sight", but otoh this would feel like a layering violation. Yet then maybe it's a layering violation no matter where that call lives. >> That would also allow to remove the noop Arm memory_type_changed() >> halper. > > Correction: the Arm memory_type_changed() needs to stay, as > iomem_{permit,deny}_access() is common code. Right, or we'd need some other arch abstraction. (I wonder whether long term Arm can actually get away without this. Even on the AMD side of x86 I don't think it's quite right that adding/removing of MMIO ranges has no effect on the memory type of accesses.) Jan
On Mon, Sep 26, 2022 at 05:36:41PM +0200, Jan Beulich wrote: > On 26.09.2022 17:25, Roger Pau Monné wrote: > > On Mon, Sep 26, 2022 at 04:50:22PM +0200, Roger Pau Monné wrote: > >> On Mon, Sep 26, 2022 at 09:33:10AM +0200, Jan Beulich wrote: > >>> On 23.09.2022 10:35, Roger Pau Monné wrote: > >>>> On Thu, Sep 22, 2022 at 09:21:59PM +0200, Jan Beulich wrote: > >>>>> On 22.09.2022 18:05, Roger Pau Monne wrote: > >>>>> And if we were to restrict the calls, I think we need to clearly > >>>>> tie together the various places which need updating together in > >>>>> case e.g. the condition in epte_get_entry_emt() is changed. > >>>>> Minimally by way of comments, but maybe by way of a small helper > >>>>> function (for which I can't seem to be able to think of a good > >>>>> name) sitting next to epte_get_entry_emt(). > >>>> > >>>> Such helper function is also kind of problematic, as it would have to > >>>> live in p2m-ept.c but be used in domctl.c and x86/domctl.c? It would > >>>> have to go through the p2m_domain indirection structure. > >>> > >>> It would need abstraction at the arch level as well as for !HVM configs > >>> on x86. I'm not sure the indirection layer would actually be needed, as > >>> the contents of the function - despite wanting placing in p2m-ept.c - > >>> isn't really vendor dependent. (If AMD/SVM gained a need for a similar > >>> helper, things would nee re-evaluating.) > >> > >> Maybe it would be better to add the calls to memory_type_changed() > >> directly in iomem_{permit,deny}_access() and > >> ioports_{permit,deny}_access itself? > > I'm of two minds - on one hand that would nicely take the call "out of > sight", but otoh this would feel like a layering violation. Yet then > maybe it's a layering violation no matter where that call lives. Kind of, I think it's slightly better than having the callers take care of calling memory_type_changed(), and prevents new users of {iomem,ioports}_{permit,deny}_access() missing the calls to memory_type_changed(). Let me post what I have with this approach. > >> That would also allow to remove the noop Arm memory_type_changed() > >> halper. > > > > Correction: the Arm memory_type_changed() needs to stay, as > > iomem_{permit,deny}_access() is common code. > > Right, or we'd need some other arch abstraction. (I wonder whether > long term Arm can actually get away without this. Even on the AMD side > of x86 I don't think it's quite right that adding/removing of MMIO > ranges has no effect on the memory type of accesses.) IIRC there's no way for the hypervisor to infer cache attributes on AMD SVM for NPT entries, but maybe I'm missing something. Guest MTRRs settings are completely ignored for AMD guests. I'm not able ATM however to find in the AMD PM how effective cache attributes are calculated when using NPT however. I would guess host MTRR + guest PAT? Thanks, Roger.
On 22/09/2022 17:05, Roger Pau Monne wrote: > memory_type_changed() is currently only implemented for Intel EPT, and > results in the invalidation of EMT attributes on all the entries in > the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > when the guest tries to access any gfns for the first time, which > results in the recalculation of the EMT for the accessed page. The > vmexit and the recalculations are expensive, and as such should be > avoided when possible. > > Remove the call to memory_type_changed() from > XEN_DOMCTL_memory_mapping: there are no modifications of the > iomem_caps ranges anymore that could alter the return of > cache_flush_permitted() from that domctl. > > Calls to memory_type_changed() resulting from changes to the domain > iomem_caps or ioport_caps ranges are only relevant for EMT > calculations if the IOMMU is not enabled, and the call has resulted in > a change to the return value of cache_flush_permitted(). This, and the perf problem Citrix have found, is caused by a more fundamental bug which I identified during XSA-402. Xen is written with assumption that cacheability other than WB is dependent on having devices. While this is perhaps true of current configurations available, it is a layering violation, and will cease being true in order to support encrypted RAM (and by extension, encrypted VMs). At the moment, we know the IOMMU-ness of a domain right from the outset, but the cacheability permits are dynamic, based on the non-emptyness of the domain's device list, ioport list, and various others. All the memory_type_changed() calls here are to cover the fact that the original design was buggy by not having the cacheability-ness part of domain create in the first place. The appropriate fix, but definitely 4.18 work at this point, is to have a new CDF flag which permits the use of non-WB cacheability. For testing purposes alone, turning it on on an otherwise "plain VM" is useful (its how I actually debugged XSA-402, and the only sane way to go about investigating the MTRR per disasters for VGPU VMs[1]), but for regular usecases, it wants cross-checking with the IOMMU flag (and encrypted VM flag in the future), and for all dynamic list checks to turn into a simple 'd->config & CDF_full_cacheability'. This way, we delete all calls to memory_type_changed() which are trying to cover the various dynamic lists becoming empty/non-empty, and we remove several ordering-of-hypercalls bugs where non-cacheable mappings can't actually be created on a VM declared to have an IOMMU until a device has actually been assigned to start with. ~Andrew [1] MTRR handling is also buggy with reduced cacheability, causing some areas of RAM to be used UC; notably the grant table. This manifests as PV device perf being worse than qemu-emulated device perf, only when a GPU is added to a VM[2]. Instead of fixing this properly, it was hacked around by forcing IPAT=1 for Xenheap pages, which only "fixed" the problem on Intel (AMD has no equivalent mechanism), and needs reverting and fixing properly (i.e. get the vMTRR layout working correctly) to support VMs with encrypted RAM. [2] There's a second bug with memory_type_changed() in that it causes dreadful system performance during VM migration, which is something to do with the interaction of restoring vMTRRs for a VM that has a device but isn't running yet. This still needs investigating, and I suspect it's got a similar root cause.
On 26.09.2022 17:58, Roger Pau Monné wrote: > On Mon, Sep 26, 2022 at 05:36:41PM +0200, Jan Beulich wrote: >> On 26.09.2022 17:25, Roger Pau Monné wrote: >>> Correction: the Arm memory_type_changed() needs to stay, as >>> iomem_{permit,deny}_access() is common code. >> >> Right, or we'd need some other arch abstraction. (I wonder whether >> long term Arm can actually get away without this. Even on the AMD side >> of x86 I don't think it's quite right that adding/removing of MMIO >> ranges has no effect on the memory type of accesses.) > > IIRC there's no way for the hypervisor to infer cache attributes on > AMD SVM for NPT entries, but maybe I'm missing something. Guest MTRRs > settings are completely ignored for AMD guests. Right, as documented: "Note that there is no hardware support for guest MTRRs; the VMM can simulate their effect by altering the memory types in the nested page tables." That's something we imo should do, but which I don't think we actually do (see p2m_type_to_flags()). We respect the PAT bit when splitting large pages, but I don't think we ever set the bit when making new / updated entries. > I'm not able ATM > however to find in the AMD PM how effective cache attributes are > calculated when using NPT however. I would guess host MTRR + guest > PAT? First guest and host PAT are combined, then the result is combined with (host) MTRR. See the tables in the "Nested Paging" sub-section "Combining Memory Types, MTRRs". Of course things are quite a bit more limited (but also simpler) in shadow mode. Jan
On Tue, Sep 27, 2022 at 08:35:20AM +0200, Jan Beulich wrote: > On 26.09.2022 17:58, Roger Pau Monné wrote: > > On Mon, Sep 26, 2022 at 05:36:41PM +0200, Jan Beulich wrote: > >> On 26.09.2022 17:25, Roger Pau Monné wrote: > >>> Correction: the Arm memory_type_changed() needs to stay, as > >>> iomem_{permit,deny}_access() is common code. > >> > >> Right, or we'd need some other arch abstraction. (I wonder whether > >> long term Arm can actually get away without this. Even on the AMD side > >> of x86 I don't think it's quite right that adding/removing of MMIO > >> ranges has no effect on the memory type of accesses.) > > > > IIRC there's no way for the hypervisor to infer cache attributes on > > AMD SVM for NPT entries, but maybe I'm missing something. Guest MTRRs > > settings are completely ignored for AMD guests. > > Right, as documented: "Note that there is no hardware support for guest > MTRRs; the VMM can simulate their effect by altering the memory types > in the nested page tables." That's something we imo should do, but which > I don't think we actually do (see p2m_type_to_flags()). We respect the > PAT bit when splitting large pages, but I don't think we ever set the > bit when making new / updated entries. > > > I'm not able ATM > > however to find in the AMD PM how effective cache attributes are > > calculated when using NPT however. I would guess host MTRR + guest > > PAT? > > First guest and host PAT are combined, then the result is combined with > (host) MTRR. See the tables in the "Nested Paging" sub-section "Combining > Memory Types, MTRRs". Of course things are quite a bit more limited (but > also simpler) in shadow mode. Thanks, so we could indeed do something similar as to what we do for Intel and set a cache attribute in the nested page tables, at which point we would need epte_get_entry_emt() to be not EPT specific anymore. I've created: https://gitlab.com/xen-project/xen/-/issues/88 To have some reminder of this pending work, or else I would forget. Thanks, Roger.
On Mon, Sep 26, 2022 at 06:03:24PM +0000, Andrew Cooper wrote: > On 22/09/2022 17:05, Roger Pau Monne wrote: > > memory_type_changed() is currently only implemented for Intel EPT, and > > results in the invalidation of EMT attributes on all the entries in > > the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits > > when the guest tries to access any gfns for the first time, which > > results in the recalculation of the EMT for the accessed page. The > > vmexit and the recalculations are expensive, and as such should be > > avoided when possible. > > > > Remove the call to memory_type_changed() from > > XEN_DOMCTL_memory_mapping: there are no modifications of the > > iomem_caps ranges anymore that could alter the return of > > cache_flush_permitted() from that domctl. > > > > Calls to memory_type_changed() resulting from changes to the domain > > iomem_caps or ioport_caps ranges are only relevant for EMT > > calculations if the IOMMU is not enabled, and the call has resulted in > > a change to the return value of cache_flush_permitted(). > > This, and the perf problem Citrix have found, is caused by a more > fundamental bug which I identified during XSA-402. > > Xen is written with assumption that cacheability other than WB is > dependent on having devices. While this is perhaps true of current > configurations available, it is a layering violation, and will cease > being true in order to support encrypted RAM (and by extension, > encrypted VMs). I assumed this was done as a performance improvement (or hack). > At the moment, we know the IOMMU-ness of a domain right from the outset, > but the cacheability permits are dynamic, based on the non-emptyness of > the domain's device list, ioport list, and various others. Well, as long as there's an IOMMU assigned cacheability will be calculated taking into account the guest MTRR values, and won't be forced to WB, regardless of the emptiness of the device or IO ports/memory capability lists. Just setting `passthrough=1` on the guest config file without any devices actually passed through should prevent the forcing WB. The use case of allowing io memory or ports to be assigned to an HVM guest without an IOMMU has always confused me, but I guess if it's there it's because it's used by someone. > All the memory_type_changed() calls here are to cover the fact that the > original design was buggy by not having the cacheability-ness part of > domain create in the first place. > > The appropriate fix, but definitely 4.18 work at this point, is to have > a new CDF flag which permits the use of non-WB cacheability. > > For testing purposes alone, turning it on on an otherwise "plain VM" is > useful (its how I actually debugged XSA-402, and the only sane way to go > about investigating the MTRR per disasters for VGPU VMs[1]), Wasn't it enough to set `passthrough=1` in order to enable cacheability for debugging purposes? (not that I oppose to adding the flag, just curious why enabling the IOMMU wasn't enough). > but for > regular usecases, it wants cross-checking with the IOMMU flag (and > encrypted VM flag in the future), and for all dynamic list checks to > turn into a simple 'd->config & CDF_full_cacheability'. > > This way, we delete all calls to memory_type_changed() which are trying > to cover the various dynamic lists becoming empty/non-empty, and we > remove several ordering-of-hypercalls bugs where non-cacheable mappings > can't actually be created on a VM declared to have an IOMMU until a > device has actually been assigned to start with. It should be possible with current code to create non-cacheable mappings on a VM as long as the IOMMU is enabled, regardless of whether no devices are assigned to the VM. > ~Andrew > > [1] MTRR handling is also buggy with reduced cacheability, causing some > areas of RAM to be used UC; notably the grant table. This manifests as > PV device perf being worse than qemu-emulated device perf, only when a > GPU is added to a VM[2]. Instead of fixing this properly, it was hacked > around by forcing IPAT=1 for Xenheap pages, which only "fixed" the > problem on Intel (AMD has no equivalent mechanism), and needs reverting > and fixing properly (i.e. get the vMTRR layout working correctly) to > support VMs with encrypted RAM. My understanding of the original problem was slightly different: we place the grant table in a BAR region of the xenpci device, and hence the gMTRRs are set to UC by the guest. The workaround in Xen is to cope with existing Windows guest drivers not setting the gMTRR values for the grant table frames to WB. Even if we calculate all the cache attributes correctly we would still need the 'hack'. > > [2] There's a second bug with memory_type_changed() in that it causes > dreadful system performance during VM migration, which is something to > do with the interaction of restoring vMTRRs for a VM that has a device > but isn't running yet. This still needs investigating, and I suspect > it's got a similar root cause. XenServer seems to have a custom patch for this: https://github.com/xenserver/xen.pg/blob/XS-8.3.x/patches/0001-x86-HVM-Avoid-cache-flush-operations-during-hvm_load.patch But we could likely avoid the flush completely if the VM hasn't been started yet. Thanks, Roger.
diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c index 020df615bd..f1150dffa5 100644 --- a/xen/arch/x86/domctl.c +++ b/xen/arch/x86/domctl.c @@ -222,6 +222,7 @@ long arch_do_domctl( unsigned int fp = domctl->u.ioport_permission.first_port; unsigned int np = domctl->u.ioport_permission.nr_ports; int allow = domctl->u.ioport_permission.allow_access; + bool flush_permitted = cache_flush_permitted(d); if ( (fp + np) <= fp || (fp + np) > MAX_IOPORTS ) ret = -EINVAL; @@ -232,7 +233,13 @@ long arch_do_domctl( ret = ioports_permit_access(d, fp, fp + np - 1); else ret = ioports_deny_access(d, fp, fp + np - 1); - if ( !ret ) + if ( !ret && !is_iommu_enabled(d) && + flush_permitted != cache_flush_permitted(d) ) + /* + * Only flush if the output of cache_flush_permitted() changes and + * IOMMU is not enabled for the domain, otherwise it makes no + * difference for EMT calculation purposes. + */ memory_type_changed(d); break; } @@ -586,6 +593,7 @@ long arch_do_domctl( struct hvm_domain *hvm; struct g2m_ioport *g2m_ioport; int found = 0; + bool flush_permitted = cache_flush_permitted(d); ret = -EOPNOTSUPP; if ( !is_hvm_domain(d) ) @@ -666,7 +674,13 @@ long arch_do_domctl( "ioport_map: error %ld denying dom%d access to [%x,%x]\n", ret, d->domain_id, fmp, fmp + np - 1); } - if ( !ret ) + if ( !ret && !is_iommu_enabled(d) && + flush_permitted != cache_flush_permitted(d) ) + /* + * Only flush if the output of cache_flush_permitted() changes and + * IOMMU is not enabled for the domain, otherwise it makes no + * difference for EMT calculation purposes. + */ memory_type_changed(d); break; } diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 452266710a..1f2f2dfcc2 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -703,6 +703,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) unsigned long mfn = op->u.iomem_permission.first_mfn; unsigned long nr_mfns = op->u.iomem_permission.nr_mfns; int allow = op->u.iomem_permission.allow_access; + bool flush_permitted = cache_flush_permitted(d); ret = -EINVAL; if ( (mfn + nr_mfns - 1) < mfn ) /* wrap? */ @@ -716,7 +717,13 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) ret = iomem_permit_access(d, mfn, mfn + nr_mfns - 1); else ret = iomem_deny_access(d, mfn, mfn + nr_mfns - 1); - if ( !ret ) + if ( !ret && !is_iommu_enabled(d) && + flush_permitted != cache_flush_permitted(d) ) + /* + * Only flush if the output of cache_flush_permitted() changes and + * IOMMU is not enabled for the domain, otherwise it makes no + * difference for effective cache attribute calculation purposes. + */ memory_type_changed(d); break; } @@ -778,8 +785,6 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) "memory_map: error %ld removing dom%d access to [%lx,%lx]\n", ret, d->domain_id, mfn, mfn_end); } - /* Do this unconditionally to cover errors on above failure paths. */ - memory_type_changed(d); break; }
memory_type_changed() is currently only implemented for Intel EPT, and results in the invalidation of EMT attributes on all the entries in the EPT page tables. Such invalidation causes EPT_MISCONFIG vmexits when the guest tries to access any gfns for the first time, which results in the recalculation of the EMT for the accessed page. The vmexit and the recalculations are expensive, and as such should be avoided when possible. Remove the call to memory_type_changed() from XEN_DOMCTL_memory_mapping: there are no modifications of the iomem_caps ranges anymore that could alter the return of cache_flush_permitted() from that domctl. Calls to memory_type_changed() resulting from changes to the domain iomem_caps or ioport_caps ranges are only relevant for EMT calculations if the IOMMU is not enabled, and the call has resulted in a change to the return value of cache_flush_permitted(). Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> --- I feel it's a bit weird to have calls to memory_type_changed() in common domctl code - for once the domctl that trigger the call doesn't change memory types, just adds or removes ranges from iomem_caps (which in turn affects the behaviour of epte_get_entry_emt()). --- xen/arch/x86/domctl.c | 18 ++++++++++++++++-- xen/common/domctl.c | 11 ++++++++--- 2 files changed, 24 insertions(+), 5 deletions(-)