[v4,10/21] AMD/IOMMU: allow use of superpage mappings

Message ID	5866e22e-9f31-84ab-1df9-db84aa802944@suse.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <5866e22e-9f31-84ab-1df9-db84aa802944@suse.com> Date: Mon, 25 Apr 2022 10:38:06 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings Content-Language: en-US From: Jan Beulich <jbeulich@suse.com> To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Paul Durrant <paul@xen.org>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> References: <b92e294e-7277-d977-bb96-7c28d60000c6@suse.com> In-Reply-To: <b92e294e-7277-d977-bb96-7c28d60000c6@suse.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit MIME-Version: 1.0
Series	IOMMU: superpage support when not sharing pagetables \| expand [v4,00/21] IOMMU: superpage support when not sharing pagetables [v4,01/21] AMD/IOMMU: correct potentially-UB shifts [v4,02/21] IOMMU: simplify unmap-on-error in iommu_map() [v4,03/21] IOMMU: add order parameter to ->{,un}map_page() hooks [v4,04/21] IOMMU: have iommu_{,un}map() split requests into largest possible chunks [v4,05/21] IOMMU/x86: restrict IO-APIC mappings for PV Dom0 [v4,06/21] IOMMU/x86: perform PV Dom0 mappings in batches [v4,07/21] IOMMU/x86: support freeing of pagetables [v4,08/21] AMD/IOMMU: walk trees upon page fault [v4,09/21] AMD/IOMMU: return old PTE from {set,clear}_iommu_pte_present() [v4,10/21] AMD/IOMMU: allow use of superpage mappings [v4,11/21] VT-d: allow use of superpage mappings [v4,12/21] IOMMU: fold flush-all hook into "flush one" [v4,13/21] IOMMU/x86: prefill newly allocate page tables [v4,14/21] x86: introduce helper for recording degree of contiguity in page tables [v4,15/21] AMD/IOMMU: free all-empty page tables [v4,16/21] VT-d: free all-empty page tables [v4,17/21] AMD/IOMMU: replace all-contiguous page tables by superpage mappings [v4,18/21] VT-d: replace all-contiguous page tables by superpage mappings [v4,19/21] IOMMU/x86: add perf counters for page table splitting / coalescing [v4,20/21] VT-d: fold iommu_flush_iotlb{,_pages}() [v4,21/21] VT-d: fold dma_pte_clear_one() into its only caller

Message ID

5866e22e-9f31-84ab-1df9-db84aa802944@suse.com (mailing list archive)

State

Superseded

Headers

Errors-To: xen-devel-bounces@lists.xenproject.org
Precedence: list
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Message-ID: <5866e22e-9f31-84ab-1df9-db84aa802944@suse.com>
Date: Mon, 25 Apr 2022 10:38:06 +0200
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.8.1
Subject: [PATCH v4 10/21] AMD/IOMMU: allow use of superpage mappings
Content-Language: en-US
From: Jan Beulich <jbeulich@suse.com>
To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Paul Durrant <paul@xen.org>,
	=?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
References: <b92e294e-7277-d977-bb96-7c28d60000c6@suse.com>
In-Reply-To: <b92e294e-7277-d977-bb96-7c28d60000c6@suse.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?q?1/OZyOGn8SDmLpaVacKj53o1gQl1?=
	=?utf-8?q?nC1tFbofV9cmn2yWEjurIFgd91usNm2Fmx76GDy7lPBYIniM1joBNAzz21eEi/YJQ?=
	=?utf-8?q?8E/Sq1qWSSp9JmJ70BKF5spJ1wq9qrZdZwnx8gppLwE4KGLjBXOv++VvuxyejhH3/?=
	=?utf-8?q?9oTASXt8aHX1HgAS01efDAjpO6i7g17pkDDPALfZUW/y6SSB7A6mrDQxBVbcueKgR?=
	=?utf-8?q?ELHHCclVDFS6GfpcLhh5QmqemNizD1mN6FzCNH17Cde4ZVmfy4hI/Blj091cFwPa+?=
	=?utf-8?q?PtDa402WC8bdB7hJ9Xu2eKBYdjI7xxBEuY99Q0pp2DuqYn0OV1UmnQEibQ5nU1ppn?=
	=?utf-8?q?G4sNJ0dOtB1u9F77B6KBEGMZ4AFQ10eu8k6Dj7+rJuOsKfn9N5Ju9xZ3/kmQpEHRX?=
	=?utf-8?q?cGpq5zgbYZA6bsFG2z0rxqeTuaSTqA3dyfwTZT4LooaqX1w8ObbE39QgeO3FRh6uX?=
	=?utf-8?q?+cz0z7teLBTw91iWjn97ZqwEJVmdbMU2LiU9ateWf+J8qynBc5wvyk1aR/svbtof6?=
	=?utf-8?q?3PjTkzaMbiovPFa4pkPGKUX9bMvavmOD8m4gawbcqGV6ojnrxBD8gGh7bairswp6F?=
	=?utf-8?q?qKur4nlZupGu7zc7dapLqojyiPIVLNu+sfLfXH3RxzWgJejHWuzysvOXQjkGWxHiC?=
	=?utf-8?q?U3Y+d03GJfrhCoNukOwx3QwA6vMjs61cuJgNgc0idmZad7WCNbqugqivwhguGguSt?=
	=?utf-8?q?eD5/KcW1e6Otk/ESdGZC98EzyY61NZL6JFlWQbSyt84QWSCN0tTazeAJ85ZXGSC5k?=
	=?utf-8?q?iy2tk10Jf+PbfpX8T54Gfb2M3KUSfbn3DVzJobmyh+hi9VtQvPcKr899t/4HEcJec?=
	=?utf-8?q?68536MQFbcNCK9ot1ZaTkDNZNifTBCt05DVqqmE3vKqEc/2XBSqOwxtym8xbWi/5K?=
	=?utf-8?q?N/Hoo25PJuY+RLQa7Mu/i6WyufbTXHn6cBqSkmF/P3C6YllhK3PpPY1Hkbiow9f/t?=
	=?utf-8?q?sVWZYgJhooV44e3FbqP66jf/yBFpuFx+NgNYcfqBDORQwE2xH7SQx21e9CfkJxI3V?=
	=?utf-8?q?fOD829OYBJUBWSuGRZS0s6/SUysW4OJjXqN/CI/iZzyM3wBGMBx6Y9gGQadxpJvu9?=
	=?utf-8?q?xIhvZ/GzBpFwrQxPYiNd7gN5YnCcA74flo6LHsIjy2VGsdaWNGeb+M8JL/ceIJezY?=
	=?utf-8?q?VWvvZoM2itkkVxjh9YBKbh5yUy4ZM6J1s1h4ta29SQdRgo6eNt1B0m4H2jB3FC1rN?=
	=?utf-8?q?tyUO/zqNaHlGLP5d2A3baGRCJ/paG19A1H0eZRxcQpMqhIBB5OrHjBlCPoUUN8G89?=
	=?utf-8?q?yQIJ9UqlSORD+qp2EO8yEoxvQLGVKyOOWF0mwtnrC4kZXRkOYIgtz9OcpLEBc8dKR?=
	=?utf-8?q?Ap0rexS7DllwKU+gypz9pU/Lrs1O/O5iENIZT3lBmWQAUqDlno+H8i2WrQ2wRIepi?=
	=?utf-8?q?wHX2BoaqmxxV82ov9kmnfcfEwMBoXPCjVGLi3pT93oNw4J3gr6JGeTYgWRMUByJ31?=
	=?utf-8?q?eiLpjJMIAik7s0bYhfvp7Wsliu03coR/VFIgUurIu2UB3pE6eFlVLRbppNsOBfir3?=
	=?utf-8?q?odgI33n0uHeRZIp6XemgHuJa4Oj1FZ7tLwpXQaI6oqh+F1ZjUqtro8nt2ocfY3EYc?=
	=?utf-8?q?JUNlBYWhSCzxQ9DnWRihOb/IhhTSWXG6iUtRUWjzryFxK+z9k8KrzjI1I/llavEWN?=
	=?utf-8?q?L5VmMwqgd7K/+yaC7DUYoM63T9QW4rMA=3D=3D?=
X-OriginatorOrg: suse.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 d56cf4ea-ca11-43f8-2c34-08da2696eab8
X-MS-Exchange-CrossTenant-AuthSource: DU2PR04MB8616.eurprd04.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Apr 2022 08:38:05.2328
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: f7a17af6-1c5c-4a36-aa8b-f5be247aa4ba
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 KSo6xq+YQlGLdbM93qG9dyJ3VhSsbd5Vboxi0sDqpRnf0RaaO16IK9bKlB8dD6nUEeR6imRzjEUM5GRcJaA+rw==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR04MB7424

Series

IOMMU: superpage support when not sharing pagetables | expand

Commit Message

Jan Beulich April 25, 2022, 8:38 a.m. UTC

No separate feature flags exist which would control availability of
these; the only restriction is HATS (establishing the maximum number of
page table levels in general), and even that has a lower bound of 4.
Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
non-default page sizes the implementation in principle permits arbitrary
size mappings, but these require multiple identical leaf PTEs to be
written, which isn't all that different from having to write multiple
consecutive PTEs with increasing frame numbers. IMO that's therefore
beneficial only on hardware where suitable TLBs exist; I'm unaware of
such hardware.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm not fully sure about allowing 512G mappings: The scheduling-for-
freeing of intermediate page tables would take quite a while when
replacing a tree of 4k mappings by a single 512G one. Yet then again
there's no present code path via which 512G chunks of memory could be
allocated (and hence mapped) anyway, so this would only benefit huge
systems where 512 1G mappings could be re-coalesced (once suitable code
is in place) into a single L4 entry. And re-coalescing wouldn't result
in scheduling-for-freeing of full trees of lower level pagetables.
---
v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
    where possible.

Comments

Roger Pau Monne May 5, 2022, 1:19 p.m. UTC | #1

On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
> No separate feature flags exist which would control availability of
> these; the only restriction is HATS (establishing the maximum number of
> page table levels in general), and even that has a lower bound of 4.
> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> non-default page sizes the implementation in principle permits arbitrary
> size mappings, but these require multiple identical leaf PTEs to be
> written, which isn't all that different from having to write multiple
> consecutive PTEs with increasing frame numbers. IMO that's therefore
> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> such hardware.)
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

> ---
> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> freeing of intermediate page tables would take quite a while when
> replacing a tree of 4k mappings by a single 512G one. Yet then again
> there's no present code path via which 512G chunks of memory could be
> allocated (and hence mapped) anyway, so this would only benefit huge
> systems where 512 1G mappings could be re-coalesced (once suitable code
> is in place) into a single L4 entry. And re-coalescing wouldn't result
> in scheduling-for-freeing of full trees of lower level pagetables.

I would think part of this should go into the commit message, as to
why enabling 512G superpages is fine.

> ---
> v4: Change type of queue_free_pt()'s 1st parameter. Re-base.
> v3: Rename queue_free_pt()'s last parameter. Replace "level > 1" checks
>     where possible.
> 
> --- a/xen/drivers/passthrough/amd/iommu_map.c
> +++ b/xen/drivers/passthrough/amd/iommu_map.c
> @@ -32,12 +32,13 @@ static unsigned int pfn_to_pde_idx(unsig
>  }
>  
>  static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
> -                                                   unsigned long dfn)
> +                                                   unsigned long dfn,
> +                                                   unsigned int level)
>  {
>      union amd_iommu_pte *table, *pte, old;
>  
>      table = map_domain_page(_mfn(l1_mfn));
> -    pte = &table[pfn_to_pde_idx(dfn, 1)];
> +    pte = &table[pfn_to_pde_idx(dfn, level)];
>      old = *pte;
>  
>      write_atomic(&pte->raw, 0);
> @@ -351,11 +352,32 @@ static int iommu_pde_from_dfn(struct dom
>      return 0;
>  }
>  
> +static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
> +{
> +    if ( level > 1 )
> +    {
> +        union amd_iommu_pte *pt = map_domain_page(mfn);
> +        unsigned int i;
> +
> +        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
> +            if ( pt[i].pr && pt[i].next_level )
> +            {
> +                ASSERT(pt[i].next_level < level);
> +                queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
> +            }
> +
> +        unmap_domain_page(pt);
> +    }
> +
> +    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
> +}
> +
>  int cf_check amd_iommu_map_page(
>      struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
>      unsigned int *flush_flags)
>  {
>      struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
>      int rc;
>      unsigned long pt_mfn = 0;
>      union amd_iommu_pte old;
> @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
>          return rc;
>      }
>  

I think it might be helpful to assert or otherwise check that the
input order is supported by the IOMMU, just to be on the safe side.

> -    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
> +    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
>           !pt_mfn )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
> @@ -394,8 +416,8 @@ int cf_check amd_iommu_map_page(
>          return -EFAULT;
>      }
>  
> -    /* Install 4k mapping */
> -    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
> +    /* Install mapping */
> +    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
>                                  (flags & IOMMUF_writable),
>                                  (flags & IOMMUF_readable));
>  
> @@ -403,8 +425,13 @@ int cf_check amd_iommu_map_page(
>  
>      *flush_flags |= IOMMU_FLUSHF_added;
>      if ( old.pr )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
>  
> +        if ( IOMMUF_order(flags) && old.next_level )
> +            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> +    }
> +
>      return 0;
>  }
>  
> @@ -413,6 +440,7 @@ int cf_check amd_iommu_unmap_page(
>  {
>      unsigned long pt_mfn = 0;
>      struct domain_iommu *hd = dom_iommu(d);
> +    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
>      union amd_iommu_pte old = {};
>  
>      spin_lock(&hd->arch.mapping_lock);
> @@ -423,7 +451,7 @@ int cf_check amd_iommu_unmap_page(
>          return 0;
>      }
>  
> -    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
> +    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
>      {
>          spin_unlock(&hd->arch.mapping_lock);
>          AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
> @@ -435,14 +463,19 @@ int cf_check amd_iommu_unmap_page(
>      if ( pt_mfn )
>      {
>          /* Mark PTE as 'page not present'. */
> -        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
> +        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
>      }
>  
>      spin_unlock(&hd->arch.mapping_lock);
>  
>      if ( old.pr )
> +    {
>          *flush_flags |= IOMMU_FLUSHF_modified;
>  
> +        if ( order && old.next_level )
> +            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
> +    }
> +
>      return 0;
>  }
>  
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
>  }
>  
>  static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
> -    .page_sizes = PAGE_SIZE_4K,
> +    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,

As mentioned on a previous email, I'm worried if we ever get to
replace an entry populated with 4K pages with a 512G superpage, as the
freeing cost of the associated pagetables would be quite high.

I guess we will have to implement a more preemptive freeing behavior
if issues arise.

Thanks, Roger.

Jan Beulich May 5, 2022, 2:34 p.m. UTC | #2

On 05.05.2022 15:19, Roger Pau Monné wrote:
> On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
>> No separate feature flags exist which would control availability of
>> these; the only restriction is HATS (establishing the maximum number of
>> page table levels in general), and even that has a lower bound of 4.
>> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
>> non-default page sizes the implementation in principle permits arbitrary
>> size mappings, but these require multiple identical leaf PTEs to be
>> written, which isn't all that different from having to write multiple
>> consecutive PTEs with increasing frame numbers. IMO that's therefore
>> beneficial only on hardware where suitable TLBs exist; I'm unaware of
>> such hardware.)
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

>> ---
>> I'm not fully sure about allowing 512G mappings: The scheduling-for-
>> freeing of intermediate page tables would take quite a while when
>> replacing a tree of 4k mappings by a single 512G one. Yet then again
>> there's no present code path via which 512G chunks of memory could be
>> allocated (and hence mapped) anyway, so this would only benefit huge
>> systems where 512 1G mappings could be re-coalesced (once suitable code
>> is in place) into a single L4 entry. And re-coalescing wouldn't result
>> in scheduling-for-freeing of full trees of lower level pagetables.
> 
> I would think part of this should go into the commit message, as to
> why enabling 512G superpages is fine.

Together with what you say at the bottom I wonder whether, rather than
moving this into the description in a slightly edited form, I shouldn't
drop the PAGE_SIZE_512G there. I don't think that would invalidate your
R-b.

>> @@ -384,7 +406,7 @@ int cf_check amd_iommu_map_page(
>>          return rc;
>>      }
>>  
> 
> I think it might be helpful to assert or otherwise check that the
> input order is supported by the IOMMU, just to be on the safe side.

Well, yes, I can certainly do so. Given how the code was developed it
didn't seem very likely that such a fundamental assumption could be
violated, but I guess I see your point.

Jan

>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> @@ -747,7 +747,7 @@ static void cf_check amd_dump_page_table
>>  }
>>  
>>  static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
>> -    .page_sizes = PAGE_SIZE_4K,
>> +    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
> 
> As mentioned on a previous email, I'm worried if we ever get to
> replace an entry populated with 4K pages with a 512G superpage, as the
> freeing cost of the associated pagetables would be quite high.
> 
> I guess we will have to implement a more preemptive freeing behavior
> if issues arise.
> 
> Thanks, Roger.
>

Roger Pau Monne May 5, 2022, 3:26 p.m. UTC | #3

On Thu, May 05, 2022 at 04:34:54PM +0200, Jan Beulich wrote:
> On 05.05.2022 15:19, Roger Pau Monné wrote:
> > On Mon, Apr 25, 2022 at 10:38:06AM +0200, Jan Beulich wrote:
> >> No separate feature flags exist which would control availability of
> >> these; the only restriction is HATS (establishing the maximum number of
> >> page table levels in general), and even that has a lower bound of 4.
> >> Thus we can unconditionally announce 2M, 1G, and 512G mappings. (Via
> >> non-default page sizes the implementation in principle permits arbitrary
> >> size mappings, but these require multiple identical leaf PTEs to be
> >> written, which isn't all that different from having to write multiple
> >> consecutive PTEs with increasing frame numbers. IMO that's therefore
> >> beneficial only on hardware where suitable TLBs exist; I'm unaware of
> >> such hardware.)
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> >> ---
> >> I'm not fully sure about allowing 512G mappings: The scheduling-for-
> >> freeing of intermediate page tables would take quite a while when
> >> replacing a tree of 4k mappings by a single 512G one. Yet then again
> >> there's no present code path via which 512G chunks of memory could be
> >> allocated (and hence mapped) anyway, so this would only benefit huge
> >> systems where 512 1G mappings could be re-coalesced (once suitable code
> >> is in place) into a single L4 entry. And re-coalescing wouldn't result
> >> in scheduling-for-freeing of full trees of lower level pagetables.
> > 
> > I would think part of this should go into the commit message, as to
> > why enabling 512G superpages is fine.
> 
> Together with what you say at the bottom I wonder whether, rather than
> moving this into the description in a slightly edited form, I shouldn't
> drop the PAGE_SIZE_512G there. I don't think that would invalidate your
> R-b.

Right, might be good to add a comment that 512G super pages could be
enabled (ie: there's no hardware limitation), but we need to be sure
that the freeing of the removed page table pages doesn't starve the
pCPU.

Thanks, Roger.

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -32,12 +32,13 @@  static unsigned int pfn_to_pde_idx(unsig
 }
 
 static union amd_iommu_pte clear_iommu_pte_present(unsigned long l1_mfn,
-                                                   unsigned long dfn)
+                                                   unsigned long dfn,
+                                                   unsigned int level)
 {
     union amd_iommu_pte *table, *pte, old;
 
     table = map_domain_page(_mfn(l1_mfn));
-    pte = &table[pfn_to_pde_idx(dfn, 1)];
+    pte = &table[pfn_to_pde_idx(dfn, level)];
     old = *pte;
 
     write_atomic(&pte->raw, 0);
@@ -351,11 +352,32 @@  static int iommu_pde_from_dfn(struct dom
     return 0;
 }
 
+static void queue_free_pt(struct domain_iommu *hd, mfn_t mfn, unsigned int level)
+{
+    if ( level > 1 )
+    {
+        union amd_iommu_pte *pt = map_domain_page(mfn);
+        unsigned int i;
+
+        for ( i = 0; i < PTE_PER_TABLE_SIZE; ++i )
+            if ( pt[i].pr && pt[i].next_level )
+            {
+                ASSERT(pt[i].next_level < level);
+                queue_free_pt(hd, _mfn(pt[i].mfn), pt[i].next_level);
+            }
+
+        unmap_domain_page(pt);
+    }
+
+    iommu_queue_free_pgtable(hd, mfn_to_page(mfn));
+}
+
 int cf_check amd_iommu_map_page(
     struct domain *d, dfn_t dfn, mfn_t mfn, unsigned int flags,
     unsigned int *flush_flags)
 {
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (IOMMUF_order(flags) / PTE_PER_TABLE_SHIFT) + 1;
     int rc;
     unsigned long pt_mfn = 0;
     union amd_iommu_pte old;
@@ -384,7 +406,7 @@  int cf_check amd_iommu_map_page(
         return rc;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, true) ||
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, true) ||
          !pt_mfn )
     {
         spin_unlock(&hd->arch.mapping_lock);
@@ -394,8 +416,8 @@  int cf_check amd_iommu_map_page(
         return -EFAULT;
     }
 
-    /* Install 4k mapping */
-    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), 1,
+    /* Install mapping */
+    old = set_iommu_pte_present(pt_mfn, dfn_x(dfn), mfn_x(mfn), level,
                                 (flags & IOMMUF_writable),
                                 (flags & IOMMUF_readable));
 
@@ -403,8 +425,13 @@  int cf_check amd_iommu_map_page(
 
     *flush_flags |= IOMMU_FLUSHF_added;
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( IOMMUF_order(flags) && old.next_level )
+            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
@@ -413,6 +440,7 @@  int cf_check amd_iommu_unmap_page(
 {
     unsigned long pt_mfn = 0;
     struct domain_iommu *hd = dom_iommu(d);
+    unsigned int level = (order / PTE_PER_TABLE_SHIFT) + 1;
     union amd_iommu_pte old = {};
 
     spin_lock(&hd->arch.mapping_lock);
@@ -423,7 +451,7 @@  int cf_check amd_iommu_unmap_page(
         return 0;
     }
 
-    if ( iommu_pde_from_dfn(d, dfn_x(dfn), 1, &pt_mfn, flush_flags, false) )
+    if ( iommu_pde_from_dfn(d, dfn_x(dfn), level, &pt_mfn, flush_flags, false) )
     {
         spin_unlock(&hd->arch.mapping_lock);
         AMD_IOMMU_ERROR("invalid IO pagetable entry dfn = %"PRI_dfn"\n",
@@ -435,14 +463,19 @@  int cf_check amd_iommu_unmap_page(
     if ( pt_mfn )
     {
         /* Mark PTE as 'page not present'. */
-        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn));
+        old = clear_iommu_pte_present(pt_mfn, dfn_x(dfn), level);
     }
 
     spin_unlock(&hd->arch.mapping_lock);
 
     if ( old.pr )
+    {
         *flush_flags |= IOMMU_FLUSHF_modified;
 
+        if ( order && old.next_level )
+            queue_free_pt(hd, _mfn(old.mfn), old.next_level);
+    }
+
     return 0;
 }
 
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -747,7 +747,7 @@  static void cf_check amd_dump_page_table
 }
 
 static const struct iommu_ops __initconst_cf_clobber _iommu_ops = {
-    .page_sizes = PAGE_SIZE_4K,
+    .page_sizes = PAGE_SIZE_4K | PAGE_SIZE_2M | PAGE_SIZE_1G | PAGE_SIZE_512G,
     .init = amd_iommu_domain_init,
     .hwdom_init = amd_iommu_hwdom_init,
     .quarantine_init = amd_iommu_quarantine_init,
--- a/xen/include/xen/page-defs.h
+++ b/xen/include/xen/page-defs.h
@@ -21,4 +21,19 @@ 
 #define PAGE_MASK_64K               PAGE_MASK_GRAN(64K)
 #define PAGE_ALIGN_64K(addr)        PAGE_ALIGN_GRAN(64K, addr)
 
+#define PAGE_SHIFT_2M               21
+#define PAGE_SIZE_2M                PAGE_SIZE_GRAN(2M)
+#define PAGE_MASK_2M                PAGE_MASK_GRAN(2M)
+#define PAGE_ALIGN_2M(addr)         PAGE_ALIGN_GRAN(2M, addr)
+
+#define PAGE_SHIFT_1G               30
+#define PAGE_SIZE_1G                PAGE_SIZE_GRAN(1G)
+#define PAGE_MASK_1G                PAGE_MASK_GRAN(1G)
+#define PAGE_ALIGN_1G(addr)         PAGE_ALIGN_GRAN(1G, addr)
+
+#define PAGE_SHIFT_512G             39
+#define PAGE_SIZE_512G              PAGE_SIZE_GRAN(512G)
+#define PAGE_MASK_512G              PAGE_MASK_GRAN(512G)
+#define PAGE_ALIGN_512G(addr)       PAGE_ALIGN_GRAN(512G, addr)
+
 #endif /* __XEN_PAGE_DEFS_H__ */

[v4,10/21] AMD/IOMMU: allow use of superpage mappings

Commit Message

Comments

Patch