x86/pv: Flush TLB in response to paging structure changes

Message ID	20201021130708.12249-1-andrew.cooper3@citrix.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=90EN=D4=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org F134122454 Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> IronPort-SDR: QzQyDFug2bnDXNed5yYiXo+hSko/zYmU8QQ2+xRD9bJ0g7yu6l7W1VzEcalKqPv3t+mat6uoQl bAoB+L4QBLt0rFPJTY3oYhL9DylF0ab4g3tKffZDd6XH5p6HJybUD3u/5Gxjfz/hDvuoLPfzTd rVAYeESMVoxulph7lOXVN6VYo1fsXrQoy6aspq2zNrTELMbMTLujkr0XoBpHv4EGG4FJZZNAau 8VLiMYevf1MZpEVD3oArMdy2qO4fKLyJR+yoYBAmu+RpmRUKWIXGjmeJX7cKzk/vKkbJAiRHfX t3Q= From: Andrew Cooper <andrew.cooper3@citrix.com> To: Xen-devel <xen-devel@lists.xenproject.org> CC: Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <JBeulich@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Wei Liu <wl@xen.org> Subject: [PATCH] x86/pv: Flush TLB in response to paging structure changes Date: Wed, 21 Oct 2020 14:07:08 +0100 Message-ID: <20201021130708.12249-1-andrew.cooper3@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	x86/pv: Flush TLB in response to paging structure changes \| expand x86/pv: Flush TLB in response to paging structure changes

Andrew Cooper Oct. 21, 2020, 1:07 p.m. UTC

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is from Xen's point of view (as the update only affects guest mappings), and
the guest is required to flush suitably after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

This combines with sync_guest for XPTI L4 "shadowing", but has some asymmetry
between local and remote flush requirements.  Replace the sync_guest boolean
with flush_flags_{local,all} and accumulate flags, performing all required
flushing at the end.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

v2:
 * Use two separate flush flags.
 * Use non-global flushes.
---
 xen/arch/x86/mm.c | 61 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 18 deletions(-)

Jan Beulich Oct. 21, 2020, 1:56 p.m. UTC | #1

On 21.10.2020 15:07, Andrew Cooper wrote:
> @@ -4037,6 +4037,9 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> +                    /* Paging structure maybe changed.  Flush linear range. */
> +                    if ( !rc )
> +                        flush_flags_all |= FLUSH_TLB;
>                      break;
>  
>                  case PGT_l3_page_table:
> @@ -4044,6 +4047,9 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> +                    /* Paging structure maybe changed.  Flush linear range. */
> +                    if ( !rc )
> +                        flush_flags_all |= FLUSH_TLB;
>                      break;
>  
>                  case PGT_l4_page_table:
> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>                          break;
>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> -                    if ( !rc && pt_owner->arch.pv.xpti )
> +                    /* Paging structure maybe changed.  Flush linear range. */
> +                    if ( !rc )
>                      {
> -                        bool local_in_use = false;
> +                        bool local_in_use = mfn_eq(
> +                            pagetable_get_mfn(curr->arch.guest_table), mfn);
>  
> -                        if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
> -                                    mfn) )
> -                        {
> -                            local_in_use = true;
> -                            get_cpu_info()->root_pgt_changed = true;
> -                        }
> +                        flush_flags_all |= FLUSH_TLB;
> +
> +                        if ( local_in_use )
> +                            flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;

Aiui here (and in the code consuming the flags) you build upon
flush_flags_local, when not zero, always being a superset of
flush_flags_all. I think this is a trap to fall into when later
wanting to change this code, but as per below this won't hold
anymore anyway, I think. Hence here I think you want to set
FLUSH_TLB unconditionally, and above for L3 and L2 you want to
set it in both variables. Or, if I'm wrong below, a comment to
that effect may help people avoid falling into this trap.

An alternative would be to have

    flush_flags_local |= (flush_flags_all & FLUSH_TLB);

before doing the actual flush.

>                          /*
>                           * No need to sync if all uses of the page can be
>                           * accounted to the page lock we hold, its pinned
>                           * status, and uses on this (v)CPU.
>                           */
> -                        if ( (page->u.inuse.type_info & PGT_count_mask) >
> +                        if ( pt_owner->arch.pv.xpti &&

I assume you've moved this here to avoid the further non-trivial
checks when possible, but you've not put it around the setting
of FLUSH_ROOT_PGTBL in flush_flags_local because setting
->root_pgt_changed is benign when !XPTI?

> +                             (page->u.inuse.type_info & PGT_count_mask) >
>                               (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>                                mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
>                                       mfn) + local_in_use) )
> -                            sync_guest = true;
> +                            flush_flags_all |= FLUSH_ROOT_PGTBL;

This needs to also specify FLUSH_TLB_GLOBAL, or else original
XPTI behavior gets broken. Since the local CPU doesn't need this,
the variable may then better be named flush_flags_remote?

Or if I'm wrong here, the reasoning behind the dropping of the
global flush in this case needs putting in the description, not
the least because it would mean the change introducing it went
too far.

> @@ -4173,18 +4180,36 @@ long do_mmu_update(
>      if ( va )
>          unmap_domain_page(va);
>  
> -    if ( sync_guest )
> +    /*
> +     * Flushing needs to occur for one of several reasons.
> +     *
> +     * 1) An update to an L2 or higher occured.  This potentially changes the
> +     *    pagetable structure, requiring a flush of the linear range.
> +     * 2) An update to an L4 occured, and XPTI is enabled.  All CPUs running
> +     *    on a copy of this L4 need refreshing.
> +     */
> +    if ( flush_flags_all || flush_flags_local )

Minor remark: At least in x86 code it is more efficient to use
| instead of || in such cases, to avoid relying on the compiler
to carry out this small optimization. It may well be that all
compilers we care about do so, but it's certainly not the case
for all the compilers I've ever worked with.

>      {
> +        cpumask_t *mask = pt_owner->dirty_cpumask;
> +
>          /*
> -         * Force other vCPU-s of the affected guest to pick up L4 entry
> -         * changes (if any).
> +         * Local flushing may be asymmetric with remote.  If there is local
> +         * flushing to do, perform it separately and omit the current CPU from
> +         * pt_owner->dirty_cpumask.
>           */
> -        unsigned int cpu = smp_processor_id();
> -        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
> +        if ( flush_flags_local )
> +        {
> +            unsigned int cpu = smp_processor_id();
> +
> +            mask = per_cpu(scratch_cpumask, cpu);
> +            cpumask_copy(mask, pt_owner->dirty_cpumask);
> +            __cpumask_clear_cpu(cpu, mask);
> +
> +            flush_local(flush_flags_local);
> +        }
>  
> -        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));

I understand you're of the opinion that cpumask_copy() +
__cpumask_clear_cpu() is more efficient than cpumask_andnot()?
(I guess I agree for high enough CPU counts.)

Jan

Andrew Cooper Oct. 21, 2020, 3:39 p.m. UTC | #2

On 21/10/2020 14:56, Jan Beulich wrote:
> On 21.10.2020 15:07, Andrew Cooper wrote:
>> @@ -4037,6 +4037,9 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> +                    /* Paging structure maybe changed.  Flush linear range. */
>> +                    if ( !rc )
>> +                        flush_flags_all |= FLUSH_TLB;
>>                      break;
>>  
>>                  case PGT_l3_page_table:
>> @@ -4044,6 +4047,9 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> +                    /* Paging structure maybe changed.  Flush linear range. */
>> +                    if ( !rc )
>> +                        flush_flags_all |= FLUSH_TLB;
>>                      break;
>>  
>>                  case PGT_l4_page_table:
>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>                          break;
>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> -                    if ( !rc && pt_owner->arch.pv.xpti )
>> +                    /* Paging structure maybe changed.  Flush linear range. */
>> +                    if ( !rc )
>>                      {
>> -                        bool local_in_use = false;
>> +                        bool local_in_use = mfn_eq(
>> +                            pagetable_get_mfn(curr->arch.guest_table), mfn);
>>  
>> -                        if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>> -                                    mfn) )
>> -                        {
>> -                            local_in_use = true;
>> -                            get_cpu_info()->root_pgt_changed = true;
>> -                        }
>> +                        flush_flags_all |= FLUSH_TLB;
>> +
>> +                        if ( local_in_use )
>> +                            flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
> Aiui here (and in the code consuming the flags) you build upon
> flush_flags_local, when not zero, always being a superset of
> flush_flags_all. I think this is a trap to fall into when later
> wanting to change this code, but as per below this won't hold
> anymore anyway, I think. Hence here I think you want to set
> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
> set it in both variables. Or, if I'm wrong below, a comment to
> that effect may help people avoid falling into this trap.
>
> An alternative would be to have
>
>     flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>
> before doing the actual flush.

Honestly, this is what I meant by stating that the asymmetry is a total
mess.

I originally named all 'remote', but this is even less accurate, it may
still contain the current cpu.

Our matrix of complexity:

* FLUSH_TLB for L2+ structure changes
* FLUSH_TLB_GLOBAL/FLUSH_ROOT_PGTBL for XPTI

with optimisations to skip GLOBAL/ROOT_PGTBL on the local CPU if none of
the updates hit the L4-in-use, and to skip the remote if we hold all
references on the L4.

Everything is complicated because pt_owner may not be current, for
toolstack operations constructing a new domain.

>
>>                          /*
>>                           * No need to sync if all uses of the page can be
>>                           * accounted to the page lock we hold, its pinned
>>                           * status, and uses on this (v)CPU.
>>                           */
>> -                        if ( (page->u.inuse.type_info & PGT_count_mask) >
>> +                        if ( pt_owner->arch.pv.xpti &&
> I assume you've moved this here to avoid the further non-trivial
> checks when possible, but you've not put it around the setting
> of FLUSH_ROOT_PGTBL in flush_flags_local because setting
> ->root_pgt_changed is benign when !XPTI?

No - that was accidental, while attempting to reduce the diff.

>
>> +                             (page->u.inuse.type_info & PGT_count_mask) >
>>                               (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>>                                mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
>>                                       mfn) + local_in_use) )
>> -                            sync_guest = true;
>> +                            flush_flags_all |= FLUSH_ROOT_PGTBL;
> This needs to also specify FLUSH_TLB_GLOBAL, or else original
> XPTI behavior gets broken. Since the local CPU doesn't need this,
> the variable may then better be named flush_flags_remote?

See above.  remote is even more confusing than all.

>
> Or if I'm wrong here, the reasoning behind the dropping of the
> global flush in this case needs putting in the description, not
> the least because it would mean the change introducing it went
> too far.
>
>> @@ -4173,18 +4180,36 @@ long do_mmu_update(
>>      if ( va )
>>          unmap_domain_page(va);
>>  
>> -    if ( sync_guest )
>> +    /*
>> +     * Flushing needs to occur for one of several reasons.
>> +     *
>> +     * 1) An update to an L2 or higher occured.  This potentially changes the
>> +     *    pagetable structure, requiring a flush of the linear range.
>> +     * 2) An update to an L4 occured, and XPTI is enabled.  All CPUs running
>> +     *    on a copy of this L4 need refreshing.
>> +     */
>> +    if ( flush_flags_all || flush_flags_local )
> Minor remark: At least in x86 code it is more efficient to use
> | instead of || in such cases, to avoid relying on the compiler
> to carry out this small optimization.

This transformation should not be recommended in general.  There are
cases, including this one, where it is at best, no effect, and at worst,
wrong.

I don't care about people using ancient compilers.  They've got far
bigger (== more impactful) problems than (the absence of) this
transformation, and the TLB flush will dwarf anything the compiler does
here.

However, the hand "optimised" case breaks a compiler spotting that the
entire second clause is actually redundant for now.

I specifically didn't encode the dependency, to avoid subtle bugs
if/when someone alters the logic.

>
>>      {
>> +        cpumask_t *mask = pt_owner->dirty_cpumask;
>> +
>>          /*
>> -         * Force other vCPU-s of the affected guest to pick up L4 entry
>> -         * changes (if any).
>> +         * Local flushing may be asymmetric with remote.  If there is local
>> +         * flushing to do, perform it separately and omit the current CPU from
>> +         * pt_owner->dirty_cpumask.
>>           */
>> -        unsigned int cpu = smp_processor_id();
>> -        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
>> +        if ( flush_flags_local )
>> +        {
>> +            unsigned int cpu = smp_processor_id();
>> +
>> +            mask = per_cpu(scratch_cpumask, cpu);
>> +            cpumask_copy(mask, pt_owner->dirty_cpumask);
>> +            __cpumask_clear_cpu(cpu, mask);
>> +
>> +            flush_local(flush_flags_local);
>> +        }
>>  
>> -        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
> I understand you're of the opinion that cpumask_copy() +
> __cpumask_clear_cpu() is more efficient than cpumask_andnot()?
> (I guess I agree for high enough CPU counts.)

Its faster in all cases, even low CPU counts.

~Andrew

Andrew Cooper Oct. 21, 2020, 3:55 p.m. UTC | #3

On 21/10/2020 16:39, Andrew Cooper wrote:
>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> -                    if ( !rc && pt_owner->arch.pv.xpti )
>>> +                    /* Paging structure maybe changed.  Flush linear range. */
>>> +                    if ( !rc )
>>>                      {
>>> -                        bool local_in_use = false;
>>> +                        bool local_in_use = mfn_eq(
>>> +                            pagetable_get_mfn(curr->arch.guest_table), mfn);
>>>  
>>> -                        if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>> -                                    mfn) )
>>> -                        {
>>> -                            local_in_use = true;
>>> -                            get_cpu_info()->root_pgt_changed = true;
>>> -                        }
>>> +                        flush_flags_all |= FLUSH_TLB;
>>> +
>>> +                        if ( local_in_use )
>>> +                            flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
>> Aiui here (and in the code consuming the flags) you build upon
>> flush_flags_local, when not zero, always being a superset of
>> flush_flags_all. I think this is a trap to fall into when later
>> wanting to change this code, but as per below this won't hold
>> anymore anyway, I think. Hence here I think you want to set
>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>> set it in both variables. Or, if I'm wrong below, a comment to
>> that effect may help people avoid falling into this trap.
>>
>> An alternative would be to have
>>
>>     flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>
>> before doing the actual flush.

Also, what I forgot to say in the previous reply, this is still buggy if
hypothetically speaking FLUSH_CACHE had managed to be accumulated in
flush_flags_all.

You cannot have general accumulation logic, a special case for local,
and any catch-all logic like that, because the only correct way to do
the catch-all logic will clobber the special case for local.

~Andrew

Andrew Cooper Oct. 21, 2020, 4:53 p.m. UTC | #4

On 21/10/2020 16:55, Andrew Cooper wrote:
> On 21/10/2020 16:39, Andrew Cooper wrote:
>>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>>>                          break;
>>>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>>> -                    if ( !rc && pt_owner->arch.pv.xpti )
>>>> +                    /* Paging structure maybe changed.  Flush linear range. */
>>>> +                    if ( !rc )
>>>>                      {
>>>> -                        bool local_in_use = false;
>>>> +                        bool local_in_use = mfn_eq(
>>>> +                            pagetable_get_mfn(curr->arch.guest_table), mfn);
>>>>  
>>>> -                        if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>>> -                                    mfn) )
>>>> -                        {
>>>> -                            local_in_use = true;
>>>> -                            get_cpu_info()->root_pgt_changed = true;
>>>> -                        }
>>>> +                        flush_flags_all |= FLUSH_TLB;
>>>> +
>>>> +                        if ( local_in_use )
>>>> +                            flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
>>> Aiui here (and in the code consuming the flags) you build upon
>>> flush_flags_local, when not zero, always being a superset of
>>> flush_flags_all. I think this is a trap to fall into when later
>>> wanting to change this code, but as per below this won't hold
>>> anymore anyway, I think. Hence here I think you want to set
>>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>>> set it in both variables. Or, if I'm wrong below, a comment to
>>> that effect may help people avoid falling into this trap.
>>>
>>> An alternative would be to have
>>>
>>>     flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>>
>>> before doing the actual flush.
> Also, what I forgot to say in the previous reply, this is still buggy if
> hypothetically speaking FLUSH_CACHE had managed to be accumulated in
> flush_flags_all.
>
> You cannot have general accumulation logic, a special case for local,
> and any catch-all logic like that, because the only correct way to do
> the catch-all logic will clobber the special case for local.

I'm going to try a 3rd time with flush_flags and local_may_skip which
defaults to GLOBAL|ROOT_PGTBL, and may get clobbered.

This seems like it might be a less fragile way of expressing the
optimisation.

~Andrew

Jan Beulich Oct. 22, 2020, 7:22 a.m. UTC | #5

On 21.10.2020 17:39, Andrew Cooper wrote:
> On 21/10/2020 14:56, Jan Beulich wrote:
>> On 21.10.2020 15:07, Andrew Cooper wrote:
>>> @@ -4037,6 +4037,9 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> +                    /* Paging structure maybe changed.  Flush linear range. */
>>> +                    if ( !rc )
>>> +                        flush_flags_all |= FLUSH_TLB;
>>>                      break;
>>>  
>>>                  case PGT_l3_page_table:
>>> @@ -4044,6 +4047,9 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> +                    /* Paging structure maybe changed.  Flush linear range. */
>>> +                    if ( !rc )
>>> +                        flush_flags_all |= FLUSH_TLB;
>>>                      break;
>>>  
>>>                  case PGT_l4_page_table:
>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>>                          break;
>>>                      rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>                                        cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> -                    if ( !rc && pt_owner->arch.pv.xpti )
>>> +                    /* Paging structure maybe changed.  Flush linear range. */
>>> +                    if ( !rc )
>>>                      {
>>> -                        bool local_in_use = false;
>>> +                        bool local_in_use = mfn_eq(
>>> +                            pagetable_get_mfn(curr->arch.guest_table), mfn);
>>>  
>>> -                        if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>> -                                    mfn) )
>>> -                        {
>>> -                            local_in_use = true;
>>> -                            get_cpu_info()->root_pgt_changed = true;
>>> -                        }
>>> +                        flush_flags_all |= FLUSH_TLB;
>>> +
>>> +                        if ( local_in_use )
>>> +                            flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
>> Aiui here (and in the code consuming the flags) you build upon
>> flush_flags_local, when not zero, always being a superset of
>> flush_flags_all. I think this is a trap to fall into when later
>> wanting to change this code, but as per below this won't hold
>> anymore anyway, I think. Hence here I think you want to set
>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>> set it in both variables. Or, if I'm wrong below, a comment to
>> that effect may help people avoid falling into this trap.
>>
>> An alternative would be to have
>>
>>     flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>
>> before doing the actual flush.
> 
> Honestly, this is what I meant by stating that the asymmetry is a total
> mess.
> 
> I originally named all 'remote', but this is even less accurate, it may
> still contain the current cpu.
> 
> Our matrix of complexity:
> 
> * FLUSH_TLB for L2+ structure changes
> * FLUSH_TLB_GLOBAL/FLUSH_ROOT_PGTBL for XPTI
> 
> with optimisations to skip GLOBAL/ROOT_PGTBL on the local CPU if none of
> the updates hit the L4-in-use, and to skip the remote if we hold all
> references on the L4.
> 
> Everything is complicated because pt_owner may not be current, for
> toolstack operations constructing a new domain.

Which is a case I'm wondering why we flush in the first place.
The L4 under construction can't be in use yet, and hence
updates to it shouldn't need syncing. Otoh
pt_owner->dirty_cpumask obviously is empty at that point, i.e.
special casing this likely isn't really worth it.

>>> @@ -4173,18 +4180,36 @@ long do_mmu_update(
>>>      if ( va )
>>>          unmap_domain_page(va);
>>>  
>>> -    if ( sync_guest )
>>> +    /*
>>> +     * Flushing needs to occur for one of several reasons.
>>> +     *
>>> +     * 1) An update to an L2 or higher occured.  This potentially changes the
>>> +     *    pagetable structure, requiring a flush of the linear range.
>>> +     * 2) An update to an L4 occured, and XPTI is enabled.  All CPUs running
>>> +     *    on a copy of this L4 need refreshing.
>>> +     */
>>> +    if ( flush_flags_all || flush_flags_local )
>> Minor remark: At least in x86 code it is more efficient to use
>> | instead of || in such cases, to avoid relying on the compiler
>> to carry out this small optimization.
> 
> This transformation should not be recommended in general.  There are
> cases, including this one, where it is at best, no effect, and at worst,
> wrong.
> 
> I don't care about people using ancient compilers.  They've got far
> bigger (== more impactful) problems than (the absence of) this
> transformation, and the TLB flush will dwarf anything the compiler does
> here.
> 
> However, the hand "optimised" case breaks a compiler spotting that the
> entire second clause is actually redundant for now.

Oh, you mean non-zero flush_flags_local implying non-zero
flush_flags_all? Not sure why a compiler recognizing this shouldn't
be able to optimize away | when it is able to optimize away || in
this case. Anyway - minor point, as said.

> I specifically didn't encode the dependency, to avoid subtle bugs
> if/when someone alters the logic.

Good point, but let me point out then that you encoded another
subtle dependency, which I didn't spell out completely before
because with the suggested adjustments it disappears: You may not
omit FLUSH_TLB from flush_flags_local when !local_in_use, as the L4
being changed may be one recursively hanging off of the L4 we're
running on.

>>>      {
>>> +        cpumask_t *mask = pt_owner->dirty_cpumask;
>>> +
>>>          /*
>>> -         * Force other vCPU-s of the affected guest to pick up L4 entry
>>> -         * changes (if any).
>>> +         * Local flushing may be asymmetric with remote.  If there is local
>>> +         * flushing to do, perform it separately and omit the current CPU from
>>> +         * pt_owner->dirty_cpumask.
>>>           */
>>> -        unsigned int cpu = smp_processor_id();
>>> -        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
>>> +        if ( flush_flags_local )
>>> +        {
>>> +            unsigned int cpu = smp_processor_id();
>>> +
>>> +            mask = per_cpu(scratch_cpumask, cpu);
>>> +            cpumask_copy(mask, pt_owner->dirty_cpumask);
>>> +            __cpumask_clear_cpu(cpu, mask);
>>> +
>>> +            flush_local(flush_flags_local);
>>> +        }
>>>  
>>> -        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
>> I understand you're of the opinion that cpumask_copy() +
>> __cpumask_clear_cpu() is more efficient than cpumask_andnot()?
>> (I guess I agree for high enough CPU counts.)
> 
> Its faster in all cases, even low CPU counts.

Is it? Data for BTR with a memory operand is available in the ORM only
for some Atom CPUs, and its latency and throughput aren't very good
there. Anyway - again a minor aspect, but I'll keep this in mind for
future code I write, as so far I've preferred the "andnot" form in
such cases.

Jan

x86/pv: Flush TLB in response to paging structure changes

Commit Message

Comments

Patch