diff mbox

[v2,2/4] xen/x86: Drop unnecessary barriers

Message ID 1502882530-31700-3-git-send-email-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Cooper Aug. 16, 2017, 11:22 a.m. UTC
x86's current implementation of wmb() is a compiler barrier.  As a result, the
only change in this patch is to remove an mfence instruction from
cpuidle_disable_deep_cstate().

None of these barriers serve any purpose.  Most aren't aren't synchronising
with any remote cpus, where as the mcetelem barriers are redundant with
spin_unlock(), which already has full read/write barrier semantics.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

v2:
 * s/erronious/unnecessary/
 * Drop more unnecessary barriers
---
 xen/arch/x86/acpi/cpu_idle.c             | 2 --
 xen/arch/x86/cpu/mcheck/mce.c            | 3 ---
 xen/arch/x86/cpu/mcheck/mctelem.c        | 3 ---
 xen/arch/x86/crash.c                     | 3 ---
 xen/arch/x86/mm/shadow/multi.c           | 1 -
 xen/arch/x86/smpboot.c                   | 2 --
 xen/drivers/passthrough/amd/iommu_init.c | 2 --
 7 files changed, 16 deletions(-)

Comments

Jan Beulich Aug. 16, 2017, 3:23 p.m. UTC | #1
>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
> x86's current implementation of wmb() is a compiler barrier.  As a result, the
> only change in this patch is to remove an mfence instruction from
> cpuidle_disable_deep_cstate().
> 
> None of these barriers serve any purpose.  Most aren't aren't synchronising
> with any remote cpus, where as the mcetelem barriers are redundant with
> spin_unlock(), which already has full read/write barrier semantics.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

For the relevant parts
Acked-by: Jan Beulich <jbeulich@suse.com>
For the parts the ack doesn't extend to, however:

> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
>       * will make sure no inconsistent mapping being translated into
>       * shadow page table. */
>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
> -    rmb();
>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);

Isn't this supposed to make sure version is being read first? I.e.
doesn't this at least need to be barrier()?

> index a459e99..d5b6049 100644
> --- a/xen/drivers/passthrough/amd/iommu_init.c
> +++ b/xen/drivers/passthrough/amd/iommu_init.c
> @@ -558,7 +558,6 @@ static void parse_event_log_entry(struct amd_iommu *iommu, u32 entry[])
>              return;
>          }
>          udelay(1);
> -        rmb();
>          code = get_field_from_reg_u32(entry[1], IOMMU_EVENT_CODE_MASK,
>                                        IOMMU_EVENT_CODE_SHIFT);
>      }
> @@ -663,7 +662,6 @@ void parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>              return;
>          }
>          udelay(1);
> -        rmb();
>          code = get_field_from_reg_u32(entry[1], IOMMU_PPR_LOG_CODE_MASK,
>                                        IOMMU_PPR_LOG_CODE_SHIFT);
>      }

With these fully removed, what keeps the compiler from moving
the entry[1] reads out of the loop? Implementation details of
udelay() don't count...

Jan
Andrew Cooper Aug. 16, 2017, 4:47 p.m. UTC | #2
On 16/08/17 16:23, Jan Beulich wrote:
>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
>> x86's current implementation of wmb() is a compiler barrier.  As a result, the
>> only change in this patch is to remove an mfence instruction from
>> cpuidle_disable_deep_cstate().
>>
>> None of these barriers serve any purpose.  Most aren't aren't synchronising
>> with any remote cpus, where as the mcetelem barriers are redundant with
>> spin_unlock(), which already has full read/write barrier semantics.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> For the relevant parts
> Acked-by: Jan Beulich <jbeulich@suse.com>
> For the parts the ack doesn't extend to, however:
>
>> --- a/xen/arch/x86/mm/shadow/multi.c
>> +++ b/xen/arch/x86/mm/shadow/multi.c
>> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
>>       * will make sure no inconsistent mapping being translated into
>>       * shadow page table. */
>>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
>> -    rmb();
>>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
> Isn't this supposed to make sure version is being read first? I.e.
> doesn't this at least need to be barrier()?

atomic_read() is not free to be reordered by the compiler.  It is an asm
volatile with a volatile memory reference.

>
>> index a459e99..d5b6049 100644
>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>> @@ -558,7 +558,6 @@ static void parse_event_log_entry(struct amd_iommu *iommu, u32 entry[])
>>              return;
>>          }
>>          udelay(1);
>> -        rmb();
>>          code = get_field_from_reg_u32(entry[1], IOMMU_EVENT_CODE_MASK,
>>                                        IOMMU_EVENT_CODE_SHIFT);
>>      }
>> @@ -663,7 +662,6 @@ void parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>>              return;
>>          }
>>          udelay(1);
>> -        rmb();
>>          code = get_field_from_reg_u32(entry[1], IOMMU_PPR_LOG_CODE_MASK,
>>                                        IOMMU_PPR_LOG_CODE_SHIFT);
>>      }
> With these fully removed, what keeps the compiler from moving
> the entry[1] reads out of the loop? Implementation details of
> udelay() don't count...

It is a write to the control variable which is derived from a non-local
non-constant object.  It can't be hoisted at all.

Consider this simplified version:

while ( count == 0 )
    count = entry[1];

If entry were const, the compiler would be free to expect that the value
doesn't change on repeated reads, but that is not the case here.

~Andrew
Andrew Cooper Aug. 16, 2017, 5:03 p.m. UTC | #3
On 16/08/17 17:47, Andrew Cooper wrote:
> On 16/08/17 16:23, Jan Beulich wrote:
>>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
>>> x86's current implementation of wmb() is a compiler barrier.  As a result, the
>>> only change in this patch is to remove an mfence instruction from
>>> cpuidle_disable_deep_cstate().
>>>
>>> None of these barriers serve any purpose.  Most aren't aren't synchronising
>>> with any remote cpus, where as the mcetelem barriers are redundant with
>>> spin_unlock(), which already has full read/write barrier semantics.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> For the relevant parts
>> Acked-by: Jan Beulich <jbeulich@suse.com>
>> For the parts the ack doesn't extend to, however:
>>
>>> --- a/xen/arch/x86/mm/shadow/multi.c
>>> +++ b/xen/arch/x86/mm/shadow/multi.c
>>> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
>>>       * will make sure no inconsistent mapping being translated into
>>>       * shadow page table. */
>>>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
>>> -    rmb();
>>>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
>> Isn't this supposed to make sure version is being read first? I.e.
>> doesn't this at least need to be barrier()?
> atomic_read() is not free to be reordered by the compiler.  It is an asm
> volatile with a volatile memory reference.
>
>>> index a459e99..d5b6049 100644
>>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>>> @@ -558,7 +558,6 @@ static void parse_event_log_entry(struct amd_iommu *iommu, u32 entry[])
>>>              return;
>>>          }
>>>          udelay(1);
>>> -        rmb();
>>>          code = get_field_from_reg_u32(entry[1], IOMMU_EVENT_CODE_MASK,
>>>                                        IOMMU_EVENT_CODE_SHIFT);
>>>      }
>>> @@ -663,7 +662,6 @@ void parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>>>              return;
>>>          }
>>>          udelay(1);
>>> -        rmb();
>>>          code = get_field_from_reg_u32(entry[1], IOMMU_PPR_LOG_CODE_MASK,
>>>                                        IOMMU_PPR_LOG_CODE_SHIFT);
>>>      }
>> With these fully removed, what keeps the compiler from moving
>> the entry[1] reads out of the loop? Implementation details of
>> udelay() don't count...
> It is a write to the control variable which is derived from a non-local
> non-constant object.  It can't be hoisted at all.
>
> Consider this simplified version:
>
> while ( count == 0 )
>     count = entry[1];
>
> If entry were const, the compiler would be free to expect that the value
> doesn't change on repeated reads, but that is not the case here.

(And continuing my run of luck today), it turns out that GCC does
compile my example here to an infinite loop.

ffff82d08026025f:       84 c0                   test   %al,%al
ffff82d080260261:       75 0a                   jne    ffff82d08026026d <parse_ppr_log_entry+0x29>
ffff82d080260263:       8b 46 04                mov    0x4(%rsi),%eax
ffff82d080260266:       c1 e8 1c                shr    $0x1c,%eax
ffff82d080260269:       84 c0                   test   %al,%al
ffff82d08026026b:       74 fc                   je     ffff82d080260269 <parse_ppr_log_entry+0x25>


I will move this to being a barrer() with a hoisting comment (to avoid
it looking like an SMP issue), and I'm going to have to re-evaluate how
sane I think the C standard to be.

~Andrew
Jan Beulich Aug. 17, 2017, 7:48 a.m. UTC | #4
>>> On 16.08.17 at 18:47, <andrew.cooper3@citrix.com> wrote:
> On 16/08/17 16:23, Jan Beulich wrote:
>>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
>>> --- a/xen/arch/x86/mm/shadow/multi.c
>>> +++ b/xen/arch/x86/mm/shadow/multi.c
>>> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
>>>       * will make sure no inconsistent mapping being translated into
>>>       * shadow page table. */
>>>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
>>> -    rmb();
>>>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
>> Isn't this supposed to make sure version is being read first? I.e.
>> doesn't this at least need to be barrier()?
> 
> atomic_read() is not free to be reordered by the compiler.  It is an asm
> volatile with a volatile memory reference.

Oh, right - I did forget about the volatiles there (since generally,
like in Linux, we appear to try to avoid volatile).

Jan
Jan Beulich Aug. 17, 2017, 7:50 a.m. UTC | #5
>>> On 16.08.17 at 19:03, <andrew.cooper3@citrix.com> wrote:
> On 16/08/17 17:47, Andrew Cooper wrote:
>> On 16/08/17 16:23, Jan Beulich wrote:
>>>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
>>>> --- a/xen/drivers/passthrough/amd/iommu_init.c
>>>> +++ b/xen/drivers/passthrough/amd/iommu_init.c
>>>> @@ -558,7 +558,6 @@ static void parse_event_log_entry(struct amd_iommu *iommu, u32 entry[])
>>>>              return;
>>>>          }
>>>>          udelay(1);
>>>> -        rmb();
>>>>          code = get_field_from_reg_u32(entry[1], IOMMU_EVENT_CODE_MASK,
>>>>                                        IOMMU_EVENT_CODE_SHIFT);
>>>>      }
>>>> @@ -663,7 +662,6 @@ void parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
>>>>              return;
>>>>          }
>>>>          udelay(1);
>>>> -        rmb();
>>>>          code = get_field_from_reg_u32(entry[1], IOMMU_PPR_LOG_CODE_MASK,
>>>>                                        IOMMU_PPR_LOG_CODE_SHIFT);
>>>>      }
>>> With these fully removed, what keeps the compiler from moving
>>> the entry[1] reads out of the loop? Implementation details of
>>> udelay() don't count...
>> It is a write to the control variable which is derived from a non-local
>> non-constant object.  It can't be hoisted at all.
>>
>> Consider this simplified version:
>>
>> while ( count == 0 )
>>     count = entry[1];
>>
>> If entry were const, the compiler would be free to expect that the value
>> doesn't change on repeated reads, but that is not the case here.
> 
> (And continuing my run of luck today), it turns out that GCC does
> compile my example here to an infinite loop.
> 
> ffff82d08026025f:       84 c0                   test   %al,%al
> ffff82d080260261:       75 0a                   jne    ffff82d08026026d <parse_ppr_log_entry+0x29>
> ffff82d080260263:       8b 46 04                mov    0x4(%rsi),%eax
> ffff82d080260266:       c1 e8 1c                shr    $0x1c,%eax
> ffff82d080260269:       84 c0                   test   %al,%al
> ffff82d08026026b:       74 fc                   je     ffff82d080260269 <parse_ppr_log_entry+0x25>
> 
> 
> I will move this to being a barrer() with a hoisting comment (to avoid
> it looking like an SMP issue), and I'm going to have to re-evaluate how
> sane I think the C standard to be.

Well, as always, the standard assumes just a single thread (i.e.
const-ness doesn't matter at all here).

Jan
Tim Deegan Aug. 18, 2017, 1:55 p.m. UTC | #6
At 12:22 +0100 on 16 Aug (1502886128), Andrew Cooper wrote:
> diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
> index c9c2252..1e3dfaf 100644
> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
>       * will make sure no inconsistent mapping being translated into
>       * shadow page table. */
>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
> -    rmb();
>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);

Nack.  We must read the version before reading the tables, or we might
accidentally use out-of-date tables.

If anything, this needs more barriers!  There ought to be a read
barrier before we re-read the version in shadow_check_gwalk(), but
taking the paging lock DTRT.  And there ought to be a wmb() before we
increment the version later on, which I guess I'll make a patch for.

Cheers,

Tim.
Tim Deegan Aug. 18, 2017, 2:07 p.m. UTC | #7
At 14:55 +0100 on 18 Aug (1503068128), Tim Deegan wrote:
> At 12:22 +0100 on 16 Aug (1502886128), Andrew Cooper wrote:
> > diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
> > index c9c2252..1e3dfaf 100644
> > --- a/xen/arch/x86/mm/shadow/multi.c
> > +++ b/xen/arch/x86/mm/shadow/multi.c
> > @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
> >       * will make sure no inconsistent mapping being translated into
> >       * shadow page table. */
> >      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
> > -    rmb();
> >      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
> 
> Nack.  We must read the version before reading the tables, or we might
> accidentally use out-of-date tables.
> 
> If anything, this needs more barriers!  There ought to be a read
> barrier before we re-read the version in shadow_check_gwalk(), but
> taking the paging lock DTRT.  And there ought to be a wmb() before we
> increment the version later on, which I guess I'll make a patch for.

These can be smp_*mb(), though, to align with the rest of the series.

Tim.
Tim Deegan Aug. 18, 2017, 2:47 p.m. UTC | #8
At 01:48 -0600 on 17 Aug (1502934495), Jan Beulich wrote:
> >>> On 16.08.17 at 18:47, <andrew.cooper3@citrix.com> wrote:
> > On 16/08/17 16:23, Jan Beulich wrote:
> >>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
> >>> --- a/xen/arch/x86/mm/shadow/multi.c
> >>> +++ b/xen/arch/x86/mm/shadow/multi.c
> >>> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
> >>>       * will make sure no inconsistent mapping being translated into
> >>>       * shadow page table. */
> >>>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
> >>> -    rmb();
> >>>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
> >> Isn't this supposed to make sure version is being read first? I.e.
> >> doesn't this at least need to be barrier()?
> > 
> > atomic_read() is not free to be reordered by the compiler.  It is an asm
> > volatile with a volatile memory reference.
> 
> Oh, right - I did forget about the volatiles there (since generally,
> like in Linux, we appear to try to avoid volatile).

FWIW, I don't think that's quite right.  The GCC docs I have say that
"volatile" will stop the compiler from omitting an asm altogether, or
hoisting it out of a loop (on the assumption that it will always
produce the same output for the same inputs).  And that "the compiler
can move even volatile 'asm' instructions relative to other code,
including across jump instructions."

Cheers,

Tim.
Jan Beulich Aug. 18, 2017, 3:04 p.m. UTC | #9
>>> On 18.08.17 at 16:47, <tim@xen.org> wrote:
> At 01:48 -0600 on 17 Aug (1502934495), Jan Beulich wrote:
>> >>> On 16.08.17 at 18:47, <andrew.cooper3@citrix.com> wrote:
>> > On 16/08/17 16:23, Jan Beulich wrote:
>> >>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
>> >>> --- a/xen/arch/x86/mm/shadow/multi.c
>> >>> +++ b/xen/arch/x86/mm/shadow/multi.c
>> >>> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
>> >>>       * will make sure no inconsistent mapping being translated into
>> >>>       * shadow page table. */
>> >>>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
>> >>> -    rmb();
>> >>>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
>> >> Isn't this supposed to make sure version is being read first? I.e.
>> >> doesn't this at least need to be barrier()?
>> > 
>> > atomic_read() is not free to be reordered by the compiler.  It is an asm
>> > volatile with a volatile memory reference.
>> 
>> Oh, right - I did forget about the volatiles there (since generally,
>> like in Linux, we appear to try to avoid volatile).
> 
> FWIW, I don't think that's quite right.  The GCC docs I have say that
> "volatile" will stop the compiler from omitting an asm altogether, or
> hoisting it out of a loop (on the assumption that it will always
> produce the same output for the same inputs).  And that "the compiler
> can move even volatile 'asm' instructions relative to other code,
> including across jump instructions."

Oh, I had talked about the volatile qualifiers, no the volatile asm-s.

Jan
Tim Deegan Aug. 18, 2017, 3:07 p.m. UTC | #10
At 15:47 +0100 on 18 Aug (1503071247), Tim Deegan wrote:
> At 01:48 -0600 on 17 Aug (1502934495), Jan Beulich wrote:
> > >>> On 16.08.17 at 18:47, <andrew.cooper3@citrix.com> wrote:
> > > atomic_read() is not free to be reordered by the compiler.  It is an asm
> > > volatile with a volatile memory reference.
> > 
> > Oh, right - I did forget about the volatiles there (since generally,
> > like in Linux, we appear to try to avoid volatile).
> 
> FWIW, I don't think that's quite right.  The GCC docs I have say that
> "volatile" will stop the compiler from omitting an asm altogether, or
> hoisting it out of a loop (on the assumption that it will always
> produce the same output for the same inputs).  And that "the compiler
> can move even volatile 'asm' instructions relative to other code,
> including across jump instructions."

...and indeed: https://godbolt.org/g/KW19QR

Tim.
Tim Deegan Aug. 18, 2017, 3:13 p.m. UTC | #11
At 09:04 -0600 on 18 Aug (1503047077), Jan Beulich wrote:
> >>> On 18.08.17 at 16:47, <tim@xen.org> wrote:
> > At 01:48 -0600 on 17 Aug (1502934495), Jan Beulich wrote:
> >> >>> On 16.08.17 at 18:47, <andrew.cooper3@citrix.com> wrote:
> >> > On 16/08/17 16:23, Jan Beulich wrote:
> >> >>>>> On 16.08.17 at 13:22, <andrew.cooper3@citrix.com> wrote:
> >> >>> --- a/xen/arch/x86/mm/shadow/multi.c
> >> >>> +++ b/xen/arch/x86/mm/shadow/multi.c
> >> >>> @@ -3112,7 +3112,6 @@ static int sh_page_fault(struct vcpu *v,
> >> >>>       * will make sure no inconsistent mapping being translated into
> >> >>>       * shadow page table. */
> >> >>>      version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
> >> >>> -    rmb();
> >> >>>      walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
> >> >> Isn't this supposed to make sure version is being read first? I.e.
> >> >> doesn't this at least need to be barrier()?
> >> > 
> >> > atomic_read() is not free to be reordered by the compiler.  It is an asm
> >> > volatile with a volatile memory reference.
> >> 
> >> Oh, right - I did forget about the volatiles there (since generally,
> >> like in Linux, we appear to try to avoid volatile).
> > 
> > FWIW, I don't think that's quite right.  The GCC docs I have say that
> > "volatile" will stop the compiler from omitting an asm altogether, or
> > hoisting it out of a loop (on the assumption that it will always
> > produce the same output for the same inputs).  And that "the compiler
> > can move even volatile 'asm' instructions relative to other code,
> > including across jump instructions."
> 
> Oh, I had talked about the volatile qualifiers, no the volatile asm-s.

I'm not sure what other volatile you mean here, but accesses to
volatile objects are only ordered WRT other _volatile_ accesses.
So, e.g.: https://godbolt.org/g/L2qa8h

Tim.
diff mbox

Patch

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index 482b8a7..5879ad6 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -1331,8 +1331,6 @@  void cpuidle_disable_deep_cstate(void)
             max_cstate = 1;
     }
 
-    mb();
-
     hpet_disable_legacy_broadcast();
 }
 
diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
index 30525dd..aa6e556 100644
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -86,7 +86,6 @@  static x86_mce_vector_t _machine_check_vector = unexpected_machine_check;
 void x86_mce_vector_register(x86_mce_vector_t hdlr)
 {
     _machine_check_vector = hdlr;
-    wmb();
 }
 
 /* Call the installed machine check handler for this CPU setup. */
@@ -369,8 +368,6 @@  mcheck_mca_logout(enum mca_source who, struct mca_banks *bankmask,
             mcabank_clear(i);
         else if ( who == MCA_MCE_SCAN && need_clear)
             mcabanks_set(i, clear_bank);
-
-        wmb();
     }
 
     if (mig && errcnt > 0) {
diff --git a/xen/arch/x86/cpu/mcheck/mctelem.c b/xen/arch/x86/cpu/mcheck/mctelem.c
index b144a66..1731514 100644
--- a/xen/arch/x86/cpu/mcheck/mctelem.c
+++ b/xen/arch/x86/cpu/mcheck/mctelem.c
@@ -520,7 +520,6 @@  mctelem_cookie_t mctelem_consume_oldest_begin(mctelem_class_t which)
 	}
 
 	mctelem_processing_hold(tep);
-	wmb();
 	spin_unlock(&processing_lock);
 	return MCTE2COOKIE(tep);
 }
@@ -531,7 +530,6 @@  void mctelem_consume_oldest_end(mctelem_cookie_t cookie)
 
 	spin_lock(&processing_lock);
 	mctelem_processing_release(tep);
-	wmb();
 	spin_unlock(&processing_lock);
 }
 
@@ -547,6 +545,5 @@  void mctelem_ack(mctelem_class_t which, mctelem_cookie_t cookie)
 	spin_lock(&processing_lock);
 	if (tep == mctctl.mctc_processing_head[target])
 		mctelem_processing_release(tep);
-	wmb();
 	spin_unlock(&processing_lock);
 }
diff --git a/xen/arch/x86/crash.c b/xen/arch/x86/crash.c
index 82535c4..8d74258 100644
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -146,9 +146,6 @@  static void nmi_shootdown_cpus(void)
     write_atomic((unsigned long *)__va(__pa(&exception_table[TRAP_nmi])),
                  (unsigned long)&do_nmi_crash);
 
-    /* Ensure the new callback function is set before sending out the NMI. */
-    wmb();
-
     smp_send_nmi_allbutself();
 
     msecs = 1000; /* Wait at most a second for the other cpus to stop */
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index c9c2252..1e3dfaf 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3112,7 +3112,6 @@  static int sh_page_fault(struct vcpu *v,
      * will make sure no inconsistent mapping being translated into
      * shadow page table. */
     version = atomic_read(&d->arch.paging.shadow.gtable_dirty_version);
-    rmb();
     walk_ok = sh_walk_guest_tables(v, va, &gw, error_code);
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 8d91f6c..5b094b4 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -355,7 +355,6 @@  void start_secondary(void *unused)
     spin_debug_enable();
     set_cpu_sibling_map(cpu);
     notify_cpu_starting(cpu);
-    wmb();
 
     /*
      * We need to hold vector_lock so there the set of online cpus
@@ -371,7 +370,6 @@  void start_secondary(void *unused)
     local_irq_enable();
     mtrr_ap_init();
 
-    wmb();
     startup_cpu_idle_loop();
 }
 
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index a459e99..d5b6049 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -558,7 +558,6 @@  static void parse_event_log_entry(struct amd_iommu *iommu, u32 entry[])
             return;
         }
         udelay(1);
-        rmb();
         code = get_field_from_reg_u32(entry[1], IOMMU_EVENT_CODE_MASK,
                                       IOMMU_EVENT_CODE_SHIFT);
     }
@@ -663,7 +662,6 @@  void parse_ppr_log_entry(struct amd_iommu *iommu, u32 entry[])
             return;
         }
         udelay(1);
-        rmb();
         code = get_field_from_reg_u32(entry[1], IOMMU_PPR_LOG_CODE_MASK,
                                       IOMMU_PPR_LOG_CODE_SHIFT);
     }