diff mbox

XSA 154 and ISA region (640K -> 1MB) WB cache instead of UC

Message ID 20160817203238.GA9408@char.us.oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk Aug. 17, 2016, 8:32 p.m. UTC
Hey Jan, et. al.,

One of the interesting things about XSA 154 fix ("x86: enforce consistent
cachability of MMIO mappings") is that when certain applications (mcelog)
are trying to map /dev/mmap and lurk in ISA regions - we get:

[   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
[   49.399055] Modules linked in: bnx2fc fcoe libfcoe libfc 8021q mrp garp stp llc bonding dm_multipath vfat fat iTCO_wdt iTCO_vendor_support pcspkr ipmi_devintf ipmi_si ipmi_msghandler sb_edac edac_core i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma sg ext4 jbd2 mbcache sr_mod cdrom sd_mod usb_storage ahci libahci megaraid_sas qla2xxx scsi_transport_fc crc32c_intel be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi ipv6 cxgb3 libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdio dm_mirror dm_region_hash dm_log dm_mod
[   49.399131] CPU: 0 PID: 2471 Comm: mcelog Not tainted 4.1
[   49.399134] Hardware name: Oracle Corporation SUN SERVER X4-2       /ASSY,MB,X4-2, 1U      , BIOS 25030100 04/15/2015
[   49.399138]  0000000000000000 ffff880074673c28 ffffffff816c66f0 0000000000000000
[   49.399143]  0000000000000391 ffff880074673c68 ffffffff81084745 ffff880074673c78
[   49.399148]  ffff88014b625db0 0000000000000000 ffff880074673d58 00007f39290ab000
[   49.399152] Call Trace:
[   49.399166]  [<ffffffff816c66f0>] dump_stack+0x63/0x83
[   49.399175]  [<ffffffff81084745>] warn_slowpath_common+0x95/0xe0
[   49.399180]  [<ffffffff810847aa>] warn_slowpath_null+0x1a/0x20
[   49.399183]  [<ffffffff810725f3>] untrack_pfn+0x93/0xc0
[   49.399190]  [<ffffffff811b90f9>] unmap_single_vma+0xa9/0x100
[   49.399194]  [<ffffffff811b9644>] unmap_vmas+0x54/0xa0
[   49.399199]  [<ffffffff811bf0da>] exit_mmap+0x9a/0x150
[   49.399204]  [<ffffffff810825d3>] mmput+0x73/0x110
[   49.399208]  [<ffffffff81082775>] dup_mm+0x105/0x110
[   49.399213]  [<ffffffff81083b1d>] copy_process+0x11ed/0x1240
[   49.399218]  [<ffffffff81084009>] do_fork+0x79/0x280
[   49.399226]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[   49.399231]  [<ffffffff81084226>] SyS_clone+0x16/0x20
[   49.399235]  [<ffffffff816cb3ee>] system_call_fastpath+0x12/0x71
[   49.399239] ---[ end trace a61cd3d271a53a54 ]---

The reason is that Linux kernel assumes that the range from 640KB -> 1MB can
be mapped as write-back (see is_new_memtype_allowed and x86_platform.is_untracked_pat_range).
But we enforce the uncached mode and Linux complains.

With the mmio-relax=1, Linux gets its way and is happy.

With the patch below:


Things work much better - as we don't treat the 640KB->1MB region specially.

Anyhow what I am wondering:

 a) Should we add a edge case in the hypervisor to allow multiple mappings
   for this region? I am thinking no.. but it sounds like mapping ISA region
   as WB is safe even in baremetal?

 b) Or would it be better to let Linux do its thing and treat 640KB->1MB
   as uncached instead of writeback?

   Looking at the kernel it assumes that WB is ok for 640KB->1MB.
   The comment says:
   " /* Low ISA region is always mapped WB in page table. No need to track *"

   which is probably true on baremetal. But with Xen PV:

 856         /*                                                                      
 857          * In domU, the ISA region is normal, usable memory, but we             
 858          * reserve ISA memory anyway because too many things poke               
 859          * about in there.                                                      
 860          */                                                                     
 861         e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS, 
 862                         E820_RESERVED);                                 

   which would imply we don't have any page table mappings.

   And then the quick fix I provided above looks like the right solution?


CC-ing Boris, Daniel, Juergen, Steve, and Chuck.

Comments

Jan Beulich Aug. 18, 2016, 10:06 a.m. UTC | #1
>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
> One of the interesting things about XSA 154 fix ("x86: enforce consistent
> cachability of MMIO mappings") is that when certain applications (mcelog)
> are trying to map /dev/mmap and lurk in ISA regions - we get:

DYM /dev/mem ? Most accesses to which are bogus in PV guests
(often including Dom0) anyway.

> [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()

What Linux version is this? untrack_pfn() doesn't span line 913 in
4.8-rc2. And follow_phys() appears to only check whether the write
flag is set as expected; I can't see any cachability checks. Plus it
gets called only when both incoming address and size are zero.

> Anyhow what I am wondering:
> 
>  a) Should we add a edge case in the hypervisor to allow multiple mappings
>    for this region? I am thinking no.. but it sounds like mapping ISA region
>    as WB is safe even in baremetal?

We should never allow multiple mappings with different cachability.
And I don't understand what makes you think the ISA region is safe
to map WB? There might be ROMs, MMIO regions, or simply nothing
there, neither of which is safe to map WB. ROMs certainly could be
WP, but that would require a way to reliably size not only ISA
extension ROMs, but also main and video BIOS.

>  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
>    as uncached instead of writeback?

According to what you wrote earlier the two parts of the sentence
read contradictory to me.

>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>    The comment says:
>    " /* Low ISA region is always mapped WB in page table. No need to track *"

As per above it's not clear to me what this comment is backed by.

Jan
Andrew Cooper Aug. 18, 2016, 10:16 a.m. UTC | #2
On 18/08/16 11:06, Jan Beulich wrote:
>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>> One of the interesting things about XSA 154 fix ("x86: enforce consistent
>> cachability of MMIO mappings") is that when certain applications (mcelog)
>> are trying to map /dev/mmap and lurk in ISA regions - we get:
> DYM /dev/mem ? Most accesses to which are bogus in PV guests
> (often including Dom0) anyway.
>
>> [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
> What Linux version is this? untrack_pfn() doesn't span line 913 in
> 4.8-rc2. And follow_phys() appears to only check whether the write
> flag is set as expected; I can't see any cachability checks. Plus it
> gets called only when both incoming address and size are zero.
>
>> Anyhow what I am wondering:
>>
>>  a) Should we add a edge case in the hypervisor to allow multiple mappings
>>    for this region? I am thinking no.. but it sounds like mapping ISA region
>>    as WB is safe even in baremetal?
> We should never allow multiple mappings with different cachability.
> And I don't understand what makes you think the ISA region is safe
> to map WB? There might be ROMs, MMIO regions, or simply nothing
> there, neither of which is safe to map WB. ROMs certainly could be
> WP, but that would require a way to reliably size not only ISA
> extension ROMs, but also main and video BIOS.
>
>>  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
>>    as uncached instead of writeback?
> According to what you wrote earlier the two parts of the sentence
> read contradictory to me.
>
>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>>    The comment says:
>>    " /* Low ISA region is always mapped WB in page table. No need to track *"
> As per above it's not clear to me what this comment is backed by.

This states what is in the pagetables.  Not the combined result with MTRRs.

WB in the pagetables and WC/UB in the MTRRs is a legal combination which
functions correctly.

~Andrew
Jan Beulich Aug. 18, 2016, 11:12 a.m. UTC | #3
>>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:
> On 18/08/16 11:06, Jan Beulich wrote:
>>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>>>    The comment says:
>>>    " /* Low ISA region is always mapped WB in page table. No need to track 
> *"
>> As per above it's not clear to me what this comment is backed by.
> 
> This states what is in the pagetables.  Not the combined result with MTRRs.
> 
> WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> functions correctly.

True, but then again - haven't I been told multiple times that Linux
nowadays prefers to run without using MTRRs?

Jan
Alan Cox Aug. 18, 2016, 3:35 p.m. UTC | #4
On Thu, 18 Aug 2016 05:12:54 -0600
"Jan Beulich" <JBeulich@suse.com> wrote:

> >>> On 18.08.16 at 12:16, <andrew.cooper3@citrix.com> wrote:  
> > On 18/08/16 11:06, Jan Beulich wrote:  
> >>>>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:  
> >>>    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> >>>    The comment says:
> >>>    " /* Low ISA region is always mapped WB in page table. No need to track   
> > *"  
> >> As per above it's not clear to me what this comment is backed by.  
> > 
> > This states what is in the pagetables.  Not the combined result with MTRRs.
> > 
> > WB in the pagetables and WC/UB in the MTRRs is a legal combination which
> > functions correctly.  
> 
> True, but then again - haven't I been told multiple times that Linux
> nowadays prefers to run without using MTRRs?

The BIOS sets up the fixed MTRR registers for the 640K-1MB window. Those
are separate to the variable range MTRR registers used for main memory
with specific mappings for segments A000 to BFFF then C000-C7FF /
C800-CFFF / etc up to FFFF.

Alan
Konrad Rzeszutek Wilk Aug. 19, 2016, 2:52 p.m. UTC | #5
On Thu, Aug 18, 2016 at 04:06:33AM -0600, Jan Beulich wrote:
> >>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
> > One of the interesting things about XSA 154 fix ("x86: enforce consistent
> > cachability of MMIO mappings") is that when certain applications (mcelog)
> > are trying to map /dev/mmap and lurk in ISA regions - we get:
> 
> DYM /dev/mem ? Most accesses to which are bogus in PV guests
> (often including Dom0) anyway.

Yes.
> 
> > [   49.399053] WARNING: CPU: 0 PID: 2471 at arch/x86/mm/pat.c:913 untrack_pfn+0x93/0xc0()
> 
> What Linux version is this? untrack_pfn() doesn't span line 913 in

4.1
> 4.8-rc2. And follow_phys() appears to only check whether the write
> flag is set as expected; I can't see any cachability checks. Plus it
> gets called only when both incoming address and size are zero.

The error that happens is much sooner - that is when the VMA is setup
with the incorrect page attributes. Specifically: reserve_memtype which

 548         /* Low ISA region is always mapped WB in page table. No need to track */
 549         if (x86_platform.is_untracked_pat_range(start, end)) {                  
 550                 if (new_type)                                                   
 551                         *new_type = _PAGE_CACHE_MODE_WB;                        
 552                 return 0;                                                       
 553         }                                           

(And this for a change is v4.8-rc2)
> 
> > Anyhow what I am wondering:
> > 
> >  a) Should we add a edge case in the hypervisor to allow multiple mappings
> >    for this region? I am thinking no.. but it sounds like mapping ISA region
> >    as WB is safe even in baremetal?
> 
> We should never allow multiple mappings with different cachability.
> And I don't understand what makes you think the ISA region is safe
> to map WB? There might be ROMs, MMIO regions, or simply nothing
> there, neither of which is safe to map WB. ROMs certainly could be
> WP, but that would require a way to reliably size not only ISA
> extension ROMs, but also main and video BIOS.
> 
> >  b) Or would it be better to let Linux do its thing and treat 640KB->1MB
> >    as uncached instead of writeback?
> 
> According to what you wrote earlier the two parts of the sentence
> read contradictory to me.
> 
> >    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
> >    The comment says:
> >    " /* Low ISA region is always mapped WB in page table. No need to track *"
> 
> As per above it's not clear to me what this comment is backed by.

I was hoping you would know :-)

Ah, commit 2e5d9c857d4e6c9e7b7d8c8c86a68a7842d213d6
Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
Date:   Tue Mar 18 17:00:14 2008 -0700

    x86: PAT infrastructure patch
    
    Sets up pat_init() infrastructure.
    

which sets the MTRR for that region.
> 
> Jan
>
Jan Beulich Aug. 19, 2016, 3:14 p.m. UTC | #6
>>> On 19.08.16 at 16:52, <konrad.wilk@oracle.com> wrote:
> On Thu, Aug 18, 2016 at 04:06:33AM -0600, Jan Beulich wrote:
>> >>> On 17.08.16 at 22:32, <konrad.wilk@oracle.com> wrote:
>> >    Looking at the kernel it assumes that WB is ok for 640KB->1MB.
>> >    The comment says:
>> >    " /* Low ISA region is always mapped WB in page table. No need to track *"
>> 
>> As per above it's not clear to me what this comment is backed by.
> 
> I was hoping you would know :-)
> 
> Ah, commit 2e5d9c857d4e6c9e7b7d8c8c86a68a7842d213d6
> Author: venkatesh.pallipadi@intel.com <venkatesh.pallipadi@intel.com>
> Date:   Tue Mar 18 17:00:14 2008 -0700
> 
>     x86: PAT infrastructure patch
>     
>     Sets up pat_init() infrastructure.
>     
> 
> which sets the MTRR for that region.

Hmm, that's the commit which introduced pat.c years ago. I can't
see it writing an MTRRs though, nor can I see it backing the comment
in adds in any way. I guess I'm confused...

Jan
diff mbox

Patch

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 70a38c1..e5ff5a5 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -288,6 +287,12 @@  static void __init xen_banner(void)
 	       version >> 16, version & 0xffff, extra.extraversion,
 	       xen_feature(XENFEAT_mmu_pt_update_preserve_ad) ? " (preserve-AD)" : "");
 }
+
+static bool xen_ignore(u64 s, u64 e)
+{
+	return false;
+}
+
 /* Check if running on Xen version (major, minor) or later */
 bool
 xen_running_on_version_or_later(unsigned int major, unsigned int minor)
@@ -1563,7 +1570,7 @@  asmlinkage __visible void __init xen_start_kernel(void)
 		x86_init.resources.memory_setup = xen_memory_setup;
 	x86_init.oem.arch_setup = xen_arch_setup;
 	x86_init.oem.banner = xen_banner;
-
+	x86_platform.is_untracked_pat_range = xen_ignore;
 	xen_init_time_ops();
 
 	/*