diff mbox series

[2/7] x86: introduce ioremap_wc()

Message ID 0e1d3e9e-ed29-d7e8-e4fc-e48a0454faa4@suse.com (mailing list archive)
State Superseded
Headers show
Series x86: memcpy() / memset() (non-)ERMS flavors plus fallout | expand

Commit Message

Jan Beulich April 27, 2021, 12:54 p.m. UTC
In order for a to-be-introduced ERMS form of memcpy() to not regress
boot performance on certain systems when video output is active, we
first need to arrange for avoiding further dependency on firmware
setting up MTRRs in a way we can actually further modify. On many
systems, due to the continuously growing amounts of installed memory,
MTRRs get configured with at least one huge WB range, and with MMIO
ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
it is today, can't deal with such a setup. Hence on such systems we
presently leave the frame buffer mapped UC, leading to significantly
reduced performance when using REP STOSB / REP MOVSB.

On post-PentiumII hardware (i.e. any that's capable of running 64-bit
code), an effective memory type of WC can be achieved without MTRRs, by
simply referencing the respective PAT entry from the PTEs. While this
will leave the switch to ERMS forms of memset() and memcpy() with
largely unchanged performance, the change here on its own improves
performance on affected systems quite significantly: Measuring just the
individual affected memcpy() invocations yielded a speedup by a factor
of over 250 on my initial (Skylake) test system. memset() isn't getting
improved by as much there, but still by a factor of about 20.

While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
to, at the very least, make clear what PTE flags this memory type uses.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: Both callers are __init, so in principle ioremap_wc() could be so,
     too, at least for the time being.
TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
     1st Mb mapping (like ioremap() does) would be an option.

Comments

Andrew Cooper April 27, 2021, 5:13 p.m. UTC | #1
On 27/04/2021 13:54, Jan Beulich wrote:
> In order for a to-be-introduced ERMS form of memcpy() to not regress
> boot performance on certain systems when video output is active, we
> first need to arrange for avoiding further dependency on firmware
> setting up MTRRs in a way we can actually further modify. On many
> systems, due to the continuously growing amounts of installed memory,
> MTRRs get configured with at least one huge WB range, and with MMIO
> ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
> it is today, can't deal with such a setup. Hence on such systems we
> presently leave the frame buffer mapped UC, leading to significantly
> reduced performance when using REP STOSB / REP MOVSB.
>
> On post-PentiumII hardware (i.e. any that's capable of running 64-bit
> code), an effective memory type of WC can be achieved without MTRRs, by
> simply referencing the respective PAT entry from the PTEs. While this
> will leave the switch to ERMS forms of memset() and memcpy() with
> largely unchanged performance, the change here on its own improves
> performance on affected systems quite significantly: Measuring just the
> individual affected memcpy() invocations yielded a speedup by a factor
> of over 250 on my initial (Skylake) test system. memset() isn't getting
> improved by as much there, but still by a factor of about 20.
>
> While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
> to, at the very least, make clear what PTE flags this memory type uses.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---

Seeing as MTRRs are full of firmware issues, shouldn't we also
cross-check that the vram is marked WC, or we'll still fall into bad
perf from combining down to UC.  (Obviously follow-up work if so.)

> TBD: Both callers are __init, so in principle ioremap_wc() could be so,
>      too, at least for the time being.

I don't see us making use this at runtime.  Uses of WC are few and far
between.

> TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
>      1st Mb mapping (like ioremap() does) would be an option.

It might be fine to do that unconditionally.  The low VRAM has had known
settings for 2 decades now.

That said, the low 1MB does use UC- mappings, which means we're entirely
dependent on MTRRs specifying WC for sensible performance.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -5883,6 +5883,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>      return (void __force __iomem *)va;
>  }
>  
> +void __iomem *ioremap_wc(paddr_t pa, size_t len)
> +{
> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
> +    unsigned int offs = pa & (PAGE_SIZE - 1);
> +    unsigned int nr = PFN_UP(offs + len);
> +    void *va;
> +
> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
> +
> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);

This doesn't look correct.  granularity and nr are passed the wrong way
around, but maybe that's related to the fact that only a single mfn is
passed.  I'm confused.

Also, several truncations will occur for a framebuffer > 4G, both with
calculations here, and the types of __vmap()'s parameters.

> +
> +    return (void __force __iomem *)(va + offs);
> +}
> +
>  int create_perdomain_mapping(struct domain *d, unsigned long va,
>                               unsigned int nr, l1_pgentry_t **pl1tab,
>                               struct page_info **ppg)
> --- a/xen/drivers/video/vesa.c
> +++ b/xen/drivers/video/vesa.c
> @@ -9,9 +9,9 @@
>  #include <xen/param.h>
>  #include <xen/xmalloc.h>
>  #include <xen/kernel.h>
> +#include <xen/mm.h>
>  #include <xen/vga.h>
>  #include <asm/io.h>
> -#include <asm/page.h>
>  #include "font.h"
>  #include "lfb.h"
>  
> @@ -103,7 +103,7 @@ void __init vesa_init(void)
>      lfbp.text_columns = vlfb_info.width / font->width;
>      lfbp.text_rows = vlfb_info.height / font->height;
>  
> -    lfbp.lfb = lfb = ioremap(lfb_base(), vram_remap);
> +    lfbp.lfb = lfb = ioremap_wc(lfb_base(), vram_remap);
>      if ( !lfb )
>          return;
>  
> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
>  
>  static void lfb_flush(void)
>  {
> -    if ( vesa_mtrr == 3 )
> -        __asm__ __volatile__ ("sfence" : : : "memory");
> +    __asm__ __volatile__ ("sfence" : : : "memory");

wmb(), seeing as that is the operation we mean here?

>  }
>  
>  void __init vesa_endboot(bool_t keep)
> --- a/xen/drivers/video/vga.c
> +++ b/xen/drivers/video/vga.c
> @@ -79,7 +79,7 @@ void __init video_init(void)
>      {
>      case XEN_VGATYPE_TEXT_MODE_3:
>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>              return;
>          outw(0x200a, 0x3d4); /* disable cursor */
>          columns = vga_console_info.u.text_mode_3.columns;
> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>      {
>      case XEN_VGATYPE_TEXT_MODE_3:
>          if ( !vgacon_keep )
> +        {
>              memset(video, 0, columns * lines * 2);
> +            iounmap(video);
> +            video = ZERO_BLOCK_PTR;
> +        }
>          break;

Shouldn't this hunk be in patch 5?

>      case XEN_VGATYPE_VESA_LFB:
>      case XEN_VGATYPE_EFI_LFB:
> --- a/xen/include/asm-x86/mm.h
> +++ b/xen/include/asm-x86/mm.h
> @@ -615,6 +615,8 @@ void destroy_perdomain_mapping(struct do
>                                 unsigned int nr);
>  void free_perdomain_mappings(struct domain *);
>  
> +void __iomem *ioremap_wc(paddr_t, size_t);

I'm not sure we want to add a second prototype.  ARM has ioremap_wc()
too, and we absolutely don't want them to get out of sync, and we have
two new architectures on the horizon.

Perhaps a new xen/ioremap.h which includes asm/ioremap.h  (although
thinking forward to encrypted RAM, we might want something which can
also encompass the memremap*() variants.)

ARM can #define ioremap_wc ioremap_wc and provide their inline wrapper. 
x86 can fall back to the common prototype.  Other architectures can do
whatever is best for them.

~Andrew
Jan Beulich April 28, 2021, 9:41 a.m. UTC | #2
On 27.04.2021 19:13, Andrew Cooper wrote:
> On 27/04/2021 13:54, Jan Beulich wrote:
>> In order for a to-be-introduced ERMS form of memcpy() to not regress
>> boot performance on certain systems when video output is active, we
>> first need to arrange for avoiding further dependency on firmware
>> setting up MTRRs in a way we can actually further modify. On many
>> systems, due to the continuously growing amounts of installed memory,
>> MTRRs get configured with at least one huge WB range, and with MMIO
>> ranges below 4Gb then forced to UC via overlapping MTRRs. mtrr_add(), as
>> it is today, can't deal with such a setup. Hence on such systems we
>> presently leave the frame buffer mapped UC, leading to significantly
>> reduced performance when using REP STOSB / REP MOVSB.
>>
>> On post-PentiumII hardware (i.e. any that's capable of running 64-bit
>> code), an effective memory type of WC can be achieved without MTRRs, by
>> simply referencing the respective PAT entry from the PTEs. While this
>> will leave the switch to ERMS forms of memset() and memcpy() with
>> largely unchanged performance, the change here on its own improves
>> performance on affected systems quite significantly: Measuring just the
>> individual affected memcpy() invocations yielded a speedup by a factor
>> of over 250 on my initial (Skylake) test system. memset() isn't getting
>> improved by as much there, but still by a factor of about 20.
>>
>> While adding {__,}PAGE_HYPERVISOR_WC, also add {__,}PAGE_HYPERVISOR_WT
>> to, at the very least, make clear what PTE flags this memory type uses.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
> 
> Seeing as MTRRs are full of firmware issues, shouldn't we also
> cross-check that the vram is marked WC, or we'll still fall into bad
> perf from combining down to UC.  (Obviously follow-up work if so.)

VRAM generally isn't marked WC nowadays, at least from my observations.
And it doesn't need to be - there's no "combining down to UC" when the
WC PAT entry is chosen by the PTE. Plus firmware doing so may actually
be counter-productive, as the VRAM address range may easily move when
BARs get re-assigned by an OS.

>> TBD: Both callers are __init, so in principle ioremap_wc() could be so,
>>      too, at least for the time being.
> 
> I don't see us making use this at runtime.  Uses of WC are few and far
> between.

Okay, will mark the function __init then.

>> TBD: If the VGA range is WC in the fixed range MTRRs, reusing the low
>>      1st Mb mapping (like ioremap() does) would be an option.
> 
> It might be fine to do that unconditionally.  The low VRAM has had known
> settings for 2 decades now.
> 
> That said, the low 1MB does use UC- mappings, which means we're entirely
> dependent on MTRRs specifying WC for sensible performance.

The two parts of your reply look to contradict one another: I've been
suggesting to use the low-Mb mapping precisely only when the fixed
range MTRRs covering the range are all saying WC.

>> --- a/xen/arch/x86/mm.c
>> +++ b/xen/arch/x86/mm.c
>> @@ -5883,6 +5883,20 @@ void __iomem *ioremap(paddr_t pa, size_t
>>      return (void __force __iomem *)va;
>>  }
>>  
>> +void __iomem *ioremap_wc(paddr_t pa, size_t len)
>> +{
>> +    mfn_t mfn = _mfn(PFN_DOWN(pa));
>> +    unsigned int offs = pa & (PAGE_SIZE - 1);
>> +    unsigned int nr = PFN_UP(offs + len);
>> +    void *va;
>> +
>> +    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
>> +
>> +    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
> 
> This doesn't look correct.  granularity and nr are passed the wrong way
> around, but maybe that's related to the fact that only a single mfn is
> passed.  I'm confused.

It's identical to ioremap(). And yes, it's this way because of there
just being a single MFN getting passed in here. If I flipped the 2nd
and 3rd arguments, nr calls to map_pages_to_xen() would result
instead of just one.

> Also, several truncations will occur for a framebuffer > 4G, both with
> calculations here, and the types of __vmap()'s parameters.

Truncations may occur, indeed, but for frame buffers >= 16Tb. I
didn't think I needed to worry about this more here than ioremap()
does.

>> @@ -179,8 +179,7 @@ void __init vesa_mtrr_init(void)
>>  
>>  static void lfb_flush(void)
>>  {
>> -    if ( vesa_mtrr == 3 )
>> -        __asm__ __volatile__ ("sfence" : : : "memory");
>> +    __asm__ __volatile__ ("sfence" : : : "memory");
> 
> wmb(), seeing as that is the operation we mean here?

Not sure, to be honest. It's not been all this long ago that
smp_wmb() and wmb() were the same. And it's also not a write
barrier we mean here, but specifically SFENCE (which merely
happens to be the insn of choice for implementing wmb()). We
don't care about having a full-fledged write barrier here -
other writes getting reordered would in principle be fine.

>> --- a/xen/drivers/video/vga.c
>> +++ b/xen/drivers/video/vga.c
>> @@ -79,7 +79,7 @@ void __init video_init(void)
>>      {
>>      case XEN_VGATYPE_TEXT_MODE_3:
>>          if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
>> -             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
>> +             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
>>              return;
>>          outw(0x200a, 0x3d4); /* disable cursor */
>>          columns = vga_console_info.u.text_mode_3.columns;
>> @@ -164,7 +164,11 @@ void __init video_endboot(void)
>>      {
>>      case XEN_VGATYPE_TEXT_MODE_3:
>>          if ( !vgacon_keep )
>> +        {
>>              memset(video, 0, columns * lines * 2);
>> +            iounmap(video);
>> +            video = ZERO_BLOCK_PTR;
>> +        }
>>          break;
> 
> Shouldn't this hunk be in patch 5?

Only if ioremap_wc() unconditionally re-used the low-Mb mapping;
I don't want to introduce another leak, paralleling the one which
patch 5 addresses. And as per above I see ioremap_wc() use __va()
at best conditionally.

>> --- a/xen/include/asm-x86/mm.h
>> +++ b/xen/include/asm-x86/mm.h
>> @@ -615,6 +615,8 @@ void destroy_perdomain_mapping(struct do
>>                                 unsigned int nr);
>>  void free_perdomain_mappings(struct domain *);
>>  
>> +void __iomem *ioremap_wc(paddr_t, size_t);
> 
> I'm not sure we want to add a second prototype.  ARM has ioremap_wc()
> too, and we absolutely don't want them to get out of sync, and we have
> two new architectures on the horizon.
> 
> Perhaps a new xen/ioremap.h which includes asm/ioremap.h  (although
> thinking forward to encrypted RAM, we might want something which can
> also encompass the memremap*() variants.)
> 
> ARM can #define ioremap_wc ioremap_wc and provide their inline wrapper. 
> x86 can fall back to the common prototype.  Other architectures can do
> whatever is best for them.

I'm afraid I don't see how what you suggest would not lead to
duplicate prototypes. Arm wanting the function be an inline
one precludes exposing the extern declaration there. Hence
there will still be a risk of things going out of sync. In
fact I first had the declaration next to ioremap()'s, until I
ran into the build issue on Arm.

Jan
diff mbox series

Patch

--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5883,6 +5883,20 @@  void __iomem *ioremap(paddr_t pa, size_t
     return (void __force __iomem *)va;
 }
 
+void __iomem *ioremap_wc(paddr_t pa, size_t len)
+{
+    mfn_t mfn = _mfn(PFN_DOWN(pa));
+    unsigned int offs = pa & (PAGE_SIZE - 1);
+    unsigned int nr = PFN_UP(offs + len);
+    void *va;
+
+    WARN_ON(page_is_ram_type(mfn_x(mfn), RAM_TYPE_CONVENTIONAL));
+
+    va = __vmap(&mfn, nr, 1, 1, PAGE_HYPERVISOR_WC, VMAP_DEFAULT);
+
+    return (void __force __iomem *)(va + offs);
+}
+
 int create_perdomain_mapping(struct domain *d, unsigned long va,
                              unsigned int nr, l1_pgentry_t **pl1tab,
                              struct page_info **ppg)
--- a/xen/drivers/video/vesa.c
+++ b/xen/drivers/video/vesa.c
@@ -9,9 +9,9 @@ 
 #include <xen/param.h>
 #include <xen/xmalloc.h>
 #include <xen/kernel.h>
+#include <xen/mm.h>
 #include <xen/vga.h>
 #include <asm/io.h>
-#include <asm/page.h>
 #include "font.h"
 #include "lfb.h"
 
@@ -103,7 +103,7 @@  void __init vesa_init(void)
     lfbp.text_columns = vlfb_info.width / font->width;
     lfbp.text_rows = vlfb_info.height / font->height;
 
-    lfbp.lfb = lfb = ioremap(lfb_base(), vram_remap);
+    lfbp.lfb = lfb = ioremap_wc(lfb_base(), vram_remap);
     if ( !lfb )
         return;
 
@@ -179,8 +179,7 @@  void __init vesa_mtrr_init(void)
 
 static void lfb_flush(void)
 {
-    if ( vesa_mtrr == 3 )
-        __asm__ __volatile__ ("sfence" : : : "memory");
+    __asm__ __volatile__ ("sfence" : : : "memory");
 }
 
 void __init vesa_endboot(bool_t keep)
--- a/xen/drivers/video/vga.c
+++ b/xen/drivers/video/vga.c
@@ -79,7 +79,7 @@  void __init video_init(void)
     {
     case XEN_VGATYPE_TEXT_MODE_3:
         if ( page_is_ram_type(paddr_to_pfn(0xB8000), RAM_TYPE_CONVENTIONAL) ||
-             ((video = ioremap(0xB8000, 0x8000)) == NULL) )
+             ((video = ioremap_wc(0xB8000, 0x8000)) == NULL) )
             return;
         outw(0x200a, 0x3d4); /* disable cursor */
         columns = vga_console_info.u.text_mode_3.columns;
@@ -164,7 +164,11 @@  void __init video_endboot(void)
     {
     case XEN_VGATYPE_TEXT_MODE_3:
         if ( !vgacon_keep )
+        {
             memset(video, 0, columns * lines * 2);
+            iounmap(video);
+            video = ZERO_BLOCK_PTR;
+        }
         break;
     case XEN_VGATYPE_VESA_LFB:
     case XEN_VGATYPE_EFI_LFB:
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -615,6 +615,8 @@  void destroy_perdomain_mapping(struct do
                                unsigned int nr);
 void free_perdomain_mappings(struct domain *);
 
+void __iomem *ioremap_wc(paddr_t, size_t);
+
 extern int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm);
 
 void domain_set_alloc_bitsize(struct domain *d);
--- a/xen/include/asm-x86/page.h
+++ b/xen/include/asm-x86/page.h
@@ -349,8 +349,10 @@  void efi_update_l4_pgtable(unsigned int
 #define __PAGE_HYPERVISOR_RX      (_PAGE_PRESENT | _PAGE_ACCESSED)
 #define __PAGE_HYPERVISOR         (__PAGE_HYPERVISOR_RX | \
                                    _PAGE_DIRTY | _PAGE_RW)
+#define __PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR | _PAGE_PWT)
 #define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD)
 #define __PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT)
+#define __PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR | _PAGE_PAT)
 #define __PAGE_HYPERVISOR_SHSTK   (__PAGE_HYPERVISOR_RO | _PAGE_DIRTY)
 
 #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */
--- a/xen/include/asm-x86/x86_64/page.h
+++ b/xen/include/asm-x86/x86_64/page.h
@@ -154,6 +154,10 @@  static inline intpte_t put_pte_flags(uns
                                  _PAGE_GLOBAL | _PAGE_NX)
 #define PAGE_HYPERVISOR_UC      (__PAGE_HYPERVISOR_UC | \
                                  _PAGE_GLOBAL | _PAGE_NX)
+#define PAGE_HYPERVISOR_WC      (__PAGE_HYPERVISOR_WC | \
+                                 _PAGE_GLOBAL | _PAGE_NX)
+#define PAGE_HYPERVISOR_WT      (__PAGE_HYPERVISOR_WT | \
+                                 _PAGE_GLOBAL | _PAGE_NX)
 
 #endif /* __X86_64_PAGE_H__ */