mbox series

[v3,00/21] huge page clearing optimizations

Message ID 20220606202109.1306034-1-ankur.a.arora@oracle.com (mailing list archive)
Headers show
Series huge page clearing optimizations | expand

Message

Ankur Arora June 6, 2022, 8:20 p.m. UTC
This series introduces two optimizations in the huge page clearing path:

 1. extends the clear_page() machinery to also handle extents larger
    than a single page.
 2. support non-cached page clearing for huge and gigantic pages.

The first optimization is useful for hugepage fault handling, the
second for prefaulting, or for gigantic pages.

The immediate motivation is to speedup creation of large VMs backed
by huge pages.

Performance
==

VM creation (192GB VM with prealloc'd 2MB backing pages) sees significant
run-time improvements:

 Icelakex:
                          Time (s)        Delta (%)
 clear_page_erms()     22.37 ( +- 0.14s )            #  9.21 bytes/ns
 clear_pages_erms()    16.49 ( +- 0.06s )  -26.28%   # 12.50 bytes/ns
 clear_pages_movnt()    9.42 ( +- 0.20s )  -42.87%   # 21.88 bytes/ns

 Milan:
                          Time (s)        Delta (%)
 clear_page_erms()     16.49 ( +- 0.06s )            # 12.50 bytes/ns
 clear_pages_erms()    11.82 ( +- 0.06s )  -28.32%   # 17.44 bytes/ns
 clear_pages_clzero()   4.91 ( +- 0.27s )  -58.49%   # 41.98 bytes/ns

As a side-effect, non-polluting clearing by eliding zero filling of
caches also shows better LLC miss rates. For a kbuild+background
page-clearing job, this gives up as a small improvement (~2%) in
runtime.

Discussion
==


With the motivation out of the way, the following note describes
v3's handling of past review comments (and other sticking points for
series of this nature -- especially the non-cached part -- over the
years):

1. Non-cached clearing is unnecessary on x86: x86 already uses 'REP;STOS'
   which unlike a MOVNT loop, has semantically richer information available
   which can be used by current (and/or future) processors to make the
   same cache-elision optimization.

   All true, except a) current-gen uarchs often don't and, b) even when
   they do, the kernel by clearing at 4K granularity doesn't expose
   the extent information in a way that processors could easily
   optimize for.

   For a), I tested a bunch of REP-STOSB/MOVNTI/CLZERO loops with different
   chunk sizes (in user-space over a VA extent of 4GB, page-size=4K.)

   Intel Icelake (LLC=48MB, no_turbo=1):

     chunk-size    REP-STOSB       MOVNTI
                        MBps         MBps

             4K         9444        24510
            64K        11931        24508
             2M        12355        24524
             8M        12369        24525
            32M        12368        24523
           128M        12374        24522
            1GB        12372        24561

   Which is pretty flat across chunk-sizes.


   AMD Milan (LLC=32MB, boost=0):

     chunk-size    REP-STOSB       MOVNTI        CLZERO 
                        MBps         MBps          MBps 
                                                        
             4K        13034        17815         45579 
            64K        15196        18549         46038 
             2M        14821        18581         39064 
             8M        13964        18557         46045 
            32M        22525        18560         45969 
           128M        29311        18581         38924 
            1GB        35807        18574         45981 

    The scaling on Milan starts right around chunk=LLC-size. It
    asymptotically does seem to get close to CLZERO performance, but the
    scaling is linear and not a step function.

    For b), as I mention above, the kernel by zeroing at 4K granularity,
    doesn't send the right signal to the uarch (though the largest
    extent we can use for huge pages is 2MB (and lower for preemptible
    kernels), which from these numbers is not large enough.)
    Still using clear_page_extent() with larger extents would send the
    uarch a hint that it could capitalize on in the future.

    This is addressed in patches 1-6:
	"mm, huge-page: reorder arguments to process_huge_page()"
	"mm, huge-page: refactor process_subpage()"
	"clear_page: add generic clear_user_pages()"
	"mm, clear_huge_page: support clear_user_pages()"
	"mm/huge_page: generalize process_huge_page()"
	"x86/clear_page: add clear_pages()"

     with patch 5, "mm/huge_page: generalize process_huge_page()"
     containing the core logic.

2. Non-caching stores (via MOVNTI, CLZERO on x86) are weakly ordered with
   respect to the cache hierarchy and unless they are combined with an
   appropriate fence, are unsafe to use.

   This is true and is a problem. Patch 12, "sparse: add address_space
   __incoherent" adds a new sparse address_space which is used in
   the architectural interfaces to make sure that any user is cognizant
   of its use:

	void clear_user_pages_incoherent(__incoherent void *page, ...)
	void clear_pages_incoherent(__incoherent void *page, ...)

   One other place it is needed (and is missing) is in highmem:
       void clear_user_highpages_incoherent(struct page *page, ...).

   Given the natural highmem interface, I couldn't think of a good
   way to add the annotation here.

3. Non-caching stores are generally slower than cached for extents
   smaller than LLC-size, and faster for larger ones.

   This means that if you choose the non-caching path for too small an
   extent, you would see performance regressions. There is of course
   benefit in not filling the cache with zeroes but that is a somewhat
   nebulous advantage and AFAICT there is no representative tests that
   probe for it.
   (Note that this slowness isn't a consequence of the extra fence --
   that is expensive but stops being noticeable for chunk-size >=
   ~32K-128K depending on uarch.)

   This is handled by adding an arch specific threshold (with a
   default CLEAR_PAGE_NON_CACHING_THRESHOLD=8MB.) in patches 15 and 16,
   "mm/clear_page: add clear_page_non_caching_threshold()",
   "x86/clear_page: add arch_clear_page_non_caching_threshold()".

   Further, a single call to clear_huge_pages() or get_/pin_user_pages()
   might only see a small portion of an extent being cleared in each
   iteration. To make sure we choose non-caching stores when working with
   large extents, patch 18, "gup: add FOLL_HINT_BULK,
   FAULT_FLAG_NON_CACHING", adds a new flag that gup users can use for
   this purpose. This is used in patch 20, "vfio_iommu_type1: specify
   FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
   while attaching passthrough PCIe devices.
  
   The get_user_pages() logic to handle these flags is in patch 19,
   "gup: hint non-caching if clearing large regions".

4. Subpoint of 3) above (non-caching stores are faster for extents
   larger than LLC-sized) is generally true, with a side of Brownian
   motion thrown in. For instance, MOVNTI (for > LLC-size) performs well
   on Broadwell and Ice Lake, but on Skylake/Cascade-lake -- sandwiched
   in between the two, it does not.

   To deal with this, use Ingo's suggestion of "trust but verify",
   (https://lore.kernel.org/lkml/20201014153127.GB1424414@gmail.com/)
   where we enable MOVNT by default and only disable it on slow
   uarchs.
   If the non-caching path ends up being a part of the kernel, uarchs
   that regress would hopefully show up early enough in chip testing.

   Patch 11, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
   and patch 21, "x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for
   Skylake" disables the non-caching path for Skylake.

Performance numbers are in patches 6 and 19, "x86/clear_page: add
clear_pages()", "gup: hint non-caching if clearing large regions".

Also at:
  github.com/terminus/linux clear-page-non-caching.upstream-v3

Comments appreciated!

Changelog
==

v2: https://lore.kernel.org/lkml/20211020170305.376118-1-ankur.a.arora@oracle.com/
  - Add multi-page clearing: this addresses comments from Ingo
    (from v1), and from an offlist discussion with Linus.
  - Rename clear_pages_uncached() to make the lack of safety
    more obvious: this addresses comments from Andy Lutomorski.
  - Simplify the clear_huge_page() changes.
  - Usual cleanups etc.
  - Rebased to v5.18.


v1: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@oracle.com/
  - Make the unsafe nature of clear_page_uncached() more obvious.
  - Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't
    have to explicitly enable it for every new model: suggestion from
    Ingo Molnar.
  - Add GUP path (and appropriate threshold) to allow the uncached path
    to be used for huge pages.
  - Make the code more generic so it's tied to fewer x86 specific assumptions.

Thanks
Ankur

Ankur Arora (21):
  mm, huge-page: reorder arguments to process_huge_page()
  mm, huge-page: refactor process_subpage()
  clear_page: add generic clear_user_pages()
  mm, clear_huge_page: support clear_user_pages()
  mm/huge_page: generalize process_huge_page()
  x86/clear_page: add clear_pages()
  x86/asm: add memset_movnti()
  perf bench: add memset_movnti()
  x86/asm: add clear_pages_movnt()
  x86/asm: add clear_pages_clzero()
  x86/cpuid: add X86_FEATURE_MOVNT_SLOW
  sparse: add address_space __incoherent
  clear_page: add generic clear_user_pages_incoherent()
  x86/clear_page: add clear_pages_incoherent()
  mm/clear_page: add clear_page_non_caching_threshold()
  x86/clear_page: add arch_clear_page_non_caching_threshold()
  clear_huge_page: use non-cached clearing
  gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING
  gup: hint non-caching if clearing large regions
  vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
  x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

 arch/alpha/include/asm/page.h                |   1 +
 arch/arc/include/asm/page.h                  |   1 +
 arch/arm/include/asm/page.h                  |   1 +
 arch/arm64/include/asm/page.h                |   1 +
 arch/csky/include/asm/page.h                 |   1 +
 arch/hexagon/include/asm/page.h              |   1 +
 arch/ia64/include/asm/page.h                 |   1 +
 arch/m68k/include/asm/page.h                 |   1 +
 arch/microblaze/include/asm/page.h           |   1 +
 arch/mips/include/asm/page.h                 |   1 +
 arch/nios2/include/asm/page.h                |   2 +
 arch/openrisc/include/asm/page.h             |   1 +
 arch/parisc/include/asm/page.h               |   1 +
 arch/powerpc/include/asm/page.h              |   1 +
 arch/riscv/include/asm/page.h                |   1 +
 arch/s390/include/asm/page.h                 |   1 +
 arch/sh/include/asm/page.h                   |   1 +
 arch/sparc/include/asm/page_32.h             |   1 +
 arch/sparc/include/asm/page_64.h             |   1 +
 arch/um/include/asm/page.h                   |   1 +
 arch/x86/include/asm/cacheinfo.h             |   1 +
 arch/x86/include/asm/cpufeatures.h           |   1 +
 arch/x86/include/asm/page.h                  |  26 ++
 arch/x86/include/asm/page_64.h               |  64 ++++-
 arch/x86/kernel/cpu/amd.c                    |   2 +
 arch/x86/kernel/cpu/bugs.c                   |  30 +++
 arch/x86/kernel/cpu/cacheinfo.c              |  13 +
 arch/x86/kernel/cpu/cpu.h                    |   2 +
 arch/x86/kernel/cpu/intel.c                  |   2 +
 arch/x86/kernel/setup.c                      |   6 +
 arch/x86/lib/clear_page_64.S                 |  78 ++++--
 arch/x86/lib/memset_64.S                     |  68 ++---
 arch/xtensa/include/asm/page.h               |   1 +
 drivers/vfio/vfio_iommu_type1.c              |   3 +
 fs/hugetlbfs/inode.c                         |   7 +-
 include/asm-generic/clear_page.h             |  69 +++++
 include/asm-generic/page.h                   |   1 +
 include/linux/compiler_types.h               |   2 +
 include/linux/highmem.h                      |  46 ++++
 include/linux/mm.h                           |  10 +-
 include/linux/mm_types.h                     |   2 +
 mm/gup.c                                     |  18 ++
 mm/huge_memory.c                             |   3 +-
 mm/hugetlb.c                                 |  10 +-
 mm/memory.c                                  | 264 +++++++++++++++----
 tools/arch/x86/lib/memset_64.S               |  68 ++---
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   6 +-
 47 files changed, 680 insertions(+), 144 deletions(-)
 create mode 100644 include/asm-generic/clear_page.h

Comments

Linus Torvalds June 6, 2022, 9:53 p.m. UTC | #1
On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> This series introduces two optimizations in the huge page clearing path:
>
>  1. extends the clear_page() machinery to also handle extents larger
>     than a single page.
>  2. support non-cached page clearing for huge and gigantic pages.
>
> The first optimization is useful for hugepage fault handling, the
> second for prefaulting, or for gigantic pages.

Please just split these two issues up into entirely different patch series.

That said, I have a few complaints about the individual patches even
in this form, to the point where I think the whole series is nasty:

 - get rid of 3/21 entirely. It's wrong in every possible way:

    (a) That shouldn't be an inline function in a header file at all.
If you're clearing several pages of data, that just shouldn't be an
inline function.

    (b) Get rid of __HAVE_ARCH_CLEAR_USER_PAGES. I hate how people
make up those idiotic pointless names.

         If you have to use a #ifdef, just use the name of the
function that the architecture overrides, not some other new name.

   But you don't need it at all, because

    (c) Just make a __weak function called clear_user_highpages() in
mm/highmem.c, and allow architectures to just create their own
non-weak ones.

 - patch 4/21 and 5/32: can  we instead just get rid of that silly
"process_huge_page()" thing entirely. It's disgusting, and it's a big
part of why 'rep movs/stos' cannot work efficiently. It also makes NO
SENSE if you then use non-temporal accesses.

   So instead of doubling down on the craziness of that function, just
get rid of it entirely.

   There are two users, and they want to clear a hugepage and copy it
respectively. Don't make it harder than it is.

    *Maybe* the code wants to do a "prefetch" afterwards. Who knows.
But I really think you sh ould do the crapectomy first, make the code
simpler and more straightforward, and just allow architectures to
override the *simple* "copy or clear a lage page" rather than keep
feeding this butt-ugly monstrosity.

 - 13/21: see 3/21.

 - 14-17/21: see 4/21 and 5/21. Once you do the crapectomy and get rid
of the crazy process_huge_page() abstraction, and just let
architectures do their own clear/copy huge pages, *all* this craziness
goes away. Those "when to use which type of clear/copy" becomes a
*local* question, no silly arch_clear_page_non_caching_threshold()
garbage.

So I really don't like this series. A *lot* of it comes from that
horrible process_huge_page() model, and the whole model is just wrong
and pointless. You're literally trying to fix the mess that that
function is, but you're keeping the fundamental problem around.

The whole *point* of your patch-set is to use non-temporal stores,
which makes all the process_huge_page() things entirely pointless, and
only complicates things.

And even if we don't use non-temporal stores, that process_huge_page()
thing makes for trouble for any "rep stos/movs" implementation that
might actualyl do a better job if it was just chunked up in bigger
chunks.

Yes, yes, you probably still want to chunk that up somewhat due to
latency reasons, but even then architectures might as well just make
their own decisions, rather than have the core mm code make one
clearly bad decision for them. Maybe chunking it up in bigger chunks
than one page.

Maybe an architecture could do even more radical things like "let's
just 'rep stos' for the whole area, but set a special thread flag that
causes the interrupt return to break it up on return to kernel space".
IOW, the "latency fix" might not even be about chunking it up, it
might look more like our exception handling thing.

So I really think that crapectomy should be the first thing you do,
and that should be that first part of "extends the clear_page()
machinery to also handle extents larger than a single page"

                Linus
Ankur Arora June 7, 2022, 3:08 p.m. UTC | #2
[ Fixed email for Joao Martins. ]

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
[snip]

> So I really don't like this series. A *lot* of it comes from that
> horrible process_huge_page() model, and the whole model is just wrong
> and pointless. You're literally trying to fix the mess that that
> function is, but you're keeping the fundamental problem around.
>
> The whole *point* of your patch-set is to use non-temporal stores,
> which makes all the process_huge_page() things entirely pointless, and
> only complicates things.
>
> And even if we don't use non-temporal stores, that process_huge_page()
> thing makes for trouble for any "rep stos/movs" implementation that
> might actualyl do a better job if it was just chunked up in bigger
> chunks.

This makes sense to me. There is a lot of unnecessary machinery
around process_huge_page() and this series adds more of it.

For highmem and page-at-a-time archs we would need to keep some
of the same optimizations (via the common clear/copy_user_highpages().)

Still that rids the arch code from pointless constraints as you
say below.

> Yes, yes, you probably still want to chunk that up somewhat due to
> latency reasons, but even then architectures might as well just make
> their own decisions, rather than have the core mm code make one
> clearly bad decision for them. Maybe chunking it up in bigger chunks
> than one page.

Right. Or doing the whole contiguous area in one or a few chunks
chunks, and then touching the faulting cachelines towards the end.

> Maybe an architecture could do even more radical things like "let's
> just 'rep stos' for the whole area, but set a special thread flag that
> causes the interrupt return to break it up on return to kernel space".
> IOW, the "latency fix" might not even be about chunking it up, it
> might look more like our exception handling thing.

When I was thinking about this earlier, I had a vague inkling of
setting a thread flag and defer writes to the last few cachelines
for just before returning to user-space.
Can you elaborate a little about what you are describing above?

> So I really think that crapectomy should be the first thing you do,
> and that should be that first part of "extends the clear_page()
> machinery to also handle extents larger than a single page"

Ack that. And, thanks for the detailed comments.

--
ankur
Linus Torvalds June 7, 2022, 5:56 p.m. UTC | #3
On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> For highmem and page-at-a-time archs we would need to keep some
> of the same optimizations (via the common clear/copy_user_highpages().)

Yeah, I guess that we could keep the code for legacy use, just make
the existing code be marked __weak so that it can be ignored for any
further work.

IOW, the first patch might be to just add that __weak to
'clear_huge_page()' and 'copy_user_huge_page()'.

At that point, any architecture can just say "I will implement my own
versions of these two".

In fact, you can start with just one or the other, which is probably
nicer to keep the patch series smaller (ie do the simpler
"clear_huge_page()" first).

I worry a bit about the insanity of the "gigantic" pages, and the
mem_map_next() games it plays, but that code is from 2008 and I really
doubt it makes any sense to keep around at least for x86. The source
of that abomination is powerpc, and I do not think that whole issue
with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

It most definitely makes no sense when there is no highmem issues, and
all those 'struct page' games should just be deleted (or at least
relegated entirely to that "legacy __weak function" case so that sane
situations don't need to care).

For that same HIGHMEM reason it's probably a good idea to limit the
new case just to x86-64, and leave 32-bit x86 behind.

> Right. Or doing the whole contiguous area in one or a few chunks
> chunks, and then touching the faulting cachelines towards the end.

Yeah, just add a prefetch for the 'addr_hint' part at the end.

> > Maybe an architecture could do even more radical things like "let's
> > just 'rep stos' for the whole area, but set a special thread flag that
> > causes the interrupt return to break it up on return to kernel space".
> > IOW, the "latency fix" might not even be about chunking it up, it
> > might look more like our exception handling thing.
>
> When I was thinking about this earlier, I had a vague inkling of
> setting a thread flag and defer writes to the last few cachelines
> for just before returning to user-space.
> Can you elaborate a little about what you are describing above?

So 'process_huge_page()' (and the gigantic page case) does three very
different things:

 (a) that page chunking for highmem accesses

 (b) the page access _ordering_ for the cache hinting reasons

 (c) the chunking for _latency_ reasons

and I think all of them are basically "bad legacy" reasons, in that

 (a) HIGHMEM doesn't exist on sane architectures that we care about these days

 (b) the cache hinting ordering makes no sense if you do non-temporal
accesses (and might then be replaced by a possible "prefetch" at the
end)

 (c) the latency reasons still *do* exist, but only with PREEMPT_NONE

So what I was alluding to with those "more radical approaches" was
that PREEMPT_NONE case: we would probably still want to chunk things
up for latency reasons and do that "cond_resched()" in  between
chunks.

Now, there are alternatives here:

 (a) only override that existing disgusting (but tested) function when
both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false

 (b) do something like this:

    void clear_huge_page(struct page *page,
        unsigned long addr_hint,
        unsigned int pages_per_huge_page)
    {
        void *addr = page_address(page);
    #ifdef CONFIG_PREEMPT_NONE
        for (int i = 0; i < pages_per_huge_page; i++)
            clear_page(addr, PAGE_SIZE);
            cond_preempt();
        }
    #else
        nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
        prefetch(addr_hint);
    #endif
    }

 or (c), do that "more radical approach", where you do something like this:

    void clear_huge_page(struct page *page,
        unsigned long addr_hint,
        unsigned int pages_per_huge_page)
    {
        set_thread_flag(TIF_PREEMPT_ME);
        nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
        clear_thread_flag(TIF_PREEMPT_ME);
        prefetch(addr_hint);
    }

and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
case and actually force preemption even on a non-preempt kernel.

It's _probably_ the case that CONFIG_PREEMPT_NONE is so rare that it's
n ot even worth doing. I dunno.

And all of the above pseudo-code may _look_ like real code, but is
entirely untested and entirely handwavy "something like this".

Hmm?

               Linus
Ankur Arora June 8, 2022, 7:24 p.m. UTC | #4
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> For highmem and page-at-a-time archs we would need to keep some
>> of the same optimizations (via the common clear/copy_user_highpages().)
>
> Yeah, I guess that we could keep the code for legacy use, just make
> the existing code be marked __weak so that it can be ignored for any
> further work.
>
> IOW, the first patch might be to just add that __weak to
> 'clear_huge_page()' and 'copy_user_huge_page()'.
>
> At that point, any architecture can just say "I will implement my own
> versions of these two".
>
> In fact, you can start with just one or the other, which is probably
> nicer to keep the patch series smaller (ie do the simpler
> "clear_huge_page()" first).

Agreed. Best way to iron out all the kinks too.

> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Looking at it now, it seems to be caused by the wide range of
MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try
to figure it out in detail.

But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
An arch specific clear_huge_page() code could, however handle 1GB pages
via some kind of static loop around (30 - MAX_SECTION_BITS).

I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP
configs. But, I think we should be able to not look up pfn_to_page(),
page_to_pfn() at all or at least not in the inner loop.

> It most definitely makes no sense when there is no highmem issues, and
> all those 'struct page' games should just be deleted (or at least
> relegated entirely to that "legacy __weak function" case so that sane
> situations don't need to care).

Yeah, I'm hoping to do exactly that.

> For that same HIGHMEM reason it's probably a good idea to limit the
> new case just to x86-64, and leave 32-bit x86 behind.

Ack that.

>> Right. Or doing the whole contiguous area in one or a few chunks
>> chunks, and then touching the faulting cachelines towards the end.
>
> Yeah, just add a prefetch for the 'addr_hint' part at the end.
>
>> > Maybe an architecture could do even more radical things like "let's
>> > just 'rep stos' for the whole area, but set a special thread flag that
>> > causes the interrupt return to break it up on return to kernel space".
>> > IOW, the "latency fix" might not even be about chunking it up, it
>> > might look more like our exception handling thing.
>>
>> When I was thinking about this earlier, I had a vague inkling of
>> setting a thread flag and defer writes to the last few cachelines
>> for just before returning to user-space.
>> Can you elaborate a little about what you are describing above?
>
> So 'process_huge_page()' (and the gigantic page case) does three very
> different things:
>
>  (a) that page chunking for highmem accesses
>
>  (b) the page access _ordering_ for the cache hinting reasons
>
>  (c) the chunking for _latency_ reasons
>
> and I think all of them are basically "bad legacy" reasons, in that
>
>  (a) HIGHMEM doesn't exist on sane architectures that we care about these days
>
>  (b) the cache hinting ordering makes no sense if you do non-temporal
> accesses (and might then be replaced by a possible "prefetch" at the
> end)
>
>  (c) the latency reasons still *do* exist, but only with PREEMPT_NONE
>
> So what I was alluding to with those "more radical approaches" was
> that PREEMPT_NONE case: we would probably still want to chunk things
> up for latency reasons and do that "cond_resched()" in  between
> chunks.

Thanks for the detail. That helps.

> Now, there are alternatives here:
>
>  (a) only override that existing disgusting (but tested) function when
> both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false
>
>  (b) do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         void *addr = page_address(page);
>     #ifdef CONFIG_PREEMPT_NONE
>         for (int i = 0; i < pages_per_huge_page; i++)
>             clear_page(addr, PAGE_SIZE);
>             cond_preempt();
>         }
>     #else
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         prefetch(addr_hint);
>     #endif
>     }

We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
as well, right? Either way, as you said earlier could chunk
up in bigger units than a single page.
(In the numbers I had posted earlier, chunking in units of upto 1MB
gave ~25% higher clearing BW. Don't think the microcode setup costs
are that high, but don't have a good explanation why.)

>  or (c), do that "more radical approach", where you do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         set_thread_flag(TIF_PREEMPT_ME);
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         clear_thread_flag(TIF_PREEMPT_ME);
>         prefetch(addr_hint);
>     }
>
> and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
> case and actually force preemption even on a non-preempt kernel.

I like this one. I'll try out (b) and (c) and see how the code shakes
out.

Just one minor point -- seems to me that the choice of nontemporal or
temporal might have to be based on a hint to clear_huge_page().

Basically the nontemporal path is only faster for
(pages_per_huge_page * PAGE_SIZE > LLC-size).

So in the page-fault path it might make sense to use the temporal
path (except for gigantic pages.) In the prefault path, nontemporal
might be better.

Thanks

--
ankur
Linus Torvalds June 8, 2022, 7:39 p.m. UTC | #5
On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
> An arch specific clear_huge_page() code could, however handle 1GB pages
> via some kind of static loop around (30 - MAX_SECTION_BITS).

Even if gigantic pages straddle that area, it simply shouldn't matter.

The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup.

And the only reason for *that* is because of HIGHMEM.

So it's all entirely silly and pointless on any sane architecture, I think.

> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
> as well, right?

Ahh, yes.  I should have looked at the code, and not just gone by my
"PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we
split that up.

> Just one minor point -- seems to me that the choice of nontemporal or
> temporal might have to be based on a hint to clear_huge_page().

Quite possibly. But I'd prefer that  as a separate "look, this
improves numbers by X%" thing from the whole "let's make the
clear_huge_page() interface at least sane".

                 Linus
Matthew Wilcox June 8, 2022, 7:49 p.m. UTC | #6
On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote:
> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Oh, argh, I meant to delete mem_map_next(), and forgot.

If you need to use struct page (a later message hints you don't), just
use nth_page() directly.  I optimised it so it's not painful except on
SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936.
And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP
systems.
Matthew Wilcox June 8, 2022, 7:51 p.m. UTC | #7
On Wed, Jun 08, 2022 at 08:49:57PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 07, 2022 at 10:56:01AM -0700, Linus Torvalds wrote:
> > I worry a bit about the insanity of the "gigantic" pages, and the
> > mem_map_next() games it plays, but that code is from 2008 and I really
> > doubt it makes any sense to keep around at least for x86. The source
> > of that abomination is powerpc, and I do not think that whole issue
> > with MAX_ORDER_NR_PAGES makes any difference on x86, at least.
> 
> Oh, argh, I meant to delete mem_map_next(), and forgot.
> 
> If you need to use struct page (a later message hints you don't), just
> use nth_page() directly.  I optimised it so it's not painful except on
> SPARSEMEM && !SPARSEMEM_VMEMMAP back in December in commit 659508f9c936.
> And nobody cares about performance on SPARSEMEM && !SPARSEMEM_VMEMMAP
> systems.

Oops, wrong commit.  I meant 1cfcee728391 from June 2021.
Ankur Arora June 8, 2022, 8:21 p.m. UTC | #8
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Jun 8, 2022 at 12:25 PM Ankur Arora <ankur.a.arora@oracle.com> wrote:
>>
>> But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
>> An arch specific clear_huge_page() code could, however handle 1GB pages
>> via some kind of static loop around (30 - MAX_SECTION_BITS).
>
> Even if gigantic pages straddle that area, it simply shouldn't matter.
>
> The only reason that MAX_SECTION_BITS matters is for the 'struct page *' lookup.
>
> And the only reason for *that* is because of HIGHMEM.
>
> So it's all entirely silly and pointless on any sane architecture, I think.
>
>> We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
>> as well, right?
>
> Ahh, yes.  I should have looked at the code, and not just gone by my
> "PREEMPT_NONE vs PREEMPT" thing that entirely forgot about how we
> split that up.
>
>> Just one minor point -- seems to me that the choice of nontemporal or
>> temporal might have to be based on a hint to clear_huge_page().
>
> Quite possibly. But I'd prefer that  as a separate "look, this
> improves numbers by X%" thing from the whole "let's make the
> clear_huge_page() interface at least sane".

Makes sense to me.

--
ankur