diff mbox series

[v11,07/14] mm: multi-gen LRU: exploit locality in rmap

Message ID 20220518014632.922072-8-yuzhao@google.com (mailing list archive)
State New
Headers show
Series Multi-Gen LRU Framework | expand

Commit Message

Yu Zhao May 18, 2022, 1:46 a.m. UTC
Searching the rmap for PTEs mapping each page on an LRU list (to test
and clear the accessed bit) can be expensive because pages from
different VMAs (PA space) are not cache friendly to the rmap (VA
space). For workloads mostly using mapped pages, the rmap has a high
CPU cost in the reclaim path.

This patch exploits spatial locality to reduce the trips into the
rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
new function lru_gen_look_around() scans at most BITS_PER_LONG-1
adjacent PTEs. On finding another young PTE, it clears the accessed
bit and updates the gen counter of the page mapped by this PTE to
(max_seq%MAX_NR_GENS)+1.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[5.5, 7.5]%
                Ops/sec      KB/sec
      patch1-6: 1120643.70   43588.06
      patch1-7: 1193918.93   46438.15

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-6
      35.99%  lzo1x_1_do_compress (real work)
      19.40%  page_vma_mapped_walk
       6.31%  _raw_spin_unlock_irq
       3.95%  do_raw_spin_lock
       2.39%  anon_vma_interval_tree_iter_first
       2.25%  ptep_clear_flush
       1.92%  __anon_vma_interval_tree_subtree_search
       1.70%  folio_referenced_one
       1.68%  __zram_bvec_write
       1.43%  anon_vma_interval_tree_iter_next

    patch1-7
      45.90%  lzo1x_1_do_compress (real work)
       9.14%  page_vma_mapped_walk
       6.81%  _raw_spin_unlock_irq
       2.80%  ptep_clear_flush
       2.34%  __zram_bvec_write
       2.29%  do_raw_spin_lock
       1.84%  lru_gen_look_around
       1.78%  memmove
       1.74%  obj_malloc
       1.50%  free_unref_page_list

  Configurations:
    no change

Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 include/linux/memcontrol.h |  31 ++++++++
 include/linux/mm.h         |   5 ++
 include/linux/mmzone.h     |   6 ++
 mm/internal.h              |   1 +
 mm/memcontrol.c            |   1 +
 mm/rmap.c                  |   7 ++
 mm/swap.c                  |   4 +-
 mm/vmscan.c                | 157 +++++++++++++++++++++++++++++++++++++
 8 files changed, 210 insertions(+), 2 deletions(-)

Comments

Barry Song June 6, 2022, 9:25 a.m. UTC | #1
On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:
>
> Searching the rmap for PTEs mapping each page on an LRU list (to test
> and clear the accessed bit) can be expensive because pages from
> different VMAs (PA space) are not cache friendly to the rmap (VA
> space). For workloads mostly using mapped pages, the rmap has a high
> CPU cost in the reclaim path.
>
> This patch exploits spatial locality to reduce the trips into the
> rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> adjacent PTEs. On finding another young PTE, it clears the accessed
> bit and updates the gen counter of the page mapped by this PTE to
> (max_seq%MAX_NR_GENS)+1.
>
> Server benchmark results:
>   Single workload:
>     fio (buffered I/O): no change
>
>   Single workload:
>     memcached (anon): +[5.5, 7.5]%
>                 Ops/sec      KB/sec
>       patch1-6: 1120643.70   43588.06
>       patch1-7: 1193918.93   46438.15
>
>   Configurations:
>     no change
>
> Client benchmark results:
>   kswapd profiles:
>     patch1-6
>       35.99%  lzo1x_1_do_compress (real work)
>       19.40%  page_vma_mapped_walk
>        6.31%  _raw_spin_unlock_irq
>        3.95%  do_raw_spin_lock
>        2.39%  anon_vma_interval_tree_iter_first
>        2.25%  ptep_clear_flush
>        1.92%  __anon_vma_interval_tree_subtree_search
>        1.70%  folio_referenced_one
>        1.68%  __zram_bvec_write
>        1.43%  anon_vma_interval_tree_iter_next
>
>     patch1-7
>       45.90%  lzo1x_1_do_compress (real work)
>        9.14%  page_vma_mapped_walk
>        6.81%  _raw_spin_unlock_irq
>        2.80%  ptep_clear_flush
>        2.34%  __zram_bvec_write
>        2.29%  do_raw_spin_lock
>        1.84%  lru_gen_look_around
>        1.78%  memmove
>        1.74%  obj_malloc
>        1.50%  free_unref_page_list
>
>   Configurations:
>     no change
>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Acked-by: Brian Geffon <bgeffon@google.com>
> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> Acked-by: Steven Barrett <steven@liquorix.net>
> Acked-by: Suleiman Souhlal <suleiman@google.com>
> Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> Tested-by: Donald Carr <d@chaos-reins.com>
> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  include/linux/memcontrol.h |  31 ++++++++
>  include/linux/mm.h         |   5 ++
>  include/linux/mmzone.h     |   6 ++
>  mm/internal.h              |   1 +
>  mm/memcontrol.c            |   1 +
>  mm/rmap.c                  |   7 ++
>  mm/swap.c                  |   4 +-
>  mm/vmscan.c                | 157 +++++++++++++++++++++++++++++++++++++
>  8 files changed, 210 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 89b14729d59f..2bfdcc77648a 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -438,6 +438,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem folio a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem folio from being released.
> @@ -499,6 +500,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + * - mem_cgroup_trylock_pages()
>   *
>   * For a kmem page a caller should hold an rcu read lock to protect memcg
>   * associated with a kmem page from being released.
> @@ -948,6 +950,23 @@ void unlock_page_memcg(struct page *page);
>
>  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
>
> +/* try to stablize folio_memcg() for all the pages in a memcg */
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +       rcu_read_lock();
> +
> +       if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> +               return true;
> +
> +       rcu_read_unlock();
> +       return false;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
>  static inline void mod_memcg_state(struct mem_cgroup *memcg,
>                                    int idx, int val)
> @@ -1386,6 +1405,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
>  {
>  }
>
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +       /* to match folio_memcg_rcu() */
> +       rcu_read_lock();
> +       return true;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +       rcu_read_unlock();
> +}
> +
>  static inline void mem_cgroup_handle_over_high(void)
>  {
>  }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 894c289c2c06..4e8ab4ad4473 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1523,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
>         return page_to_pfn(&folio->page);
>  }
>
> +static inline struct folio *pfn_folio(unsigned long pfn)
> +{
> +       return page_folio(pfn_to_page(pfn));
> +}
> +
>  static inline atomic_t *folio_pincount_ptr(struct folio *folio)
>  {
>         return &folio_page(folio, 1)->compound_pincount;
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2d023d243e73..f0b980362186 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -374,6 +374,7 @@ enum lruvec_flags {
>  #ifndef __GENERATING_BOUNDS_H
>
>  struct lruvec;
> +struct page_vma_mapped_walk;
>
>  #define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
>  #define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> @@ -429,6 +430,7 @@ struct lru_gen_struct {
>  };
>
>  void lru_gen_init_lruvec(struct lruvec *lruvec);
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
>
>  #ifdef CONFIG_MEMCG
>  void lru_gen_init_memcg(struct mem_cgroup *memcg);
> @@ -441,6 +443,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
>  {
>  }
>
> +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +}
> +
>  #ifdef CONFIG_MEMCG
>  static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/internal.h b/mm/internal.h
> index cf16280ce132..59d2422b647d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -68,6 +68,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
>  void folio_rotate_reclaimable(struct folio *folio);
>  bool __folio_end_writeback(struct folio *folio);
>  void deactivate_file_folio(struct folio *folio);
> +void folio_activate(struct folio *folio);
>
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>                 unsigned long floor, unsigned long ceiling);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2ee074f80e72..98aa720ac639 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2769,6 +2769,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>          * - LRU isolation
>          * - lock_page_memcg()
>          * - exclusive reference
> +        * - mem_cgroup_trylock_pages()
>          */
>         folio->memcg_data = (unsigned long)memcg;
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fedb82371efe..7cb7ef29088a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -73,6 +73,7 @@
>  #include <linux/page_idle.h>
>  #include <linux/memremap.h>
>  #include <linux/userfaultfd_k.h>
> +#include <linux/mm_inline.h>
>
>  #include <asm/tlbflush.h>
>
> @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
>                 }
>
>                 if (pvmw.pte) {
> +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> +                               lru_gen_look_around(&pvmw);
> +                               referenced++;
> +                       }
> +
>                         if (ptep_clear_flush_young_notify(vma, address,

Hello, Yu.
look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
only without flush and notify. for flush, there is a tlb operation for arm64:
static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
                                         unsigned long address, pte_t *ptep)
{
        int young = ptep_test_and_clear_young(vma, address, ptep);

        if (young) {
                /*
                 * We can elide the trailing DSB here since the worst that can
                 * happen is that a CPU continues to use the young entry in its
                 * TLB and we mistakenly reclaim the associated page. The
                 * window for such an event is bounded by the next
                 * context-switch, which provides a DSB to complete the TLB
                 * invalidation.
                 */
                flush_tlb_page_nosync(vma, address);
        }

        return young;
}

Does it mean the current kernel is over cautious?  is it
safe to call ptep_test_and_clear_young() only?

btw, lru_gen_look_around() has already included 'address', are we doing
pte check for 'address' twice here?


>                                                 pvmw.pte)) {
>                                 /*
> diff --git a/mm/swap.c b/mm/swap.c
> index a99d22308f28..0aa1d0b33d42 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -342,7 +342,7 @@ static bool need_activate_page_drain(int cpu)
>         return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
>  }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
>  {
>         if (folio_test_lru(folio) && !folio_test_active(folio) &&
>             !folio_test_unevictable(folio)) {
> @@ -362,7 +362,7 @@ static inline void activate_page_drain(int cpu)
>  {
>  }
>
> -static void folio_activate(struct folio *folio)
> +void folio_activate(struct folio *folio)
>  {
>         struct lruvec *lruvec;
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 891f0ab69b3a..cf89a28c3b0e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1554,6 +1554,11 @@ static unsigned int shrink_page_list(struct list_head *page_list,
>                 if (!sc->may_unmap && page_mapped(page))
>                         goto keep_locked;
>
> +               /* folio_update_gen() tried to promote this page? */
> +               if (lru_gen_enabled() && !ignore_references &&
> +                   page_mapped(page) && PageReferenced(page))
> +                       goto keep_locked;
> +
>                 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
>                         (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
>
> @@ -3137,6 +3142,28 @@ static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
>   *                          the aging
>   ******************************************************************************/
>
> +static int folio_update_gen(struct folio *folio, int gen)
> +{
> +       unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
> +
> +       VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
> +       VM_WARN_ON_ONCE(!rcu_read_lock_held());
> +
> +       do {
> +               /* lru_gen_del_folio() has isolated this page? */
> +               if (!(old_flags & LRU_GEN_MASK)) {
> +                       /* for shrink_page_list() */
> +                       new_flags = old_flags | BIT(PG_referenced);
> +                       continue;
> +               }
> +
> +               new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
> +               new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
> +       } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
> +
> +       return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +}
> +
>  static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
>  {
>         int type = folio_is_file_lru(folio);
> @@ -3147,6 +3174,11 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
>         VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
>
>         do {
> +               new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> +               /* folio_update_gen() has promoted this page? */
> +               if (new_gen >= 0 && new_gen != old_gen)
> +                       return new_gen;
> +
>                 new_gen = (old_gen + 1) % MAX_NR_GENS;
>
>                 new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
> @@ -3365,6 +3397,125 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
>         } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>  }
>
> +/*
> + * This function exploits spatial locality when shrink_page_list() walks the
> + * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
> + */
> +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> +{
> +       int i;
> +       pte_t *pte;
> +       unsigned long start;
> +       unsigned long end;
> +       unsigned long addr;
> +       unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
> +       struct folio *folio = pfn_folio(pvmw->pfn);
> +       struct mem_cgroup *memcg = folio_memcg(folio);
> +       struct pglist_data *pgdat = folio_pgdat(folio);
> +       struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +       DEFINE_MAX_SEQ(lruvec);
> +       int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> +       lockdep_assert_held(pvmw->ptl);
> +       VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
> +
> +       if (spin_is_contended(pvmw->ptl))
> +               return;
> +
> +       start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
> +       end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
> +
> +       if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
> +               if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
> +                       end = start + MIN_LRU_BATCH * PAGE_SIZE;
> +               else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
> +                       start = end - MIN_LRU_BATCH * PAGE_SIZE;
> +               else {
> +                       start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
> +                       end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
> +               }
> +       }
> +
> +       pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
> +
> +       rcu_read_lock();
> +       arch_enter_lazy_mmu_mode();
> +
> +       for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> +               unsigned long pfn = pte_pfn(pte[i]);
> +
> +               VM_WARN_ON_ONCE(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
> +
> +               if (!pte_present(pte[i]) || is_zero_pfn(pfn))
> +                       continue;
> +
> +               if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
> +                       continue;
> +
> +               if (!pte_young(pte[i]))
> +                       continue;
> +
> +               VM_WARN_ON_ONCE(!pfn_valid(pfn));
> +               if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
> +                       continue;
> +
> +               folio = pfn_folio(pfn);
> +               if (folio_nid(folio) != pgdat->node_id)
> +                       continue;
> +
> +               if (folio_memcg_rcu(folio) != memcg)
> +                       continue;
> +
> +               if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> +                       continue;
> +
> +               if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
> +                   !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
> +                     !folio_test_swapcache(folio)))
> +                       folio_mark_dirty(folio);
> +
> +               old_gen = folio_lru_gen(folio);
> +               if (old_gen < 0)
> +                       folio_set_referenced(folio);
> +               else if (old_gen != new_gen)
> +                       __set_bit(i, bitmap);
> +       }
> +
> +       arch_leave_lazy_mmu_mode();
> +       rcu_read_unlock();
> +
> +       if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
> +               for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> +                       folio = pfn_folio(pte_pfn(pte[i]));
> +                       folio_activate(folio);
> +               }
> +               return;
> +       }
> +
> +       /* folio_update_gen() requires stable folio_memcg() */
> +       if (!mem_cgroup_trylock_pages(memcg))
> +               return;
> +
> +       spin_lock_irq(&lruvec->lru_lock);
> +       new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
> +
> +       for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
> +               folio = pfn_folio(pte_pfn(pte[i]));
> +               if (folio_memcg_rcu(folio) != memcg)
> +                       continue;
> +
> +               old_gen = folio_update_gen(folio, new_gen);
> +               if (old_gen < 0 || old_gen == new_gen)
> +                       continue;
> +
> +               lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +       }
> +
> +       spin_unlock_irq(&lruvec->lru_lock);
> +
> +       mem_cgroup_unlock_pages();
> +}
> +
>  /******************************************************************************
>   *                          the eviction
>   ******************************************************************************/
> @@ -3401,6 +3552,12 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
>                 return true;
>         }
>
> +       /* promoted */
> +       if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
> +               list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
> +               return true;
> +       }
> +
>         /* protected */
>         if (tier > tier_idx) {
>                 int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> --
> 2.36.0.550.gb090851708-goog
>

Thanks
Barry
Barry Song June 6, 2022, 10:37 p.m. UTC | #2
On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <will@kernel.org> wrote:
>
> On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > On Mon, Jun 6, 2022 at 9:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index fedb82371efe..7cb7ef29088a 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -73,6 +73,7 @@
> > > >  #include <linux/page_idle.h>
> > > >  #include <linux/memremap.h>
> > > >  #include <linux/userfaultfd_k.h>
> > > > +#include <linux/mm_inline.h>
> > > >
> > > >  #include <asm/tlbflush.h>
> > > >
> > > > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> > > >                 }
> > > >
> > > >                 if (pvmw.pte) {
> > > > +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > > > +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > > > +                               lru_gen_look_around(&pvmw);
> > > > +                               referenced++;
> > > > +                       }
> > > > +
> > > >                         if (ptep_clear_flush_young_notify(vma, address,
> > >
> > > Hello, Yu.
> > > look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> > > only without flush and notify. for flush, there is a tlb operation for arm64:
> > > static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >                                          unsigned long address, pte_t *ptep)
> > > {
> > >         int young = ptep_test_and_clear_young(vma, address, ptep);
> > >
> > >         if (young) {
> > >                 /*
> > >                  * We can elide the trailing DSB here since the worst that can
> > >                  * happen is that a CPU continues to use the young entry in its
> > >                  * TLB and we mistakenly reclaim the associated page. The
> > >                  * window for such an event is bounded by the next
> > >                  * context-switch, which provides a DSB to complete the TLB
> > >                  * invalidation.
> > >                  */
> > >                 flush_tlb_page_nosync(vma, address);
> > >         }
> > >
> > >         return young;
> > > }
> > >
> > > Does it mean the current kernel is over cautious?  is it
> > > safe to call ptep_test_and_clear_young() only?
> >
> > I can't really explain why we are getting a random app/java vm crash in monkey
> > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > armv8-a machine without hardware PTE young support.
> >
> > Moving to  ptep_clear_flush_young() in look_around can make the random
> > hang disappear according to zhanyuan(Cc-ed).
> >
> > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > after
> >  'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > the accessed bit instead of flushing the TLB")'
> >
> > But on arm64, they are different. according to Will's comments in this
> > thread which
> > tried to make arm64 same with x86,
> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html
> >
> > "
> > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > the TLB on context switch. That means our window for keeping the stale
> > entries around is potentially much bigger and might not be a great idea.
> >
> > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > performance does that get you?
> > "
> > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > clear PTE young? Any comments from Will?
>
> Given that this issue is specific to the multi-gen LRU work, I think Yu is
> the best person to comment. However, looking quickly at your analysis above,
> I wonder if the code is relying on this sequence:
>
>
>         ptep_test_and_clear_young(vma, address, ptep);
>         ptep_clear_flush_young(vma, address, ptep);
>
>
> to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> in ptep_clear_flush_young() is predicated on the pte being young (and this
> patches the generic implementation in mm/pgtable-generic.c. In fact, that
> second function call is always going to be a no-op unless the pte became
> young again in the middle.

Hi Will,
thanks for your reply, sorry for failing to let you understand my question.
my question is actually as below,
right now  lru_gen_look_around() is using ptep_test_and_clear_young()
only without flush to clear pte for a couple of pages including the specific
address:
void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
{
       ...

       for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
               ...

               if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
                       continue;

               ...
}

I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
in the loop?

>
> Will

Thanks
Barry
Barry Song June 7, 2022, 7:37 a.m. UTC | #3
On Mon, Jun 6, 2022 at 9:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > Searching the rmap for PTEs mapping each page on an LRU list (to test
> > and clear the accessed bit) can be expensive because pages from
> > different VMAs (PA space) are not cache friendly to the rmap (VA
> > space). For workloads mostly using mapped pages, the rmap has a high
> > CPU cost in the reclaim path.
> >
> > This patch exploits spatial locality to reduce the trips into the
> > rmap. When shrink_page_list() walks the rmap and finds a young PTE, a
> > new function lru_gen_look_around() scans at most BITS_PER_LONG-1
> > adjacent PTEs. On finding another young PTE, it clears the accessed
> > bit and updates the gen counter of the page mapped by this PTE to
> > (max_seq%MAX_NR_GENS)+1.
> >
> > Server benchmark results:
> >   Single workload:
> >     fio (buffered I/O): no change
> >
> >   Single workload:
> >     memcached (anon): +[5.5, 7.5]%
> >                 Ops/sec      KB/sec
> >       patch1-6: 1120643.70   43588.06
> >       patch1-7: 1193918.93   46438.15
> >
> >   Configurations:
> >     no change
> >
> > Client benchmark results:
> >   kswapd profiles:
> >     patch1-6
> >       35.99%  lzo1x_1_do_compress (real work)
> >       19.40%  page_vma_mapped_walk
> >        6.31%  _raw_spin_unlock_irq
> >        3.95%  do_raw_spin_lock
> >        2.39%  anon_vma_interval_tree_iter_first
> >        2.25%  ptep_clear_flush
> >        1.92%  __anon_vma_interval_tree_subtree_search
> >        1.70%  folio_referenced_one
> >        1.68%  __zram_bvec_write
> >        1.43%  anon_vma_interval_tree_iter_next
> >
> >     patch1-7
> >       45.90%  lzo1x_1_do_compress (real work)
> >        9.14%  page_vma_mapped_walk
> >        6.81%  _raw_spin_unlock_irq
> >        2.80%  ptep_clear_flush
> >        2.34%  __zram_bvec_write
> >        2.29%  do_raw_spin_lock
> >        1.84%  lru_gen_look_around
> >        1.78%  memmove
> >        1.74%  obj_malloc
> >        1.50%  free_unref_page_list
> >
> >   Configurations:
> >     no change
> >
> > Signed-off-by: Yu Zhao <yuzhao@google.com>
> > Acked-by: Brian Geffon <bgeffon@google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> > Acked-by: Steven Barrett <steven@liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman@google.com>
> > Tested-by: Daniel Byrne <djbyrne@mtu.edu>
> > Tested-by: Donald Carr <d@chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
> > Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh@edi.works>
> > Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> > ---
> >  include/linux/memcontrol.h |  31 ++++++++
> >  include/linux/mm.h         |   5 ++
> >  include/linux/mmzone.h     |   6 ++
> >  mm/internal.h              |   1 +
> >  mm/memcontrol.c            |   1 +
> >  mm/rmap.c                  |   7 ++
> >  mm/swap.c                  |   4 +-
> >  mm/vmscan.c                | 157 +++++++++++++++++++++++++++++++++++++
> >  8 files changed, 210 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 89b14729d59f..2bfdcc77648a 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -438,6 +438,7 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
> >   * - LRU isolation
> >   * - lock_page_memcg()
> >   * - exclusive reference
> > + * - mem_cgroup_trylock_pages()
> >   *
> >   * For a kmem folio a caller should hold an rcu read lock to protect memcg
> >   * associated with a kmem folio from being released.
> > @@ -499,6 +500,7 @@ static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
> >   * - LRU isolation
> >   * - lock_page_memcg()
> >   * - exclusive reference
> > + * - mem_cgroup_trylock_pages()
> >   *
> >   * For a kmem page a caller should hold an rcu read lock to protect memcg
> >   * associated with a kmem page from being released.
> > @@ -948,6 +950,23 @@ void unlock_page_memcg(struct page *page);
> >
> >  void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
> >
> > +/* try to stablize folio_memcg() for all the pages in a memcg */
> > +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> > +{
> > +       rcu_read_lock();
> > +
> > +       if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> > +               return true;
> > +
> > +       rcu_read_unlock();
> > +       return false;
> > +}
> > +
> > +static inline void mem_cgroup_unlock_pages(void)
> > +{
> > +       rcu_read_unlock();
> > +}
> > +
> >  /* idx can be of type enum memcg_stat_item or node_stat_item */
> >  static inline void mod_memcg_state(struct mem_cgroup *memcg,
> >                                    int idx, int val)
> > @@ -1386,6 +1405,18 @@ static inline void folio_memcg_unlock(struct folio *folio)
> >  {
> >  }
> >
> > +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> > +{
> > +       /* to match folio_memcg_rcu() */
> > +       rcu_read_lock();
> > +       return true;
> > +}
> > +
> > +static inline void mem_cgroup_unlock_pages(void)
> > +{
> > +       rcu_read_unlock();
> > +}
> > +
> >  static inline void mem_cgroup_handle_over_high(void)
> >  {
> >  }
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 894c289c2c06..4e8ab4ad4473 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1523,6 +1523,11 @@ static inline unsigned long folio_pfn(struct folio *folio)
> >         return page_to_pfn(&folio->page);
> >  }
> >
> > +static inline struct folio *pfn_folio(unsigned long pfn)
> > +{
> > +       return page_folio(pfn_to_page(pfn));
> > +}
> > +
> >  static inline atomic_t *folio_pincount_ptr(struct folio *folio)
> >  {
> >         return &folio_page(folio, 1)->compound_pincount;
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 2d023d243e73..f0b980362186 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -374,6 +374,7 @@ enum lruvec_flags {
> >  #ifndef __GENERATING_BOUNDS_H
> >
> >  struct lruvec;
> > +struct page_vma_mapped_walk;
> >
> >  #define LRU_GEN_MASK           ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
> >  #define LRU_REFS_MASK          ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
> > @@ -429,6 +430,7 @@ struct lru_gen_struct {
> >  };
> >
> >  void lru_gen_init_lruvec(struct lruvec *lruvec);
> > +void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
> >
> >  #ifdef CONFIG_MEMCG
> >  void lru_gen_init_memcg(struct mem_cgroup *memcg);
> > @@ -441,6 +443,10 @@ static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
> >  {
> >  }
> >
> > +static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > +{
> > +}
> > +
> >  #ifdef CONFIG_MEMCG
> >  static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
> >  {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index cf16280ce132..59d2422b647d 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -68,6 +68,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
> >  void folio_rotate_reclaimable(struct folio *folio);
> >  bool __folio_end_writeback(struct folio *folio);
> >  void deactivate_file_folio(struct folio *folio);
> > +void folio_activate(struct folio *folio);
> >
> >  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> >                 unsigned long floor, unsigned long ceiling);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 2ee074f80e72..98aa720ac639 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2769,6 +2769,7 @@ static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
> >          * - LRU isolation
> >          * - lock_page_memcg()
> >          * - exclusive reference
> > +        * - mem_cgroup_trylock_pages()
> >          */
> >         folio->memcg_data = (unsigned long)memcg;
> >  }
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index fedb82371efe..7cb7ef29088a 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -73,6 +73,7 @@
> >  #include <linux/page_idle.h>
> >  #include <linux/memremap.h>
> >  #include <linux/userfaultfd_k.h>
> > +#include <linux/mm_inline.h>
> >
> >  #include <asm/tlbflush.h>
> >
> > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> >                 }
> >
> >                 if (pvmw.pte) {
> > +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > +                               lru_gen_look_around(&pvmw);
> > +                               referenced++;
> > +                       }
> > +
> >                         if (ptep_clear_flush_young_notify(vma, address,
>
> Hello, Yu.
> look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> only without flush and notify. for flush, there is a tlb operation for arm64:
> static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>                                          unsigned long address, pte_t *ptep)
> {
>         int young = ptep_test_and_clear_young(vma, address, ptep);
>
>         if (young) {
>                 /*
>                  * We can elide the trailing DSB here since the worst that can
>                  * happen is that a CPU continues to use the young entry in its
>                  * TLB and we mistakenly reclaim the associated page. The
>                  * window for such an event is bounded by the next
>                  * context-switch, which provides a DSB to complete the TLB
>                  * invalidation.
>                  */
>                 flush_tlb_page_nosync(vma, address);
>         }
>
>         return young;
> }
>
> Does it mean the current kernel is over cautious?  is it
> safe to call ptep_test_and_clear_young() only?

I can't really explain why we are getting a random app/java vm crash in monkey
test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
armv8-a machine without hardware PTE young support.

Moving to  ptep_clear_flush_young() in look_around can make the random
hang disappear according to zhanyuan(Cc-ed).

On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
after
 'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
the accessed bit instead of flushing the TLB")'

But on arm64, they are different. according to Will's comments in this
thread which
tried to make arm64 same with x86,
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html

"
This is blindly copied from x86 and isn't true for us: we don't invalidate
the TLB on context switch. That means our window for keeping the stale
entries around is potentially much bigger and might not be a great idea.

If we roll a TLB invalidation routine without the trailing DSB, what sort of
performance does that get you?
"
We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
clear PTE young? Any comments from Will?

>
> btw, lru_gen_look_around() has already included 'address', are we doing
> pte check for 'address' twice here?
>

Thanks
Barry
Will Deacon June 7, 2022, 10:21 a.m. UTC | #4
On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> On Mon, Jun 6, 2022 at 9:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index fedb82371efe..7cb7ef29088a 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -73,6 +73,7 @@
> > >  #include <linux/page_idle.h>
> > >  #include <linux/memremap.h>
> > >  #include <linux/userfaultfd_k.h>
> > > +#include <linux/mm_inline.h>
> > >
> > >  #include <asm/tlbflush.h>
> > >
> > > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> > >                 }
> > >
> > >                 if (pvmw.pte) {
> > > +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > > +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > > +                               lru_gen_look_around(&pvmw);
> > > +                               referenced++;
> > > +                       }
> > > +
> > >                         if (ptep_clear_flush_young_notify(vma, address,
> >
> > Hello, Yu.
> > look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> > only without flush and notify. for flush, there is a tlb operation for arm64:
> > static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> >                                          unsigned long address, pte_t *ptep)
> > {
> >         int young = ptep_test_and_clear_young(vma, address, ptep);
> >
> >         if (young) {
> >                 /*
> >                  * We can elide the trailing DSB here since the worst that can
> >                  * happen is that a CPU continues to use the young entry in its
> >                  * TLB and we mistakenly reclaim the associated page. The
> >                  * window for such an event is bounded by the next
> >                  * context-switch, which provides a DSB to complete the TLB
> >                  * invalidation.
> >                  */
> >                 flush_tlb_page_nosync(vma, address);
> >         }
> >
> >         return young;
> > }
> >
> > Does it mean the current kernel is over cautious?  is it
> > safe to call ptep_test_and_clear_young() only?
> 
> I can't really explain why we are getting a random app/java vm crash in monkey
> test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> armv8-a machine without hardware PTE young support.
> 
> Moving to  ptep_clear_flush_young() in look_around can make the random
> hang disappear according to zhanyuan(Cc-ed).
> 
> On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> after
>  'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
> 
> But on arm64, they are different. according to Will's comments in this
> thread which
> tried to make arm64 same with x86,
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html
> 
> "
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.
> 
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?
> "
> We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> clear PTE young? Any comments from Will?

Given that this issue is specific to the multi-gen LRU work, I think Yu is
the best person to comment. However, looking quickly at your analysis above,
I wonder if the code is relying on this sequence:


	ptep_test_and_clear_young(vma, address, ptep);
	ptep_clear_flush_young(vma, address, ptep);


to invalidate the TLB. On arm64, that won't be the case, as the invalidation
in ptep_clear_flush_young() is predicated on the pte being young (and this
patches the generic implementation in mm/pgtable-generic.c. In fact, that
second function call is always going to be a no-op unless the pte became
young again in the middle.

Will
Will Deacon June 7, 2022, 10:43 a.m. UTC | #5
On Tue, Jun 07, 2022 at 10:37:46AM +1200, Barry Song wrote:
> On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <will@kernel.org> wrote:
> > On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > > I can't really explain why we are getting a random app/java vm crash in monkey
> > > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > > armv8-a machine without hardware PTE young support.
> > >
> > > Moving to  ptep_clear_flush_young() in look_around can make the random
> > > hang disappear according to zhanyuan(Cc-ed).
> > >
> > > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > > after
> > >  'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > > the accessed bit instead of flushing the TLB")'
> > >
> > > But on arm64, they are different. according to Will's comments in this
> > > thread which
> > > tried to make arm64 same with x86,
> > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html
> > >
> > > "
> > > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > > the TLB on context switch. That means our window for keeping the stale
> > > entries around is potentially much bigger and might not be a great idea.
> > >
> > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > performance does that get you?
> > > "
> > > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > > clear PTE young? Any comments from Will?
> >
> > Given that this issue is specific to the multi-gen LRU work, I think Yu is
> > the best person to comment. However, looking quickly at your analysis above,
> > I wonder if the code is relying on this sequence:
> >
> >
> >         ptep_test_and_clear_young(vma, address, ptep);
> >         ptep_clear_flush_young(vma, address, ptep);
> >
> >
> > to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> > in ptep_clear_flush_young() is predicated on the pte being young (and this
> > patches the generic implementation in mm/pgtable-generic.c. In fact, that
> > second function call is always going to be a no-op unless the pte became
> > young again in the middle.
> 
> thanks for your reply, sorry for failing to let you understand my question.
> my question is actually as below,
> right now  lru_gen_look_around() is using ptep_test_and_clear_young()
> only without flush to clear pte for a couple of pages including the specific
> address:
> void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> {
>        ...
> 
>        for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
>                ...
> 
>                if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
>                        continue;
> 
>                ...
> }
> 
> I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
> in the loop?

I don't know what this code is doing, so Yu is the best person to answer
that. There's nothing inherently dangerous about eliding the TLB
maintenance; it really depends on the guarantees needed by the caller.

However, the snippet you posted from folio_referenced_one():

 |                  if (pvmw.pte) {
 |  +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
 |  +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
 |  +                               lru_gen_look_around(&pvmw);
 |  +                               referenced++;
 |  +                       }
 |  +
 |                          if (ptep_clear_flush_young_notify(vma, address,


Does seem to call lru_gen_look_around() *and*
ptep_clear_flush_young_notify(), which is what prompted my question as it
looks pretty suspicious to me.

Will
Yu Zhao June 7, 2022, 6:58 p.m. UTC | #6
On Mon, Jun 6, 2022 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:

...

> > @@ -821,6 +822,12 @@ static bool folio_referenced_one(struct folio *folio,
> >                 }
> >
> >                 if (pvmw.pte) {
> > +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> > +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> > +                               lru_gen_look_around(&pvmw);
> > +                               referenced++;
> > +                       }
> > +
> >                         if (ptep_clear_flush_young_notify(vma, address,
>
> Hello, Yu.
> look_around() is calling ptep_test_and_clear_young(pvmw->vma, addr, pte + i)
> only without flush and notify. for flush, there is a tlb operation for arm64:
> static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>                                          unsigned long address, pte_t *ptep)
> {
>         int young = ptep_test_and_clear_young(vma, address, ptep);
>
>         if (young) {
>                 /*
>                  * We can elide the trailing DSB here since the worst that can
>                  * happen is that a CPU continues to use the young entry in its
>                  * TLB and we mistakenly reclaim the associated page. The
>                  * window for such an event is bounded by the next
>                  * context-switch, which provides a DSB to complete the TLB
>                  * invalidation.
>                  */
>                 flush_tlb_page_nosync(vma, address);
>         }
>
>         return young;
> }
>
> Does it mean the current kernel is over cautious?

Hi Barry,

This is up to individual archs. For x86, ptep_clear_flush_young() is
ptep_test_and_clear_young(). For arm64, I'd say yes, based on Figure 1
of Navarro, Juan, et al. "Practical, transparent operating system
support for superpages." [1].

int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        /*
         * On x86 CPUs, clearing the accessed bit without a TLB flush
         * doesn't cause data corruption. [ It could cause incorrect
         * page aging and the (mistaken) reclaim of hot pages, but the
         * chance of that should be relatively low. ]
         *
         * So as a performance optimization don't flush the TLB when
         * clearing the accessed bit, it will eventually be flushed by
         * a context switch or a VM operation anyway. [ In the rare
         * event of it not getting flushed for a long time the delay
         * shouldn't really matter because there's no real memory
         * pressure for swapout to react to. ]
         */
        return ptep_test_and_clear_young(vma, address, ptep);
}

[1] https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

> is it
> safe to call ptep_test_and_clear_young() only?

Yes. Though the h/w A-bit is designed to allow OSes to skip TLB
flushes when unmapping, the Linux kernel doesn't do this.

> btw, lru_gen_look_around() has already included 'address', are we doing
> pte check for 'address' twice here?

Yes for host MMU but no KVM MMU. ptep_clear_flush_young_notify() goes
into the MMU notifier. We don't use the _notify variant in
lru_gen_look_around() because GPA space generally exhibits no memory
locality.

Thanks.
Yu Zhao June 7, 2022, 7:07 p.m. UTC | #7
On Tue, Jun 7, 2022 at 1:37 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Jun 6, 2022 at 9:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:

...

> I can't really explain why we are getting a random app/java vm crash in monkey
> test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> armv8-a machine without hardware PTE young support.
>
> Moving to  ptep_clear_flush_young() in look_around can make the random
> hang disappear according to zhanyuan(Cc-ed).

This sounds too familiar -- let me ask again: was the following commit
included during the test?

  07509e10dcc7 arm64: pgtable: Fix pte_accessible()

If not, it will cause exactly the problem you described. And what
about this one?

  e914d8f00391 mm: fix unexpected zeroed page mapping with zram swap

Missing it also causes userspace memory corruption on Android, i.e.,
random app crashes.

> On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> after
>  'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> the accessed bit instead of flushing the TLB")'
>
> But on arm64, they are different. according to Will's comments in this
> thread which
> tried to make arm64 same with x86,
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html
>
> "
> This is blindly copied from x86 and isn't true for us: we don't invalidate
> the TLB on context switch. That means our window for keeping the stale
> entries around is potentially much bigger and might not be a great idea.
>
> If we roll a TLB invalidation routine without the trailing DSB, what sort of
> performance does that get you?
> "
> We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> clear PTE young? Any comments from Will?
>
> >
> > btw, lru_gen_look_around() has already included 'address', are we doing
> > pte check for 'address' twice here?

Explained in the previous reply. Hope that clarifies things.
Yu Zhao June 7, 2022, 9:06 p.m. UTC | #8
On Tue, Jun 7, 2022 at 4:44 AM Will Deacon <will@kernel.org> wrote:
>
> On Tue, Jun 07, 2022 at 10:37:46AM +1200, Barry Song wrote:
> > On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <will@kernel.org> wrote:
> > > On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > > > I can't really explain why we are getting a random app/java vm crash in monkey
> > > > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > > > armv8-a machine without hardware PTE young support.
> > > >
> > > > Moving to  ptep_clear_flush_young() in look_around can make the random
> > > > hang disappear according to zhanyuan(Cc-ed).
> > > >
> > > > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > > > after
> > > >  'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > > > the accessed bit instead of flushing the TLB")'
> > > >
> > > > But on arm64, they are different. according to Will's comments in this
> > > > thread which
> > > > tried to make arm64 same with x86,
> > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html
> > > >
> > > > "
> > > > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > > > the TLB on context switch. That means our window for keeping the stale
> > > > entries around is potentially much bigger and might not be a great idea.
> > > >
> > > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > > performance does that get you?
> > > > "
> > > > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > > > clear PTE young? Any comments from Will?
> > >
> > > Given that this issue is specific to the multi-gen LRU work, I think Yu is
> > > the best person to comment. However, looking quickly at your analysis above,
> > > I wonder if the code is relying on this sequence:
> > >
> > >
> > >         ptep_test_and_clear_young(vma, address, ptep);
> > >         ptep_clear_flush_young(vma, address, ptep);
> > >
> > >
> > > to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> > > in ptep_clear_flush_young() is predicated on the pte being young (and this
> > > patches the generic implementation in mm/pgtable-generic.c. In fact, that
> > > second function call is always going to be a no-op unless the pte became
> > > young again in the middle.
> >
> > thanks for your reply, sorry for failing to let you understand my question.
> > my question is actually as below,
> > right now  lru_gen_look_around() is using ptep_test_and_clear_young()
> > only without flush to clear pte for a couple of pages including the specific
> > address:
> > void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > {
> >        ...
> >
> >        for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> >                ...
> >
> >                if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> >                        continue;
> >
> >                ...
> > }
> >
> > I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
> > in the loop?
>
> I don't know what this code is doing, so Yu is the best person to answer
> that. There's nothing inherently dangerous about eliding the TLB
> maintenance; it really depends on the guarantees needed by the caller.

Ack.

> However, the snippet you posted from folio_referenced_one():
>
>  |                  if (pvmw.pte) {
>  |  +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
>  |  +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
>  |  +                               lru_gen_look_around(&pvmw);
>  |  +                               referenced++;
>  |  +                       }
>  |  +
>  |                          if (ptep_clear_flush_young_notify(vma, address,
>
>
> Does seem to call lru_gen_look_around() *and*
> ptep_clear_flush_young_notify(), which is what prompted my question as it
> looks pretty suspicious to me.

The _notify varint reaches into the MMU notifier --
lru_gen_look_around() doesn't do that because GPA space generally has
no locality. I hope this explains why both.

As to why the code is organized this way -- it depends on the point of
view. Mine is that lru_gen_look_around() is an add-on, since its logic
is independent/separable from ptep_clear_flush_young_notify(). We can
make lru_gen_look_around() include ptep_clear_flush_young_notify(),
but that would make the code functionally interwinted, which is bad
for my taste.
Barry Song June 8, 2022, 12:43 a.m. UTC | #9
On Wed, Jun 8, 2022 at 9:07 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Jun 7, 2022 at 4:44 AM Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Jun 07, 2022 at 10:37:46AM +1200, Barry Song wrote:
> > > On Tue, Jun 7, 2022 at 10:21 PM Will Deacon <will@kernel.org> wrote:
> > > > On Tue, Jun 07, 2022 at 07:37:10PM +1200, Barry Song wrote:
> > > > > I can't really explain why we are getting a random app/java vm crash in monkey
> > > > > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > > > > armv8-a machine without hardware PTE young support.
> > > > >
> > > > > Moving to  ptep_clear_flush_young() in look_around can make the random
> > > > > hang disappear according to zhanyuan(Cc-ed).
> > > > >
> > > > > On x86, ptep_clear_flush_young() is exactly ptep_test_and_clear_young()
> > > > > after
> > > > >  'commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case clear
> > > > > the accessed bit instead of flushing the TLB")'
> > > > >
> > > > > But on arm64, they are different. according to Will's comments in this
> > > > > thread which
> > > > > tried to make arm64 same with x86,
> > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html
> > > > >
> > > > > "
> > > > > This is blindly copied from x86 and isn't true for us: we don't invalidate
> > > > > the TLB on context switch. That means our window for keeping the stale
> > > > > entries around is potentially much bigger and might not be a great idea.
> > > > >
> > > > > If we roll a TLB invalidation routine without the trailing DSB, what sort of
> > > > > performance does that get you?
> > > > > "
> > > > > We shouldn't think ptep_clear_flush_young() is safe enough in LRU to
> > > > > clear PTE young? Any comments from Will?
> > > >
> > > > Given that this issue is specific to the multi-gen LRU work, I think Yu is
> > > > the best person to comment. However, looking quickly at your analysis above,
> > > > I wonder if the code is relying on this sequence:
> > > >
> > > >
> > > >         ptep_test_and_clear_young(vma, address, ptep);
> > > >         ptep_clear_flush_young(vma, address, ptep);
> > > >
> > > >
> > > > to invalidate the TLB. On arm64, that won't be the case, as the invalidation
> > > > in ptep_clear_flush_young() is predicated on the pte being young (and this
> > > > patches the generic implementation in mm/pgtable-generic.c. In fact, that
> > > > second function call is always going to be a no-op unless the pte became
> > > > young again in the middle.
> > >
> > > thanks for your reply, sorry for failing to let you understand my question.
> > > my question is actually as below,
> > > right now  lru_gen_look_around() is using ptep_test_and_clear_young()
> > > only without flush to clear pte for a couple of pages including the specific
> > > address:
> > > void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > > {
> > >        ...
> > >
> > >        for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
> > >                ...
> > >
> > >                if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
> > >                        continue;
> > >
> > >                ...
> > > }
> > >
> > > I wonder if it is safe to arm64. Do we need to move to ptep_clear_flush_young()
> > > in the loop?
> >
> > I don't know what this code is doing, so Yu is the best person to answer
> > that. There's nothing inherently dangerous about eliding the TLB
> > maintenance; it really depends on the guarantees needed by the caller.
>
> Ack.
>
> > However, the snippet you posted from folio_referenced_one():
> >
> >  |                  if (pvmw.pte) {
> >  |  +                       if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
> >  |  +                           !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
> >  |  +                               lru_gen_look_around(&pvmw);
> >  |  +                               referenced++;
> >  |  +                       }
> >  |  +
> >  |                          if (ptep_clear_flush_young_notify(vma, address,
> >
> >
> > Does seem to call lru_gen_look_around() *and*
> > ptep_clear_flush_young_notify(), which is what prompted my question as it
> > looks pretty suspicious to me.
>
> The _notify varint reaches into the MMU notifier --
> lru_gen_look_around() doesn't do that because GPA space generally has
> no locality. I hope this explains why both.
>
> As to why the code is organized this way -- it depends on the point of
> view. Mine is that lru_gen_look_around() is an add-on, since its logic
> is independent/separable from ptep_clear_flush_young_notify(). We can
> make lru_gen_look_around() include ptep_clear_flush_young_notify(),
> but that would make the code functionally interwinted, which is bad
> for my taste.

Given we used to have a flush for clear pte young in LRU, right now we are
moving to nop in almost all cases for the flush unless the address becomes
young exactly after look_around and before ptep_clear_flush_young_notify.
It means we are actually dropping flush. So the question is,  were we
overcautious? we actually don't need the flush at all even without mglru?
for arm64, without the flush, stale data might be used for a
relatively long time
as commented in [1], does it actually harm?
[1]https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793881.html

Thanks
Barry
Barry Song June 8, 2022, 7:48 a.m. UTC | #10
On Wed, Jun 8, 2022 at 7:07 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Tue, Jun 7, 2022 at 1:37 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 9:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, May 18, 2022 at 4:49 PM Yu Zhao <yuzhao@google.com> wrote:
>
> ...
>
> > I can't really explain why we are getting a random app/java vm crash in monkey
> > test by using ptep_test_and_clear_young() only in lru_gen_look_around() on an
> > armv8-a machine without hardware PTE young support.
> >
> > Moving to  ptep_clear_flush_young() in look_around can make the random
> > hang disappear according to zhanyuan(Cc-ed).
>
> This sounds too familiar -- let me ask again: was the following commit
> included during the test?
>
>   07509e10dcc7 arm64: pgtable: Fix pte_accessible()
>
> If not, it will cause exactly the problem you described. And what
> about this one?
>
>   e914d8f00391 mm: fix unexpected zeroed page mapping with zram swap
>
> Missing it also causes userspace memory corruption on Android, i.e.,
> random app crashes.
>

According to zhanyuan's testing, we can confirm the above two commits
can fix the random android crash.

Thanks
Barry
Linus Torvalds June 8, 2022, 3:51 p.m. UTC | #11
On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Given we used to have a flush for clear pte young in LRU, right now we are
> moving to nop in almost all cases for the flush unless the address becomes
> young exactly after look_around and before ptep_clear_flush_young_notify.
> It means we are actually dropping flush. So the question is,  were we
> overcautious? we actually don't need the flush at all even without mglru?

We stopped flushing the TLB on A bit clears on x86 back in 2014.

See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
clear the accessed bit instead of flushing the TLB").

               Linus
Barry Song June 8, 2022, 10:45 p.m. UTC | #12
On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > Given we used to have a flush for clear pte young in LRU, right now we are
> > moving to nop in almost all cases for the flush unless the address becomes
> > young exactly after look_around and before ptep_clear_flush_young_notify.
> > It means we are actually dropping flush. So the question is,  were we
> > overcautious? we actually don't need the flush at all even without mglru?
>
> We stopped flushing the TLB on A bit clears on x86 back in 2014.
>
> See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> clear the accessed bit instead of flushing the TLB").

This is true for x86, RISC-V, powerpc and S390. but it is not true for
most platforms.

There was an attempt to do the same thing in arm64:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html

Plus, generic code will also send a tlb flush:
int ptep_clear_flush_young(struct vm_area_struct *vma,
                           unsigned long address, pte_t *ptep)
{
        int young;
        young = ptep_test_and_clear_young(vma, address, ptep);
        if (young)
                flush_tlb_page(vma, address);
        return young;
}

We used to use ptep_test_and_clear_young() only in rmap.c for page_referenced()
in 2.6.0:
https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/tree/mm/rmap.c?h=v2.6.0
int page_referenced(struct page * page)
{
      ...
      if (ptep_test_and_clear_young(p))
      ...
}

but in 2.6.12, it has been already ptep_clear_flush_young() in
page_referenced_one()
https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/tree/mm/rmap.c?h=v2.6.12

I failed to find the history to figure out the motivation for 2.6.12
to use ptep_clear_flush_young()
in LRU, but I am still curious how using flush or not will affect LRU
on those platforms whose
ptep_clear_flush_young() and  ptep_test_and_clear_young() are different.

>
>                Linus

Thanks
Barry
Yu Zhao June 16, 2022, 9:55 p.m. UTC | #13
On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > moving to nop in almost all cases for the flush unless the address becomes
> > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > It means we are actually dropping flush. So the question is,  were we
> > > overcautious? we actually don't need the flush at all even without mglru?
> >
> > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> >
> > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > clear the accessed bit instead of flushing the TLB").
>
> This is true for x86, RISC-V, powerpc and S390. but it is not true for
> most platforms.
>
> There was an attempt to do the same thing in arm64:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html

Barry, you've already answered your own question.

Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
   #define pte_accessible(mm, pte)        \
  -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
  +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))

You missed all TLB flushes for PTEs that have gone through
ptep_test_and_clear_young() on the reclaim path. But most of the time,
you got away with it, only occasional app crashes:
https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/

Why?
Barry Song June 16, 2022, 10:33 p.m. UTC | #14
On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > It means we are actually dropping flush. So the question is,  were we
> > > > overcautious? we actually don't need the flush at all even without mglru?
> > >
> > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > >
> > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > clear the accessed bit instead of flushing the TLB").
> >
> > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > most platforms.
> >
> > There was an attempt to do the same thing in arm64:
> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
>
> Barry, you've already answered your own question.
>
> Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
>    #define pte_accessible(mm, pte)        \
>   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
>   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
>
> You missed all TLB flushes for PTEs that have gone through
> ptep_test_and_clear_young() on the reclaim path. But most of the time,
> you got away with it, only occasional app crashes:
> https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
>
> Why?

Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
can cause random
App to crash.
ptep_test_and_clear_young() + flush won't have this kind of crashes though.
But after applying commit 07509e10dcc7 arm64: pgtable: Fix
pte_accessible(), on arm64,
ptep_test_and_clear_young() without flush won't cause App to crash.

ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH

So is it possible that other platforms have similar problems with
arm64 while commit
07509e10dcc7 isn't there? but anyway, we depend on those platforms which can
really use mglru to expose this kind of potential bugs.

BTW, do you think it is safe to totally remove the flush code even for
the original
LRU? I don't see fundamental difference between MGLRU and LRU on this
"flush" thing. Since MGLRU doesn't need flush, why does LRU need it? flush
is very expensive, if we do think this flush is unnecessary, will we remove
it for the original LRU as well?

BTW, look_around is a great idea. Our experiments also show some great
decrease on the cpu consumption of page_referenced().

Thanks
Barry
Yu Zhao June 16, 2022, 11:29 p.m. UTC | #15
On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > >
> > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > >
> > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > clear the accessed bit instead of flushing the TLB").
> > >
> > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > most platforms.
> > >
> > > There was an attempt to do the same thing in arm64:
> > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> >
> > Barry, you've already answered your own question.
> >
> > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> >    #define pte_accessible(mm, pte)        \
> >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> >
> > You missed all TLB flushes for PTEs that have gone through
> > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > you got away with it, only occasional app crashes:
> > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> >
> > Why?
>
> Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> can cause random
> App to crash.
> ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> pte_accessible(), on arm64,
> ptep_test_and_clear_young() without flush won't cause App to crash.
>
> ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH

I agree -- my question was rhetorical :)

I was trying to imply this logic:
1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
3 (case 1 & 2 guarantee flushes)
3. We saw crashes, but only occasionally

Assuming TLB cached those PTEs, we would have seen the crashes more
often, which contradicts our observation. So the conclusion is TLB
didn't cache them most of the time, meaning flushing TLB just for the
sake of the A-bit isn't necessary.

> do you think it is safe to totally remove the flush code even for
> the original
> LRU?

Affirmative, based on not only my words, but 3rd parties':
1. Your (indirect) observation
2. Alexander's benchmark:
https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
3. The fundamental hardware limitation in terms of the TLB scalability
(Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
Yu Zhao June 17, 2022, 1:42 a.m. UTC | #16
On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > >
> > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > >
> > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > clear the accessed bit instead of flushing the TLB").
> > > >
> > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > most platforms.
> > > >
> > > > There was an attempt to do the same thing in arm64:
> > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> > >
> > > Barry, you've already answered your own question.
> > >
> > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > >    #define pte_accessible(mm, pte)        \
> > >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > >
> > > You missed all TLB flushes for PTEs that have gone through
> > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > you got away with it, only occasional app crashes:
> > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > >
> > > Why?
> >
> > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > can cause random
> > App to crash.
> > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > pte_accessible(), on arm64,
> > ptep_test_and_clear_young() without flush won't cause App to crash.
> >
> > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH
>
> I agree -- my question was rhetorical :)
>
> I was trying to imply this logic:
> 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> 3 (case 1 & 2 guarantee flushes)
> 3. We saw crashes, but only occasionally
>
> Assuming TLB cached those PTEs, we would have seen the crashes more
> often, which contradicts our observation. So the conclusion is TLB
> didn't cache them most of the time, meaning flushing TLB just for the
> sake of the A-bit isn't necessary.
>
> > do you think it is safe to totally remove the flush code even for
> > the original
> > LRU?
>
> Affirmative, based on not only my words, but 3rd parties':
> 1. Your (indirect) observation
> 2. Alexander's benchmark:
> https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> 3. The fundamental hardware limitation in terms of the TLB scalability
> (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf

4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
reclaim case clear the accessed bit instead of flushing the TLB")
Barry Song June 17, 2022, 2:01 a.m. UTC | #17
On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > <torvalds@linux-foundation.org> wrote:
> > > > > >
> > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > >
> > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > >
> > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > clear the accessed bit instead of flushing the TLB").
> > > > >
> > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > most platforms.
> > > > >
> > > > > There was an attempt to do the same thing in arm64:
> > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> > > >
> > > > Barry, you've already answered your own question.
> > > >
> > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > >    #define pte_accessible(mm, pte)        \
> > > >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > >
> > > > You missed all TLB flushes for PTEs that have gone through
> > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > you got away with it, only occasional app crashes:
> > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > >
> > > > Why?
> > >
> > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > can cause random
> > > App to crash.
> > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > pte_accessible(), on arm64,
> > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > >
> > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH
> >
> > I agree -- my question was rhetorical :)
> >
> > I was trying to imply this logic:
> > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > 3 (case 1 & 2 guarantee flushes)
> > 3. We saw crashes, but only occasionally
> >
> > Assuming TLB cached those PTEs, we would have seen the crashes more
> > often, which contradicts our observation. So the conclusion is TLB
> > didn't cache them most of the time, meaning flushing TLB just for the
> > sake of the A-bit isn't necessary.
> >
> > > do you think it is safe to totally remove the flush code even for
> > > the original
> > > LRU?
> >
> > Affirmative, based on not only my words, but 3rd parties':
> > 1. Your (indirect) observation
> > 2. Alexander's benchmark:
> > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > 3. The fundamental hardware limitation in terms of the TLB scalability
> > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
>
> 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> reclaim case clear the accessed bit instead of flushing the TLB")

Hi Yu,
I am going to send a RFC based on the above discussion.

diff --git a/mm/rmap.c b/mm/rmap.c
index 5bcb334cd6f2..7ce6f0b6c330 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
                }

                if (pvmw.pte) {
-                       if (ptep_clear_flush_young_notify(vma, address,
+                       if (ptep_clear_young_notify(vma, address,
                                                pvmw.pte)) {
                                /*
                                 * Don't treat a reference through
Thanks
Barry
Yu Zhao June 17, 2022, 3:03 a.m. UTC | #18
On Thu, Jun 16, 2022 at 8:01 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > <torvalds@linux-foundation.org> wrote:
> > > > > > >
> > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > >
> > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > >
> > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > >
> > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > most platforms.
> > > > > >
> > > > > > There was an attempt to do the same thing in arm64:
> > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> > > > >
> > > > > Barry, you've already answered your own question.
> > > > >
> > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > >    #define pte_accessible(mm, pte)        \
> > > > >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > >
> > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > you got away with it, only occasional app crashes:
> > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > >
> > > > > Why?
> > > >
> > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > can cause random
> > > > App to crash.
> > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > pte_accessible(), on arm64,
> > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > >
> > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH
> > >
> > > I agree -- my question was rhetorical :)
> > >
> > > I was trying to imply this logic:
> > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > 3 (case 1 & 2 guarantee flushes)
> > > 3. We saw crashes, but only occasionally
> > >
> > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > often, which contradicts our observation. So the conclusion is TLB
> > > didn't cache them most of the time, meaning flushing TLB just for the
> > > sake of the A-bit isn't necessary.
> > >
> > > > do you think it is safe to totally remove the flush code even for
> > > > the original
> > > > LRU?
> > >
> > > Affirmative, based on not only my words, but 3rd parties':
> > > 1. Your (indirect) observation
> > > 2. Alexander's benchmark:
> > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> >
> > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > reclaim case clear the accessed bit instead of flushing the TLB")
>
> Hi Yu,
> I am going to send a RFC based on the above discussion.
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5bcb334cd6f2..7ce6f0b6c330 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
>                 }
>
>                 if (pvmw.pte) {
> -                       if (ptep_clear_flush_young_notify(vma, address,
> +                       if (ptep_clear_young_notify(vma, address,
>                                                 pvmw.pte)) {
>                                 /*
>                                  * Don't treat a reference through

Thanks!

This might make a difference on my 64 core Altra -- I'll test after
you post the RFC.
Yu Zhao June 17, 2022, 3:17 a.m. UTC | #19
On Thu, Jun 16, 2022 at 9:03 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Jun 16, 2022 at 8:01 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> > > > > >
> > > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > > <torvalds@linux-foundation.org> wrote:
> > > > > > > >
> > > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > > >
> > > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > > >
> > > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > > >
> > > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > > most platforms.
> > > > > > >
> > > > > > > There was an attempt to do the same thing in arm64:
> > > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> > > > > >
> > > > > > Barry, you've already answered your own question.
> > > > > >
> > > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > >    #define pte_accessible(mm, pte)        \
> > > > > >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > > >
> > > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > > you got away with it, only occasional app crashes:
> > > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > > >
> > > > > > Why?
> > > > >
> > > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > > can cause random
> > > > > App to crash.
> > > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > > pte_accessible(), on arm64,
> > > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > > >
> > > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> > > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> > > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH
> > > >
> > > > I agree -- my question was rhetorical :)
> > > >
> > > > I was trying to imply this logic:
> > > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > > 3 (case 1 & 2 guarantee flushes)
> > > > 3. We saw crashes, but only occasionally
> > > >
> > > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > > often, which contradicts our observation. So the conclusion is TLB
> > > > didn't cache them most of the time, meaning flushing TLB just for the
> > > > sake of the A-bit isn't necessary.
> > > >
> > > > > do you think it is safe to totally remove the flush code even for
> > > > > the original
> > > > > LRU?
> > > >
> > > > Affirmative, based on not only my words, but 3rd parties':
> > > > 1. Your (indirect) observation
> > > > 2. Alexander's benchmark:
> > > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> > >
> > > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > > reclaim case clear the accessed bit instead of flushing the TLB")
> >
> > Hi Yu,
> > I am going to send a RFC based on the above discussion.
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 5bcb334cd6f2..7ce6f0b6c330 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> >                 }
> >
> >                 if (pvmw.pte) {
> > -                       if (ptep_clear_flush_young_notify(vma, address,
> > +                       if (ptep_clear_young_notify(vma, address,
> >                                                 pvmw.pte)) {
> >                                 /*
> >                                  * Don't treat a reference through
>
> Thanks!
>
> This might make a difference on my 64 core Altra -- I'll test after
> you post the RFC.

Also, IIRC, it made no difference on POWER9 because POWER9 flushes TBL
regardless which variant is used.
Yu Zhao June 19, 2022, 8:36 p.m. UTC | #20
On Thu, Jun 16, 2022 at 9:17 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Jun 16, 2022 at 9:03 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Jun 16, 2022 at 8:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > > > <torvalds@linux-foundation.org> wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > > > >
> > > > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > > > >
> > > > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > > > >
> > > > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > > > most platforms.
> > > > > > > >
> > > > > > > > There was an attempt to do the same thing in arm64:
> > > > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> > > > > > >
> > > > > > > Barry, you've already answered your own question.
> > > > > > >
> > > > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > > >    #define pte_accessible(mm, pte)        \
> > > > > > >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > > >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > > > >
> > > > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > > > you got away with it, only occasional app crashes:
> > > > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > > > >
> > > > > > > Why?
> > > > > >
> > > > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > > > can cause random
> > > > > > App to crash.
> > > > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > > > pte_accessible(), on arm64,
> > > > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > > > >
> > > > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> > > > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> > > > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH
> > > > >
> > > > > I agree -- my question was rhetorical :)
> > > > >
> > > > > I was trying to imply this logic:
> > > > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > > > 3 (case 1 & 2 guarantee flushes)
> > > > > 3. We saw crashes, but only occasionally
> > > > >
> > > > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > > > often, which contradicts our observation. So the conclusion is TLB
> > > > > didn't cache them most of the time, meaning flushing TLB just for the
> > > > > sake of the A-bit isn't necessary.
> > > > >
> > > > > > do you think it is safe to totally remove the flush code even for
> > > > > > the original
> > > > > > LRU?
> > > > >
> > > > > Affirmative, based on not only my words, but 3rd parties':
> > > > > 1. Your (indirect) observation
> > > > > 2. Alexander's benchmark:
> > > > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> > > >
> > > > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > > > reclaim case clear the accessed bit instead of flushing the TLB")
> > >
> > > Hi Yu,
> > > I am going to send a RFC based on the above discussion.
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 5bcb334cd6f2..7ce6f0b6c330 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> > >                 }
> > >
> > >                 if (pvmw.pte) {
> > > -                       if (ptep_clear_flush_young_notify(vma, address,
> > > +                       if (ptep_clear_young_notify(vma, address,
> > >                                                 pvmw.pte)) {
> > >                                 /*
> > >                                  * Don't treat a reference through
> >
> > Thanks!
> >
> > This might make a difference on my 64 core Altra -- I'll test after
> > you post the RFC.
>
> Also, IIRC, it made no difference on POWER9 because POWER9
> flushes TBL regardless which variant is used.
  ^^^^^^^ doesn't flush

I just verified this on POWER9. So on top of exhibit 1-4, we got:
  5. 3cb1aa7aa3940 ("powerpc/64s: Implement ptep_clear_flush_young
that does not flush TLBs")
Barry Song June 19, 2022, 9:56 p.m. UTC | #21
On Mon, Jun 20, 2022 at 8:37 AM Yu Zhao <yuzhao@google.com> wrote:
>
> On Thu, Jun 16, 2022 at 9:17 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Thu, Jun 16, 2022 at 9:03 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Thu, Jun 16, 2022 at 8:01 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Fri, Jun 17, 2022 at 1:43 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Thu, Jun 16, 2022 at 5:29 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 16, 2022 at 4:33 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jun 17, 2022 at 9:56 AM Yu Zhao <yuzhao@google.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Jun 8, 2022 at 4:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Jun 9, 2022 at 3:52 AM Linus Torvalds
> > > > > > > > > <torvalds@linux-foundation.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 7, 2022 at 5:43 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Given we used to have a flush for clear pte young in LRU, right now we are
> > > > > > > > > > > moving to nop in almost all cases for the flush unless the address becomes
> > > > > > > > > > > young exactly after look_around and before ptep_clear_flush_young_notify.
> > > > > > > > > > > It means we are actually dropping flush. So the question is,  were we
> > > > > > > > > > > overcautious? we actually don't need the flush at all even without mglru?
> > > > > > > > > >
> > > > > > > > > > We stopped flushing the TLB on A bit clears on x86 back in 2014.
> > > > > > > > > >
> > > > > > > > > > See commit b13b1d2d8692 ("x86/mm: In the PTE swapout page reclaim case
> > > > > > > > > > clear the accessed bit instead of flushing the TLB").
> > > > > > > > >
> > > > > > > > > This is true for x86, RISC-V, powerpc and S390. but it is not true for
> > > > > > > > > most platforms.
> > > > > > > > >
> > > > > > > > > There was an attempt to do the same thing in arm64:
> > > > > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1793830.html
> > > > > > > > > but arm64 still sent a nosync tlbi and depent on a deferred to dsb :
> > > > > > > > > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1794484.html
> > > > > > > >
> > > > > > > > Barry, you've already answered your own question.
> > > > > > > >
> > > > > > > > Without commit 07509e10dcc7 arm64: pgtable: Fix pte_accessible():
> > > > > > > >    #define pte_accessible(mm, pte)        \
> > > > > > > >   -       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> > > > > > > >   +       (mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
> > > > > > > >
> > > > > > > > You missed all TLB flushes for PTEs that have gone through
> > > > > > > > ptep_test_and_clear_young() on the reclaim path. But most of the time,
> > > > > > > > you got away with it, only occasional app crashes:
> > > > > > > > https://lore.kernel.org/r/CAGsJ_4w6JjuG4rn2P=d974wBOUtXUUnaZKnx+-G6a8_mSROa+Q@mail.gmail.com/
> > > > > > > >
> > > > > > > > Why?
> > > > > > >
> > > > > > > Yes. On the arm64 platform, ptep_test_and_clear_young() without flush
> > > > > > > can cause random
> > > > > > > App to crash.
> > > > > > > ptep_test_and_clear_young() + flush won't have this kind of crashes though.
> > > > > > > But after applying commit 07509e10dcc7 arm64: pgtable: Fix
> > > > > > > pte_accessible(), on arm64,
> > > > > > > ptep_test_and_clear_young() without flush won't cause App to crash.
> > > > > > >
> > > > > > > ptep_test_and_clear_young(), with flush, without commit 07509e10dcc7:   OK
> > > > > > > ptep_test_and_clear_young(), without flush, with commit 07509e10dcc7:   OK
> > > > > > > ptep_test_and_clear_young(), without flush, without commit 07509e10dcc7:   CRASH
> > > > > >
> > > > > > I agree -- my question was rhetorical :)
> > > > > >
> > > > > > I was trying to imply this logic:
> > > > > > 1. We cleared the A-bit in PTEs with ptep_test_and_clear_young()
> > > > > > 2. We missed TLB flush for those PTEs on the reclaim path, i.e., case
> > > > > > 3 (case 1 & 2 guarantee flushes)
> > > > > > 3. We saw crashes, but only occasionally
> > > > > >
> > > > > > Assuming TLB cached those PTEs, we would have seen the crashes more
> > > > > > often, which contradicts our observation. So the conclusion is TLB
> > > > > > didn't cache them most of the time, meaning flushing TLB just for the
> > > > > > sake of the A-bit isn't necessary.
> > > > > >
> > > > > > > do you think it is safe to totally remove the flush code even for
> > > > > > > the original
> > > > > > > LRU?
> > > > > >
> > > > > > Affirmative, based on not only my words, but 3rd parties':
> > > > > > 1. Your (indirect) observation
> > > > > > 2. Alexander's benchmark:
> > > > > > https://lore.kernel.org/r/BYAPR12MB271295B398729E07F31082A7CFAA0@BYAPR12MB2712.namprd12.prod.outlook.com/
> > > > > > 3. The fundamental hardware limitation in terms of the TLB scalability
> > > > > > (Fig. 1): https://www.usenix.org/legacy/events/osdi02/tech/full_papers/navarro/navarro.pdf
> > > > >
> > > > > 4. Intel's commit b13b1d2d8692 ("x86/mm: In the PTE swapout page
> > > > > reclaim case clear the accessed bit instead of flushing the TLB")
> > > >
> > > > Hi Yu,
> > > > I am going to send a RFC based on the above discussion.
> > > >
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 5bcb334cd6f2..7ce6f0b6c330 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -830,7 +830,7 @@ static bool folio_referenced_one(struct folio *folio,
> > > >                 }
> > > >
> > > >                 if (pvmw.pte) {
> > > > -                       if (ptep_clear_flush_young_notify(vma, address,
> > > > +                       if (ptep_clear_young_notify(vma, address,
> > > >                                                 pvmw.pte)) {
> > > >                                 /*
> > > >                                  * Don't treat a reference through
> > >
> > > Thanks!
> > >
> > > This might make a difference on my 64 core Altra -- I'll test after
> > > you post the RFC.
> >
> > Also, IIRC, it made no difference on POWER9 because POWER9
> > flushes TBL regardless which variant is used.
>   ^^^^^^^ doesn't flush
>
> I just verified this on POWER9. So on top of exhibit 1-4, we got:
>   5. 3cb1aa7aa3940 ("powerpc/64s: Implement ptep_clear_flush_young
> that does not flush TLBs")

Thanks, Yu. I put a rfc,
https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@gmail.com/

we may clarify everything in that thread :-)

Thanks
Barry
diff mbox series

Patch

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 89b14729d59f..2bfdcc77648a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -438,6 +438,7 @@  static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem folio a caller should hold an rcu read lock to protect memcg
  * associated with a kmem folio from being released.
@@ -499,6 +500,7 @@  static inline struct mem_cgroup *folio_memcg_rcu(struct folio *folio)
  * - LRU isolation
  * - lock_page_memcg()
  * - exclusive reference
+ * - mem_cgroup_trylock_pages()
  *
  * For a kmem page a caller should hold an rcu read lock to protect memcg
  * associated with a kmem page from being released.
@@ -948,6 +950,23 @@  void unlock_page_memcg(struct page *page);
 
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
 
+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	rcu_read_lock();
+
+	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+		return true;
+
+	rcu_read_unlock();
+	return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
 				   int idx, int val)
@@ -1386,6 +1405,18 @@  static inline void folio_memcg_unlock(struct folio *folio)
 {
 }
 
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
+	return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
 static inline void mem_cgroup_handle_over_high(void)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 894c289c2c06..4e8ab4ad4473 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1523,6 +1523,11 @@  static inline unsigned long folio_pfn(struct folio *folio)
 	return page_to_pfn(&folio->page);
 }
 
+static inline struct folio *pfn_folio(unsigned long pfn)
+{
+	return page_folio(pfn_to_page(pfn));
+}
+
 static inline atomic_t *folio_pincount_ptr(struct folio *folio)
 {
 	return &folio_page(folio, 1)->compound_pincount;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2d023d243e73..f0b980362186 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -374,6 +374,7 @@  enum lruvec_flags {
 #ifndef __GENERATING_BOUNDS_H
 
 struct lruvec;
+struct page_vma_mapped_walk;
 
 #define LRU_GEN_MASK		((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
 #define LRU_REFS_MASK		((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
@@ -429,6 +430,7 @@  struct lru_gen_struct {
 };
 
 void lru_gen_init_lruvec(struct lruvec *lruvec);
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw);
 
 #ifdef CONFIG_MEMCG
 void lru_gen_init_memcg(struct mem_cgroup *memcg);
@@ -441,6 +443,10 @@  static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
 {
 }
 
+static inline void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+}
+
 #ifdef CONFIG_MEMCG
 static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
 {
diff --git a/mm/internal.h b/mm/internal.h
index cf16280ce132..59d2422b647d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -68,6 +68,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf);
 void folio_rotate_reclaimable(struct folio *folio);
 bool __folio_end_writeback(struct folio *folio);
 void deactivate_file_folio(struct folio *folio);
+void folio_activate(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2ee074f80e72..98aa720ac639 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2769,6 +2769,7 @@  static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 	 * - LRU isolation
 	 * - lock_page_memcg()
 	 * - exclusive reference
+	 * - mem_cgroup_trylock_pages()
 	 */
 	folio->memcg_data = (unsigned long)memcg;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fedb82371efe..7cb7ef29088a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -73,6 +73,7 @@ 
 #include <linux/page_idle.h>
 #include <linux/memremap.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/mm_inline.h>
 
 #include <asm/tlbflush.h>
 
@@ -821,6 +822,12 @@  static bool folio_referenced_one(struct folio *folio,
 		}
 
 		if (pvmw.pte) {
+			if (lru_gen_enabled() && pte_young(*pvmw.pte) &&
+			    !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ))) {
+				lru_gen_look_around(&pvmw);
+				referenced++;
+			}
+
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				/*
diff --git a/mm/swap.c b/mm/swap.c
index a99d22308f28..0aa1d0b33d42 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -342,7 +342,7 @@  static bool need_activate_page_drain(int cpu)
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	if (folio_test_lru(folio) && !folio_test_active(folio) &&
 	    !folio_test_unevictable(folio)) {
@@ -362,7 +362,7 @@  static inline void activate_page_drain(int cpu)
 {
 }
 
-static void folio_activate(struct folio *folio)
+void folio_activate(struct folio *folio)
 {
 	struct lruvec *lruvec;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 891f0ab69b3a..cf89a28c3b0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1554,6 +1554,11 @@  static unsigned int shrink_page_list(struct list_head *page_list,
 		if (!sc->may_unmap && page_mapped(page))
 			goto keep_locked;
 
+		/* folio_update_gen() tried to promote this page? */
+		if (lru_gen_enabled() && !ignore_references &&
+		    page_mapped(page) && PageReferenced(page))
+			goto keep_locked;
+
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
@@ -3137,6 +3142,28 @@  static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
  *                          the aging
  ******************************************************************************/
 
+static int folio_update_gen(struct folio *folio, int gen)
+{
+	unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
+
+	VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
+	VM_WARN_ON_ONCE(!rcu_read_lock_held());
+
+	do {
+		/* lru_gen_del_folio() has isolated this page? */
+		if (!(old_flags & LRU_GEN_MASK)) {
+			/* for shrink_page_list() */
+			new_flags = old_flags | BIT(PG_referenced);
+			continue;
+		}
+
+		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
+		new_flags |= (gen + 1UL) << LRU_GEN_PGOFF;
+	} while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
+
+	return ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+}
+
 static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
 {
 	int type = folio_is_file_lru(folio);
@@ -3147,6 +3174,11 @@  static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 	VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
 
 	do {
+		new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
+		/* folio_update_gen() has promoted this page? */
+		if (new_gen >= 0 && new_gen != old_gen)
+			return new_gen;
+
 		new_gen = (old_gen + 1) % MAX_NR_GENS;
 
 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | LRU_REFS_FLAGS);
@@ -3365,6 +3397,125 @@  static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc)
 	} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
 }
 
+/*
+ * This function exploits spatial locality when shrink_page_list() walks the
+ * rmap. It scans the adjacent PTEs of a young PTE and promotes hot pages.
+ */
+void lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
+{
+	int i;
+	pte_t *pte;
+	unsigned long start;
+	unsigned long end;
+	unsigned long addr;
+	unsigned long bitmap[BITS_TO_LONGS(MIN_LRU_BATCH)] = {};
+	struct folio *folio = pfn_folio(pvmw->pfn);
+	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct pglist_data *pgdat = folio_pgdat(folio);
+	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	DEFINE_MAX_SEQ(lruvec);
+	int old_gen, new_gen = lru_gen_from_seq(max_seq);
+
+	lockdep_assert_held(pvmw->ptl);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
+
+	if (spin_is_contended(pvmw->ptl))
+		return;
+
+	start = max(pvmw->address & PMD_MASK, pvmw->vma->vm_start);
+	end = pmd_addr_end(pvmw->address, pvmw->vma->vm_end);
+
+	if (end - start > MIN_LRU_BATCH * PAGE_SIZE) {
+		if (pvmw->address - start < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			end = start + MIN_LRU_BATCH * PAGE_SIZE;
+		else if (end - pvmw->address < MIN_LRU_BATCH * PAGE_SIZE / 2)
+			start = end - MIN_LRU_BATCH * PAGE_SIZE;
+		else {
+			start = pvmw->address - MIN_LRU_BATCH * PAGE_SIZE / 2;
+			end = pvmw->address + MIN_LRU_BATCH * PAGE_SIZE / 2;
+		}
+	}
+
+	pte = pvmw->pte - (pvmw->address - start) / PAGE_SIZE;
+
+	rcu_read_lock();
+	arch_enter_lazy_mmu_mode();
+
+	for (i = 0, addr = start; addr != end; i++, addr += PAGE_SIZE) {
+		unsigned long pfn = pte_pfn(pte[i]);
+
+		VM_WARN_ON_ONCE(addr < pvmw->vma->vm_start || addr >= pvmw->vma->vm_end);
+
+		if (!pte_present(pte[i]) || is_zero_pfn(pfn))
+			continue;
+
+		if (WARN_ON_ONCE(pte_devmap(pte[i]) || pte_special(pte[i])))
+			continue;
+
+		if (!pte_young(pte[i]))
+			continue;
+
+		VM_WARN_ON_ONCE(!pfn_valid(pfn));
+		if (pfn < pgdat->node_start_pfn || pfn >= pgdat_end_pfn(pgdat))
+			continue;
+
+		folio = pfn_folio(pfn);
+		if (folio_nid(folio) != pgdat->node_id)
+			continue;
+
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		if (!ptep_test_and_clear_young(pvmw->vma, addr, pte + i))
+			continue;
+
+		if (pte_dirty(pte[i]) && !folio_test_dirty(folio) &&
+		    !(folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+		      !folio_test_swapcache(folio)))
+			folio_mark_dirty(folio);
+
+		old_gen = folio_lru_gen(folio);
+		if (old_gen < 0)
+			folio_set_referenced(folio);
+		else if (old_gen != new_gen)
+			__set_bit(i, bitmap);
+	}
+
+	arch_leave_lazy_mmu_mode();
+	rcu_read_unlock();
+
+	if (bitmap_weight(bitmap, MIN_LRU_BATCH) < PAGEVEC_SIZE) {
+		for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+			folio = pfn_folio(pte_pfn(pte[i]));
+			folio_activate(folio);
+		}
+		return;
+	}
+
+	/* folio_update_gen() requires stable folio_memcg() */
+	if (!mem_cgroup_trylock_pages(memcg))
+		return;
+
+	spin_lock_irq(&lruvec->lru_lock);
+	new_gen = lru_gen_from_seq(lruvec->lrugen.max_seq);
+
+	for_each_set_bit(i, bitmap, MIN_LRU_BATCH) {
+		folio = pfn_folio(pte_pfn(pte[i]));
+		if (folio_memcg_rcu(folio) != memcg)
+			continue;
+
+		old_gen = folio_update_gen(folio, new_gen);
+		if (old_gen < 0 || old_gen == new_gen)
+			continue;
+
+		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
+	}
+
+	spin_unlock_irq(&lruvec->lru_lock);
+
+	mem_cgroup_unlock_pages();
+}
+
 /******************************************************************************
  *                          the eviction
  ******************************************************************************/
@@ -3401,6 +3552,12 @@  static bool sort_folio(struct lruvec *lruvec, struct folio *folio, int tier_idx)
 		return true;
 	}
 
+	/* promoted */
+	if (gen != lru_gen_from_seq(lrugen->min_seq[type])) {
+		list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
+		return true;
+	}
+
 	/* protected */
 	if (tier > tier_idx) {
 		int hist = lru_hist_from_seq(lrugen->min_seq[type]);