diff mbox series

[-next,v5] mm: vmscan: retry folios written back while isolated for traditional LRU

Message ID 20241220010931.3603111-1-chenridong@huaweicloud.com (mailing list archive)
State New
Headers show
Series [-next,v5] mm: vmscan: retry folios written back while isolated for traditional LRU | expand

Commit Message

Chen Ridong Dec. 20, 2024, 1:09 a.m. UTC
From: Chen Ridong <chenridong@huawei.com>

The page reclaim isolates a batch of folios from the tail of one of the
LRU lists and works on those folios one by one.  For a suitable
swap-backed folio, if the swap device is async, it queues that folio for
writeback.  After the page reclaim finishes an entire batch, it puts back
the folios it queued for writeback to the head of the original LRU list.

In the meantime, the page writeback flushes the queued folios also by
batches.  Its batching logic is independent from that of the page reclaim.
For each of the folios it writes back, the page writeback calls
folio_rotate_reclaimable() which tries to rotate a folio to the tail.

folio_rotate_reclaimable() only works for a folio after the page reclaim
has put it back.  If an async swap device is fast enough, the page
writeback can finish with that folio while the page reclaim is still
working on the rest of the batch containing it.  In this case, that folio
will remain at the head and the page reclaim will not retry it before
reaching there.

The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
while isolated") only fixed the issue for mglru. However, this issue
also exists in the traditional active/inactive LRU. This issue will be
worse if THP is split, which makes the list longer and needs longer time
to finish a batch of folios reclaim.

This issue should be fixed in the same way for the traditional LRU.
Therefore, the common logic was extracted to the 'find_folios_written_back'
function firstly, which is then reused in the 'shrink_inactive_list'
function. Finally, retry reclaiming those folios that may have missed the
rotation for traditional LRU.

Link: https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
Link: https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 mm/vmscan.c | 108 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 70 insertions(+), 38 deletions(-)

Comments

Barry Song Dec. 20, 2024, 2:30 a.m. UTC | #1
On Fri, Dec 20, 2024 at 2:19 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
> From: Chen Ridong <chenridong@huawei.com>
>
> The page reclaim isolates a batch of folios from the tail of one of the
> LRU lists and works on those folios one by one.  For a suitable
> swap-backed folio, if the swap device is async, it queues that folio for
> writeback.  After the page reclaim finishes an entire batch, it puts back
> the folios it queued for writeback to the head of the original LRU list.
>
> In the meantime, the page writeback flushes the queued folios also by
> batches.  Its batching logic is independent from that of the page reclaim.
> For each of the folios it writes back, the page writeback calls
> folio_rotate_reclaimable() which tries to rotate a folio to the tail.
>
> folio_rotate_reclaimable() only works for a folio after the page reclaim
> has put it back.  If an async swap device is fast enough, the page
> writeback can finish with that folio while the page reclaim is still
> working on the rest of the batch containing it.  In this case, that folio
> will remain at the head and the page reclaim will not retry it before
> reaching there.
>
> The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
> while isolated") only fixed the issue for mglru. However, this issue
> also exists in the traditional active/inactive LRU. This issue will be
> worse if THP is split, which makes the list longer and needs longer time
> to finish a batch of folios reclaim.
>
> This issue should be fixed in the same way for the traditional LRU.
> Therefore, the common logic was extracted to the 'find_folios_written_back'
> function firstly, which is then reused in the 'shrink_inactive_list'
> function. Finally, retry reclaiming those folios that may have missed the
> rotation for traditional LRU.
>
> Link: https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
> Link: https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
> Signed-off-by: Chen Ridong <chenridong@huawei.com>
> ---
>  mm/vmscan.c | 108 ++++++++++++++++++++++++++++++++++------------------
>  1 file changed, 70 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 39886f435ec5..e67e446540ba 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -283,6 +283,39 @@ static void set_task_reclaim_state(struct task_struct *task,
>         task->reclaim_state = rs;
>  }
>
> +/**
> + * find_folios_written_back - Find and move the written back folios to a new list.
> + * @list: filios list
> + * @clean: the written back folios list
> + * @is_retried: whether the list has already been retried.
> + */
> +static inline void find_folios_written_back(struct list_head *list,
> +               struct list_head *clean, bool is_retried)
> +{
> +       struct folio *folio;
> +       struct folio *next;
> +
> +       list_for_each_entry_safe_reverse(folio, next, list, lru) {
> +               if (!folio_evictable(folio)) {
> +                       list_del(&folio->lru);
> +                       folio_putback_lru(folio);
> +                       continue;
> +               }
> +
> +               /* retry folios that may have missed folio_rotate_reclaimable() */
> +               if (!is_retried && !folio_test_active(folio) && !folio_mapped(folio) &&
> +                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> +                       list_move(&folio->lru, clean);
> +                       continue;
> +               }
> +
> +               /* don't add rejected folios to the oldest generation */
> +               if (lru_gen_enabled() && !lru_gen_distance(folio, false))
> +                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> +       }
> +
> +}
> +
>  /*
>   * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
>   * scan_control->nr_reclaimed.
> @@ -1959,14 +1992,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>                 enum lru_list lru)
>  {
>         LIST_HEAD(folio_list);
> +       LIST_HEAD(clean_list);
>         unsigned long nr_scanned;
> -       unsigned int nr_reclaimed = 0;
> +       unsigned int nr_reclaimed, total_reclaimed = 0;
> +       unsigned int nr_pageout = 0;
> +       unsigned int nr_unqueued_dirty = 0;
>         unsigned long nr_taken;
>         struct reclaim_stat stat;
>         bool file = is_file_lru(lru);
>         enum vm_event_item item;
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>         bool stalled = false;
> +       bool is_retried = false;
>
>         while (unlikely(too_many_isolated(pgdat, file, sc))) {
>                 if (stalled)
> @@ -2000,22 +2037,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>         if (nr_taken == 0)
>                 return 0;
>
> +retry:
>         nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
>
> +       sc->nr.dirty += stat.nr_dirty;
> +       sc->nr.congested += stat.nr_congested;
> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> +       sc->nr.writeback += stat.nr_writeback;
> +       sc->nr.immediate += stat.nr_immediate;
> +       total_reclaimed += nr_reclaimed;
> +       nr_pageout += stat.nr_pageout;
> +       nr_unqueued_dirty += stat.nr_unqueued_dirty;
> +
> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> +                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);

This is a bit odd, as nr_scanned during a retry still uses the
previous nr_scanned
value. However, I find that mglru shows no difference.

retry:
        reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
        sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
        sc->nr_reclaimed += reclaimed;
        trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
                        scanned, reclaimed, &stat, sc->priority,
                        type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);

Currently, the active/inactive state aligns with mglru in this trace.
It seems that
the userspace BPF should recognize that the nr_scanned during a retry doesn't
indicate we are isolating new nr_scanned folios. Ideally, the is_retry
flag should
be passed to the trace, allowing userspace to identify that it's a retry and
disregard the nr_scanned value.

It might be worth addressing this in a separate patch. Add Bixuan to clarify
how userspace depends on this trace and if "retry" will break his userspace
BPF for both MGLRU and active/inactive cases.

Otherwise, the patch looks good to me.

> +
> +       find_folios_written_back(&folio_list, &clean_list, is_retried);
> +
>         spin_lock_irq(&lruvec->lru_lock);
>         move_folios_to_lru(lruvec, &folio_list);
>
>         __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
>                                         stat.nr_demoted);
> -       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>         item = PGSTEAL_KSWAPD + reclaimer_offset();
>         if (!cgroup_reclaim(sc))
>                 __count_vm_events(item, nr_reclaimed);
>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> +
> +       if (!list_empty(&clean_list)) {
> +               list_splice_init(&clean_list, &folio_list);
> +               is_retried = true;
> +               spin_unlock_irq(&lruvec->lru_lock);
> +               goto retry;
> +       }
> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>         spin_unlock_irq(&lruvec->lru_lock);
> +       sc->nr.taken += nr_taken;
> +       if (file)
> +               sc->nr.file_taken += nr_taken;
>
> -       lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
> +       lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
>
>         /*
>          * If dirty folios are scanned that are not queued for IO, it
> @@ -2028,7 +2090,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>          * the flushers simply cannot keep up with the allocation
>          * rate. Nudge the flusher threads in case they are asleep.
>          */
> -       if (stat.nr_unqueued_dirty == nr_taken) {
> +       if (nr_unqueued_dirty == nr_taken) {
>                 wakeup_flusher_threads(WB_REASON_VMSCAN);
>                 /*
>                  * For cgroupv1 dirty throttling is achieved by waking up
> @@ -2043,18 +2105,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>                         reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>         }
>
> -       sc->nr.dirty += stat.nr_dirty;
> -       sc->nr.congested += stat.nr_congested;
> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> -       sc->nr.writeback += stat.nr_writeback;
> -       sc->nr.immediate += stat.nr_immediate;
> -       sc->nr.taken += nr_taken;
> -       if (file)
> -               sc->nr.file_taken += nr_taken;
> -
> -       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> -                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> -       return nr_reclaimed;
> +       return total_reclaimed;
>  }
>
>  /*
> @@ -4585,12 +4636,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>         int reclaimed;
>         LIST_HEAD(list);
>         LIST_HEAD(clean);
> -       struct folio *folio;
> -       struct folio *next;
>         enum vm_event_item item;
>         struct reclaim_stat stat;
>         struct lru_gen_mm_walk *walk;
> -       bool skip_retry = false;
> +       bool is_retried = false;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> @@ -4616,24 +4665,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>                         scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> -       list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> -               if (!folio_evictable(folio)) {
> -                       list_del(&folio->lru);
> -                       folio_putback_lru(folio);
> -                       continue;
> -               }
> -
> -               /* retry folios that may have missed folio_rotate_reclaimable() */
> -               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> -                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> -                       list_move(&folio->lru, &clean);
> -                       continue;
> -               }
> -
> -               /* don't add rejected folios to the oldest generation */
> -               if (!lru_gen_distance(folio, false))
> -                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> -       }
> +       find_folios_written_back(&list, &clean, is_retried);
>
>         spin_lock_irq(&lruvec->lru_lock);
>
> @@ -4656,7 +4688,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>         list_splice_init(&clean, &list);
>
>         if (!list_empty(&list)) {
> -               skip_retry = true;
> +               is_retried = true;
>                 goto retry;
>         }
>
> --
> 2.34.1
>

Thanks
barry
Barry Song Dec. 20, 2024, 3:09 a.m. UTC | #2
On Fri, Dec 20, 2024 at 3:30 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Dec 20, 2024 at 2:19 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> >
> > From: Chen Ridong <chenridong@huawei.com>
> >
> > The page reclaim isolates a batch of folios from the tail of one of the
> > LRU lists and works on those folios one by one.  For a suitable
> > swap-backed folio, if the swap device is async, it queues that folio for
> > writeback.  After the page reclaim finishes an entire batch, it puts back
> > the folios it queued for writeback to the head of the original LRU list.
> >
> > In the meantime, the page writeback flushes the queued folios also by
> > batches.  Its batching logic is independent from that of the page reclaim.
> > For each of the folios it writes back, the page writeback calls
> > folio_rotate_reclaimable() which tries to rotate a folio to the tail.
> >
> > folio_rotate_reclaimable() only works for a folio after the page reclaim
> > has put it back.  If an async swap device is fast enough, the page
> > writeback can finish with that folio while the page reclaim is still
> > working on the rest of the batch containing it.  In this case, that folio
> > will remain at the head and the page reclaim will not retry it before
> > reaching there.
> >
> > The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
> > while isolated") only fixed the issue for mglru. However, this issue
> > also exists in the traditional active/inactive LRU. This issue will be
> > worse if THP is split, which makes the list longer and needs longer time
> > to finish a batch of folios reclaim.
> >
> > This issue should be fixed in the same way for the traditional LRU.
> > Therefore, the common logic was extracted to the 'find_folios_written_back'
> > function firstly, which is then reused in the 'shrink_inactive_list'
> > function. Finally, retry reclaiming those folios that may have missed the
> > rotation for traditional LRU.
> >
> > Link: https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
> > Link: https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
> > Signed-off-by: Chen Ridong <chenridong@huawei.com>
> > ---
> >  mm/vmscan.c | 108 ++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 70 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 39886f435ec5..e67e446540ba 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -283,6 +283,39 @@ static void set_task_reclaim_state(struct task_struct *task,
> >         task->reclaim_state = rs;
> >  }
> >
> > +/**
> > + * find_folios_written_back - Find and move the written back folios to a new list.
> > + * @list: filios list
> > + * @clean: the written back folios list
> > + * @is_retried: whether the list has already been retried.
> > + */
> > +static inline void find_folios_written_back(struct list_head *list,
> > +               struct list_head *clean, bool is_retried)
> > +{
> > +       struct folio *folio;
> > +       struct folio *next;
> > +
> > +       list_for_each_entry_safe_reverse(folio, next, list, lru) {
> > +               if (!folio_evictable(folio)) {
> > +                       list_del(&folio->lru);
> > +                       folio_putback_lru(folio);
> > +                       continue;
> > +               }
> > +
> > +               /* retry folios that may have missed folio_rotate_reclaimable() */
> > +               if (!is_retried && !folio_test_active(folio) && !folio_mapped(folio) &&
> > +                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> > +                       list_move(&folio->lru, clean);
> > +                       continue;
> > +               }
> > +
> > +               /* don't add rejected folios to the oldest generation */
> > +               if (lru_gen_enabled() && !lru_gen_distance(folio, false))
> > +                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> > +       }
> > +
> > +}
> > +
> >  /*
> >   * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
> >   * scan_control->nr_reclaimed.
> > @@ -1959,14 +1992,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >                 enum lru_list lru)
> >  {
> >         LIST_HEAD(folio_list);
> > +       LIST_HEAD(clean_list);
> >         unsigned long nr_scanned;
> > -       unsigned int nr_reclaimed = 0;
> > +       unsigned int nr_reclaimed, total_reclaimed = 0;
> > +       unsigned int nr_pageout = 0;
> > +       unsigned int nr_unqueued_dirty = 0;
> >         unsigned long nr_taken;
> >         struct reclaim_stat stat;
> >         bool file = is_file_lru(lru);
> >         enum vm_event_item item;
> >         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >         bool stalled = false;
> > +       bool is_retried = false;

The name is_retried is a bit confusing. It should be is_retry or
is_retrying since
we are currently retrying, not that we have already retried.

> >
> >         while (unlikely(too_many_isolated(pgdat, file, sc))) {
> >                 if (stalled)
> > @@ -2000,22 +2037,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >         if (nr_taken == 0)
> >                 return 0;
> >
> > +retry:
> >         nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
> >
> > +       sc->nr.dirty += stat.nr_dirty;
> > +       sc->nr.congested += stat.nr_congested;
> > +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> > +       sc->nr.writeback += stat.nr_writeback;
> > +       sc->nr.immediate += stat.nr_immediate;
> > +       total_reclaimed += nr_reclaimed;
> > +       nr_pageout += stat.nr_pageout;
> > +       nr_unqueued_dirty += stat.nr_unqueued_dirty;
> > +
> > +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> > +                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>
> This is a bit odd, as nr_scanned during a retry still uses the
> previous nr_scanned
> value. However, I find that mglru shows no difference.
>
> retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
>         sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>         sc->nr_reclaimed += reclaimed;
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> Currently, the active/inactive state aligns with mglru in this trace.
> It seems that
> the userspace BPF should recognize that the nr_scanned during a retry doesn't
> indicate we are isolating new nr_scanned folios. Ideally, the is_retry
> flag should
> be passed to the trace, allowing userspace to identify that it's a retry and
> disregard the nr_scanned value.
>
> It might be worth addressing this in a separate patch. Add Bixuan to clarify
> how userspace depends on this trace and if "retry" will break his userspace
> BPF for both MGLRU and active/inactive cases.
>
> Otherwise, the patch looks good to me.
>

By the way, it's completely clear that the trace was added after mglru's retry:
https://lore.kernel.org/linux-mm/20240105013607.2868-3-cuibixuan@vivo.com/

Therefore, I don't believe the potential confusion about nr_scanned in the trace
should prevent Ridong's fix for the missed rotation of written-back folios from
proceeding.

If there is an issue with that, we should open a separate thread to address the
trace.

Please feel free to add the below in the future version after you fix
"is_retried".

Reviewed-by: Barry Song <baohua@kernel.org>

> > +
> > +       find_folios_written_back(&folio_list, &clean_list, is_retried);
> > +
> >         spin_lock_irq(&lruvec->lru_lock);
> >         move_folios_to_lru(lruvec, &folio_list);
> >
> >         __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
> >                                         stat.nr_demoted);
> > -       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> >         item = PGSTEAL_KSWAPD + reclaimer_offset();
> >         if (!cgroup_reclaim(sc))
> >                 __count_vm_events(item, nr_reclaimed);
> >         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> >         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
> > +
> > +       if (!list_empty(&clean_list)) {
> > +               list_splice_init(&clean_list, &folio_list);
> > +               is_retried = true;
> > +               spin_unlock_irq(&lruvec->lru_lock);
> > +               goto retry;
> > +       }
> > +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> >         spin_unlock_irq(&lruvec->lru_lock);
> > +       sc->nr.taken += nr_taken;
> > +       if (file)
> > +               sc->nr.file_taken += nr_taken;
> >
> > -       lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
> > +       lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
> >
> >         /*
> >          * If dirty folios are scanned that are not queued for IO, it
> > @@ -2028,7 +2090,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >          * the flushers simply cannot keep up with the allocation
> >          * rate. Nudge the flusher threads in case they are asleep.
> >          */
> > -       if (stat.nr_unqueued_dirty == nr_taken) {
> > +       if (nr_unqueued_dirty == nr_taken) {
> >                 wakeup_flusher_threads(WB_REASON_VMSCAN);
> >                 /*
> >                  * For cgroupv1 dirty throttling is achieved by waking up
> > @@ -2043,18 +2105,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >                         reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> >         }
> >
> > -       sc->nr.dirty += stat.nr_dirty;
> > -       sc->nr.congested += stat.nr_congested;
> > -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> > -       sc->nr.writeback += stat.nr_writeback;
> > -       sc->nr.immediate += stat.nr_immediate;
> > -       sc->nr.taken += nr_taken;
> > -       if (file)
> > -               sc->nr.file_taken += nr_taken;
> > -
> > -       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> > -                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> > -       return nr_reclaimed;
> > +       return total_reclaimed;
> >  }
> >
> >  /*
> > @@ -4585,12 +4636,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
> >         int reclaimed;
> >         LIST_HEAD(list);
> >         LIST_HEAD(clean);
> > -       struct folio *folio;
> > -       struct folio *next;
> >         enum vm_event_item item;
> >         struct reclaim_stat stat;
> >         struct lru_gen_mm_walk *walk;
> > -       bool skip_retry = false;
> > +       bool is_retried = false;
> >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> > @@ -4616,24 +4665,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
> >                         scanned, reclaimed, &stat, sc->priority,
> >                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >
> > -       list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> > -               if (!folio_evictable(folio)) {
> > -                       list_del(&folio->lru);
> > -                       folio_putback_lru(folio);
> > -                       continue;
> > -               }
> > -
> > -               /* retry folios that may have missed folio_rotate_reclaimable() */
> > -               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
> > -                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
> > -                       list_move(&folio->lru, &clean);
> > -                       continue;
> > -               }
> > -
> > -               /* don't add rejected folios to the oldest generation */
> > -               if (!lru_gen_distance(folio, false))
> > -                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
> > -       }
> > +       find_folios_written_back(&list, &clean, is_retried);
> >
> >         spin_lock_irq(&lruvec->lru_lock);
> >
> > @@ -4656,7 +4688,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
> >         list_splice_init(&clean, &list);
> >
> >         if (!list_empty(&list)) {
> > -               skip_retry = true;
> > +               is_retried = true;
> >                 goto retry;
> >         }
> >
> > --
> > 2.34.1
> >
>

Thanks
barry
Chen Ridong Dec. 20, 2024, 7:48 a.m. UTC | #3
On 2024/12/20 11:09, Barry Song wrote:
> On Fri, Dec 20, 2024 at 3:30 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Fri, Dec 20, 2024 at 2:19 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>
>>> From: Chen Ridong <chenridong@huawei.com>
>>>
>>> The page reclaim isolates a batch of folios from the tail of one of the
>>> LRU lists and works on those folios one by one.  For a suitable
>>> swap-backed folio, if the swap device is async, it queues that folio for
>>> writeback.  After the page reclaim finishes an entire batch, it puts back
>>> the folios it queued for writeback to the head of the original LRU list.
>>>
>>> In the meantime, the page writeback flushes the queued folios also by
>>> batches.  Its batching logic is independent from that of the page reclaim.
>>> For each of the folios it writes back, the page writeback calls
>>> folio_rotate_reclaimable() which tries to rotate a folio to the tail.
>>>
>>> folio_rotate_reclaimable() only works for a folio after the page reclaim
>>> has put it back.  If an async swap device is fast enough, the page
>>> writeback can finish with that folio while the page reclaim is still
>>> working on the rest of the batch containing it.  In this case, that folio
>>> will remain at the head and the page reclaim will not retry it before
>>> reaching there.
>>>
>>> The commit 359a5e1416ca ("mm: multi-gen LRU: retry folios written back
>>> while isolated") only fixed the issue for mglru. However, this issue
>>> also exists in the traditional active/inactive LRU. This issue will be
>>> worse if THP is split, which makes the list longer and needs longer time
>>> to finish a batch of folios reclaim.
>>>
>>> This issue should be fixed in the same way for the traditional LRU.
>>> Therefore, the common logic was extracted to the 'find_folios_written_back'
>>> function firstly, which is then reused in the 'shrink_inactive_list'
>>> function. Finally, retry reclaiming those folios that may have missed the
>>> rotation for traditional LRU.
>>>
>>> Link: https://lore.kernel.org/linux-kernel/20241010081802.290893-1-chenridong@huaweicloud.com/
>>> Link: https://lore.kernel.org/linux-kernel/CAGsJ_4zqL8ZHNRZ44o_CC69kE7DBVXvbZfvmQxMGiFqRxqHQdA@mail.gmail.com/
>>> Signed-off-by: Chen Ridong <chenridong@huawei.com>
>>> ---
>>>  mm/vmscan.c | 108 ++++++++++++++++++++++++++++++++++------------------
>>>  1 file changed, 70 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 39886f435ec5..e67e446540ba 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -283,6 +283,39 @@ static void set_task_reclaim_state(struct task_struct *task,
>>>         task->reclaim_state = rs;
>>>  }
>>>
>>> +/**
>>> + * find_folios_written_back - Find and move the written back folios to a new list.
>>> + * @list: filios list
>>> + * @clean: the written back folios list
>>> + * @is_retried: whether the list has already been retried.
>>> + */
>>> +static inline void find_folios_written_back(struct list_head *list,
>>> +               struct list_head *clean, bool is_retried)
>>> +{
>>> +       struct folio *folio;
>>> +       struct folio *next;
>>> +
>>> +       list_for_each_entry_safe_reverse(folio, next, list, lru) {
>>> +               if (!folio_evictable(folio)) {
>>> +                       list_del(&folio->lru);
>>> +                       folio_putback_lru(folio);
>>> +                       continue;
>>> +               }
>>> +
>>> +               /* retry folios that may have missed folio_rotate_reclaimable() */
>>> +               if (!is_retried && !folio_test_active(folio) && !folio_mapped(folio) &&
>>> +                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
>>> +                       list_move(&folio->lru, clean);
>>> +                       continue;
>>> +               }
>>> +
>>> +               /* don't add rejected folios to the oldest generation */
>>> +               if (lru_gen_enabled() && !lru_gen_distance(folio, false))
>>> +                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
>>> +       }
>>> +
>>> +}
>>> +
>>>  /*
>>>   * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
>>>   * scan_control->nr_reclaimed.
>>> @@ -1959,14 +1992,18 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>>                 enum lru_list lru)
>>>  {
>>>         LIST_HEAD(folio_list);
>>> +       LIST_HEAD(clean_list);
>>>         unsigned long nr_scanned;
>>> -       unsigned int nr_reclaimed = 0;
>>> +       unsigned int nr_reclaimed, total_reclaimed = 0;
>>> +       unsigned int nr_pageout = 0;
>>> +       unsigned int nr_unqueued_dirty = 0;
>>>         unsigned long nr_taken;
>>>         struct reclaim_stat stat;
>>>         bool file = is_file_lru(lru);
>>>         enum vm_event_item item;
>>>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>         bool stalled = false;
>>> +       bool is_retried = false;
> 
> The name is_retried is a bit confusing. It should be is_retry or
> is_retrying since
> we are currently retrying, not that we have already retried.
> 
>>>
>>>         while (unlikely(too_many_isolated(pgdat, file, sc))) {
>>>                 if (stalled)
>>> @@ -2000,22 +2037,47 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>>         if (nr_taken == 0)
>>>                 return 0;
>>>
>>> +retry:
>>>         nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
>>>
>>> +       sc->nr.dirty += stat.nr_dirty;
>>> +       sc->nr.congested += stat.nr_congested;
>>> +       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>>> +       sc->nr.writeback += stat.nr_writeback;
>>> +       sc->nr.immediate += stat.nr_immediate;
>>> +       total_reclaimed += nr_reclaimed;
>>> +       nr_pageout += stat.nr_pageout;
>>> +       nr_unqueued_dirty += stat.nr_unqueued_dirty;
>>> +
>>> +       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>>> +                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>>
>> This is a bit odd, as nr_scanned during a retry still uses the
>> previous nr_scanned
>> value. However, I find that mglru shows no difference.
>>
>> retry:
>>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false);
>>         sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>>         sc->nr_reclaimed += reclaimed;
>>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>>                         scanned, reclaimed, &stat, sc->priority,
>>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>>
>> Currently, the active/inactive state aligns with mglru in this trace.
>> It seems that
>> the userspace BPF should recognize that the nr_scanned during a retry doesn't
>> indicate we are isolating new nr_scanned folios. Ideally, the is_retry
>> flag should
>> be passed to the trace, allowing userspace to identify that it's a retry and
>> disregard the nr_scanned value.
>>
>> It might be worth addressing this in a separate patch. Add Bixuan to clarify
>> how userspace depends on this trace and if "retry" will break his userspace
>> BPF for both MGLRU and active/inactive cases.
>>
>> Otherwise, the patch looks good to me.
>>
> 
> By the way, it's completely clear that the trace was added after mglru's retry:
> https://lore.kernel.org/linux-mm/20240105013607.2868-3-cuibixuan@vivo.com/
> 
> Therefore, I don't believe the potential confusion about nr_scanned in the trace
> should prevent Ridong's fix for the missed rotation of written-back folios from
> proceeding.
> 
> If there is an issue with that, we should open a separate thread to address the
> trace.
> 
> Please feel free to add the below in the future version after you fix
> "is_retried".
> 
> Reviewed-by: Barry Song <baohua@kernel.org>
> 

Thank you very much.
I will update.

Best regards
Ridong

>>> +
>>> +       find_folios_written_back(&folio_list, &clean_list, is_retried);
>>> +
>>>         spin_lock_irq(&lruvec->lru_lock);
>>>         move_folios_to_lru(lruvec, &folio_list);
>>>
>>>         __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
>>>                                         stat.nr_demoted);
>>> -       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>>>         item = PGSTEAL_KSWAPD + reclaimer_offset();
>>>         if (!cgroup_reclaim(sc))
>>>                 __count_vm_events(item, nr_reclaimed);
>>>         __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
>>>         __count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
>>> +
>>> +       if (!list_empty(&clean_list)) {
>>> +               list_splice_init(&clean_list, &folio_list);
>>> +               is_retried = true;
>>> +               spin_unlock_irq(&lruvec->lru_lock);
>>> +               goto retry;
>>> +       }
>>> +       __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>>>         spin_unlock_irq(&lruvec->lru_lock);
>>> +       sc->nr.taken += nr_taken;
>>> +       if (file)
>>> +               sc->nr.file_taken += nr_taken;
>>>
>>> -       lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
>>> +       lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
>>>
>>>         /*
>>>          * If dirty folios are scanned that are not queued for IO, it
>>> @@ -2028,7 +2090,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>>          * the flushers simply cannot keep up with the allocation
>>>          * rate. Nudge the flusher threads in case they are asleep.
>>>          */
>>> -       if (stat.nr_unqueued_dirty == nr_taken) {
>>> +       if (nr_unqueued_dirty == nr_taken) {
>>>                 wakeup_flusher_threads(WB_REASON_VMSCAN);
>>>                 /*
>>>                  * For cgroupv1 dirty throttling is achieved by waking up
>>> @@ -2043,18 +2105,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>>                         reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>>>         }
>>>
>>> -       sc->nr.dirty += stat.nr_dirty;
>>> -       sc->nr.congested += stat.nr_congested;
>>> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>>> -       sc->nr.writeback += stat.nr_writeback;
>>> -       sc->nr.immediate += stat.nr_immediate;
>>> -       sc->nr.taken += nr_taken;
>>> -       if (file)
>>> -               sc->nr.file_taken += nr_taken;
>>> -
>>> -       trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>>> -                       nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>>> -       return nr_reclaimed;
>>> +       return total_reclaimed;
>>>  }
>>>
>>>  /*
>>> @@ -4585,12 +4636,10 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>>>         int reclaimed;
>>>         LIST_HEAD(list);
>>>         LIST_HEAD(clean);
>>> -       struct folio *folio;
>>> -       struct folio *next;
>>>         enum vm_event_item item;
>>>         struct reclaim_stat stat;
>>>         struct lru_gen_mm_walk *walk;
>>> -       bool skip_retry = false;
>>> +       bool is_retried = false;
>>>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>>>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>> @@ -4616,24 +4665,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>>>                         scanned, reclaimed, &stat, sc->priority,
>>>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>>>
>>> -       list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>>> -               if (!folio_evictable(folio)) {
>>> -                       list_del(&folio->lru);
>>> -                       folio_putback_lru(folio);
>>> -                       continue;
>>> -               }
>>> -
>>> -               /* retry folios that may have missed folio_rotate_reclaimable() */
>>> -               if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
>>> -                   !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
>>> -                       list_move(&folio->lru, &clean);
>>> -                       continue;
>>> -               }
>>> -
>>> -               /* don't add rejected folios to the oldest generation */
>>> -               if (!lru_gen_distance(folio, false))
>>> -                       set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
>>> -       }
>>> +       find_folios_written_back(&list, &clean, is_retried);
>>>
>>>         spin_lock_irq(&lruvec->lru_lock);
>>>
>>> @@ -4656,7 +4688,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>>>         list_splice_init(&clean, &list);
>>>
>>>         if (!list_empty(&list)) {
>>> -               skip_retry = true;
>>> +               is_retried = true;
>>>                 goto retry;
>>>         }
>>>
>>> --
>>> 2.34.1
>>>
>>
> 
> Thanks
> barry
kernel test robot Dec. 20, 2024, 5:32 p.m. UTC | #4
Hi Chen,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Chen-Ridong/mm-vmscan-retry-folios-written-back-while-isolated-for-traditional-LRU/20241220-092147
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241220010931.3603111-1-chenridong%40huaweicloud.com
patch subject: [PATCH -next v5] mm: vmscan: retry folios written back while isolated for traditional LRU
config: i386-buildonly-randconfig-004-20241220 (https://download.01.org/0day-ci/archive/20241221/202412210101.Og2hX0Rs-lkp@intel.com/config)
compiler: clang version 19.1.3 (https://github.com/llvm/llvm-project ab51eccf88f5321e7c60591c5546b254b6afab99)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241221/202412210101.Og2hX0Rs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412210101.Og2hX0Rs-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/vmscan.c:30:
   include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      47 |         __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
         |                                    ~~~~~~~~~~~ ^ ~~~
   include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
      49 |                                 NR_ZONE_LRU_BASE + lru, nr_pages);
         |                                 ~~~~~~~~~~~~~~~~ ^ ~~~
   In file included from mm/vmscan.c:42:
   In file included from include/linux/migrate.h:8:
   include/linux/hugetlb.h:1063:5: warning: no previous prototype for function 'replace_free_hugepage_folios' [-Wmissing-prototypes]
    1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
         |     ^
   include/linux/hugetlb.h:1063:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
         | ^
         | static 
>> mm/vmscan.c:313:29: error: call to undeclared function 'lru_gen_distance'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     313 |                 if (lru_gen_enabled() && !lru_gen_distance(folio, false))
         |                                           ^
   mm/vmscan.c:442:51: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     442 |                         size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
         |                                                       ~~~~~~~~~~~~~~~~ ^ ~~~
   mm/vmscan.c:1806:4: warning: arithmetic between different enumeration types ('enum vm_event_item' and 'enum zone_type') [-Wenum-enum-conversion]
    1806 |                         __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
         |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:139:34: note: expanded from macro '__count_zid_vm_events'
     139 |         __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
         |                           ~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
   mm/vmscan.c:2330:51: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
    2330 |         inactive = lruvec_page_state(lruvec, NR_LRU_BASE + inactive_lru);
         |                                              ~~~~~~~~~~~ ^ ~~~~~~~~~~~~
   mm/vmscan.c:2331:49: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
    2331 |         active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
         |                                            ~~~~~~~~~~~ ^ ~~~~~~~~~~
   mm/vmscan.c:6271:3: warning: arithmetic between different enumeration types ('enum vm_event_item' and 'enum zone_type') [-Wenum-enum-conversion]
    6271 |                 __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
         |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:139:34: note: expanded from macro '__count_zid_vm_events'
     139 |         __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
         |                           ~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
   8 warnings and 1 error generated.


vim +/lru_gen_distance +313 mm/vmscan.c

   285	
   286	/**
   287	 * find_folios_written_back - Find and move the written back folios to a new list.
   288	 * @list: filios list
   289	 * @clean: the written back folios list
   290	 * @is_retried: whether the list has already been retried.
   291	 */
   292	static inline void find_folios_written_back(struct list_head *list,
   293			struct list_head *clean, bool is_retried)
   294	{
   295		struct folio *folio;
   296		struct folio *next;
   297	
   298		list_for_each_entry_safe_reverse(folio, next, list, lru) {
   299			if (!folio_evictable(folio)) {
   300				list_del(&folio->lru);
   301				folio_putback_lru(folio);
   302				continue;
   303			}
   304	
   305			/* retry folios that may have missed folio_rotate_reclaimable() */
   306			if (!is_retried && !folio_test_active(folio) && !folio_mapped(folio) &&
   307			    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
   308				list_move(&folio->lru, clean);
   309				continue;
   310			}
   311	
   312			/* don't add rejected folios to the oldest generation */
 > 313			if (lru_gen_enabled() && !lru_gen_distance(folio, false))
   314				set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
   315		}
   316
kernel test robot Dec. 21, 2024, 2:35 a.m. UTC | #5
Hi Chen,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Chen-Ridong/mm-vmscan-retry-folios-written-back-while-isolated-for-traditional-LRU/20241220-092147
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20241220010931.3603111-1-chenridong%40huaweicloud.com
patch subject: [PATCH -next v5] mm: vmscan: retry folios written back while isolated for traditional LRU
config: i386-buildonly-randconfig-003-20241220 (https://download.01.org/0day-ci/archive/20241221/202412211023.UIBZqNFK-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241221/202412211023.UIBZqNFK-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412211023.UIBZqNFK-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/migrate.h:8,
                    from mm/vmscan.c:42:
   include/linux/hugetlb.h:1063:5: warning: no previous prototype for 'replace_free_hugepage_folios' [-Wmissing-prototypes]
    1063 | int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
         |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/vmscan.c: In function 'find_folios_written_back':
>> mm/vmscan.c:313:43: error: implicit declaration of function 'lru_gen_distance'; did you mean 'node_distance'? [-Werror=implicit-function-declaration]
     313 |                 if (lru_gen_enabled() && !lru_gen_distance(folio, false))
         |                                           ^~~~~~~~~~~~~~~~
         |                                           node_distance
   cc1: some warnings being treated as errors


vim +313 mm/vmscan.c

   285	
   286	/**
   287	 * find_folios_written_back - Find and move the written back folios to a new list.
   288	 * @list: filios list
   289	 * @clean: the written back folios list
   290	 * @is_retried: whether the list has already been retried.
   291	 */
   292	static inline void find_folios_written_back(struct list_head *list,
   293			struct list_head *clean, bool is_retried)
   294	{
   295		struct folio *folio;
   296		struct folio *next;
   297	
   298		list_for_each_entry_safe_reverse(folio, next, list, lru) {
   299			if (!folio_evictable(folio)) {
   300				list_del(&folio->lru);
   301				folio_putback_lru(folio);
   302				continue;
   303			}
   304	
   305			/* retry folios that may have missed folio_rotate_reclaimable() */
   306			if (!is_retried && !folio_test_active(folio) && !folio_mapped(folio) &&
   307			    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
   308				list_move(&folio->lru, clean);
   309				continue;
   310			}
   311	
   312			/* don't add rejected folios to the oldest generation */
 > 313			if (lru_gen_enabled() && !lru_gen_distance(folio, false))
   314				set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
   315		}
   316
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 39886f435ec5..e67e446540ba 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -283,6 +283,39 @@  static void set_task_reclaim_state(struct task_struct *task,
 	task->reclaim_state = rs;
 }
 
+/**
+ * find_folios_written_back - Find and move the written back folios to a new list.
+ * @list: filios list
+ * @clean: the written back folios list
+ * @is_retried: whether the list has already been retried.
+ */
+static inline void find_folios_written_back(struct list_head *list,
+		struct list_head *clean, bool is_retried)
+{
+	struct folio *folio;
+	struct folio *next;
+
+	list_for_each_entry_safe_reverse(folio, next, list, lru) {
+		if (!folio_evictable(folio)) {
+			list_del(&folio->lru);
+			folio_putback_lru(folio);
+			continue;
+		}
+
+		/* retry folios that may have missed folio_rotate_reclaimable() */
+		if (!is_retried && !folio_test_active(folio) && !folio_mapped(folio) &&
+		    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
+			list_move(&folio->lru, clean);
+			continue;
+		}
+
+		/* don't add rejected folios to the oldest generation */
+		if (lru_gen_enabled() && !lru_gen_distance(folio, false))
+			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
+	}
+
+}
+
 /*
  * flush_reclaim_state(): add pages reclaimed outside of LRU-based reclaim to
  * scan_control->nr_reclaimed.
@@ -1959,14 +1992,18 @@  static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		enum lru_list lru)
 {
 	LIST_HEAD(folio_list);
+	LIST_HEAD(clean_list);
 	unsigned long nr_scanned;
-	unsigned int nr_reclaimed = 0;
+	unsigned int nr_reclaimed, total_reclaimed = 0;
+	unsigned int nr_pageout = 0;
+	unsigned int nr_unqueued_dirty = 0;
 	unsigned long nr_taken;
 	struct reclaim_stat stat;
 	bool file = is_file_lru(lru);
 	enum vm_event_item item;
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	bool stalled = false;
+	bool is_retried = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
 		if (stalled)
@@ -2000,22 +2037,47 @@  static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	if (nr_taken == 0)
 		return 0;
 
+retry:
 	nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
 
+	sc->nr.dirty += stat.nr_dirty;
+	sc->nr.congested += stat.nr_congested;
+	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
+	sc->nr.writeback += stat.nr_writeback;
+	sc->nr.immediate += stat.nr_immediate;
+	total_reclaimed += nr_reclaimed;
+	nr_pageout += stat.nr_pageout;
+	nr_unqueued_dirty += stat.nr_unqueued_dirty;
+
+	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
+
+	find_folios_written_back(&folio_list, &clean_list, is_retried);
+
 	spin_lock_irq(&lruvec->lru_lock);
 	move_folios_to_lru(lruvec, &folio_list);
 
 	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
 					stat.nr_demoted);
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	item = PGSTEAL_KSWAPD + reclaimer_offset();
 	if (!cgroup_reclaim(sc))
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
+
+	if (!list_empty(&clean_list)) {
+		list_splice_init(&clean_list, &folio_list);
+		is_retried = true;
+		spin_unlock_irq(&lruvec->lru_lock);
+		goto retry;
+	}
+	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&lruvec->lru_lock);
+	sc->nr.taken += nr_taken;
+	if (file)
+		sc->nr.file_taken += nr_taken;
 
-	lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
+	lru_note_cost(lruvec, file, nr_pageout, nr_scanned - total_reclaimed);
 
 	/*
 	 * If dirty folios are scanned that are not queued for IO, it
@@ -2028,7 +2090,7 @@  static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	 * the flushers simply cannot keep up with the allocation
 	 * rate. Nudge the flusher threads in case they are asleep.
 	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
+	if (nr_unqueued_dirty == nr_taken) {
 		wakeup_flusher_threads(WB_REASON_VMSCAN);
 		/*
 		 * For cgroupv1 dirty throttling is achieved by waking up
@@ -2043,18 +2105,7 @@  static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
-
-	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
-	return nr_reclaimed;
+	return total_reclaimed;
 }
 
 /*
@@ -4585,12 +4636,10 @@  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
-	struct folio *folio;
-	struct folio *next;
 	enum vm_event_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
-	bool skip_retry = false;
+	bool is_retried = false;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
@@ -4616,24 +4665,7 @@  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 			scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
-		if (!folio_evictable(folio)) {
-			list_del(&folio->lru);
-			folio_putback_lru(folio);
-			continue;
-		}
-
-		/* retry folios that may have missed folio_rotate_reclaimable() */
-		if (!skip_retry && !folio_test_active(folio) && !folio_mapped(folio) &&
-		    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
-			list_move(&folio->lru, &clean);
-			continue;
-		}
-
-		/* don't add rejected folios to the oldest generation */
-		if (!lru_gen_distance(folio, false))
-			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
-	}
+	find_folios_written_back(&list, &clean, is_retried);
 
 	spin_lock_irq(&lruvec->lru_lock);
 
@@ -4656,7 +4688,7 @@  static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 	list_splice_init(&clean, &list);
 
 	if (!list_empty(&list)) {
-		skip_retry = true;
+		is_retried = true;
 		goto retry;
 	}