diff mbox series

[v17,14/21] mm/compaction: do page isolation first in compaction

Message ID 1595681998-19193-15-git-send-email-alex.shi@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series per memcg lru lock | expand

Commit Message

Alex Shi July 25, 2020, 12:59 p.m. UTC
Currently, compaction would get the lru_lock and then do page isolation
which works fine with pgdat->lru_lock, since any page isoltion would
compete for the lru_lock. If we want to change to memcg lru_lock, we
have to isolate the page before getting lru_lock, thus isoltion would
block page's memcg change which relay on page isoltion too. Then we
could safely use per memcg lru_lock later.

The new page isolation use previous introduced TestClearPageLRU() +
pgdat lru locking which will be changed to memcg lru lock later.

Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
early version:

Fix lots of crashes under compaction load: isolate_migratepages_block()
must clean up appropriately when rejecting a page, setting PageLRU again
if it had been cleared; and a put_page() after get_page_unless_zero()
cannot safely be done while holding locked_lruvec - it may turn out to
be the final put_page(), which will take an lruvec lock when PageLRU.
And move __isolate_lru_page_prepare back after get_page_unless_zero to
make trylock_page() safe:
trylock_page() is not safe to use at this time: its setting PG_locked
can race with the page being freed or allocated ("Bad page"), and can
also erase flags being set by one of those "sole owners" of a freshly
allocated page who use non-atomic __SetPageFlag().

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
---
 include/linux/swap.h |  2 +-
 mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
 mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
 3 files changed, 60 insertions(+), 30 deletions(-)

Comments

Alexander H Duyck Aug. 4, 2020, 9:35 p.m. UTC | #1
On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
>
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
>
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
>  3 files changed, 60 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2c29399b29a0..6d23d3beeff7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f14780fc296a..2da2933fe56b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>                         if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>                                 low_pfn = end_pfn;
> +                               page = NULL;
>                                 goto isolate_abort;
>                         }
>                         valid_page = page;
> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +               /*
> +                * Be careful not to clear PageLRU until after we're
> +                * sure the page is not being freed elsewhere -- the
> +                * page release code relies on it.
> +                */
> +               if (unlikely(!get_page_unless_zero(page)))
> +                       goto isolate_fail;
> +
> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +                       goto isolate_fail_put;
> +
> +               /* Try isolate the page */
> +               if (!TestClearPageLRU(page))
> +                       goto isolate_fail_put;
> +
>                 /* If we already hold the lock, we can skip some rechecking */
>                 if (!locked) {
>                         locked = compact_lock_irqsave(&pgdat->lru_lock,

So the fact that this flow doesn't match what we have below in
isolate_lru_pages(). I went digging through the history and realize I
brought this up before and you referenced the following patch from
hugh:
https://lore.kernel.org/lkml/alpine.LSU.2.11.2006111529010.10801@eggly.anvils/

As such I am assuming this flow is needed because we aren't holding an
LRU lock, and the flow in mm/vmscan.c works because that is being
called while holding an LRU lock. I am wondering if we are
overcomplicating things by keeping the LRU check in
__isolate_lru_page_prepare(). If we were to pull it out then you could
just perform the get_page_unless_zero and TestClearPageLRU check
before you call the function and you could consolidate the code so
that it could be combined into a single function as below. So for
example you could combine them into:
static inline bool get_lru_page_unless_zero(struct page *page)
{
    /*
     * Be careful not to clear PageLRU until after we're
     * sure the page is not being freed elsewhere -- the
     * page release code relies on it.
     */
   if (unlikely(!get_page_unless_zero(page))
        return false;
    if(TestClearPageLRU(page))
        return true;
    put_page(page);
    return false;
}

Then the logic becomes that you have to either call
get_lru_page_unless_zero before calling __isolate_lru_page_prepare or
you have to be holding the LRU lock.

> @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /* Recheck PageLRU and PageCompound under lock */
> -                       if (!PageLRU(page))
> -                               goto isolate_fail;
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>                                 low_pfn += compound_nr(page) - 1;
> -                               goto isolate_fail;
> +                               SetPageLRU(page);
> +                               goto isolate_fail_put;
>                         }
>                 }
>
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> -               /* Try isolate the page */
> -               if (__isolate_lru_page(page, isolate_mode) != 0)
> -                       goto isolate_fail;
> -
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
>                         low_pfn += compound_nr(page) - 1;
> @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 }
>
>                 continue;
> +
> +isolate_fail_put:
> +               /* Avoid potential deadlock in freeing page under lru_lock */
> +               if (locked) {
> +                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +                       locked = false;
> +               }
> +               put_page(page);
> +
>  isolate_fail:
>                 if (!skip_on_failure)
>                         continue;
> @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         if (unlikely(low_pfn > end_pfn))
>                 low_pfn = end_pfn;
>
> +       page = NULL;
> +
>  isolate_abort:
>         if (locked)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (page) {
> +               SetPageLRU(page);
> +               put_page(page);
> +       }
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4183ae6b54b5..f77748adc340 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1544,20 +1544,20 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
>   *
>   * returns 0 on success, -ve errno on failure.
>   */
> -int __isolate_lru_page(struct page *page, isolate_mode_t mode)
> +int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
>  {
>         int ret = -EINVAL;
>
> -       /* Only take pages on the LRU. */
> -       if (!PageLRU(page))
> -               return ret;
> -
>         /* Compaction should not handle unevictable pages but CMA can do so */
>         if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
>                 return ret;
>
>         ret = -EBUSY;
>
> +       /* Only take pages on the LRU. */
> +       if (!PageLRU(page))
> +               return ret;
> +
>         /*
>          * To minimise LRU disruption, the caller can indicate that it only
>          * wants to isolate pages it will be able to operate on without

So the question I would have is if we really need to be checking
PageLRU here? I wonder if this isn't another spot where we would be
better served by just assuming that PageLRU has been checked while
holding the lock, or tested and cleared while holding a page
reference. The original patch from Hugh referenced above mentions a
desire to do away with __isolate_lru_page_prepare entirely, so I
wonder if it wouldn't be good to be proactive and pull out the bits we
think we might need versus the ones we don't.

> @@ -1598,20 +1598,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>         if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
>                 return ret;
>
> -       if (likely(get_page_unless_zero(page))) {
> -               /*
> -                * Be careful not to clear PageLRU until after we're
> -                * sure the page is not being freed elsewhere -- the
> -                * page release code relies on it.
> -                */
> -               ClearPageLRU(page);
> -               ret = 0;
> -       }
> -
> -       return ret;
> +       return 0;
>  }
>
> -
>  /*
>   * Update LRU sizes after isolating pages. The LRU size updates must
>   * be complete before mem_cgroup_update_lru_size due to a sanity check.
> @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                  * only when the page is being freed somewhere else.
>                  */
>                 scan += nr_pages;
> -               switch (__isolate_lru_page(page, mode)) {
> +               switch (__isolate_lru_page_prepare(page, mode)) {

So after looking through the code I realized that "mode" here will
always be either 0 or ISOLATE_UNMAPPED. I assume this is why we aren't
worried about the trylock_page call messing things up. With that said
it looks like the function just breaks down to three tests, first for
PageUnevictable(), then PageLRU(), and then possibly page_mapped(). As
such I believe dropping the PageLRU check from the function as I
suggested above should be safe since we are at risk for the test of it
racing anyway since it could be cleared out from under us, and the bit
isn't really protecting anything anyway since we are holding the LRU
lock and anyhow.

>                 case 0:
> +                       /*
> +                        * Be careful not to clear PageLRU until after we're
> +                        * sure the page is not being freed elsewhere -- the
> +                        * page release code relies on it.
> +                        */
> +                       if (unlikely(!get_page_unless_zero(page)))
> +                               goto busy;
> +
> +                       if (!TestClearPageLRU(page)) {
> +                               /*
> +                                * This page may in other isolation path,
> +                                * but we still hold lru_lock.
> +                                */
> +                               put_page(page);
> +                               goto busy;
> +                       }
> +

This piece could be consolidated via the single function I called out above.

>                         nr_taken += nr_pages;
>                         nr_zone_taken[page_zonenum(page)] += nr_pages;
>                         list_move(&page->lru, dst);
>                         break;
> -
> +busy:
>                 case -EBUSY:
>                         /* else it is being freed elsewhere */
>                         list_move(&page->lru, src);
> -                       continue;
> +                       break;
>                 default:
>                         BUG();
> --
> 1.8.3.1
>
Alexander H Duyck Aug. 6, 2020, 6:38 p.m. UTC | #2
On Sat, Jul 25, 2020 at 6:00 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
> Currently, compaction would get the lru_lock and then do page isolation
> which works fine with pgdat->lru_lock, since any page isoltion would
> compete for the lru_lock. If we want to change to memcg lru_lock, we
> have to isolate the page before getting lru_lock, thus isoltion would
> block page's memcg change which relay on page isoltion too. Then we
> could safely use per memcg lru_lock later.
>
> The new page isolation use previous introduced TestClearPageLRU() +
> pgdat lru locking which will be changed to memcg lru lock later.
>
> Hugh Dickins <hughd@google.com> fixed following bugs in this patch's
> early version:
>
> Fix lots of crashes under compaction load: isolate_migratepages_block()
> must clean up appropriately when rejecting a page, setting PageLRU again
> if it had been cleared; and a put_page() after get_page_unless_zero()
> cannot safely be done while holding locked_lruvec - it may turn out to
> be the final put_page(), which will take an lruvec lock when PageLRU.
> And move __isolate_lru_page_prepare back after get_page_unless_zero to
> make trylock_page() safe:
> trylock_page() is not safe to use at this time: its setting PG_locked
> can race with the page being freed or allocated ("Bad page"), and can
> also erase flags being set by one of those "sole owners" of a freshly
> allocated page who use non-atomic __SetPageFlag().
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> ---
>  include/linux/swap.h |  2 +-
>  mm/compaction.c      | 42 +++++++++++++++++++++++++++++++++---------
>  mm/vmscan.c          | 46 ++++++++++++++++++++++++++--------------------
>  3 files changed, 60 insertions(+), 30 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2c29399b29a0..6d23d3beeff7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -358,7 +358,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                                         gfp_t gfp_mask, nodemask_t *mask);
> -extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
> +extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f14780fc296a..2da2933fe56b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -869,6 +869,7 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
>                         if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
>                                 low_pfn = end_pfn;
> +                               page = NULL;
>                                 goto isolate_abort;
>                         }
>                         valid_page = page;
> @@ -950,6 +951,21 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
>                         goto isolate_fail;
>
> +               /*
> +                * Be careful not to clear PageLRU until after we're
> +                * sure the page is not being freed elsewhere -- the
> +                * page release code relies on it.
> +                */
> +               if (unlikely(!get_page_unless_zero(page)))
> +                       goto isolate_fail;
> +
> +               if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> +                       goto isolate_fail_put;
> +
> +               /* Try isolate the page */
> +               if (!TestClearPageLRU(page))
> +                       goto isolate_fail_put;
> +
>                 /* If we already hold the lock, we can skip some rechecking */
>                 if (!locked) {
>                         locked = compact_lock_irqsave(&pgdat->lru_lock,
> @@ -962,10 +978,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /* Recheck PageLRU and PageCompound under lock */
> -                       if (!PageLRU(page))
> -                               goto isolate_fail;
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> @@ -973,16 +985,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                          */
>                         if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>                                 low_pfn += compound_nr(page) - 1;
> -                               goto isolate_fail;
> +                               SetPageLRU(page);
> +                               goto isolate_fail_put;
>                         }
>                 }
>
>                 lruvec = mem_cgroup_page_lruvec(page, pgdat);
>
> -               /* Try isolate the page */
> -               if (__isolate_lru_page(page, isolate_mode) != 0)
> -                       goto isolate_fail;
> -
>                 /* The whole page is taken off the LRU; skip the tail pages. */
>                 if (PageCompound(page))
>                         low_pfn += compound_nr(page) - 1;
> @@ -1011,6 +1020,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 }
>
>                 continue;
> +
> +isolate_fail_put:
> +               /* Avoid potential deadlock in freeing page under lru_lock */
> +               if (locked) {
> +                       spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +                       locked = false;
> +               }
> +               put_page(page);
> +
>  isolate_fail:
>                 if (!skip_on_failure)
>                         continue;
> @@ -1047,9 +1065,15 @@ static bool too_many_isolated(pg_data_t *pgdat)
>         if (unlikely(low_pfn > end_pfn))
>                 low_pfn = end_pfn;
>
> +       page = NULL;
> +
>  isolate_abort:
>         if (locked)
>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> +       if (page) {
> +               SetPageLRU(page);
> +               put_page(page);
> +       }
>
>         /*
>          * Updated the cached scanner pfn once the pageblock has been scanned

We should probably be calling SetPageLRU before we release the lru
lock instead of before. It might make sense to just call it before we
get here, similar to how you did in the isolate_fail_put case a few
lines later. Otherwise this seems to violate the rules you had set up
earlier where we were only going to be setting the LRU bit while
holding the LRU lock.
Alex Shi Aug. 7, 2020, 3:24 a.m. UTC | #3
在 2020/8/7 上午2:38, Alexander Duyck 写道:
>> +
>>  isolate_abort:
>>         if (locked)
>>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>> +       if (page) {
>> +               SetPageLRU(page);
>> +               put_page(page);
>> +       }
>>
>>         /*
>>          * Updated the cached scanner pfn once the pageblock has been scanned
> We should probably be calling SetPageLRU before we release the lru
> lock instead of before. It might make sense to just call it before we
> get here, similar to how you did in the isolate_fail_put case a few
> lines later. Otherwise this seems to violate the rules you had set up
> earlier where we were only going to be setting the LRU bit while
> holding the LRU lock.

Hi Alex,

Set out of lock here should be fine. I never said we must set the bit in locking.
And this page is get by get_page_unless_zero(), no warry on release.

Thanks
Alex
Alexander H Duyck Aug. 7, 2020, 2:51 p.m. UTC | #4
On Thu, Aug 6, 2020 at 8:25 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/7 上午2:38, Alexander Duyck 写道:
> >> +
> >>  isolate_abort:
> >>         if (locked)
> >>                 spin_unlock_irqrestore(&pgdat->lru_lock, flags);
> >> +       if (page) {
> >> +               SetPageLRU(page);
> >> +               put_page(page);
> >> +       }
> >>
> >>         /*
> >>          * Updated the cached scanner pfn once the pageblock has been scanned
> > We should probably be calling SetPageLRU before we release the lru
> > lock instead of before. It might make sense to just call it before we
> > get here, similar to how you did in the isolate_fail_put case a few
> > lines later. Otherwise this seems to violate the rules you had set up
> > earlier where we were only going to be setting the LRU bit while
> > holding the LRU lock.
>
> Hi Alex,
>
> Set out of lock here should be fine. I never said we must set the bit in locking.
> And this page is get by get_page_unless_zero(), no warry on release.
>
> Thanks
> Alex

I wonder if this entire section shouldn't be restructured. This is the
only spot I can see where we are resetting the LRU flag instead of
pulling the page from the LRU list with the lock held. Looking over
the code it seems like something like that should be possible. I am
not sure the LRU lock is really protecting us in either the
PageCompound check nor the skip bits. It seems like holding a
reference on the page should prevent it from switching between
compound or not, and the skip bits are per pageblock with the LRU bits
being per node/memcg which I would think implies that we could have
multiple LRU locks that could apply to a single skip bit.
Alex Shi Aug. 10, 2020, 1:10 p.m. UTC | #5
在 2020/8/7 下午10:51, Alexander Duyck 写道:
> I wonder if this entire section shouldn't be restructured. This is the
> only spot I can see where we are resetting the LRU flag instead of
> pulling the page from the LRU list with the lock held. Looking over
> the code it seems like something like that should be possible. I am
> not sure the LRU lock is really protecting us in either the
> PageCompound check nor the skip bits. It seems like holding a
> reference on the page should prevent it from switching between
> compound or not, and the skip bits are per pageblock with the LRU bits
> being per node/memcg which I would think implies that we could have
> multiple LRU locks that could apply to a single skip bit.

Hi Alexander,

I don't find problem yet on compound or skip bit usage. Would you clarify the
issue do you concerned? 

Thanks!
Alexander H Duyck Aug. 10, 2020, 2:41 p.m. UTC | #6
On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> > I wonder if this entire section shouldn't be restructured. This is the
> > only spot I can see where we are resetting the LRU flag instead of
> > pulling the page from the LRU list with the lock held. Looking over
> > the code it seems like something like that should be possible. I am
> > not sure the LRU lock is really protecting us in either the
> > PageCompound check nor the skip bits. It seems like holding a
> > reference on the page should prevent it from switching between
> > compound or not, and the skip bits are per pageblock with the LRU bits
> > being per node/memcg which I would think implies that we could have
> > multiple LRU locks that could apply to a single skip bit.
>
> Hi Alexander,
>
> I don't find problem yet on compound or skip bit usage. Would you clarify the
> issue do you concerned?
>
> Thanks!

The point I was getting at is that the LRU lock is being used to
protect these and with your changes I don't think that makes sense
anymore.

The skip bits are per-pageblock bits. With your change the LRU lock is
now per memcg first and then per node. As such I do not believe it
really provides any sort of exclusive access to the skip bits. I still
have to look into this more, but it seems like you need a lock per
either section or zone that can be used to protect those bits and deal
with this sooner rather than waiting until you have found an LRU page.
The one part that is confusing though is that the definition of the
skip bits seems to call out that they are a hint since they are not
protected by a lock, but that is exactly what has been happening here.

The point I was getting at with the PageCompound check is that instead
of needing the LRU lock you should be able to look at PageCompound as
soon as you call get_page_unless_zero() and preempt the need to set
the LRU bit again. Instead of trying to rely on the LRU lock to
guarantee that the page hasn't been merged you could just rely on the
fact that you are holding a reference to it so it isn't going to
switch between being compound or order 0 since it cannot be freed. It
spoils the idea I originally had of combining the logic for
get_page_unless_zero and TestClearPageLRU into a single function, but
the advantage is you aren't clearing the LRU flag unless you are
actually going to pull the page from the LRU list.

My main worry is that this is the one spot where we appear to be
clearing the LRU bit without ever actually pulling the page off of the
LRU list, and I am thinking we would be better served by addressing
the skip and PageCompound checks earlier rather than adding code to
set the bit again if either of those cases are encountered. This way
we don't pseudo-pin pages in the LRU if they are compound or supposed
to be skipped.

Thanks.

- Alex
Alex Shi Aug. 11, 2020, 8:22 a.m. UTC | #7
在 2020/8/10 下午10:41, Alexander Duyck 写道:
> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
>>> I wonder if this entire section shouldn't be restructured. This is the
>>> only spot I can see where we are resetting the LRU flag instead of
>>> pulling the page from the LRU list with the lock held. Looking over
>>> the code it seems like something like that should be possible. I am
>>> not sure the LRU lock is really protecting us in either the
>>> PageCompound check nor the skip bits. It seems like holding a
>>> reference on the page should prevent it from switching between
>>> compound or not, and the skip bits are per pageblock with the LRU bits
>>> being per node/memcg which I would think implies that we could have
>>> multiple LRU locks that could apply to a single skip bit.
>>
>> Hi Alexander,
>>
>> I don't find problem yet on compound or skip bit usage. Would you clarify the
>> issue do you concerned?
>>
>> Thanks!
> 
> The point I was getting at is that the LRU lock is being used to
> protect these and with your changes I don't think that makes sense
> anymore.
> 
> The skip bits are per-pageblock bits. With your change the LRU lock is
> now per memcg first and then per node. As such I do not believe it
> really provides any sort of exclusive access to the skip bits. I still
> have to look into this more, but it seems like you need a lock per
> either section or zone that can be used to protect those bits and deal
> with this sooner rather than waiting until you have found an LRU page.
> The one part that is confusing though is that the definition of the
> skip bits seems to call out that they are a hint since they are not
> protected by a lock, but that is exactly what has been happening here.
> 

The skip bits are safe here, since even it race with other skip action,
It will still skip out. The skip action is try not to compaction too much,
not a exclusive action needs avoid race.


> The point I was getting at with the PageCompound check is that instead
> of needing the LRU lock you should be able to look at PageCompound as
> soon as you call get_page_unless_zero() and preempt the need to set
> the LRU bit again. Instead of trying to rely on the LRU lock to
> guarantee that the page hasn't been merged you could just rely on the
> fact that you are holding a reference to it so it isn't going to
> switch between being compound or order 0 since it cannot be freed. It
> spoils the idea I originally had of combining the logic for
> get_page_unless_zero and TestClearPageLRU into a single function, but
> the advantage is you aren't clearing the LRU flag unless you are
> actually going to pull the page from the LRU list.

Sorry, I still can not follow you here. Compound code part is unchanged
and follow the original logical. So would you like to pose a new code to
see if its works?

Thanks
Alex

> 
> My main worry is that this is the one spot where we appear to be
> clearing the LRU bit without ever actually pulling the page off of the
> LRU list, and I am thinking we would be better served by addressing
> the skip and PageCompound checks earlier rather than adding code to
> set the bit again if either of those cases are encountered. This way
> we don't pseudo-pin pages in the LRU if they are compound or supposed
> to be skipped.
> 
> Thanks.
> 
> - Alex
>
Alexander H Duyck Aug. 11, 2020, 2:47 p.m. UTC | #8
On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> > On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>> I wonder if this entire section shouldn't be restructured. This is the
> >>> only spot I can see where we are resetting the LRU flag instead of
> >>> pulling the page from the LRU list with the lock held. Looking over
> >>> the code it seems like something like that should be possible. I am
> >>> not sure the LRU lock is really protecting us in either the
> >>> PageCompound check nor the skip bits. It seems like holding a
> >>> reference on the page should prevent it from switching between
> >>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>> being per node/memcg which I would think implies that we could have
> >>> multiple LRU locks that could apply to a single skip bit.
> >>
> >> Hi Alexander,
> >>
> >> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >> issue do you concerned?
> >>
> >> Thanks!
> >
> > The point I was getting at is that the LRU lock is being used to
> > protect these and with your changes I don't think that makes sense
> > anymore.
> >
> > The skip bits are per-pageblock bits. With your change the LRU lock is
> > now per memcg first and then per node. As such I do not believe it
> > really provides any sort of exclusive access to the skip bits. I still
> > have to look into this more, but it seems like you need a lock per
> > either section or zone that can be used to protect those bits and deal
> > with this sooner rather than waiting until you have found an LRU page.
> > The one part that is confusing though is that the definition of the
> > skip bits seems to call out that they are a hint since they are not
> > protected by a lock, but that is exactly what has been happening here.
> >
>
> The skip bits are safe here, since even it race with other skip action,
> It will still skip out. The skip action is try not to compaction too much,
> not a exclusive action needs avoid race.

That would be the case if it didn't have the impact that they
currently do on the compaction process. What I am getting at is that a
race was introduced when you placed this test between the clearing of
the LRU flag and the actual pulling of the page from the LRU list. So
if you tested the skip bits before clearing the LRU flag then I would
be okay with the code, however because it is triggering an abort after
the LRU flag is cleared then you are creating a situation where
multiple processes will be stomping all over each other as you can
have each thread essentially take a page via the LRU flag, but only
one thread will process a page and it could skip over all other pages
that preemptively had their LRU flag cleared.

If you take a look at the test_and_set_skip the function only acts on
the pageblock aligned PFN for a given range. WIth the changes you have
in place now that would mean that only one thread would ever actually
call this function anyway since the first PFN would take the LRU flag
so no other thread could follow through and test or set the bit as
well. The expectation before was that all threads would encounter this
test and either proceed after setting the bit for the first PFN or
abort after testing the first PFN. With you changes only the first
thread actually runs this test and then it and the others will likely
encounter multiple failures as they are all clearing LRU bits
simultaneously and tripping each other up. That is why the skip bit
must have a test and set done before you even get to the point of
clearing the LRU flag.

> > The point I was getting at with the PageCompound check is that instead
> > of needing the LRU lock you should be able to look at PageCompound as
> > soon as you call get_page_unless_zero() and preempt the need to set
> > the LRU bit again. Instead of trying to rely on the LRU lock to
> > guarantee that the page hasn't been merged you could just rely on the
> > fact that you are holding a reference to it so it isn't going to
> > switch between being compound or order 0 since it cannot be freed. It
> > spoils the idea I originally had of combining the logic for
> > get_page_unless_zero and TestClearPageLRU into a single function, but
> > the advantage is you aren't clearing the LRU flag unless you are
> > actually going to pull the page from the LRU list.
>
> Sorry, I still can not follow you here. Compound code part is unchanged
> and follow the original logical. So would you like to pose a new code to
> see if its works?

No there are significant changes as you reordered all of the
operations. Prior to your change the LRU bit was checked, but not
cleared before testing for PageCompound. Now you are clearing it
before you are testing if it is a compound page. So if compaction is
running we will be seeing the pages in the LRU stay put, but the
compound bit flickering off and on if the compound page is encountered
with the wrong or NULL lruvec. What I was suggesting is that the
PageCompound test probably doesn't need to be concerned with the lock
after your changes. You could test it after you call
get_page_unless_zero() and before you call
__isolate_lru_page_prepare(). Instead of relying on the LRU lock to
protect us from the page switching between compound and not we would
be relying on the fact that we are holding a reference to the page so
it should not be freed and transition between compound or not.
Alex Shi Aug. 12, 2020, 11:43 a.m. UTC | #9
在 2020/8/11 下午10:47, Alexander Duyck 写道:
> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
>>>>> I wonder if this entire section shouldn't be restructured. This is the
>>>>> only spot I can see where we are resetting the LRU flag instead of
>>>>> pulling the page from the LRU list with the lock held. Looking over
>>>>> the code it seems like something like that should be possible. I am
>>>>> not sure the LRU lock is really protecting us in either the
>>>>> PageCompound check nor the skip bits. It seems like holding a
>>>>> reference on the page should prevent it from switching between
>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
>>>>> being per node/memcg which I would think implies that we could have
>>>>> multiple LRU locks that could apply to a single skip bit.
>>>>
>>>> Hi Alexander,
>>>>
>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
>>>> issue do you concerned?
>>>>
>>>> Thanks!
>>>
>>> The point I was getting at is that the LRU lock is being used to
>>> protect these and with your changes I don't think that makes sense
>>> anymore.
>>>
>>> The skip bits are per-pageblock bits. With your change the LRU lock is
>>> now per memcg first and then per node. As such I do not believe it
>>> really provides any sort of exclusive access to the skip bits. I still
>>> have to look into this more, but it seems like you need a lock per
>>> either section or zone that can be used to protect those bits and deal
>>> with this sooner rather than waiting until you have found an LRU page.
>>> The one part that is confusing though is that the definition of the
>>> skip bits seems to call out that they are a hint since they are not
>>> protected by a lock, but that is exactly what has been happening here.
>>>
>>
>> The skip bits are safe here, since even it race with other skip action,
>> It will still skip out. The skip action is try not to compaction too much,
>> not a exclusive action needs avoid race.
> 
> That would be the case if it didn't have the impact that they
> currently do on the compaction process. What I am getting at is that a
> race was introduced when you placed this test between the clearing of
> the LRU flag and the actual pulling of the page from the LRU list. So
> if you tested the skip bits before clearing the LRU flag then I would
> be okay with the code, however because it is triggering an abort after

Hi Alexander,

Thanks a lot for comments and suggestions!

I have tried your suggestion:

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
---
 mm/compaction.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b99c96c4862d..6c881dee8c9a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
 			goto isolate_fail_put;

+		/* Try get exclusive access under lock */
+		if (!skip_updated) {
+			skip_updated = true;
+			if (test_and_set_skip(cc, page, low_pfn))
+				goto isolate_fail_put;
+		}
+
 		/* Try isolate the page */
 		if (!TestClearPageLRU(page))
 			goto isolate_fail_put;
@@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)

 			lruvec_memcg_debug(lruvec, page);

-			/* Try get exclusive access under lock */
-			if (!skip_updated) {
-				skip_updated = true;
-				if (test_and_set_skip(cc, page, low_pfn))
-					goto isolate_abort;
-			}
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
--

Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
helpful

> the LRU flag is cleared then you are creating a situation where
> multiple processes will be stomping all over each other as you can
> have each thread essentially take a page via the LRU flag, but only
> one thread will process a page and it could skip over all other pages
> that preemptively had their LRU flag cleared.

It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
could stop each other in a array check(bitmap). So compare to whole node 
lru_lock, the net profit is clear in patch 17.

> 
> If you take a look at the test_and_set_skip the function only acts on
> the pageblock aligned PFN for a given range. WIth the changes you have
> in place now that would mean that only one thread would ever actually
> call this function anyway since the first PFN would take the LRU flag
> so no other thread could follow through and test or set the bit as

Is this good for only one process could do test_and_set_skip? is that 
the 'skip' meaning to be?

> well. The expectation before was that all threads would encounter this
> test and either proceed after setting the bit for the first PFN or
> abort after testing the first PFN. With you changes only the first
> thread actually runs this test and then it and the others will likely
> encounter multiple failures as they are all clearing LRU bits
> simultaneously and tripping each other up. That is why the skip bit
> must have a test and set done before you even get to the point of
> clearing the LRU flag.

It make the things warse in my machine, would you like to have a try by yourself?

> 
>>> The point I was getting at with the PageCompound check is that instead
>>> of needing the LRU lock you should be able to look at PageCompound as
>>> soon as you call get_page_unless_zero() and preempt the need to set
>>> the LRU bit again. Instead of trying to rely on the LRU lock to
>>> guarantee that the page hasn't been merged you could just rely on the
>>> fact that you are holding a reference to it so it isn't going to
>>> switch between being compound or order 0 since it cannot be freed. It
>>> spoils the idea I originally had of combining the logic for
>>> get_page_unless_zero and TestClearPageLRU into a single function, but
>>> the advantage is you aren't clearing the LRU flag unless you are
>>> actually going to pull the page from the LRU list.
>>
>> Sorry, I still can not follow you here. Compound code part is unchanged
>> and follow the original logical. So would you like to pose a new code to
>> see if its works?
> 
> No there are significant changes as you reordered all of the
> operations. Prior to your change the LRU bit was checked, but not
> cleared before testing for PageCompound. Now you are clearing it
> before you are testing if it is a compound page. So if compaction is
> running we will be seeing the pages in the LRU stay put, but the
> compound bit flickering off and on if the compound page is encountered
> with the wrong or NULL lruvec. What I was suggesting is that the

The lruvec could be wrong or NULL here, that is the base stone of whole
patchset.

> PageCompound test probably doesn't need to be concerned with the lock
> after your changes. You could test it after you call
> get_page_unless_zero() and before you call
> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> protect us from the page switching between compound and not we would
> be relying on the fact that we are holding a reference to the page so
> it should not be freed and transition between compound or not.
> 

I have tried the patch as your suggested, it has no clear help on performance
on above vm-scaliblity case. Maybe it's due to we checked the same thing
before lock already.

diff --git a/mm/compaction.c b/mm/compaction.c
index b99c96c4862d..cf2ac5148001 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
 		if (unlikely(!get_page_unless_zero(page)))
 			goto isolate_fail;

+			/*
+			 * Page become compound since the non-locked check,
+			 * and it's on LRU. It can only be a THP so the order
+			 * is safe to read and it's 0 for tail pages.
+			 */
+			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
+				low_pfn += compound_nr(page) - 1;
+				goto isolate_fail_put;
+			}
+
 		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
 			goto isolate_fail_put;

@@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}

-			/*
-			 * Page become compound since the non-locked check,
-			 * and it's on LRU. It can only be a THP so the order
-			 * is safe to read and it's 0 for tail pages.
-			 */
-			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
-				low_pfn += compound_nr(page) - 1;
-				SetPageLRU(page);
-				goto isolate_fail_put;
-			}
 		} else
 			rcu_read_unlock();

Thanks
Alex
Alex Shi Aug. 12, 2020, 12:16 p.m. UTC | #10
在 2020/8/12 下午7:43, Alex Shi 写道:
>>> Sorry, I still can not follow you here. Compound code part is unchanged
>>> and follow the original logical. So would you like to pose a new code to
>>> see if its works?
>> No there are significant changes as you reordered all of the
>> operations. Prior to your change the LRU bit was checked, but not
>> cleared before testing for PageCompound. Now you are clearing it
>> before you are testing if it is a compound page. So if compaction is
>> running we will be seeing the pages in the LRU stay put, but the
>> compound bit flickering off and on if the compound page is encountered
>> with the wrong or NULL lruvec. What I was suggesting is that the

> The lruvec could be wrong or NULL here, that is the base stone of whole
> patchset.
> 
Sorry for typo. s/could/could not/
Alexander H Duyck Aug. 12, 2020, 4:51 p.m. UTC | #11
On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
> > On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> >>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>>>> I wonder if this entire section shouldn't be restructured. This is the
> >>>>> only spot I can see where we are resetting the LRU flag instead of
> >>>>> pulling the page from the LRU list with the lock held. Looking over
> >>>>> the code it seems like something like that should be possible. I am
> >>>>> not sure the LRU lock is really protecting us in either the
> >>>>> PageCompound check nor the skip bits. It seems like holding a
> >>>>> reference on the page should prevent it from switching between
> >>>>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>>>> being per node/memcg which I would think implies that we could have
> >>>>> multiple LRU locks that could apply to a single skip bit.
> >>>>
> >>>> Hi Alexander,
> >>>>
> >>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >>>> issue do you concerned?
> >>>>
> >>>> Thanks!
> >>>
> >>> The point I was getting at is that the LRU lock is being used to
> >>> protect these and with your changes I don't think that makes sense
> >>> anymore.
> >>>
> >>> The skip bits are per-pageblock bits. With your change the LRU lock is
> >>> now per memcg first and then per node. As such I do not believe it
> >>> really provides any sort of exclusive access to the skip bits. I still
> >>> have to look into this more, but it seems like you need a lock per
> >>> either section or zone that can be used to protect those bits and deal
> >>> with this sooner rather than waiting until you have found an LRU page.
> >>> The one part that is confusing though is that the definition of the
> >>> skip bits seems to call out that they are a hint since they are not
> >>> protected by a lock, but that is exactly what has been happening here.
> >>>
> >>
> >> The skip bits are safe here, since even it race with other skip action,
> >> It will still skip out. The skip action is try not to compaction too much,
> >> not a exclusive action needs avoid race.
> >
> > That would be the case if it didn't have the impact that they
> > currently do on the compaction process. What I am getting at is that a
> > race was introduced when you placed this test between the clearing of
> > the LRU flag and the actual pulling of the page from the LRU list. So
> > if you tested the skip bits before clearing the LRU flag then I would
> > be okay with the code, however because it is triggering an abort after
>
> Hi Alexander,
>
> Thanks a lot for comments and suggestions!
>
> I have tried your suggestion:
>
> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> ---
>  mm/compaction.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b99c96c4862d..6c881dee8c9a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>                         goto isolate_fail_put;
>
> +               /* Try get exclusive access under lock */
> +               if (!skip_updated) {
> +                       skip_updated = true;
> +                       if (test_and_set_skip(cc, page, low_pfn))
> +                               goto isolate_fail_put;
> +               }
> +
>                 /* Try isolate the page */
>                 if (!TestClearPageLRU(page))
>                         goto isolate_fail_put;

I would have made this much sooner. Probably before you call
get_page_unless_zero so as to avoid the unnecessary atomic operations.

> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>
>                         lruvec_memcg_debug(lruvec, page);
>
> -                       /* Try get exclusive access under lock */
> -                       if (!skip_updated) {
> -                               skip_updated = true;
> -                               if (test_and_set_skip(cc, page, low_pfn))
> -                                       goto isolate_abort;
> -                       }
> -
>                         /*
>                          * Page become compound since the non-locked check,
>                          * and it's on LRU. It can only be a THP so the order
> --
>
> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
> helpful

So one issue with this change is that it is still too late to be of
much benefit. Really you should probably be doing this much sooner,
for example somewhere before the get_page_unless_zero(). Also the
thing that still has me scratching my head is the "Try get exclusive
access under lock" comment. The function declaration says this is
supposed to be a hint, but we were using the LRU lock to synchronize
it. I'm wondering if we should really be protecting this with the zone
lock since we are modifying the pageblock flags which also contain the
migration type value for the pageblock and are only modified while
holding the zone lock.

> > the LRU flag is cleared then you are creating a situation where
> > multiple processes will be stomping all over each other as you can
> > have each thread essentially take a page via the LRU flag, but only
> > one thread will process a page and it could skip over all other pages
> > that preemptively had their LRU flag cleared.
>
> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
> could stop each other in a array check(bitmap). So compare to whole node
> lru_lock, the net profit is clear in patch 17.

My concern is that what you can end up with is multiple threads all
working over the same pageblock for isolation. With the old code the
LRU lock was used to make certain that test_and_set_skip was being
synchronized on the first page in the pageblock so you would only have
one thread going through and working a single pageblock. However after
your changes it doesn't seem like the test_and_set_skip has that
protection since only one thread will ever be able to successfully
call it for the first page in the pageblock assuming that the LRU flag
is set on the first page in the pageblock block.

> >
> > If you take a look at the test_and_set_skip the function only acts on
> > the pageblock aligned PFN for a given range. WIth the changes you have
> > in place now that would mean that only one thread would ever actually
> > call this function anyway since the first PFN would take the LRU flag
> > so no other thread could follow through and test or set the bit as
>
> Is this good for only one process could do test_and_set_skip? is that
> the 'skip' meaning to be?

So only one thread really getting to fully use test_and_set_skip is
good, however the issue is that there is nothing to synchronize the
testing from the other threads. As a result the other threads could
have isolated other pages within the pageblock before the thread that
is calling test_and_set_skip will get to complete the setting of the
skip bit. This will result in isolation failures for the thread that
set the skip bit which may be undesirable behavior.

With the old code the threads were all synchronized on testing the
first PFN in the pageblock while holding the LRU lock and that is what
we lost. My concern is the cases where skip_on_failure == true are
going to fail much more often now as the threads can easily interfere
with each other.

> > well. The expectation before was that all threads would encounter this
> > test and either proceed after setting the bit for the first PFN or
> > abort after testing the first PFN. With you changes only the first
> > thread actually runs this test and then it and the others will likely
> > encounter multiple failures as they are all clearing LRU bits
> > simultaneously and tripping each other up. That is why the skip bit
> > must have a test and set done before you even get to the point of
> > clearing the LRU flag.
>
> It make the things warse in my machine, would you like to have a try by yourself?

I plan to do that. I have already been working on a few things to
clean up and optimize your patch set further. I will try to submit an
RFC this evening so we can discuss.

> >
> >>> The point I was getting at with the PageCompound check is that instead
> >>> of needing the LRU lock you should be able to look at PageCompound as
> >>> soon as you call get_page_unless_zero() and preempt the need to set
> >>> the LRU bit again. Instead of trying to rely on the LRU lock to
> >>> guarantee that the page hasn't been merged you could just rely on the
> >>> fact that you are holding a reference to it so it isn't going to
> >>> switch between being compound or order 0 since it cannot be freed. It
> >>> spoils the idea I originally had of combining the logic for
> >>> get_page_unless_zero and TestClearPageLRU into a single function, but
> >>> the advantage is you aren't clearing the LRU flag unless you are
> >>> actually going to pull the page from the LRU list.
> >>
> >> Sorry, I still can not follow you here. Compound code part is unchanged
> >> and follow the original logical. So would you like to pose a new code to
> >> see if its works?
> >
> > No there are significant changes as you reordered all of the
> > operations. Prior to your change the LRU bit was checked, but not
> > cleared before testing for PageCompound. Now you are clearing it
> > before you are testing if it is a compound page. So if compaction is
> > running we will be seeing the pages in the LRU stay put, but the
> > compound bit flickering off and on if the compound page is encountered
> > with the wrong or NULL lruvec. What I was suggesting is that the
>
> The lruvec could be wrong or NULL here, that is the base stone of whole
> patchset.

Sorry I had a typo in my comment as well as it is the LRU bit that
will be flickering, not the compound. The goal here is to avoid
clearing the LRU bit unless we are sure we are going to take the
lruvec lock and pull the page from the list.

> > PageCompound test probably doesn't need to be concerned with the lock
> > after your changes. You could test it after you call
> > get_page_unless_zero() and before you call
> > __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> > protect us from the page switching between compound and not we would
> > be relying on the fact that we are holding a reference to the page so
> > it should not be freed and transition between compound or not.
> >
>
> I have tried the patch as your suggested, it has no clear help on performance
> on above vm-scaliblity case. Maybe it's due to we checked the same thing
> before lock already.
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b99c96c4862d..cf2ac5148001 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                 if (unlikely(!get_page_unless_zero(page)))
>                         goto isolate_fail;
>
> +                       /*
> +                        * Page become compound since the non-locked check,
> +                        * and it's on LRU. It can only be a THP so the order
> +                        * is safe to read and it's 0 for tail pages.
> +                        */
> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> +                               low_pfn += compound_nr(page) - 1;
> +                               goto isolate_fail_put;
> +                       }
> +
>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>                         goto isolate_fail_put;
>
> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>                                         goto isolate_abort;
>                         }
>
> -                       /*
> -                        * Page become compound since the non-locked check,
> -                        * and it's on LRU. It can only be a THP so the order
> -                        * is safe to read and it's 0 for tail pages.
> -                        */
> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> -                               low_pfn += compound_nr(page) - 1;
> -                               SetPageLRU(page);
> -                               goto isolate_fail_put;
> -                       }
>                 } else
>                         rcu_read_unlock();
>

So actually there is more we could do than just this. Specifically a
few lines below the rcu_read_lock there is yet another PageCompound
check that sets low_pfn yet again. So in theory we could combine both
of those and modify the code so you end up with something more like:
@@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
*cc, unsigned long low_pfn,
                if (unlikely(!get_page_unless_zero(page)))
                        goto isolate_fail;

+               if (PageCompound(page)) {
+                       const unsigned int order = compound_order(page);
+
+                       if (likely(order < MAX_ORDER))
+                               low_pfn += (1UL << order) - 1;
+
+                       if (unlikely(!cc->alloc_contig))
+                               goto isolate_fail_put;
+               }
+
                if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
                        goto isolate_fail_put;

Doing this you would be more likely to skip over the entire compound
page in a single jump should you not be able to either take the LRU
bit or encounter a busy page in __isolate_Lru_page_prepare. I had
copied this bit from an earlier check and modified it as I was not
sure I can guarantee that this is a THP since we haven't taken the LRU
lock yet. However I believe the page cannot be split up while we are
holding the extra reference so the PageCompound flag and order should
not change until we call put_page.
Alex Shi Aug. 13, 2020, 1:46 a.m. UTC | #12
在 2020/8/13 上午12:51, Alexander Duyck 写道:
> On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>
>>
>>
>> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
>>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
>>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
>>>>>>> I wonder if this entire section shouldn't be restructured. This is the
>>>>>>> only spot I can see where we are resetting the LRU flag instead of
>>>>>>> pulling the page from the LRU list with the lock held. Looking over
>>>>>>> the code it seems like something like that should be possible. I am
>>>>>>> not sure the LRU lock is really protecting us in either the
>>>>>>> PageCompound check nor the skip bits. It seems like holding a
>>>>>>> reference on the page should prevent it from switching between
>>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
>>>>>>> being per node/memcg which I would think implies that we could have
>>>>>>> multiple LRU locks that could apply to a single skip bit.
>>>>>>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
>>>>>> issue do you concerned?
>>>>>>
>>>>>> Thanks!
>>>>>
>>>>> The point I was getting at is that the LRU lock is being used to
>>>>> protect these and with your changes I don't think that makes sense
>>>>> anymore.
>>>>>
>>>>> The skip bits are per-pageblock bits. With your change the LRU lock is
>>>>> now per memcg first and then per node. As such I do not believe it
>>>>> really provides any sort of exclusive access to the skip bits. I still
>>>>> have to look into this more, but it seems like you need a lock per
>>>>> either section or zone that can be used to protect those bits and deal
>>>>> with this sooner rather than waiting until you have found an LRU page.
>>>>> The one part that is confusing though is that the definition of the
>>>>> skip bits seems to call out that they are a hint since they are not
>>>>> protected by a lock, but that is exactly what has been happening here.
>>>>>
>>>>
>>>> The skip bits are safe here, since even it race with other skip action,
>>>> It will still skip out. The skip action is try not to compaction too much,
>>>> not a exclusive action needs avoid race.
>>>
>>> That would be the case if it didn't have the impact that they
>>> currently do on the compaction process. What I am getting at is that a
>>> race was introduced when you placed this test between the clearing of
>>> the LRU flag and the actual pulling of the page from the LRU list. So
>>> if you tested the skip bits before clearing the LRU flag then I would
>>> be okay with the code, however because it is triggering an abort after
>>
>> Hi Alexander,
>>
>> Thanks a lot for comments and suggestions!
>>
>> I have tried your suggestion:
>>
>> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
>> ---
>>  mm/compaction.c | 14 +++++++-------
>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b99c96c4862d..6c881dee8c9a 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>                         goto isolate_fail_put;
>>
>> +               /* Try get exclusive access under lock */
>> +               if (!skip_updated) {
>> +                       skip_updated = true;
>> +                       if (test_and_set_skip(cc, page, low_pfn))
>> +                               goto isolate_fail_put;
>> +               }
>> +
>>                 /* Try isolate the page */
>>                 if (!TestClearPageLRU(page))
>>                         goto isolate_fail_put;
> 
> I would have made this much sooner. Probably before you call
> get_page_unless_zero so as to avoid the unnecessary atomic operations.
> 
>> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>
>>                         lruvec_memcg_debug(lruvec, page);
>>
>> -                       /* Try get exclusive access under lock */
>> -                       if (!skip_updated) {
>> -                               skip_updated = true;
>> -                               if (test_and_set_skip(cc, page, low_pfn))
>> -                                       goto isolate_abort;
>> -                       }
>> -
>>                         /*
>>                          * Page become compound since the non-locked check,
>>                          * and it's on LRU. It can only be a THP so the order
>> --
>>
>> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
>> helpful
> 
> So one issue with this change is that it is still too late to be of
> much benefit. Really you should probably be doing this much sooner,
> for example somewhere before the get_page_unless_zero(). Also the
> thing that still has me scratching my head is the "Try get exclusive
> access under lock" comment. The function declaration says this is
> supposed to be a hint, but we were using the LRU lock to synchronize
> it. I'm wondering if we should really be protecting this with the zone
> lock since we are modifying the pageblock flags which also contain the
> migration type value for the pageblock and are only modified while
> holding the zone lock.

zone lock is probability better. you can try and test.
> 
>>> the LRU flag is cleared then you are creating a situation where
>>> multiple processes will be stomping all over each other as you can
>>> have each thread essentially take a page via the LRU flag, but only
>>> one thread will process a page and it could skip over all other pages
>>> that preemptively had their LRU flag cleared.
>>
>> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
>> could stop each other in a array check(bitmap). So compare to whole node
>> lru_lock, the net profit is clear in patch 17.
> 
> My concern is that what you can end up with is multiple threads all
> working over the same pageblock for isolation. With the old code the
> LRU lock was used to make certain that test_and_set_skip was being
> synchronized on the first page in the pageblock so you would only have
> one thread going through and working a single pageblock. However after
> your changes it doesn't seem like the test_and_set_skip has that
> protection since only one thread will ever be able to successfully
> call it for the first page in the pageblock assuming that the LRU flag
> is set on the first page in the pageblock block.
> 
>>>
>>> If you take a look at the test_and_set_skip the function only acts on
>>> the pageblock aligned PFN for a given range. WIth the changes you have
>>> in place now that would mean that only one thread would ever actually
>>> call this function anyway since the first PFN would take the LRU flag
>>> so no other thread could follow through and test or set the bit as
>>
>> Is this good for only one process could do test_and_set_skip? is that
>> the 'skip' meaning to be?
> 
> So only one thread really getting to fully use test_and_set_skip is
> good, however the issue is that there is nothing to synchronize the
> testing from the other threads. As a result the other threads could
> have isolated other pages within the pageblock before the thread that
> is calling test_and_set_skip will get to complete the setting of the
> skip bit. This will result in isolation failures for the thread that
> set the skip bit which may be undesirable behavior.
> 
> With the old code the threads were all synchronized on testing the
> first PFN in the pageblock while holding the LRU lock and that is what
> we lost. My concern is the cases where skip_on_failure == true are
> going to fail much more often now as the threads can easily interfere
> with each other.

I have a patch to fix this, which is on 
	https://github.com/alexshi/linux.git lrunext
> 
>>> well. The expectation before was that all threads would encounter this
>>> test and either proceed after setting the bit for the first PFN or
>>> abort after testing the first PFN. With you changes only the first
>>> thread actually runs this test and then it and the others will likely
>>> encounter multiple failures as they are all clearing LRU bits
>>> simultaneously and tripping each other up. That is why the skip bit
>>> must have a test and set done before you even get to the point of
>>> clearing the LRU flag.
>>
>> It make the things warse in my machine, would you like to have a try by yourself?
> 
> I plan to do that. I have already been working on a few things to
> clean up and optimize your patch set further. I will try to submit an
> RFC this evening so we can discuss.
> 

Glad to see your new code soon. Would you like do it base on 
		https://github.com/alexshi/linux.git lrunext

>>>
>>>>> The point I was getting at with the PageCompound check is that instead
>>>>> of needing the LRU lock you should be able to look at PageCompound as
>>>>> soon as you call get_page_unless_zero() and preempt the need to set
>>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
>>>>> guarantee that the page hasn't been merged you could just rely on the
>>>>> fact that you are holding a reference to it so it isn't going to
>>>>> switch between being compound or order 0 since it cannot be freed. It
>>>>> spoils the idea I originally had of combining the logic for
>>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
>>>>> the advantage is you aren't clearing the LRU flag unless you are
>>>>> actually going to pull the page from the LRU list.
>>>>
>>>> Sorry, I still can not follow you here. Compound code part is unchanged
>>>> and follow the original logical. So would you like to pose a new code to
>>>> see if its works?
>>>
>>> No there are significant changes as you reordered all of the
>>> operations. Prior to your change the LRU bit was checked, but not
>>> cleared before testing for PageCompound. Now you are clearing it
>>> before you are testing if it is a compound page. So if compaction is
>>> running we will be seeing the pages in the LRU stay put, but the
>>> compound bit flickering off and on if the compound page is encountered
>>> with the wrong or NULL lruvec. What I was suggesting is that the
>>
>> The lruvec could be wrong or NULL here, that is the base stone of whole
>> patchset.
> 
> Sorry I had a typo in my comment as well as it is the LRU bit that
> will be flickering, not the compound. The goal here is to avoid
> clearing the LRU bit unless we are sure we are going to take the
> lruvec lock and pull the page from the list.
> 
>>> PageCompound test probably doesn't need to be concerned with the lock
>>> after your changes. You could test it after you call
>>> get_page_unless_zero() and before you call
>>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
>>> protect us from the page switching between compound and not we would
>>> be relying on the fact that we are holding a reference to the page so
>>> it should not be freed and transition between compound or not.
>>>
>>
>> I have tried the patch as your suggested, it has no clear help on performance
>> on above vm-scaliblity case. Maybe it's due to we checked the same thing
>> before lock already.
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index b99c96c4862d..cf2ac5148001 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                 if (unlikely(!get_page_unless_zero(page)))
>>                         goto isolate_fail;
>>
>> +                       /*
>> +                        * Page become compound since the non-locked check,
>> +                        * and it's on LRU. It can only be a THP so the order
>> +                        * is safe to read and it's 0 for tail pages.
>> +                        */
>> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>> +                               low_pfn += compound_nr(page) - 1;
>> +                               goto isolate_fail_put;
>> +                       }
>> +
>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>                         goto isolate_fail_put;
>>
>> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>                                         goto isolate_abort;
>>                         }
>>
>> -                       /*
>> -                        * Page become compound since the non-locked check,
>> -                        * and it's on LRU. It can only be a THP so the order
>> -                        * is safe to read and it's 0 for tail pages.
>> -                        */
>> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>> -                               low_pfn += compound_nr(page) - 1;
>> -                               SetPageLRU(page);
>> -                               goto isolate_fail_put;
>> -                       }
>>                 } else
>>                         rcu_read_unlock();
>>
> 
> So actually there is more we could do than just this. Specifically a
> few lines below the rcu_read_lock there is yet another PageCompound
> check that sets low_pfn yet again. So in theory we could combine both
> of those and modify the code so you end up with something more like:
> @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
> *cc, unsigned long low_pfn,
>                 if (unlikely(!get_page_unless_zero(page)))
>                         goto isolate_fail;
> 
> +               if (PageCompound(page)) {
> +                       const unsigned int order = compound_order(page);
> +
> +                       if (likely(order < MAX_ORDER))
> +                               low_pfn += (1UL << order) - 1;
> +
> +                       if (unlikely(!cc->alloc_contig))
> +                               goto isolate_fail_put;
> 

The current don't check this unless locked changed. But anyway check it
every page may have no performance impact.

+               }
> +
>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>                         goto isolate_fail_put;
> 
> Doing this you would be more likely to skip over the entire compound
> page in a single jump should you not be able to either take the LRU
> bit or encounter a busy page in __isolate_Lru_page_prepare. I had
> copied this bit from an earlier check and modified it as I was not
> sure I can guarantee that this is a THP since we haven't taken the LRU
> lock yet. However I believe the page cannot be split up while we are
> holding the extra reference so the PageCompound flag and order should
> not change until we call put_page.
> 

It looks like the lock_page protect this instead of get_page that just works
after split func called.

Thanks
Alex
Alexander H Duyck Aug. 13, 2020, 2:17 a.m. UTC | #13
On Wed, Aug 12, 2020 at 6:47 PM Alex Shi <alex.shi@linux.alibaba.com> wrote:
>
>
>
> 在 2020/8/13 上午12:51, Alexander Duyck 写道:
> > On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
> >>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> >>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@linux.alibaba.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>>>>>> I wonder if this entire section shouldn't be restructured. This is the
> >>>>>>> only spot I can see where we are resetting the LRU flag instead of
> >>>>>>> pulling the page from the LRU list with the lock held. Looking over
> >>>>>>> the code it seems like something like that should be possible. I am
> >>>>>>> not sure the LRU lock is really protecting us in either the
> >>>>>>> PageCompound check nor the skip bits. It seems like holding a
> >>>>>>> reference on the page should prevent it from switching between
> >>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>>>>>> being per node/memcg which I would think implies that we could have
> >>>>>>> multiple LRU locks that could apply to a single skip bit.
> >>>>>>
> >>>>>> Hi Alexander,
> >>>>>>
> >>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >>>>>> issue do you concerned?
> >>>>>>
> >>>>>> Thanks!
> >>>>>
> >>>>> The point I was getting at is that the LRU lock is being used to
> >>>>> protect these and with your changes I don't think that makes sense
> >>>>> anymore.
> >>>>>
> >>>>> The skip bits are per-pageblock bits. With your change the LRU lock is
> >>>>> now per memcg first and then per node. As such I do not believe it
> >>>>> really provides any sort of exclusive access to the skip bits. I still
> >>>>> have to look into this more, but it seems like you need a lock per
> >>>>> either section or zone that can be used to protect those bits and deal
> >>>>> with this sooner rather than waiting until you have found an LRU page.
> >>>>> The one part that is confusing though is that the definition of the
> >>>>> skip bits seems to call out that they are a hint since they are not
> >>>>> protected by a lock, but that is exactly what has been happening here.
> >>>>>
> >>>>
> >>>> The skip bits are safe here, since even it race with other skip action,
> >>>> It will still skip out. The skip action is try not to compaction too much,
> >>>> not a exclusive action needs avoid race.
> >>>
> >>> That would be the case if it didn't have the impact that they
> >>> currently do on the compaction process. What I am getting at is that a
> >>> race was introduced when you placed this test between the clearing of
> >>> the LRU flag and the actual pulling of the page from the LRU list. So
> >>> if you tested the skip bits before clearing the LRU flag then I would
> >>> be okay with the code, however because it is triggering an abort after
> >>
> >> Hi Alexander,
> >>
> >> Thanks a lot for comments and suggestions!
> >>
> >> I have tried your suggestion:
> >>
> >> Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
> >> ---
> >>  mm/compaction.c | 14 +++++++-------
> >>  1 file changed, 7 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index b99c96c4862d..6c881dee8c9a 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >>                         goto isolate_fail_put;
> >>
> >> +               /* Try get exclusive access under lock */
> >> +               if (!skip_updated) {
> >> +                       skip_updated = true;
> >> +                       if (test_and_set_skip(cc, page, low_pfn))
> >> +                               goto isolate_fail_put;
> >> +               }
> >> +
> >>                 /* Try isolate the page */
> >>                 if (!TestClearPageLRU(page))
> >>                         goto isolate_fail_put;
> >
> > I would have made this much sooner. Probably before you call
> > get_page_unless_zero so as to avoid the unnecessary atomic operations.
> >
> >> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>
> >>                         lruvec_memcg_debug(lruvec, page);
> >>
> >> -                       /* Try get exclusive access under lock */
> >> -                       if (!skip_updated) {
> >> -                               skip_updated = true;
> >> -                               if (test_and_set_skip(cc, page, low_pfn))
> >> -                                       goto isolate_abort;
> >> -                       }
> >> -
> >>                         /*
> >>                          * Page become compound since the non-locked check,
> >>                          * and it's on LRU. It can only be a THP so the order
> >> --
> >>
> >> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
> >> helpful
> >
> > So one issue with this change is that it is still too late to be of
> > much benefit. Really you should probably be doing this much sooner,
> > for example somewhere before the get_page_unless_zero(). Also the
> > thing that still has me scratching my head is the "Try get exclusive
> > access under lock" comment. The function declaration says this is
> > supposed to be a hint, but we were using the LRU lock to synchronize
> > it. I'm wondering if we should really be protecting this with the zone
> > lock since we are modifying the pageblock flags which also contain the
> > migration type value for the pageblock and are only modified while
> > holding the zone lock.
>
> zone lock is probability better. you can try and test.

So I spent a good chunk of today looking the code over and what I
realized is that we probably don't even really need to have this code
protected by the zone lock since the LRU bit in the pageblock should
do most of the work for us. In addition we can get rid of the test
portion of this and just make it a set only operation if I am not
mistaken.

> >>> the LRU flag is cleared then you are creating a situation where
> >>> multiple processes will be stomping all over each other as you can
> >>> have each thread essentially take a page via the LRU flag, but only
> >>> one thread will process a page and it could skip over all other pages
> >>> that preemptively had their LRU flag cleared.
> >>
> >> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
> >> could stop each other in a array check(bitmap). So compare to whole node
> >> lru_lock, the net profit is clear in patch 17.
> >
> > My concern is that what you can end up with is multiple threads all
> > working over the same pageblock for isolation. With the old code the
> > LRU lock was used to make certain that test_and_set_skip was being
> > synchronized on the first page in the pageblock so you would only have
> > one thread going through and working a single pageblock. However after
> > your changes it doesn't seem like the test_and_set_skip has that
> > protection since only one thread will ever be able to successfully
> > call it for the first page in the pageblock assuming that the LRU flag
> > is set on the first page in the pageblock block.
> >
> >>>
> >>> If you take a look at the test_and_set_skip the function only acts on
> >>> the pageblock aligned PFN for a given range. WIth the changes you have
> >>> in place now that would mean that only one thread would ever actually
> >>> call this function anyway since the first PFN would take the LRU flag
> >>> so no other thread could follow through and test or set the bit as
> >>
> >> Is this good for only one process could do test_and_set_skip? is that
> >> the 'skip' meaning to be?
> >
> > So only one thread really getting to fully use test_and_set_skip is
> > good, however the issue is that there is nothing to synchronize the
> > testing from the other threads. As a result the other threads could
> > have isolated other pages within the pageblock before the thread that
> > is calling test_and_set_skip will get to complete the setting of the
> > skip bit. This will result in isolation failures for the thread that
> > set the skip bit which may be undesirable behavior.
> >
> > With the old code the threads were all synchronized on testing the
> > first PFN in the pageblock while holding the LRU lock and that is what
> > we lost. My concern is the cases where skip_on_failure == true are
> > going to fail much more often now as the threads can easily interfere
> > with each other.
>
> I have a patch to fix this, which is on
>         https://github.com/alexshi/linux.git lrunext

I don't think that patch helps to address anything. You are now
failing to set the bit in the case that something modifies the
pageblock flags while you are attempting to do so. I think it would be
better to just leave the cmpxchg loop as it is.

> >
> >>> well. The expectation before was that all threads would encounter this
> >>> test and either proceed after setting the bit for the first PFN or
> >>> abort after testing the first PFN. With you changes only the first
> >>> thread actually runs this test and then it and the others will likely
> >>> encounter multiple failures as they are all clearing LRU bits
> >>> simultaneously and tripping each other up. That is why the skip bit
> >>> must have a test and set done before you even get to the point of
> >>> clearing the LRU flag.
> >>
> >> It make the things warse in my machine, would you like to have a try by yourself?
> >
> > I plan to do that. I have already been working on a few things to
> > clean up and optimize your patch set further. I will try to submit an
> > RFC this evening so we can discuss.
> >
>
> Glad to see your new code soon. Would you like do it base on
>                 https://github.com/alexshi/linux.git lrunext

I can rebase off of that tree. It may add another half hour or so. I
have barely had any time to test my code. When I enabled some of the
debugging features in the kernel related to using the vm-scalability
tests the boot time became incredibly slow so I may just make certain
I can boot and not mess the system up before submitting my patches as
an RFC. I can probably try testing them more tomorrow.

> >>>
> >>>>> The point I was getting at with the PageCompound check is that instead
> >>>>> of needing the LRU lock you should be able to look at PageCompound as
> >>>>> soon as you call get_page_unless_zero() and preempt the need to set
> >>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
> >>>>> guarantee that the page hasn't been merged you could just rely on the
> >>>>> fact that you are holding a reference to it so it isn't going to
> >>>>> switch between being compound or order 0 since it cannot be freed. It
> >>>>> spoils the idea I originally had of combining the logic for
> >>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
> >>>>> the advantage is you aren't clearing the LRU flag unless you are
> >>>>> actually going to pull the page from the LRU list.
> >>>>
> >>>> Sorry, I still can not follow you here. Compound code part is unchanged
> >>>> and follow the original logical. So would you like to pose a new code to
> >>>> see if its works?
> >>>
> >>> No there are significant changes as you reordered all of the
> >>> operations. Prior to your change the LRU bit was checked, but not
> >>> cleared before testing for PageCompound. Now you are clearing it
> >>> before you are testing if it is a compound page. So if compaction is
> >>> running we will be seeing the pages in the LRU stay put, but the
> >>> compound bit flickering off and on if the compound page is encountered
> >>> with the wrong or NULL lruvec. What I was suggesting is that the
> >>
> >> The lruvec could be wrong or NULL here, that is the base stone of whole
> >> patchset.
> >
> > Sorry I had a typo in my comment as well as it is the LRU bit that
> > will be flickering, not the compound. The goal here is to avoid
> > clearing the LRU bit unless we are sure we are going to take the
> > lruvec lock and pull the page from the list.
> >
> >>> PageCompound test probably doesn't need to be concerned with the lock
> >>> after your changes. You could test it after you call
> >>> get_page_unless_zero() and before you call
> >>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> >>> protect us from the page switching between compound and not we would
> >>> be relying on the fact that we are holding a reference to the page so
> >>> it should not be freed and transition between compound or not.
> >>>
> >>
> >> I have tried the patch as your suggested, it has no clear help on performance
> >> on above vm-scaliblity case. Maybe it's due to we checked the same thing
> >> before lock already.
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index b99c96c4862d..cf2ac5148001 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (unlikely(!get_page_unless_zero(page)))
> >>                         goto isolate_fail;
> >>
> >> +                       /*
> >> +                        * Page become compound since the non-locked check,
> >> +                        * and it's on LRU. It can only be a THP so the order
> >> +                        * is safe to read and it's 0 for tail pages.
> >> +                        */
> >> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> >> +                               low_pfn += compound_nr(page) - 1;
> >> +                               goto isolate_fail_put;
> >> +                       }
> >> +
> >>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >>                         goto isolate_fail_put;
> >>
> >> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                                         goto isolate_abort;
> >>                         }
> >>
> >> -                       /*
> >> -                        * Page become compound since the non-locked check,
> >> -                        * and it's on LRU. It can only be a THP so the order
> >> -                        * is safe to read and it's 0 for tail pages.
> >> -                        */
> >> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> >> -                               low_pfn += compound_nr(page) - 1;
> >> -                               SetPageLRU(page);
> >> -                               goto isolate_fail_put;
> >> -                       }
> >>                 } else
> >>                         rcu_read_unlock();
> >>
> >
> > So actually there is more we could do than just this. Specifically a
> > few lines below the rcu_read_lock there is yet another PageCompound
> > check that sets low_pfn yet again. So in theory we could combine both
> > of those and modify the code so you end up with something more like:
> > @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
> > *cc, unsigned long low_pfn,
> >                 if (unlikely(!get_page_unless_zero(page)))
> >                         goto isolate_fail;
> >
> > +               if (PageCompound(page)) {
> > +                       const unsigned int order = compound_order(page);
> > +
> > +                       if (likely(order < MAX_ORDER))
> > +                               low_pfn += (1UL << order) - 1;
> > +
> > +                       if (unlikely(!cc->alloc_contig))
> > +                               goto isolate_fail_put;
> >
>
> The current don't check this unless locked changed. But anyway check it
> every page may have no performance impact.

Yes and no. The same code is also ran outside the lock and that is why
I suggested merging the two and creating this block of logic. It will
be clearer once I have done some initial smoke testing and submitted
my patch.

> +               }
> > +
> >                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >                         goto isolate_fail_put;
> >
> > Doing this you would be more likely to skip over the entire compound
> > page in a single jump should you not be able to either take the LRU
> > bit or encounter a busy page in __isolate_Lru_page_prepare. I had
> > copied this bit from an earlier check and modified it as I was not
> > sure I can guarantee that this is a THP since we haven't taken the LRU
> > lock yet. However I believe the page cannot be split up while we are
> > holding the extra reference so the PageCompound flag and order should
> > not change until we call put_page.
> >
>
> It looks like the lock_page protect this instead of get_page that just works
> after split func called.

So I thought that the call to page_ref_freeze that is used in
functions like split_huge_page_to_list is meant to address this case.
What it is essentially doing is setting the reference count to zero if
the count is at the expected value. So with the get_page_unless_zero
it would either fail because the value is already zero, or the
page_ref_freeze would fail because the count would be one higher than
the expected value. Either that or I am still missing another piece in
the understanding of this.

Thanks.

- Alex
Alex Shi Aug. 13, 2020, 3:52 a.m. UTC | #14
在 2020/8/13 上午10:17, Alexander Duyck 写道:
>> zone lock is probability better. you can try and test.
> So I spent a good chunk of today looking the code over and what I
> realized is that we probably don't even really need to have this code
> protected by the zone lock since the LRU bit in the pageblock should
> do most of the work for us. In addition we can get rid of the test
> portion of this and just make it a set only operation if I am not
> mistaken.
> 
>>>>> the LRU flag is cleared then you are creating a situation where
>>>>> multiple processes will be stomping all over each other as you can
>>>>> have each thread essentially take a page via the LRU flag, but only
>>>>> one thread will process a page and it could skip over all other pages
>>>>> that preemptively had their LRU flag cleared.
>>>> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
>>>> could stop each other in a array check(bitmap). So compare to whole node
>>>> lru_lock, the net profit is clear in patch 17.
>>> My concern is that what you can end up with is multiple threads all
>>> working over the same pageblock for isolation. With the old code the
>>> LRU lock was used to make certain that test_and_set_skip was being
>>> synchronized on the first page in the pageblock so you would only have
>>> one thread going through and working a single pageblock. However after
>>> your changes it doesn't seem like the test_and_set_skip has that
>>> protection since only one thread will ever be able to successfully
>>> call it for the first page in the pageblock assuming that the LRU flag
>>> is set on the first page in the pageblock block.
>>>
>>>>> If you take a look at the test_and_set_skip the function only acts on
>>>>> the pageblock aligned PFN for a given range. WIth the changes you have
>>>>> in place now that would mean that only one thread would ever actually
>>>>> call this function anyway since the first PFN would take the LRU flag
>>>>> so no other thread could follow through and test or set the bit as
>>>> Is this good for only one process could do test_and_set_skip? is that
>>>> the 'skip' meaning to be?
>>> So only one thread really getting to fully use test_and_set_skip is
>>> good, however the issue is that there is nothing to synchronize the
>>> testing from the other threads. As a result the other threads could
>>> have isolated other pages within the pageblock before the thread that
>>> is calling test_and_set_skip will get to complete the setting of the
>>> skip bit. This will result in isolation failures for the thread that
>>> set the skip bit which may be undesirable behavior.
>>>
>>> With the old code the threads were all synchronized on testing the
>>> first PFN in the pageblock while holding the LRU lock and that is what
>>> we lost. My concern is the cases where skip_on_failure == true are
>>> going to fail much more often now as the threads can easily interfere
>>> with each other.
>> I have a patch to fix this, which is on
>>         https://github.com/alexshi/linux.git lrunext
> I don't think that patch helps to address anything. You are now
> failing to set the bit in the case that something modifies the
> pageblock flags while you are attempting to do so. I think it would be
> better to just leave the cmpxchg loop as it is.

It do increae the case-lru-file-mmap-read in vm-scalibity about 3% performance.
Yes, I am glad to see it can be make better.


> 
>>>>> well. The expectation before was that all threads would encounter this
>>>>> test and either proceed after setting the bit for the first PFN or
>>>>> abort after testing the first PFN. With you changes only the first
>>>>> thread actually runs this test and then it and the others will likely
>>>>> encounter multiple failures as they are all clearing LRU bits
>>>>> simultaneously and tripping each other up. That is why the skip bit
>>>>> must have a test and set done before you even get to the point of
>>>>> clearing the LRU flag.
>>>> It make the things warse in my machine, would you like to have a try by yourself?
>>> I plan to do that. I have already been working on a few things to
>>> clean up and optimize your patch set further. I will try to submit an
>>> RFC this evening so we can discuss.
>>>
>> Glad to see your new code soon. Would you like do it base on
>>                 https://github.com/alexshi/linux.git lrunext
> I can rebase off of that tree. It may add another half hour or so. I
> have barely had any time to test my code. When I enabled some of the
> debugging features in the kernel related to using the vm-scalability
> tests the boot time became incredibly slow so I may just make certain
> I can boot and not mess the system up before submitting my patches as
> an RFC. I can probably try testing them more tomorrow.
> 
>>>>>>> The point I was getting at with the PageCompound check is that instead
>>>>>>> of needing the LRU lock you should be able to look at PageCompound as
>>>>>>> soon as you call get_page_unless_zero() and preempt the need to set
>>>>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
>>>>>>> guarantee that the page hasn't been merged you could just rely on the
>>>>>>> fact that you are holding a reference to it so it isn't going to
>>>>>>> switch between being compound or order 0 since it cannot be freed. It
>>>>>>> spoils the idea I originally had of combining the logic for
>>>>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
>>>>>>> the advantage is you aren't clearing the LRU flag unless you are
>>>>>>> actually going to pull the page from the LRU list.
>>>>>> Sorry, I still can not follow you here. Compound code part is unchanged
>>>>>> and follow the original logical. So would you like to pose a new code to
>>>>>> see if its works?
>>>>> No there are significant changes as you reordered all of the
>>>>> operations. Prior to your change the LRU bit was checked, but not
>>>>> cleared before testing for PageCompound. Now you are clearing it
>>>>> before you are testing if it is a compound page. So if compaction is
>>>>> running we will be seeing the pages in the LRU stay put, but the
>>>>> compound bit flickering off and on if the compound page is encountered
>>>>> with the wrong or NULL lruvec. What I was suggesting is that the
>>>> The lruvec could be wrong or NULL here, that is the base stone of whole
>>>> patchset.
>>> Sorry I had a typo in my comment as well as it is the LRU bit that
>>> will be flickering, not the compound. The goal here is to avoid
>>> clearing the LRU bit unless we are sure we are going to take the
>>> lruvec lock and pull the page from the list.
>>>
>>>>> PageCompound test probably doesn't need to be concerned with the lock
>>>>> after your changes. You could test it after you call
>>>>> get_page_unless_zero() and before you call
>>>>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
>>>>> protect us from the page switching between compound and not we would
>>>>> be relying on the fact that we are holding a reference to the page so
>>>>> it should not be freed and transition between compound or not.
>>>>>
>>>> I have tried the patch as your suggested, it has no clear help on performance
>>>> on above vm-scaliblity case. Maybe it's due to we checked the same thing
>>>> before lock already.
>>>>
>>>> diff --git a/mm/compaction.c b/mm/compaction.c
>>>> index b99c96c4862d..cf2ac5148001 100644
>>>> --- a/mm/compaction.c
>>>> +++ b/mm/compaction.c
>>>> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>>>                 if (unlikely(!get_page_unless_zero(page)))
>>>>                         goto isolate_fail;
>>>>
>>>> +                       /*
>>>> +                        * Page become compound since the non-locked check,
>>>> +                        * and it's on LRU. It can only be a THP so the order
>>>> +                        * is safe to read and it's 0 for tail pages.
>>>> +                        */
>>>> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>>>> +                               low_pfn += compound_nr(page) - 1;
>>>> +                               goto isolate_fail_put;
>>>> +                       }
>>>> +
>>>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>>>                         goto isolate_fail_put;
>>>>
>>>> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
>>>>                                         goto isolate_abort;
>>>>                         }
>>>>
>>>> -                       /*
>>>> -                        * Page become compound since the non-locked check,
>>>> -                        * and it's on LRU. It can only be a THP so the order
>>>> -                        * is safe to read and it's 0 for tail pages.
>>>> -                        */
>>>> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
>>>> -                               low_pfn += compound_nr(page) - 1;
>>>> -                               SetPageLRU(page);
>>>> -                               goto isolate_fail_put;
>>>> -                       }
>>>>                 } else
>>>>                         rcu_read_unlock();
>>>>
>>> So actually there is more we could do than just this. Specifically a
>>> few lines below the rcu_read_lock there is yet another PageCompound
>>> check that sets low_pfn yet again. So in theory we could combine both
>>> of those and modify the code so you end up with something more like:
>>> @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
>>> *cc, unsigned long low_pfn,
>>>                 if (unlikely(!get_page_unless_zero(page)))
>>>                         goto isolate_fail;
>>>
>>> +               if (PageCompound(page)) {
>>> +                       const unsigned int order = compound_order(page);
>>> +
>>> +                       if (likely(order < MAX_ORDER))
>>> +                               low_pfn += (1UL << order) - 1;
>>> +
>>> +                       if (unlikely(!cc->alloc_contig))
>>> +                               goto isolate_fail_put;
>>>
>> The current don't check this unless locked changed. But anyway check it
>> every page may have no performance impact.
> Yes and no. The same code is also ran outside the lock and that is why
> I suggested merging the two and creating this block of logic. It will
> be clearer once I have done some initial smoke testing and submitted
> my patch.
> 
>> +               }
>>> +
>>>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
>>>                         goto isolate_fail_put;
>>>
>>> Doing this you would be more likely to skip over the entire compound
>>> page in a single jump should you not be able to either take the LRU
>>> bit or encounter a busy page in __isolate_Lru_page_prepare. I had
>>> copied this bit from an earlier check and modified it as I was not
>>> sure I can guarantee that this is a THP since we haven't taken the LRU
>>> lock yet. However I believe the page cannot be split up while we are
>>> holding the extra reference so the PageCompound flag and order should
>>> not change until we call put_page.
>>>
>> It looks like the lock_page protect this instead of get_page that just works
>> after split func called.
> So I thought that the call to page_ref_freeze that is used in
> functions like split_huge_page_to_list is meant to address this case.
> What it is essentially doing is setting the reference count to zero if
> the count is at the expected value. So with the get_page_unless_zero
> it would either fail because the value is already zero, or the
> page_ref_freeze would fail because the count would be one higher than
> the expected value. Either that or I am still missing another piece in
> the understanding of this.

Uh, the front xa_lock or anon_vma lock guard the -refcount, so long locking path...

Thanks
Alex
Alexander H Duyck Aug. 17, 2020, 10:58 p.m. UTC | #15
> @@ -1691,17 +1680,34 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>                  * only when the page is being freed somewhere else.
>                  */
>                 scan += nr_pages;
> -               switch (__isolate_lru_page(page, mode)) {
> +               switch (__isolate_lru_page_prepare(page, mode)) {
>                 case 0:
> +                       /*
> +                        * Be careful not to clear PageLRU until after we're
> +                        * sure the page is not being freed elsewhere -- the
> +                        * page release code relies on it.
> +                        */
> +                       if (unlikely(!get_page_unless_zero(page)))
> +                               goto busy;
> +
> +                       if (!TestClearPageLRU(page)) {
> +                               /*
> +                                * This page may in other isolation path,
> +                                * but we still hold lru_lock.
> +                                */
> +                               put_page(page);
> +                               goto busy;
> +                       }
> +

So I was reviewing the code and came across this piece. It has me a
bit concerned since we are calling put_page while holding the LRU lock
which was taken before calling the function. We should be fine in
terms of not encountering a deadlock since the LRU bit is cleared the
page shouldn't grab the LRU lock again, however we could end up
grabbing the zone lock while holding the LRU lock which would be an
issue.

One other thought I had is that this might be safe because the
assumption would be that another thread is holding a reference on the
page, has already called TestClearPageLRU on the page and retrieved
the LRU bit, and is waiting on us to release the LRU lock before it
can pull the page off of the list. In that case put_page will never
decrement the reference count to 0. I believe that is the current case
but I cannot be certain.

I'm just wondering if we should just replace the put_page(page) with a
WARN_ON(put_page_testzero(page)) and a bit more documentation. If I am
not mistaken it should never be possible for the reference count to
actually hit zero here.

Thanks.

- Alex
diff mbox series

Patch

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2c29399b29a0..6d23d3beeff7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -358,7 +358,7 @@  extern void lru_cache_add_active_or_unevictable(struct page *page,
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
-extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
+extern int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/compaction.c b/mm/compaction.c
index f14780fc296a..2da2933fe56b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -869,6 +869,7 @@  static bool too_many_isolated(pg_data_t *pgdat)
 		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {
 			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {
 				low_pfn = end_pfn;
+				page = NULL;
 				goto isolate_abort;
 			}
 			valid_page = page;
@@ -950,6 +951,21 @@  static bool too_many_isolated(pg_data_t *pgdat)
 		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
 			goto isolate_fail;
 
+		/*
+		 * Be careful not to clear PageLRU until after we're
+		 * sure the page is not being freed elsewhere -- the
+		 * page release code relies on it.
+		 */
+		if (unlikely(!get_page_unless_zero(page)))
+			goto isolate_fail;
+
+		if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
+			goto isolate_fail_put;
+
+		/* Try isolate the page */
+		if (!TestClearPageLRU(page))
+			goto isolate_fail_put;
+
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
 			locked = compact_lock_irqsave(&pgdat->lru_lock,
@@ -962,10 +978,6 @@  static bool too_many_isolated(pg_data_t *pgdat)
 					goto isolate_abort;
 			}
 
-			/* Recheck PageLRU and PageCompound under lock */
-			if (!PageLRU(page))
-				goto isolate_fail;
-
 			/*
 			 * Page become compound since the non-locked check,
 			 * and it's on LRU. It can only be a THP so the order
@@ -973,16 +985,13 @@  static bool too_many_isolated(pg_data_t *pgdat)
 			 */
 			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
 				low_pfn += compound_nr(page) - 1;
-				goto isolate_fail;
+				SetPageLRU(page);
+				goto isolate_fail_put;
 			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 
-		/* Try isolate the page */
-		if (__isolate_lru_page(page, isolate_mode) != 0)
-			goto isolate_fail;
-
 		/* The whole page is taken off the LRU; skip the tail pages. */
 		if (PageCompound(page))
 			low_pfn += compound_nr(page) - 1;
@@ -1011,6 +1020,15 @@  static bool too_many_isolated(pg_data_t *pgdat)
 		}
 
 		continue;
+
+isolate_fail_put:
+		/* Avoid potential deadlock in freeing page under lru_lock */
+		if (locked) {
+			spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+			locked = false;
+		}
+		put_page(page);
+
 isolate_fail:
 		if (!skip_on_failure)
 			continue;
@@ -1047,9 +1065,15 @@  static bool too_many_isolated(pg_data_t *pgdat)
 	if (unlikely(low_pfn > end_pfn))
 		low_pfn = end_pfn;
 
+	page = NULL;
+
 isolate_abort:
 	if (locked)
 		spin_unlock_irqrestore(&pgdat->lru_lock, flags);
+	if (page) {
+		SetPageLRU(page);
+		put_page(page);
+	}
 
 	/*
 	 * Updated the cached scanner pfn once the pageblock has been scanned
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4183ae6b54b5..f77748adc340 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1544,20 +1544,20 @@  unsigned int reclaim_clean_pages_from_list(struct zone *zone,
  *
  * returns 0 on success, -ve errno on failure.
  */
-int __isolate_lru_page(struct page *page, isolate_mode_t mode)
+int __isolate_lru_page_prepare(struct page *page, isolate_mode_t mode)
 {
 	int ret = -EINVAL;
 
-	/* Only take pages on the LRU. */
-	if (!PageLRU(page))
-		return ret;
-
 	/* Compaction should not handle unevictable pages but CMA can do so */
 	if (PageUnevictable(page) && !(mode & ISOLATE_UNEVICTABLE))
 		return ret;
 
 	ret = -EBUSY;
 
+	/* Only take pages on the LRU. */
+	if (!PageLRU(page))
+		return ret;
+
 	/*
 	 * To minimise LRU disruption, the caller can indicate that it only
 	 * wants to isolate pages it will be able to operate on without
@@ -1598,20 +1598,9 @@  int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
 
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * Be careful not to clear PageLRU until after we're
-		 * sure the page is not being freed elsewhere -- the
-		 * page release code relies on it.
-		 */
-		ClearPageLRU(page);
-		ret = 0;
-	}
-
-	return ret;
+	return 0;
 }
 
-
 /*
  * Update LRU sizes after isolating pages. The LRU size updates must
  * be complete before mem_cgroup_update_lru_size due to a sanity check.
@@ -1691,17 +1680,34 @@  static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		 * only when the page is being freed somewhere else.
 		 */
 		scan += nr_pages;
-		switch (__isolate_lru_page(page, mode)) {
+		switch (__isolate_lru_page_prepare(page, mode)) {
 		case 0:
+			/*
+			 * Be careful not to clear PageLRU until after we're
+			 * sure the page is not being freed elsewhere -- the
+			 * page release code relies on it.
+			 */
+			if (unlikely(!get_page_unless_zero(page)))
+				goto busy;
+
+			if (!TestClearPageLRU(page)) {
+				/*
+				 * This page may in other isolation path,
+				 * but we still hold lru_lock.
+				 */
+				put_page(page);
+				goto busy;
+			}
+
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
-
+busy:
 		case -EBUSY:
 			/* else it is being freed elsewhere */
 			list_move(&page->lru, src);
-			continue;
+			break;
 
 		default:
 			BUG();