diff mbox series

mm, page_isolation: remove drain_all_pages() in set_migratetype_isolate()

Message ID 20181214023912.77474-1-richard.weiyang@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm, page_isolation: remove drain_all_pages() in set_migratetype_isolate() | expand

Commit Message

Wei Yang Dec. 14, 2018, 2:39 a.m. UTC
Below is a brief call flow for __offline_pages() and
alloc_contig_range():

  __offline_pages()/alloc_contig_range()
      start_isolate_page_range()
          set_migratetype_isolate()
              drain_all_pages()
      drain_all_pages()

Since set_migratetype_isolate() is only used in
start_isolate_page_range(), which is just used in __offline_pages() and
alloc_contig_range(). And both of them call drain_all_pages() if every
check looks good. This means it is not necessary call drain_all_pages()
in each iteration of set_migratetype_isolate().

By doing so, the logic seems a little bit clearer.
set_migratetype_isolate() handles pages in Buddy, while
drain_all_pages() takes care of pages in pcp.

Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
---
 mm/page_isolation.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Andrew Morton Dec. 14, 2018, 3:57 a.m. UTC | #1
On Fri, 14 Dec 2018 10:39:12 +0800 Wei Yang <richard.weiyang@gmail.com> wrote:

> Below is a brief call flow for __offline_pages()

Offtopic...

set_migratetype_isolate() has the comment

	/*
	 * immobile means "not-on-lru" pages. If immobile is larger than
	 * removable-by-driver pages reported by notifier, we'll fail.
	 */

what the heck does that mean?  It used to talk about unmovable pages,
but this was mysteriously changed to use the unique term "immobile" by
Minchan's ee6f509c32 ("mm: factor out memory isolate functions"). 
Could someone please take a look?


> and
> alloc_contig_range():
> 
>   __offline_pages()/alloc_contig_range()
>       start_isolate_page_range()
>           set_migratetype_isolate()
>               drain_all_pages()
>       drain_all_pages()
> 
> Since set_migratetype_isolate() is only used in
> start_isolate_page_range(), which is just used in __offline_pages() and
> alloc_contig_range(). And both of them call drain_all_pages() if every
> check looks good. This means it is not necessary call drain_all_pages()
> in each iteration of set_migratetype_isolate().
>
> By doing so, the logic seems a little bit clearer.
> set_migratetype_isolate() handles pages in Buddy, while
> drain_all_pages() takes care of pages in pcp.

Well.  drain_all_pages() moves pages from pcp to buddy so I'm not sure
that argument holds water.

Can we step back a bit and ask ourselves what all these draining
operations are actually for?  What is the intent behind each callsite? 
Figuring that out (and perhaps even documenting it!) would help us
decide the most appropriate places from which to perform the drain.
Wei Yang Dec. 14, 2018, 7:01 a.m. UTC | #2
On Thu, Dec 13, 2018 at 07:57:12PM -0800, Andrew Morton wrote:
>On Fri, 14 Dec 2018 10:39:12 +0800 Wei Yang <richard.weiyang@gmail.com> wrote:
>
>> Below is a brief call flow for __offline_pages()
>
>Offtopic...
>
>set_migratetype_isolate() has the comment
>
>	/*
>	 * immobile means "not-on-lru" pages. If immobile is larger than
>	 * removable-by-driver pages reported by notifier, we'll fail.
>	 */
>
>what the heck does that mean?  It used to talk about unmovable pages,
>but this was mysteriously changed to use the unique term "immobile" by
>Minchan's ee6f509c32 ("mm: factor out memory isolate functions"). 
>Could someone please take a look?
>

What immobile stands for? I searched the whole kernel tree and just this
place use this terminology.

>
>> and
>> alloc_contig_range():
>> 
>>   __offline_pages()/alloc_contig_range()
>>       start_isolate_page_range()
>>           set_migratetype_isolate()
>>               drain_all_pages()
>>       drain_all_pages()
>> 
>> Since set_migratetype_isolate() is only used in
>> start_isolate_page_range(), which is just used in __offline_pages() and
>> alloc_contig_range(). And both of them call drain_all_pages() if every
>> check looks good. This means it is not necessary call drain_all_pages()
>> in each iteration of set_migratetype_isolate().
>>
>> By doing so, the logic seems a little bit clearer.
>> set_migratetype_isolate() handles pages in Buddy, while
>> drain_all_pages() takes care of pages in pcp.
>
>Well.  drain_all_pages() moves pages from pcp to buddy so I'm not sure
>that argument holds water.
>

You mean the wartermark?

>Can we step back a bit and ask ourselves what all these draining
>operations are actually for?  What is the intent behind each callsite? 
>Figuring that out (and perhaps even documenting it!) would help us
>decide the most appropriate places from which to perform the drain.

That is great. I found myself hard to understand current implementation.
Let me try to write down what I understand now.

Current mm subsystem manage memory with a hierarchic way.

  * Buddy system
  * pcp pageset
  * slub

With this background, we handle pages differently for different layer.

  * set_migratetype_isolate() handle pages still in Buddy system.
  * drain_all_pages() handle pages in pcp pageset.
  * I don't know who handle pages in slub.

While there are still pages out there, eg. page table, file pages, I
don't understand how they are handled during offline. Especially, how to
catch them all in a specific range.

Now go back to this patch. 

   __offline_pages()/alloc_contig_range()
       start_isolate_page_range()
           set_migratetype_isolate()
               drain_all_pages()
       drain_all_pages()

start_isolate_page_range() will iterate a range with pageblock step to
isolate them. Since both __offline_pages() and alloc_contig_range()
require this range to be in the same zone, drain_all_pages() will drain
the pcp pageset of the same zone several times. After that,
drain_all_pages() will be called again to drain pages.

One thing we can notice is after set_migratetype_isolate() for a
particular range, this range's page will not be available for
allocation. But the pages after this range still has a chance to be put
on pcp pageset. And during this process, pages of the same zone but out
of the whole range could be put on the pcp pageset. This means current
implementation would drain those pages several times and may increase
contention for this zone.

This behavior seems suboptimal. And we can do this just in once to drain
all of them.
Wei Yang Dec. 14, 2018, 3:17 p.m. UTC | #3
On Thu, Dec 13, 2018 at 07:57:12PM -0800, Andrew Morton wrote:
>On Fri, 14 Dec 2018 10:39:12 +0800 Wei Yang <richard.weiyang@gmail.com> wrote:
>
>> Below is a brief call flow for __offline_pages()
>
>Offtopic...
>
>set_migratetype_isolate() has the comment
>
>	/*
>	 * immobile means "not-on-lru" pages. If immobile is larger than
>	 * removable-by-driver pages reported by notifier, we'll fail.
>	 */
>
>what the heck does that mean?  It used to talk about unmovable pages,
>but this was mysteriously changed to use the unique term "immobile" by
>Minchan's ee6f509c32 ("mm: factor out memory isolate functions"). 
>Could someone please take a look?
>
>
>> and
>> alloc_contig_range():
>> 
>>   __offline_pages()/alloc_contig_range()
>>       start_isolate_page_range()
>>           set_migratetype_isolate()
>>               drain_all_pages()
>>       drain_all_pages()
>> 
>> Since set_migratetype_isolate() is only used in
>> start_isolate_page_range(), which is just used in __offline_pages() and
>> alloc_contig_range(). And both of them call drain_all_pages() if every
>> check looks good. This means it is not necessary call drain_all_pages()
>> in each iteration of set_migratetype_isolate().
>>
>> By doing so, the logic seems a little bit clearer.
>> set_migratetype_isolate() handles pages in Buddy, while
>> drain_all_pages() takes care of pages in pcp.
>
>Well.  drain_all_pages() moves pages from pcp to buddy so I'm not sure
>that argument holds water.
>
>Can we step back a bit and ask ourselves what all these draining
>operations are actually for?  What is the intent behind each callsite? 
>Figuring that out (and perhaps even documenting it!) would help us
>decide the most appropriate places from which to perform the drain.

With some rethinking we even could take drain_all_pages() out of the
repeat loop. Because after isolation, the page in this range will not be
put to pcp pageset. So we just need to drain pages once.

The change may look like this.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6910e0eea074..120e9fdfd055 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1590,6 +1590,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
        if (ret)
                goto failed_removal;

+       drain_all_pages(zone);
        pfn = start_pfn;
 repeat:
        /* start memory hot removal */
@@ -1599,7 +1600,6 @@ static int __ref __offline_pages(unsigned long start_pfn,

        cond_resched();
        lru_add_drain_all();
-       drain_all_pages(zone);

        pfn = scan_movable_pages(start_pfn, end_pfn);
        if (pfn) { /* We have movable pages */
Michal Hocko Dec. 17, 2018, 12:21 p.m. UTC | #4
On Fri 14-12-18 15:17:56, Wei Yang wrote:
> On Thu, Dec 13, 2018 at 07:57:12PM -0800, Andrew Morton wrote:
> >On Fri, 14 Dec 2018 10:39:12 +0800 Wei Yang <richard.weiyang@gmail.com> wrote:
> >
> >> Below is a brief call flow for __offline_pages()
> >
> >Offtopic...
> >
> >set_migratetype_isolate() has the comment
> >
> >	/*
> >	 * immobile means "not-on-lru" pages. If immobile is larger than
> >	 * removable-by-driver pages reported by notifier, we'll fail.
> >	 */
> >
> >what the heck does that mean?  It used to talk about unmovable pages,
> >but this was mysteriously changed to use the unique term "immobile" by
> >Minchan's ee6f509c32 ("mm: factor out memory isolate functions"). 
> >Could someone please take a look?
> >
> >
> >> and
> >> alloc_contig_range():
> >> 
> >>   __offline_pages()/alloc_contig_range()
> >>       start_isolate_page_range()
> >>           set_migratetype_isolate()
> >>               drain_all_pages()
> >>       drain_all_pages()
> >> 
> >> Since set_migratetype_isolate() is only used in
> >> start_isolate_page_range(), which is just used in __offline_pages() and
> >> alloc_contig_range(). And both of them call drain_all_pages() if every
> >> check looks good. This means it is not necessary call drain_all_pages()
> >> in each iteration of set_migratetype_isolate().
> >>
> >> By doing so, the logic seems a little bit clearer.
> >> set_migratetype_isolate() handles pages in Buddy, while
> >> drain_all_pages() takes care of pages in pcp.
> >
> >Well.  drain_all_pages() moves pages from pcp to buddy so I'm not sure
> >that argument holds water.
> >
> >Can we step back a bit and ask ourselves what all these draining
> >operations are actually for?  What is the intent behind each callsite? 
> >Figuring that out (and perhaps even documenting it!) would help us
> >decide the most appropriate places from which to perform the drain.
> 
> With some rethinking we even could take drain_all_pages() out of the
> repeat loop. Because after isolation, the page in this range will not be
> put to pcp pageset. So we just need to drain pages once.
> 
> The change may look like this.

No, this is incorrect. Draining pcp lists before scan_movable_pages is
most likely sub-optimal, because scan_movable_pages will simply ignore
pages being on the pcp lists. But we definitely want to drain before we
terminate the offlining phase because we do not want to have isolated
pages on those lists before we allow the final hotremove.

The way how we retry the migration loop until there is no page in use
just guarantees that drain_all_pages is called. If you put it out of the
loop then you just break that assumption. Moving drain_all_pages down
after the migration is done should work well AFAICS but I didn't really
think through all potential side effects nor have time to do so now.

> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6910e0eea074..120e9fdfd055 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1590,6 +1590,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>         if (ret)
>                 goto failed_removal;
> 
> +       drain_all_pages(zone);
>         pfn = start_pfn;
>  repeat:
>         /* start memory hot removal */
> @@ -1599,7 +1600,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
> 
>         cond_resched();
>         lru_add_drain_all();
> -       drain_all_pages(zone);
> 
>         pfn = scan_movable_pages(start_pfn, end_pfn);
>         if (pfn) { /* We have movable pages */
> 
> -- 
> Wei Yang
> Help you, Help me
Michal Hocko Dec. 17, 2018, 12:25 p.m. UTC | #5
On Fri 14-12-18 10:39:12, Wei Yang wrote:
> Below is a brief call flow for __offline_pages() and
> alloc_contig_range():
> 
>   __offline_pages()/alloc_contig_range()
>       start_isolate_page_range()
>           set_migratetype_isolate()
>               drain_all_pages()
>       drain_all_pages()
> 
> Since set_migratetype_isolate() is only used in
> start_isolate_page_range(), which is just used in __offline_pages() and
> alloc_contig_range(). And both of them call drain_all_pages() if every
> check looks good. This means it is not necessary call drain_all_pages()
> in each iteration of set_migratetype_isolate().
> 
> By doing so, the logic seems a little bit clearer.
> set_migratetype_isolate() handles pages in Buddy, while
> drain_all_pages() takes care of pages in pcp.

I have to confess I am not sure about the purpose of the draining here.
I suspect it is to make sure that pages in the pcp lists really get
isolated and if that is the case then it makes sense.

In any case I strongly suggest not touching this code without a very
good explanation on why this is not needed. Callers do XYZ is not a
proper explanation because assumes that all callers will know that this
has to be done. So either we really need to drain and then it is better
to make it here or we don't but that requires some explanation.

> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
> ---
>  mm/page_isolation.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 43e085608846..f44c0e333bed 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -83,8 +83,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>  	}
>  
>  	spin_unlock_irqrestore(&zone->lock, flags);
> -	if (!ret)
> -		drain_all_pages(zone);
>  	return ret;
>  }
>  
> -- 
> 2.15.1
>
Wei Yang Dec. 17, 2018, 3:08 p.m. UTC | #6
On Mon, Dec 17, 2018 at 01:25:23PM +0100, Michal Hocko wrote:
>On Fri 14-12-18 10:39:12, Wei Yang wrote:
>> Below is a brief call flow for __offline_pages() and
>> alloc_contig_range():
>> 
>>   __offline_pages()/alloc_contig_range()
>>       start_isolate_page_range()
>>           set_migratetype_isolate()
>>               drain_all_pages()
>>       drain_all_pages()
>> 
>> Since set_migratetype_isolate() is only used in
>> start_isolate_page_range(), which is just used in __offline_pages() and
>> alloc_contig_range(). And both of them call drain_all_pages() if every
>> check looks good. This means it is not necessary call drain_all_pages()
>> in each iteration of set_migratetype_isolate().
>> 
>> By doing so, the logic seems a little bit clearer.
>> set_migratetype_isolate() handles pages in Buddy, while
>> drain_all_pages() takes care of pages in pcp.
>
>I have to confess I am not sure about the purpose of the draining here.
>I suspect it is to make sure that pages in the pcp lists really get
>isolated and if that is the case then it makes sense.
>
>In any case I strongly suggest not touching this code without a very
>good explanation on why this is not needed. Callers do XYZ is not a
>proper explanation because assumes that all callers will know that this
>has to be done. So either we really need to drain and then it is better
>to make it here or we don't but that requires some explanation.
>

Yep, let me try to explain what is trying to do.

Based on my understanding, online_pages do two things

    * adjust zone/pgdat status
    * put pages into Buddy

Generally, offline_pages do the reverse

    * take pages out of Buddy
    * adjust zone/pgdat status

While it is not that easy to take pages out of Buddy, since pages are

    * pcp list
    * slub
    * other usage

This means before taking a page out of Buddy, we need to return it first
to Buddy.

Current implementation is interesting by introducing migrate type. By
setting migrate type to MIGRATE_ISOLATE, this range of pages will never
be allocated from Buddy. And every page returned in this range will
never be touched by Buddy.

Function start_isolate_page_range() just do this.

Then let's focus on the pcp list. This is a little bit different
than other allocated pages. These are actually "partially" allocated
pages. They are not counted in Buddy Free pages, either no real use. So
we have two choice to get back those pages:

    * wait until it is allocated to a real user and wait for return
    * or drain them directly

Current implementation take 2nd approach.

Then we can see there are also two way to drain them:

    * drain them range by range
    * drain them in a whole range

Both looks good, but not necessary to do them both. Because after we set
a pageblock migrate type to MIGRATE_ISOLATE, pages in this range will
never be allocated nor be put on pcp list. So after we drain one
particular range, it is not necessary to drain this range again.

The reason why I choose to drain them in a whole range is current
drain_all_pages() just carry zone information. For example, a zone may
have 1G while a pageblock is 128M. The pageblock is 1/8 of this zone.
This means in case there are 8 pages on pcp list, only 1 page drained by
drain_all_pages belongs to this pageblock. But we drain other 7 healthy
pages.

 CPU1 pcp list                            CPU2 pcp list

 +---------------+                        +---------------+ 
 |A1  B3  C8  F6 |                        |E1  G3  D8  B6 |
 +---------------+                        +---------------+


   A         B         C         D         E         F         G
  +---------+---------+---------+---------+---------+---------+---------+
  |012345678|         |         |         |         |         |         |
  +---------+---------+---------+---------+---------+---------+---------+
                                |<-pgblk->|
  |<-                              Zone                               ->|


This is a chart for illustration. In case we want to isolate pgblk D,
while zone pcp list has 8 pages and only one belongs to this pgblk D.
This means the drain on pgblk base has much side effect. And with one
drain on each pgblk, this may increase the contention on this zone.

Well, another approach is to enable drain_all_pages() with exact range
information. But neither approach needs to do them both.
Michal Hocko Dec. 17, 2018, 3:48 p.m. UTC | #7
On Mon 17-12-18 15:08:19, Wei Yang wrote:
> On Mon, Dec 17, 2018 at 01:25:23PM +0100, Michal Hocko wrote:
> >On Fri 14-12-18 10:39:12, Wei Yang wrote:
> >> Below is a brief call flow for __offline_pages() and
> >> alloc_contig_range():
> >> 
> >>   __offline_pages()/alloc_contig_range()
> >>       start_isolate_page_range()
> >>           set_migratetype_isolate()
> >>               drain_all_pages()
> >>       drain_all_pages()
> >> 
> >> Since set_migratetype_isolate() is only used in
> >> start_isolate_page_range(), which is just used in __offline_pages() and
> >> alloc_contig_range(). And both of them call drain_all_pages() if every
> >> check looks good. This means it is not necessary call drain_all_pages()
> >> in each iteration of set_migratetype_isolate().
> >> 
> >> By doing so, the logic seems a little bit clearer.
> >> set_migratetype_isolate() handles pages in Buddy, while
> >> drain_all_pages() takes care of pages in pcp.
> >
> >I have to confess I am not sure about the purpose of the draining here.
> >I suspect it is to make sure that pages in the pcp lists really get
> >isolated and if that is the case then it makes sense.
> >
> >In any case I strongly suggest not touching this code without a very
> >good explanation on why this is not needed. Callers do XYZ is not a
> >proper explanation because assumes that all callers will know that this
> >has to be done. So either we really need to drain and then it is better
> >to make it here or we don't but that requires some explanation.
> >
> 
> Yep, let me try to explain what is trying to do.
> 
> Based on my understanding, online_pages do two things
> 
>     * adjust zone/pgdat status
>     * put pages into Buddy
> 
> Generally, offline_pages do the reverse
> 
>     * take pages out of Buddy
>     * adjust zone/pgdat status
> 
> While it is not that easy to take pages out of Buddy, since pages are
> 
>     * pcp list
>     * slub
>     * other usage
> 
> This means before taking a page out of Buddy, we need to return it first
> to Buddy.
> 
> Current implementation is interesting by introducing migrate type. By
> setting migrate type to MIGRATE_ISOLATE, this range of pages will never
> be allocated from Buddy. And every page returned in this range will
> never be touched by Buddy.
> 
> Function start_isolate_page_range() just do this.
> 
> Then let's focus on the pcp list. This is a little bit different
> than other allocated pages. These are actually "partially" allocated
> pages. They are not counted in Buddy Free pages, either no real use. So
> we have two choice to get back those pages:
> 
>     * wait until it is allocated to a real user and wait for return
>     * or drain them directly
> 
> Current implementation take 2nd approach.
> 
> Then we can see there are also two way to drain them:
> 
>     * drain them range by range
>     * drain them in a whole range
> 
> Both looks good, but not necessary to do them both. Because after we set
> a pageblock migrate type to MIGRATE_ISOLATE, pages in this range will
> never be allocated nor be put on pcp list. So after we drain one
> particular range, it is not necessary to drain this range again.

OK, this is an important point and actually the argument that i am
wrong. I have missed that free_unref_page_commit skips pcp lists for
MIGRATE_ISOLATE (resp. all migrate types above MIGRATE_PCPTYPES).
Then you are right that we are OK to drain the zone only once _after_ we
have isolated the full range.

So please send a new patch with this clarification in the changelog and
I will ack it.
 
> The reason why I choose to drain them in a whole range is current
> drain_all_pages() just carry zone information. For example, a zone may
> have 1G while a pageblock is 128M. The pageblock is 1/8 of this zone.
> This means in case there are 8 pages on pcp list, only 1 page drained by
> drain_all_pages belongs to this pageblock. But we drain other 7 healthy
> pages.
> 
>  CPU1 pcp list                            CPU2 pcp list
> 
>  +---------------+                        +---------------+ 
>  |A1  B3  C8  F6 |                        |E1  G3  D8  B6 |
>  +---------------+                        +---------------+
> 
> 
>    A         B         C         D         E         F         G
>   +---------+---------+---------+---------+---------+---------+---------+
>   |012345678|         |         |         |         |         |         |
>   +---------+---------+---------+---------+---------+---------+---------+
>                                 |<-pgblk->|
>   |<-                              Zone                               ->|
> 
> 
> This is a chart for illustration. In case we want to isolate pgblk D,
> while zone pcp list has 8 pages and only one belongs to this pgblk D.
> This means the drain on pgblk base has much side effect. And with one
> drain on each pgblk, this may increase the contention on this zone.
> 
> Well, another approach is to enable drain_all_pages() with exact range
> information. But neither approach needs to do them both.

Is this actually worth the additional complexity? Have you seen an
actual workload that would benefit from that?
Wei Yang Dec. 18, 2018, 2:44 p.m. UTC | #8
On Mon, Dec 17, 2018 at 04:48:12PM +0100, Michal Hocko wrote:
>On Mon 17-12-18 15:08:19, Wei Yang wrote:
>> On Mon, Dec 17, 2018 at 01:25:23PM +0100, Michal Hocko wrote:
>> >On Fri 14-12-18 10:39:12, Wei Yang wrote:
>> >> Below is a brief call flow for __offline_pages() and
>> >> alloc_contig_range():
>> >> 
>> >>   __offline_pages()/alloc_contig_range()
>> >>       start_isolate_page_range()
>> >>           set_migratetype_isolate()
>> >>               drain_all_pages()
>> >>       drain_all_pages()
>> >> 
>> >> Since set_migratetype_isolate() is only used in
>> >> start_isolate_page_range(), which is just used in __offline_pages() and
>> >> alloc_contig_range(). And both of them call drain_all_pages() if every
>> >> check looks good. This means it is not necessary call drain_all_pages()
>> >> in each iteration of set_migratetype_isolate().
>> >> 
>> >> By doing so, the logic seems a little bit clearer.
>> >> set_migratetype_isolate() handles pages in Buddy, while
>> >> drain_all_pages() takes care of pages in pcp.
>> >
>> >I have to confess I am not sure about the purpose of the draining here.
>> >I suspect it is to make sure that pages in the pcp lists really get
>> >isolated and if that is the case then it makes sense.
>> >
>> >In any case I strongly suggest not touching this code without a very
>> >good explanation on why this is not needed. Callers do XYZ is not a
>> >proper explanation because assumes that all callers will know that this
>> >has to be done. So either we really need to drain and then it is better
>> >to make it here or we don't but that requires some explanation.
>> >
>> 
>> Yep, let me try to explain what is trying to do.
>> 
>> Based on my understanding, online_pages do two things
>> 
>>     * adjust zone/pgdat status
>>     * put pages into Buddy
>> 
>> Generally, offline_pages do the reverse
>> 
>>     * take pages out of Buddy
>>     * adjust zone/pgdat status
>> 
>> While it is not that easy to take pages out of Buddy, since pages are
>> 
>>     * pcp list
>>     * slub
>>     * other usage
>> 
>> This means before taking a page out of Buddy, we need to return it first
>> to Buddy.
>> 
>> Current implementation is interesting by introducing migrate type. By
>> setting migrate type to MIGRATE_ISOLATE, this range of pages will never
>> be allocated from Buddy. And every page returned in this range will
>> never be touched by Buddy.
>> 
>> Function start_isolate_page_range() just do this.
>> 
>> Then let's focus on the pcp list. This is a little bit different
>> than other allocated pages. These are actually "partially" allocated
>> pages. They are not counted in Buddy Free pages, either no real use. So
>> we have two choice to get back those pages:
>> 
>>     * wait until it is allocated to a real user and wait for return
>>     * or drain them directly
>> 
>> Current implementation take 2nd approach.
>> 
>> Then we can see there are also two way to drain them:
>> 
>>     * drain them range by range
>>     * drain them in a whole range
>> 
>> Both looks good, but not necessary to do them both. Because after we set
>> a pageblock migrate type to MIGRATE_ISOLATE, pages in this range will
>> never be allocated nor be put on pcp list. So after we drain one
>> particular range, it is not necessary to drain this range again.
>
>OK, this is an important point and actually the argument that i am
>wrong. I have missed that free_unref_page_commit skips pcp lists for
>MIGRATE_ISOLATE (resp. all migrate types above MIGRATE_PCPTYPES).
>Then you are right that we are OK to drain the zone only once _after_ we
>have isolated the full range.
>
>So please send a new patch with this clarification in the changelog and
>I will ack it.
> 
>> The reason why I choose to drain them in a whole range is current
>> drain_all_pages() just carry zone information. For example, a zone may
>> have 1G while a pageblock is 128M. The pageblock is 1/8 of this zone.
>> This means in case there are 8 pages on pcp list, only 1 page drained by
>> drain_all_pages belongs to this pageblock. But we drain other 7 healthy
>> pages.
>> 
>>  CPU1 pcp list                            CPU2 pcp list
>> 
>>  +---------------+                        +---------------+ 
>>  |A1  B3  C8  F6 |                        |E1  G3  D8  B6 |
>>  +---------------+                        +---------------+
>> 
>> 
>>    A         B         C         D         E         F         G
>>   +---------+---------+---------+---------+---------+---------+---------+
>>   |012345678|         |         |         |         |         |         |
>>   +---------+---------+---------+---------+---------+---------+---------+
>>                                 |<-pgblk->|
>>   |<-                              Zone                               ->|
>> 
>> 
>> This is a chart for illustration. In case we want to isolate pgblk D,
>> while zone pcp list has 8 pages and only one belongs to this pgblk D.
>> This means the drain on pgblk base has much side effect. And with one
>> drain on each pgblk, this may increase the contention on this zone.
>> 
>> Well, another approach is to enable drain_all_pages() with exact range
>> information. But neither approach needs to do them both.
>
>Is this actually worth the additional complexity? Have you seen an
>actual workload that would benefit from that?

No, I just mention the possible approach in my mind. While currently
drain pcp list once is enough.

I will prepare v2 with more detailed changelog with migratetype thing.

>-- 
>Michal Hocko
>SUSE Labs
Wei Yang Dec. 18, 2018, 8:48 p.m. UTC | #9
On Mon, Dec 17, 2018 at 01:21:32PM +0100, Michal Hocko wrote:
>On Fri 14-12-18 15:17:56, Wei Yang wrote:
>> On Thu, Dec 13, 2018 at 07:57:12PM -0800, Andrew Morton wrote:
>> >On Fri, 14 Dec 2018 10:39:12 +0800 Wei Yang <richard.weiyang@gmail.com> wrote:
>> >
>> >> Below is a brief call flow for __offline_pages()
>> >
>> >Offtopic...
>> >
>> >set_migratetype_isolate() has the comment
>> >
>> >	/*
>> >	 * immobile means "not-on-lru" pages. If immobile is larger than
>> >	 * removable-by-driver pages reported by notifier, we'll fail.
>> >	 */
>> >
>> >what the heck does that mean?  It used to talk about unmovable pages,
>> >but this was mysteriously changed to use the unique term "immobile" by
>> >Minchan's ee6f509c32 ("mm: factor out memory isolate functions"). 
>> >Could someone please take a look?
>> >
>> >
>> >> and
>> >> alloc_contig_range():
>> >> 
>> >>   __offline_pages()/alloc_contig_range()
>> >>       start_isolate_page_range()
>> >>           set_migratetype_isolate()
>> >>               drain_all_pages()
>> >>       drain_all_pages()
>> >> 
>> >> Since set_migratetype_isolate() is only used in
>> >> start_isolate_page_range(), which is just used in __offline_pages() and
>> >> alloc_contig_range(). And both of them call drain_all_pages() if every
>> >> check looks good. This means it is not necessary call drain_all_pages()
>> >> in each iteration of set_migratetype_isolate().
>> >>
>> >> By doing so, the logic seems a little bit clearer.
>> >> set_migratetype_isolate() handles pages in Buddy, while
>> >> drain_all_pages() takes care of pages in pcp.
>> >
>> >Well.  drain_all_pages() moves pages from pcp to buddy so I'm not sure
>> >that argument holds water.
>> >
>> >Can we step back a bit and ask ourselves what all these draining
>> >operations are actually for?  What is the intent behind each callsite? 
>> >Figuring that out (and perhaps even documenting it!) would help us
>> >decide the most appropriate places from which to perform the drain.
>> 
>> With some rethinking we even could take drain_all_pages() out of the
>> repeat loop. Because after isolation, the page in this range will not be
>> put to pcp pageset. So we just need to drain pages once.
>> 
>> The change may look like this.
>
>No, this is incorrect. Draining pcp lists before scan_movable_pages is
>most likely sub-optimal, because scan_movable_pages will simply ignore
>pages being on the pcp lists. But we definitely want to drain before we
>terminate the offlining phase because we do not want to have isolated
>pages on those lists before we allow the final hotremove.
>
>The way how we retry the migration loop until there is no page in use
>just guarantees that drain_all_pages is called. If you put it out of the
>loop then you just break that assumption. Moving drain_all_pages down
>after the migration is done should work well AFAICS but I didn't really
>think through all potential side effects nor have time to do so now.
>

After discussion, do you agree with this proposal?

>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 6910e0eea074..120e9fdfd055 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1590,6 +1590,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>         if (ret)
>>                 goto failed_removal;
>> 
>> +       drain_all_pages(zone);
>>         pfn = start_pfn;
>>  repeat:
>>         /* start memory hot removal */
>> @@ -1599,7 +1600,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>> 
>>         cond_resched();
>>         lru_add_drain_all();
>> -       drain_all_pages(zone);
>> 
>>         pfn = scan_movable_pages(start_pfn, end_pfn);
>>         if (pfn) { /* We have movable pages */
>> 
>> -- 
>> Wei Yang
>> Help you, Help me
>
>-- 
>Michal Hocko
>SUSE Labs
diff mbox series

Patch

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 43e085608846..f44c0e333bed 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -83,8 +83,6 @@  static int set_migratetype_isolate(struct page *page, int migratetype,
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret)
-		drain_all_pages(zone);
 	return ret;
 }