diff mbox series

[RFC,v4,06/13] mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE

Message ID 20191212171137.13872-7-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series virtio-mem: paravirtualized memory | expand

Commit Message

David Hildenbrand Dec. 12, 2019, 5:11 p.m. UTC
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observerable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
	count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
	the memory block has to be handled by hooking into onlining code
	(online_page_callback_t), resetting the page PageOffline() and
	not giving them to the buddy.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/page-flags.h | 10 ++++++++++
 mm/memory_hotplug.c        | 41 ++++++++++++++++++++++++++++----------
 mm/page_alloc.c            | 24 ++++++++++++++++++++++
 mm/page_isolation.c        |  9 +++++++++
 4 files changed, 74 insertions(+), 10 deletions(-)

Comments

Alexander Duyck Feb. 25, 2020, 6:26 p.m. UTC | #1
On Thu, 2019-12-12 at 18:11 +0100, David Hildenbrand wrote:
> virtio-mem wants to allow to offline memory blocks of which some parts
> were unplugged (allocated via alloc_contig_range()), especially, to later
> offline and remove completely unplugged memory blocks. The important part
> is that PageOffline() has to remain set until the section is offline, so
> these pages will never get accessed (e.g., when dumping). The pages should
> not be handed back to the buddy (which would require clearing PageOffline()
> and result in issues if offlining fails and the pages are suddenly in the
> buddy).
> 
> Let's allow to do that by allowing to isolate any PageOffline() page
> when offlining. This way, we can reach the memory hotplug notifier
> MEM_GOING_OFFLINE, where the driver can signal that he is fine with
> offlining this page by dropping its reference count. PageOffline() pages
> with a reference count of 0 can then be skipped when offlining the
> pages (like if they were free, however they are not in the buddy).
> 
> Anybody who uses PageOffline() pages and does not agree to offline them
> (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
> decrement the reference count and make offlining fail when trying to
> migrate such an unmovable page. So there should be no observerable change.
> Same applies to balloon compaction users (movable PageOffline() pages), the
> pages will simply be migrated.
> 
> Note 1: If offlining fails, a driver has to increment the reference
> 	count again in MEM_CANCEL_OFFLINE.
> 
> Note 2: A driver that makes use of this has to be aware that re-onlining
> 	the memory block has to be handled by hooking into onlining code
> 	(online_page_callback_t), resetting the page PageOffline() and
> 	not giving them to the buddy.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Pingfan Liu <kernelfans@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/page-flags.h | 10 ++++++++++
>  mm/memory_hotplug.c        | 41 ++++++++++++++++++++++++++++----------
>  mm/page_alloc.c            | 24 ++++++++++++++++++++++
>  mm/page_isolation.c        |  9 +++++++++
>  4 files changed, 74 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1bf83c8fcaa7..ac1775082343 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -761,6 +761,16 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   * not onlined when onlining the section).
>   * The content of these pages is effectively stale. Such pages should not
>   * be touched (read/write/dump/save) except by their owner.
> + *
> + * If a driver wants to allow to offline unmovable PageOffline() pages without
> + * putting them back to the buddy, it can do so via the memory notifier by
> + * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
> + * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
> + * pages (now with a reference count of zero) are treated like free pages,
> + * allowing the containing memory block to get offlined. A driver that
> + * relies on this feature is aware that re-onlining the memory block will
> + * require to re-set the pages PageOffline() and not giving them to the
> + * buddy via online_page_callback_t.
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index fc617ad6f035..da01453a04e6 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1231,11 +1231,15 @@ int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
>  
>  /*
>   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
> - * non-lru movable pages and hugepages). We scan pfn because it's much
> - * easier than scanning over linked list. This function returns the pfn
> - * of the first found movable page if it's found, otherwise 0.
> + * non-lru movable pages and hugepages).
> + *
> + * Returns:
> + *	0 in case a movable page is found and movable_pfn was updated.
> + *	-ENOENT in case no movable page was found.
> + *	-EBUSY in case a definetly unmovable page was found.
>   */
> -static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> +static int scan_movable_pages(unsigned long start, unsigned long end,
> +			      unsigned long *movable_pfn)
>  {
>  	unsigned long pfn;
>  
> @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>  			continue;
>  		page = pfn_to_page(pfn);
>  		if (PageLRU(page))
> -			return pfn;
> +			goto found;
>  		if (__PageMovable(page))
> -			return pfn;
> +			goto found;
> +
> +		/*
> +		 * Unmovable PageOffline() pages where somebody still holds
> +		 * a reference count (after MEM_GOING_OFFLINE) can definetly
> +		 * not be offlined.
> +		 */
> +		if (PageOffline(page) && page_count(page))
> +			return -EBUSY;

So the comment confused me a bit because technically this function isn't
about offlining memory, it is about finding movable pages. I had to do a
bit of digging to find the only consumer is __offline_pages, but if we are
going to talk about "offlining" instead of "moving" in this function it
might make sense to rename it.

>  
>  		if (!PageHuge(page))
>  			continue;
>  		head = compound_head(page);
>  		if (page_huge_active(head))
> -			return pfn;
> +			goto found;
>  		skip = compound_nr(head) - (page - head);
>  		pfn += skip - 1;
>  	}
> +	return -ENOENT;
> +found:
> +	*movable_pfn = pfn;
>  	return 0;
>  }

So I am looking at this function and it seems like your change completely
changes the behavior. The code before would walk the entire range and if
at least 1 page was available to move you would return the PFN of that
page. Now what seems to happen is that you will return -EBUSY as soon as
you encounter an offline page with a page count. I would think that would
slow down the offlining process since you have made the Unmovable
PageOffline() page a head of line blocker that you have to wait to get
around.

Would it perhaps make more sense to add a return value initialized to
ENOENT, and if you encounter one of these offline pages you change the
return value to EBUSY, and then if you walk through the entire list
without finding a movable page you just return the value?

Otherwise you might want to add a comment explaining why the function
should stall instead of skipping over the unmovable section that will
hopefully become movable later.

> @@ -1528,7 +1543,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	}
>  
>  	do {
> -		for (pfn = start_pfn; pfn;) {
> +		pfn = start_pfn;
> +		do {
>  			if (signal_pending(current)) {
>  				ret = -EINTR;
>  				reason = "signal backoff";
> @@ -1538,14 +1554,19 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  			cond_resched();
>  			lru_add_drain_all();
>  
> -			pfn = scan_movable_pages(pfn, end_pfn);
> -			if (pfn) {
> +			ret = scan_movable_pages(pfn, end_pfn, &pfn);
> +			if (!ret) {
>  				/*
>  				 * TODO: fatal migration failures should bail
>  				 * out
>  				 */
>  				do_migrate_range(pfn, end_pfn);
>  			}
> +		} while (!ret);
> +
> +		if (ret != -ENOENT) {
> +			reason = "unmovable page";
> +			goto failed_removal_isolated;
>  		}
>  
>  		/*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5334decc9e06..840c0bbe2d9f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8256,6 +8256,19 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
>  			continue;
>  
> +		/*
> +		 * We treat all PageOffline() pages as movable when offlining
> +		 * to give drivers a chance to decrement their reference count
> +		 * in MEM_GOING_OFFLINE in order to signalize that these pages

You can probably just use "signal" or "indicate" instead of "signalize".

> +		 * can be offlined as there are no direct references anymore.
> +		 * For actually unmovable PageOffline() where the driver does
> +		 * not support this, we will fail later when trying to actually
> +		 * move these pages that still have a reference count > 0.
> +		 * (false negatives in this function only)
> +		 */
> +		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
> +			continue;
> +
>  		if (__PageMovable(page))
>  			continue;
>  
> @@ -8683,6 +8696,17 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>  			offlined_pages++;
>  			continue;
>  		}
> +		/*
> +		 * At this point all remaining PageOffline() pages have a
> +		 * reference count of 0 and can simply be skipped.
> +		 */
> +		if (PageOffline(page)) {
> +			BUG_ON(page_count(page));
> +			BUG_ON(PageBuddy(page));
> +			pfn++;
> +			offlined_pages++;
> +			continue;
> +		}
>  
>  		BUG_ON(page_count(page));
>  		BUG_ON(!PageBuddy(page));
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 04ee1663cdbe..43b4dabfedc8 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -170,6 +170,7 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>   *			a bit mask)
>   *			MEMORY_OFFLINE - isolate to offline (!allocate) memory
>   *					 e.g., skip over PageHWPoison() pages
> + *					 and PageOffline() pages.
>   *			REPORT_FAILURE - report details about the failure to
>   *			isolate the range
>   *
> @@ -278,6 +279,14 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
>  		else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
>  			/* A HWPoisoned page cannot be also PageBuddy */
>  			pfn++;
> +		else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> +			 !page_count(page))
> +			/*
> +			 * The responsible driver agreed to offline
> +			 * PageOffline() pages by dropping its reference in
> +			 * MEM_GOING_OFFLINE.
> +			 */
> +			pfn++;
>  		else
>  			break;
>  	}
David Hildenbrand Feb. 25, 2020, 6:49 p.m. UTC | #2
>>  /*
>>   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
>> - * non-lru movable pages and hugepages). We scan pfn because it's much
>> - * easier than scanning over linked list. This function returns the pfn
>> - * of the first found movable page if it's found, otherwise 0.
>> + * non-lru movable pages and hugepages).
>> + *
>> + * Returns:
>> + *	0 in case a movable page is found and movable_pfn was updated.
>> + *	-ENOENT in case no movable page was found.
>> + *	-EBUSY in case a definetly unmovable page was found.
>>   */
>> -static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>> +static int scan_movable_pages(unsigned long start, unsigned long end,
>> +			      unsigned long *movable_pfn)
>>  {
>>  	unsigned long pfn;
>>  
>> @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>>  			continue;
>>  		page = pfn_to_page(pfn);
>>  		if (PageLRU(page))
>> -			return pfn;
>> +			goto found;
>>  		if (__PageMovable(page))
>> -			return pfn;
>> +			goto found;
>> +
>> +		/*
>> +		 * Unmovable PageOffline() pages where somebody still holds
>> +		 * a reference count (after MEM_GOING_OFFLINE) can definetly
>> +		 * not be offlined.
>> +		 */
>> +		if (PageOffline(page) && page_count(page))
>> +			return -EBUSY;
> 
> So the comment confused me a bit because technically this function isn't
> about offlining memory, it is about finding movable pages. I had to do a
> bit of digging to find the only consumer is __offline_pages, but if we are
> going to talk about "offlining" instead of "moving" in this function it
> might make sense to rename it.

Well, it's contained in memory_hotplug.c, and the only user of moving
pages around in there is offlining code :) And it's job is to locate
movable pages, skip over some (temporary? unmovable ones) and (now)
indicate definitely unmovable ones.

Any idea for a better name?
scan_movable_pages_and_stop_on_definitely_unmovable() is not so nice :)


> 
>>  
>>  		if (!PageHuge(page))
>>  			continue;
>>  		head = compound_head(page);
>>  		if (page_huge_active(head))
>> -			return pfn;
>> +			goto found;
>>  		skip = compound_nr(head) - (page - head);
>>  		pfn += skip - 1;
>>  	}
>> +	return -ENOENT;
>> +found:
>> +	*movable_pfn = pfn;
>>  	return 0;
>>  }
> 
> So I am looking at this function and it seems like your change completely
> changes the behavior. The code before would walk the entire range and if
> at least 1 page was available to move you would return the PFN of that
> page. Now what seems to happen is that you will return -EBUSY as soon as
> you encounter an offline page with a page count. I would think that would
> slow down the offlining process since you have made the Unmovable
> PageOffline() page a head of line blocker that you have to wait to get
> around.

So, the comment says "Unmovable PageOffline() pages where somebody still
holds a reference count (after MEM_GOING_OFFLINE) can definitely not be
offlined". And the doc "-EBUSY in case a definitely unmovable page was
found."

So why would this make offlining slow? Offlining will be aborted,
because offlining is not possible.

Please note that this is the exact old behavior, where isolating the
page range would have failed directly and offlining would have been
aborted early. The old offlining failure in the case in the offlining
path would have been "failure to isolate range".

Also, note that the users of PageOffline() with unmovable pages are very
rare (only balloon drivers for now).

> 
> Would it perhaps make more sense to add a return value initialized to
> ENOENT, and if you encounter one of these offline pages you change the
> return value to EBUSY, and then if you walk through the entire list
> without finding a movable page you just return the value?

Did you have a look in  which context this function is used, especially
[1] and [2]?

> 
> Otherwise you might want to add a comment explaining why the function
> should stall instead of skipping over the unmovable section that will
> hopefully become movable later.

So we have "-EBUSY in case a definitely unmovable page was found.". Do
you have a better suggestion?

> 
>> @@ -1528,7 +1543,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>  	}
>>  
>>  	do {
>> -		for (pfn = start_pfn; pfn;) {
>> +		pfn = start_pfn;
>> +		do {
>>  			if (signal_pending(current)) {
>>  				ret = -EINTR;
>>  				reason = "signal backoff";
>> @@ -1538,14 +1554,19 @@ static int __ref __offline_pages(unsigned long start_pfn,
>>  			cond_resched();
>>  			lru_add_drain_all();
>>  
>> -			pfn = scan_movable_pages(pfn, end_pfn);
>> -			if (pfn) {
>> +			ret = scan_movable_pages(pfn, end_pfn, &pfn);
>> +			if (!ret) {
>>  				/*
>>  				 * TODO: fatal migration failures should bail
>>  				 * out
>>  				 */
>>  				do_migrate_range(pfn, end_pfn);
>>  			}

[1] we detect a definite offlining blocker and

>> +		} while (!ret);
>> +
>> +		if (ret != -ENOENT) {
>> +			reason = "unmovable page";

[2] we abort offlining

>> +			goto failed_removal_isolated;
>>  		}
>>  
>>  		/*
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 5334decc9e06..840c0bbe2d9f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -8256,6 +8256,19 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>>  		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
>>  			continue;
>>  
>> +		/*
>> +		 * We treat all PageOffline() pages as movable when offlining
>> +		 * to give drivers a chance to decrement their reference count
>> +		 * in MEM_GOING_OFFLINE in order to signalize that these pages
> 
> You can probably just use "signal" or "indicate" instead of "signalize".

Makes sense, "indicate" it is!

Thanks!
Alexander Duyck Feb. 25, 2020, 9:46 p.m. UTC | #3
On Tue, 2020-02-25 at 19:49 +0100, David Hildenbrand wrote:
> > >  /*
> > >   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
> > > - * non-lru movable pages and hugepages). We scan pfn because it's much
> > > - * easier than scanning over linked list. This function returns the pfn
> > > - * of the first found movable page if it's found, otherwise 0.
> > > + * non-lru movable pages and hugepages).
> > > + *
> > > + * Returns:
> > > + *	0 in case a movable page is found and movable_pfn was updated.
> > > + *	-ENOENT in case no movable page was found.
> > > + *	-EBUSY in case a definetly unmovable page was found.
> > >   */
> > > -static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> > > +static int scan_movable_pages(unsigned long start, unsigned long end,
> > > +			      unsigned long *movable_pfn)
> > >  {
> > >  	unsigned long pfn;
> > >  
> > > @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> > >  			continue;
> > >  		page = pfn_to_page(pfn);
> > >  		if (PageLRU(page))
> > > -			return pfn;
> > > +			goto found;
> > >  		if (__PageMovable(page))
> > > -			return pfn;
> > > +			goto found;
> > > +
> > > +		/*
> > > +		 * Unmovable PageOffline() pages where somebody still holds
> > > +		 * a reference count (after MEM_GOING_OFFLINE) can definetly
> > > +		 * not be offlined.
> > > +		 */
> > > +		if (PageOffline(page) && page_count(page))
> > > +			return -EBUSY;
> > 
> > So the comment confused me a bit because technically this function isn't
> > about offlining memory, it is about finding movable pages. I had to do a
> > bit of digging to find the only consumer is __offline_pages, but if we are
> > going to talk about "offlining" instead of "moving" in this function it
> > might make sense to rename it.
> 
> Well, it's contained in memory_hotplug.c, and the only user of moving
> pages around in there is offlining code :) And it's job is to locate
> movable pages, skip over some (temporary? unmovable ones) and (now)
> indicate definitely unmovable ones.
> 
> Any idea for a better name?
> scan_movable_pages_and_stop_on_definitely_unmovable() is not so nice :)

I dunno. What I was getting at is that the wording here would make it
clearer if you simply stated that these pages "can definately not be
moved". Saying you cannot offline a page that is PageOffline seems kind of
redundant, then again calling it an Unmovable and then saying it cannot be
moves is also redundant I suppose. In the end you don't move them, but
they can be switched to offline if the page count hits 0. When that
happens you simply end up skipping over them in the code for
__test_page_isolated_in_pageblock and __offline_isolated_pages.

> > >  
> > >  		if (!PageHuge(page))
> > >  			continue;
> > >  		head = compound_head(page);
> > >  		if (page_huge_active(head))
> > > -			return pfn;
> > > +			goto found;
> > >  		skip = compound_nr(head) - (page - head);
> > >  		pfn += skip - 1;
> > >  	}
> > > +	return -ENOENT;
> > > +found:
> > > +	*movable_pfn = pfn;
> > >  	return 0;
> > >  }
> > 
> > So I am looking at this function and it seems like your change completely
> > changes the behavior. The code before would walk the entire range and if
> > at least 1 page was available to move you would return the PFN of that
> > page. Now what seems to happen is that you will return -EBUSY as soon as
> > you encounter an offline page with a page count. I would think that would
> > slow down the offlining process since you have made the Unmovable
> > PageOffline() page a head of line blocker that you have to wait to get
> > around.
> 
> So, the comment says "Unmovable PageOffline() pages where somebody still
> holds a reference count (after MEM_GOING_OFFLINE) can definitely not be
> offlined". And the doc "-EBUSY in case a definitely unmovable page was
> found."
> 
> So why would this make offlining slow? Offlining will be aborted,
> because offlining is not possible.

Okay, my reading of the code was a bit off. In my mind I was thinking you
were effectively treating it almost like an EAGAIN and continuing the
loop. I didn't catch the part where you had rewritten the for loop as a
do-while in __offline_pages.

> Please note that this is the exact old behavior, where isolating the
> page range would have failed directly and offlining would have been
> aborted early. The old offlining failure in the case in the offlining
> path would have been "failure to isolate range".
> 
> Also, note that the users of PageOffline() with unmovable pages are very
> rare (only balloon drivers for now).
> 
> > Would it perhaps make more sense to add a return value initialized to
> > ENOENT, and if you encounter one of these offline pages you change the
> > return value to EBUSY, and then if you walk through the entire list
> > without finding a movable page you just return the value?
> 
> Did you have a look in  which context this function is used, especially
> [1] and [2]?
> 
> > Otherwise you might want to add a comment explaining why the function
> > should stall instead of skipping over the unmovable section that will
> > hopefully become movable later.
> 
> So we have "-EBUSY in case a definitely unmovable page was found.". Do
> you have a better suggestion?
> 
> > > @@ -1528,7 +1543,8 @@ static int __ref __offline_pages(unsigned long start_pfn,
> > >  	}
> > >  
> > >  	do {
> > > -		for (pfn = start_pfn; pfn;) {
> > > +		pfn = start_pfn;
> > > +		do {
> > >  			if (signal_pending(current)) {
> > >  				ret = -EINTR;
> > >  				reason = "signal backoff";
> > > @@ -1538,14 +1554,19 @@ static int __ref __offline_pages(unsigned long start_pfn,
> > >  			cond_resched();
> > >  			lru_add_drain_all();
> > >  
> > > -			pfn = scan_movable_pages(pfn, end_pfn);
> > > -			if (pfn) {
> > > +			ret = scan_movable_pages(pfn, end_pfn, &pfn);
> > > +			if (!ret) {
> > >  				/*
> > >  				 * TODO: fatal migration failures should bail
> > >  				 * out
> > >  				 */
> > >  				do_migrate_range(pfn, end_pfn);
> > >  			}
> 
> [1] we detect a definite offlining blocker and
> 
> > > +		} while (!ret);
> > > +
> > > +		if (ret != -ENOENT) {
> > > +			reason = "unmovable page";
> 
> [2] we abort offlining
> 
> > > +			goto failed_removal_isolated;
> > >  		}
> > >  
> > >  		/*

Yeah, this is the piece I misread.  I knew the loop this was in previously
was looping when returning -ENOENT so for some reason I had it in my head
that you were still looping on -EBUSY.

So the one question I would have is if at this point are we guaranteed
that the balloon drivers have already taken care of the page count for all
the pages they set to PageOffline? Based on the patch description I was
thinking that this was going to be looping for a while waiting for the
driver to clear the pages and then walking through them at the end of the
loop via check_pages_isolated_cb.

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 5334decc9e06..840c0bbe2d9f 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -8256,6 +8256,19 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
> > >  		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
> > >  			continue;
> > >  
> > > +		/*
> > > +		 * We treat all PageOffline() pages as movable when offlining
> > > +		 * to give drivers a chance to decrement their reference count
> > > +		 * in MEM_GOING_OFFLINE in order to signalize that these pages
> > 
> > You can probably just use "signal" or "indicate" instead of "signalize".
> 
> Makes sense, "indicate" it is!
> 
> Thanks!
> 

No problem.
David Hildenbrand Feb. 25, 2020, 10:19 p.m. UTC | #4
On 25.02.20 22:46, Alexander Duyck wrote:
> On Tue, 2020-02-25 at 19:49 +0100, David Hildenbrand wrote:
>>>>  /*
>>>>   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
>>>> - * non-lru movable pages and hugepages). We scan pfn because it's much
>>>> - * easier than scanning over linked list. This function returns the pfn
>>>> - * of the first found movable page if it's found, otherwise 0.
>>>> + * non-lru movable pages and hugepages).
>>>> + *
>>>> + * Returns:
>>>> + *	0 in case a movable page is found and movable_pfn was updated.
>>>> + *	-ENOENT in case no movable page was found.
>>>> + *	-EBUSY in case a definetly unmovable page was found.
>>>>   */
>>>> -static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>>>> +static int scan_movable_pages(unsigned long start, unsigned long end,
>>>> +			      unsigned long *movable_pfn)
>>>>  {
>>>>  	unsigned long pfn;
>>>>  
>>>> @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
>>>>  			continue;
>>>>  		page = pfn_to_page(pfn);
>>>>  		if (PageLRU(page))
>>>> -			return pfn;
>>>> +			goto found;
>>>>  		if (__PageMovable(page))
>>>> -			return pfn;
>>>> +			goto found;
>>>> +
>>>> +		/*
>>>> +		 * Unmovable PageOffline() pages where somebody still holds
>>>> +		 * a reference count (after MEM_GOING_OFFLINE) can definetly
>>>> +		 * not be offlined.
>>>> +		 */
>>>> +		if (PageOffline(page) && page_count(page))
>>>> +			return -EBUSY;
>>>
>>> So the comment confused me a bit because technically this function isn't
>>> about offlining memory, it is about finding movable pages. I had to do a
>>> bit of digging to find the only consumer is __offline_pages, but if we are
>>> going to talk about "offlining" instead of "moving" in this function it
>>> might make sense to rename it.
>>
>> Well, it's contained in memory_hotplug.c, and the only user of moving
>> pages around in there is offlining code :) And it's job is to locate
>> movable pages, skip over some (temporary? unmovable ones) and (now)
>> indicate definitely unmovable ones.
>>
>> Any idea for a better name?
>> scan_movable_pages_and_stop_on_definitely_unmovable() is not so nice :)
> 
> I dunno. What I was getting at is that the wording here would make it
> clearer if you simply stated that these pages "can definately not be
> moved". Saying you cannot offline a page that is PageOffline seems kind of
> redundant, then again calling it an Unmovable and then saying it cannot be
> moves is also redundant I suppose. In the end you don't move them, but

So, in summary, there are
- PageOffline() pages that are movable (balloon compaction).
- PageOffline() pages that cannot be moved and cannot be offlined (e.g.,
  no balloon compaction enabled, XEN, HyperV, ...) . page_count(page) >=
  0
- PageOffline() pages that cannot be moved, but can be offlined.
  page_count(page) == 0.


> they can be switched to offline if the page count hits 0. When that
> happens you simply end up skipping over them in the code for
> __test_page_isolated_in_pageblock and __offline_isolated_pages.

Yes. The thing with the wording is that pages with (PageOffline(page) &&
!page_count(page)) can also not really be moved, but they can be skipped
when offlining. If we call that "moving them to /dev/null", then yes,
they can be moved to some degree :)

I can certainly do here e.g.,

/*
 * PageOffline() pages that are not marked __PageMovable() and have a
 * reference count > 0 (after MEM_GOING_OFFLINE) are definitely
 * unmovable. If their reference count would be 0, they could be skipped
 * when offlining memory sections.
 */

And maybe I'll add to the function doc, that unmovable pages that are
skipped in this function can include pages that can be skipped when
offlining (moving them to nirvana).

Other suggestions?

[...]

>>
>> [1] we detect a definite offlining blocker and
>>
>>>> +		} while (!ret);
>>>> +
>>>> +		if (ret != -ENOENT) {
>>>> +			reason = "unmovable page";
>>
>> [2] we abort offlining
>>
>>>> +			goto failed_removal_isolated;
>>>>  		}
>>>>  
>>>>  		/*
> 
> Yeah, this is the piece I misread.  I knew the loop this was in previously
> was looping when returning -ENOENT so for some reason I had it in my head
> that you were still looping on -EBUSY.

Ah okay, I see. Yeah, that wouldn't make sense for the use case I have :)

> 
> So the one question I would have is if at this point are we guaranteed
> that the balloon drivers have already taken care of the page count for all
> the pages they set to PageOffline? Based on the patch description I was
> thinking that this was going to be looping for a while waiting for the
> driver to clear the pages and then walking through them at the end of the
> loop via check_pages_isolated_cb.

So, e.g., the patch description states

"Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count."

Any balloon driver that does not allow offlining (e.g., XEN, HyperV,
virtio-balloon), will always have a refcount of (at least) 1. Drivers
that want to make use of that (esp. virtio-mem, but eventually also
HyperV), will drop their refcount via the MEM_GOING_OFFLINE call.

So yes, at this point, all applicable users were notified via
MEM_GOING_OFFLINE and had their chance to decrement the refcount. If
they didn't, offlining will be aborted.

Thanks again!
Alexander Duyck Feb. 26, 2020, 4:27 p.m. UTC | #5
On Tue, 2020-02-25 at 23:19 +0100, David Hildenbrand wrote:
> On 25.02.20 22:46, Alexander Duyck wrote:
> > On Tue, 2020-02-25 at 19:49 +0100, David Hildenbrand wrote:
> > > > >  /*
> > > > >   * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
> > > > > - * non-lru movable pages and hugepages). We scan pfn because it's much
> > > > > - * easier than scanning over linked list. This function returns the pfn
> > > > > - * of the first found movable page if it's found, otherwise 0.
> > > > > + * non-lru movable pages and hugepages).
> > > > > + *
> > > > > + * Returns:
> > > > > + *	0 in case a movable page is found and movable_pfn was updated.
> > > > > + *	-ENOENT in case no movable page was found.
> > > > > + *	-EBUSY in case a definetly unmovable page was found.
> > > > >   */
> > > > > -static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> > > > > +static int scan_movable_pages(unsigned long start, unsigned long end,
> > > > > +			      unsigned long *movable_pfn)
> > > > >  {
> > > > >  	unsigned long pfn;
> > > > >  
> > > > > @@ -1247,18 +1251,29 @@ static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
> > > > >  			continue;
> > > > >  		page = pfn_to_page(pfn);
> > > > >  		if (PageLRU(page))
> > > > > -			return pfn;
> > > > > +			goto found;
> > > > >  		if (__PageMovable(page))
> > > > > -			return pfn;
> > > > > +			goto found;
> > > > > +
> > > > > +		/*
> > > > > +		 * Unmovable PageOffline() pages where somebody still holds
> > > > > +		 * a reference count (after MEM_GOING_OFFLINE) can definetly
> > > > > +		 * not be offlined.
> > > > > +		 */
> > > > > +		if (PageOffline(page) && page_count(page))
> > > > > +			return -EBUSY;
> > > > 
> > > > So the comment confused me a bit because technically this function isn't
> > > > about offlining memory, it is about finding movable pages. I had to do a
> > > > bit of digging to find the only consumer is __offline_pages, but if we are
> > > > going to talk about "offlining" instead of "moving" in this function it
> > > > might make sense to rename it.
> > > 
> > > Well, it's contained in memory_hotplug.c, and the only user of moving
> > > pages around in there is offlining code :) And it's job is to locate
> > > movable pages, skip over some (temporary? unmovable ones) and (now)
> > > indicate definitely unmovable ones.
> > > 
> > > Any idea for a better name?
> > > scan_movable_pages_and_stop_on_definitely_unmovable() is not so nice :)
> > 
> > I dunno. What I was getting at is that the wording here would make it
> > clearer if you simply stated that these pages "can definately not be
> > moved". Saying you cannot offline a page that is PageOffline seems kind of
> > redundant, then again calling it an Unmovable and then saying it cannot be
> > moves is also redundant I suppose. In the end you don't move them, but
> 
> So, in summary, there are
> - PageOffline() pages that are movable (balloon compaction).
> - PageOffline() pages that cannot be moved and cannot be offlined (e.g.,
>   no balloon compaction enabled, XEN, HyperV, ...) . page_count(page) >=
>   0
> - PageOffline() pages that cannot be moved, but can be offlined.
>   page_count(page) == 0.
> 
> 
> > they can be switched to offline if the page count hits 0. When that
> > happens you simply end up skipping over them in the code for
> > __test_page_isolated_in_pageblock and __offline_isolated_pages.
> 
> Yes. The thing with the wording is that pages with (PageOffline(page) &&
> !page_count(page)) can also not really be moved, but they can be skipped
> when offlining. If we call that "moving them to /dev/null", then yes,
> they can be moved to some degree :)
> 
> I can certainly do here e.g.,
> 
> /*
>  * PageOffline() pages that are not marked __PageMovable() and have a
>  * reference count > 0 (after MEM_GOING_OFFLINE) are definitely
>  * unmovable. If their reference count would be 0, they could be skipped
>  * when offlining memory sections.
>  */
> 
> And maybe I'll add to the function doc, that unmovable pages that are
> skipped in this function can include pages that can be skipped when
> offlining (moving them to nirvana).
> 
> Other suggestions?

No, this sounds good and makes it much clearer.

> [...]
> 
> > > [1] we detect a definite offlining blocker and
> > > 
> > > > > +		} while (!ret);
> > > > > +
> > > > > +		if (ret != -ENOENT) {
> > > > > +			reason = "unmovable page";
> > > 
> > > [2] we abort offlining
> > > 
> > > > > +			goto failed_removal_isolated;
> > > > >  		}
> > > > >  
> > > > >  		/*
> > 
> > Yeah, this is the piece I misread.  I knew the loop this was in previously
> > was looping when returning -ENOENT so for some reason I had it in my head
> > that you were still looping on -EBUSY.
> 
> Ah okay, I see. Yeah, that wouldn't make sense for the use case I have :)
> 
> > So the one question I would have is if at this point are we guaranteed
> > that the balloon drivers have already taken care of the page count for all
> > the pages they set to PageOffline? Based on the patch description I was
> > thinking that this was going to be looping for a while waiting for the
> > driver to clear the pages and then walking through them at the end of the
> > loop via check_pages_isolated_cb.
> 
> So, e.g., the patch description states
> 
> "Let's allow to do that by allowing to isolate any PageOffline() page
> when offlining. This way, we can reach the memory hotplug notifier
> MEM_GOING_OFFLINE, where the driver can signal that he is fine with
> offlining this page by dropping its reference count."
> 
> Any balloon driver that does not allow offlining (e.g., XEN, HyperV,
> virtio-balloon), will always have a refcount of (at least) 1. Drivers
> that want to make use of that (esp. virtio-mem, but eventually also
> HyperV), will drop their refcount via the MEM_GOING_OFFLINE call.
> 
> So yes, at this point, all applicable users were notified via
> MEM_GOING_OFFLINE and had their chance to decrement the refcount. If
> they didn't, offlining will be aborted.
> 
> Thanks again!

Thank you as well. I'm still getting up to speed on the inner workings of
much of this and so discussions such as this usually prove to be quite
beneficial for me.
diff mbox series

Patch

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1bf83c8fcaa7..ac1775082343 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -761,6 +761,16 @@  PAGE_TYPE_OPS(Buddy, buddy)
  * not onlined when onlining the section).
  * The content of these pages is effectively stale. Such pages should not
  * be touched (read/write/dump/save) except by their owner.
+ *
+ * If a driver wants to allow to offline unmovable PageOffline() pages without
+ * putting them back to the buddy, it can do so via the memory notifier by
+ * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
+ * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
+ * pages (now with a reference count of zero) are treated like free pages,
+ * allowing the containing memory block to get offlined. A driver that
+ * relies on this feature is aware that re-onlining the memory block will
+ * require to re-set the pages PageOffline() and not giving them to the
+ * buddy via online_page_callback_t.
  */
 PAGE_TYPE_OPS(Offline, offline)
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index fc617ad6f035..da01453a04e6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1231,11 +1231,15 @@  int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn,
 
 /*
  * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
- * non-lru movable pages and hugepages). We scan pfn because it's much
- * easier than scanning over linked list. This function returns the pfn
- * of the first found movable page if it's found, otherwise 0.
+ * non-lru movable pages and hugepages).
+ *
+ * Returns:
+ *	0 in case a movable page is found and movable_pfn was updated.
+ *	-ENOENT in case no movable page was found.
+ *	-EBUSY in case a definetly unmovable page was found.
  */
-static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
+static int scan_movable_pages(unsigned long start, unsigned long end,
+			      unsigned long *movable_pfn)
 {
 	unsigned long pfn;
 
@@ -1247,18 +1251,29 @@  static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
 			continue;
 		page = pfn_to_page(pfn);
 		if (PageLRU(page))
-			return pfn;
+			goto found;
 		if (__PageMovable(page))
-			return pfn;
+			goto found;
+
+		/*
+		 * Unmovable PageOffline() pages where somebody still holds
+		 * a reference count (after MEM_GOING_OFFLINE) can definetly
+		 * not be offlined.
+		 */
+		if (PageOffline(page) && page_count(page))
+			return -EBUSY;
 
 		if (!PageHuge(page))
 			continue;
 		head = compound_head(page);
 		if (page_huge_active(head))
-			return pfn;
+			goto found;
 		skip = compound_nr(head) - (page - head);
 		pfn += skip - 1;
 	}
+	return -ENOENT;
+found:
+	*movable_pfn = pfn;
 	return 0;
 }
 
@@ -1528,7 +1543,8 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	}
 
 	do {
-		for (pfn = start_pfn; pfn;) {
+		pfn = start_pfn;
+		do {
 			if (signal_pending(current)) {
 				ret = -EINTR;
 				reason = "signal backoff";
@@ -1538,14 +1554,19 @@  static int __ref __offline_pages(unsigned long start_pfn,
 			cond_resched();
 			lru_add_drain_all();
 
-			pfn = scan_movable_pages(pfn, end_pfn);
-			if (pfn) {
+			ret = scan_movable_pages(pfn, end_pfn, &pfn);
+			if (!ret) {
 				/*
 				 * TODO: fatal migration failures should bail
 				 * out
 				 */
 				do_migrate_range(pfn, end_pfn);
 			}
+		} while (!ret);
+
+		if (ret != -ENOENT) {
+			reason = "unmovable page";
+			goto failed_removal_isolated;
 		}
 
 		/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5334decc9e06..840c0bbe2d9f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8256,6 +8256,19 @@  bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 		if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
 			continue;
 
+		/*
+		 * We treat all PageOffline() pages as movable when offlining
+		 * to give drivers a chance to decrement their reference count
+		 * in MEM_GOING_OFFLINE in order to signalize that these pages
+		 * can be offlined as there are no direct references anymore.
+		 * For actually unmovable PageOffline() where the driver does
+		 * not support this, we will fail later when trying to actually
+		 * move these pages that still have a reference count > 0.
+		 * (false negatives in this function only)
+		 */
+		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
+			continue;
+
 		if (__PageMovable(page))
 			continue;
 
@@ -8683,6 +8696,17 @@  __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 			offlined_pages++;
 			continue;
 		}
+		/*
+		 * At this point all remaining PageOffline() pages have a
+		 * reference count of 0 and can simply be skipped.
+		 */
+		if (PageOffline(page)) {
+			BUG_ON(page_count(page));
+			BUG_ON(PageBuddy(page));
+			pfn++;
+			offlined_pages++;
+			continue;
+		}
 
 		BUG_ON(page_count(page));
 		BUG_ON(!PageBuddy(page));
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 04ee1663cdbe..43b4dabfedc8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -170,6 +170,7 @@  __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  *			a bit mask)
  *			MEMORY_OFFLINE - isolate to offline (!allocate) memory
  *					 e.g., skip over PageHWPoison() pages
+ *					 and PageOffline() pages.
  *			REPORT_FAILURE - report details about the failure to
  *			isolate the range
  *
@@ -278,6 +279,14 @@  __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 		else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page))
 			/* A HWPoisoned page cannot be also PageBuddy */
 			pfn++;
+		else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
+			 !page_count(page))
+			/*
+			 * The responsible driver agreed to offline
+			 * PageOffline() pages by dropping its reference in
+			 * MEM_GOING_OFFLINE.
+			 */
+			pfn++;
 		else
 			break;
 	}