Message ID | 1499863221-16206-7-git-send-email-wei.w.wang@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote: > This patch adds support for reporting blocks of pages on the free list > specified by the caller. > > As pages can leave the free list during this call or immediately > afterwards, they are not guaranteed to be free after the function > returns. The only guarantee this makes is that the page was on the free > list at some point in time after the function has been invoked. > > Therefore, it is not safe for caller to use any pages on the returned > block or to discard data that is put there after the function returns. > However, it is safe for caller to discard data that was in one of these > pages before the function was invoked. > > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: Liang Li <liang.z.li@intel.com> > --- > include/linux/mm.h | 5 +++ > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 101 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 46b9ac5..76cb433 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > unsigned long zone_start_pfn, unsigned long *zholes_size); > extern void free_initmem(void); > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, > + struct page **page); > +#endif > /* > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > * into the buddy system. The freed pages will be poisoned with pattern > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 64b7d82..8b3c9dd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > show_swap_cache_info(); > } > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > + > +/* > + * Heuristically get a page block in the system that is unused. > + * It is possible that pages from the page block are used immediately after > + * report_unused_page_block() returns. It is the caller's responsibility > + * to either detect or prevent the use of such pages. > + * > + * The free list to check: zone->free_area[order].free_list[migratetype]. > + * > + * If the caller supplied page block (i.e. **page) is on the free list, offer > + * the next page block on the list to the caller. Otherwise, offer the first > + * page block on the list. > + * > + * Note: it is not safe for caller to use any pages on the returned > + * block or to discard data that is put there after the function returns. > + * However, it is safe for caller to discard data that was in one of these > + * pages before the function was invoked. > + * > + * Return 0 when a page block is found on the caller specified free list. Otherwise? > + */ As an alternative, we could have an API that scans free pages and invokes a callback under a lock. Granted, this might end up staying a lot of time under a lock. Is this a big issue? Some benchmarking will tell. It would then be up to the hypervisor to decide whether it wants to play tricks with the dirty bit or just wants to drop pages while VCPU is stopped. > +int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, struct page **page) > +{ > + struct zone *this_zone; > + struct list_head *this_list; > + int ret = 0; > + unsigned long flags; > + > + /* Sanity check */ > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > + migratetype >= MIGRATE_TYPES) > + return -EINVAL; Why do callers this? > + > + /* Zone validity check */ > + for_each_populated_zone(this_zone) { > + if (zone == this_zone) > + break; > + } Why? Will take a long time if there are lots of zones. > + > + /* Got a non-existent zone from the caller? */ > + if (zone != this_zone) > + return -EINVAL; When does this happen? > + > + spin_lock_irqsave(&this_zone->lock, flags); > + > + this_list = &zone->free_area[order].free_list[migratetype]; > + if (list_empty(this_list)) { > + *page = NULL; > + ret = 1; What does this mean? > + goto out; > + } > + > + /* The caller is asking for the first free page block on the list */ > + if ((*page) == NULL) { if (!*page) is shorter and prettier. > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller is not on this free list > + * anymore (e.g. a 1MB free page block has been split). In this case, > + * offer the first page block on the free list that the caller is > + * asking for. This just might keep giving you same block over and over again. E.g. - get 1st block - get 2nd block - 2nd gets broken up - get 1st block again this way we might never make progress beyond the 1st 2 blocks > + */ > + if (PageBuddy(*page) && order != page_order(*page)) { > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller has been the last page block > + * on the list. > + */ > + if ((*page)->lru.next == this_list) { > + *page = NULL; > + ret = 1; > + goto out; > + } > + > + /* > + * Finally, fall into the regular case: the page block passed from the > + * caller is still on the free list. Offer the next one. > + */ > + *page = list_next_entry((*page), lru); > + ret = 0; > +out: > + spin_unlock_irqrestore(&this_zone->lock, flags); > + return ret; > +} > +EXPORT_SYMBOL(report_unused_page_block); > + > +#endif > + > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > { > zoneref->zone = zone; > -- > 2.7.4
On 07/13/2017 08:33 AM, Michael S. Tsirkin wrote: > On Wed, Jul 12, 2017 at 08:40:19PM +0800, Wei Wang wrote: >> This patch adds support for reporting blocks of pages on the free list >> specified by the caller. >> >> As pages can leave the free list during this call or immediately >> afterwards, they are not guaranteed to be free after the function >> returns. The only guarantee this makes is that the page was on the free >> list at some point in time after the function has been invoked. >> >> Therefore, it is not safe for caller to use any pages on the returned >> block or to discard data that is put there after the function returns. >> However, it is safe for caller to discard data that was in one of these >> pages before the function was invoked. >> >> Signed-off-by: Wei Wang <wei.w.wang@intel.com> >> Signed-off-by: Liang Li <liang.z.li@intel.com> >> --- >> include/linux/mm.h | 5 +++ >> mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 101 insertions(+) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 46b9ac5..76cb433 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, >> unsigned long zone_start_pfn, unsigned long *zholes_size); >> extern void free_initmem(void); >> >> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) >> +extern int report_unused_page_block(struct zone *zone, unsigned int order, >> + unsigned int migratetype, >> + struct page **page); >> +#endif >> /* >> * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) >> * into the buddy system. The freed pages will be poisoned with pattern >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 64b7d82..8b3c9dd 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) >> show_swap_cache_info(); >> } >> >> +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) >> + >> +/* >> + * Heuristically get a page block in the system that is unused. >> + * It is possible that pages from the page block are used immediately after >> + * report_unused_page_block() returns. It is the caller's responsibility >> + * to either detect or prevent the use of such pages. >> + * >> + * The free list to check: zone->free_area[order].free_list[migratetype]. >> + * >> + * If the caller supplied page block (i.e. **page) is on the free list, offer >> + * the next page block on the list to the caller. Otherwise, offer the first >> + * page block on the list. >> + * >> + * Note: it is not safe for caller to use any pages on the returned >> + * block or to discard data that is put there after the function returns. >> + * However, it is safe for caller to discard data that was in one of these >> + * pages before the function was invoked. >> + * >> + * Return 0 when a page block is found on the caller specified free list. > Otherwise? Other values mean that no page block is found. I will add them. > >> + */ > As an alternative, we could have an API that scans free pages > and invokes a callback under a lock. Granted, this might end up > staying a lot of time under a lock. Is this a big issue? > Some benchmarking will tell. > > It would then be up to the hypervisor to decide whether it wants to play > tricks with the dirty bit or just wants to drop pages while VCPU is > stopped. > > >> +int report_unused_page_block(struct zone *zone, unsigned int order, >> + unsigned int migratetype, struct page **page) >> +{ >> + struct zone *this_zone; >> + struct list_head *this_list; >> + int ret = 0; >> + unsigned long flags; >> + >> + /* Sanity check */ >> + if (zone == NULL || page == NULL || order >= MAX_ORDER || >> + migratetype >= MIGRATE_TYPES) >> + return -EINVAL; > Why do callers this? > >> + >> + /* Zone validity check */ >> + for_each_populated_zone(this_zone) { >> + if (zone == this_zone) >> + break; >> + } > Why? Will take a long time if there are lots of zones. > >> + >> + /* Got a non-existent zone from the caller? */ >> + if (zone != this_zone) >> + return -EINVAL; > When does this happen? The above lines of code are just sanity check. If not necessary, we can remove them. > >> + >> + spin_lock_irqsave(&this_zone->lock, flags); >> + >> + this_list = &zone->free_area[order].free_list[migratetype]; >> + if (list_empty(this_list)) { >> + *page = NULL; >> + ret = 1; > > What does this mean? Just means the list is empty, and expects the caller to try again in the next list. Probably, use "-EAGAIN" is better? > >> + *page = list_first_entry(this_list, struct page, lru); >> + ret = 0; >> + goto out; >> + } >> + >> + /* >> + * The page block passed from the caller is not on this free list >> + * anymore (e.g. a 1MB free page block has been split). In this case, >> + * offer the first page block on the free list that the caller is >> + * asking for. > This just might keep giving you same block over and over again. > E.g. > - get 1st block > - get 2nd block > - 2nd gets broken up > - get 1st block again > > this way we might never make progress beyond the 1st 2 blocks Not really. I think the pages are allocated in order. If the 2nd block isn't there, then the 1st block must have gone, too. So, the call will return the 3rd one (which is the new first) on the list. Best, Wei
On Wed 12-07-17 20:40:19, Wei Wang wrote: > This patch adds support for reporting blocks of pages on the free list > specified by the caller. > > As pages can leave the free list during this call or immediately > afterwards, they are not guaranteed to be free after the function > returns. The only guarantee this makes is that the page was on the free > list at some point in time after the function has been invoked. > > Therefore, it is not safe for caller to use any pages on the returned > block or to discard data that is put there after the function returns. > However, it is safe for caller to discard data that was in one of these > pages before the function was invoked. I do not understand what is the point of such a function and how it is used because the patch doesn't give us any user (I haven't checked other patches yet). But just from the semantic point of view this sounds like a horrible idea. The only way to get a free block of pages is to call the page allocator. I am tempted to give it Nack right on those grounds but I would like to hear more about what you actually want to achieve. > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > Signed-off-by: Liang Li <liang.z.li@intel.com> > --- > include/linux/mm.h | 5 +++ > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 101 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 46b9ac5..76cb433 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > unsigned long zone_start_pfn, unsigned long *zholes_size); > extern void free_initmem(void); > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, > + struct page **page); > +#endif > /* > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > * into the buddy system. The freed pages will be poisoned with pattern > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 64b7d82..8b3c9dd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > show_swap_cache_info(); > } > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > + > +/* > + * Heuristically get a page block in the system that is unused. > + * It is possible that pages from the page block are used immediately after > + * report_unused_page_block() returns. It is the caller's responsibility > + * to either detect or prevent the use of such pages. > + * > + * The free list to check: zone->free_area[order].free_list[migratetype]. > + * > + * If the caller supplied page block (i.e. **page) is on the free list, offer > + * the next page block on the list to the caller. Otherwise, offer the first > + * page block on the list. > + * > + * Note: it is not safe for caller to use any pages on the returned > + * block or to discard data that is put there after the function returns. > + * However, it is safe for caller to discard data that was in one of these > + * pages before the function was invoked. > + * > + * Return 0 when a page block is found on the caller specified free list. > + */ > +int report_unused_page_block(struct zone *zone, unsigned int order, > + unsigned int migratetype, struct page **page) > +{ > + struct zone *this_zone; > + struct list_head *this_list; > + int ret = 0; > + unsigned long flags; > + > + /* Sanity check */ > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > + migratetype >= MIGRATE_TYPES) > + return -EINVAL; > + > + /* Zone validity check */ > + for_each_populated_zone(this_zone) { > + if (zone == this_zone) > + break; > + } > + > + /* Got a non-existent zone from the caller? */ > + if (zone != this_zone) > + return -EINVAL; Huh, what do you check for here? Why don't you simply populated_zone(zone)? > + > + spin_lock_irqsave(&this_zone->lock, flags); > + > + this_list = &zone->free_area[order].free_list[migratetype]; > + if (list_empty(this_list)) { > + *page = NULL; > + ret = 1; > + goto out; > + } > + > + /* The caller is asking for the first free page block on the list */ > + if ((*page) == NULL) { > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller is not on this free list > + * anymore (e.g. a 1MB free page block has been split). In this case, > + * offer the first page block on the free list that the caller is > + * asking for. > + */ > + if (PageBuddy(*page) && order != page_order(*page)) { > + *page = list_first_entry(this_list, struct page, lru); > + ret = 0; > + goto out; > + } > + > + /* > + * The page block passed from the caller has been the last page block > + * on the list. > + */ > + if ((*page)->lru.next == this_list) { > + *page = NULL; > + ret = 1; > + goto out; > + } > + > + /* > + * Finally, fall into the regular case: the page block passed from the > + * caller is still on the free list. Offer the next one. > + */ > + *page = list_next_entry((*page), lru); > + ret = 0; > +out: > + spin_unlock_irqrestore(&this_zone->lock, flags); > + return ret; > +} > +EXPORT_SYMBOL(report_unused_page_block); > + > +#endif > + > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > { > zoneref->zone = zone; > -- > 2.7.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Fri 14-07-17 14:30:23, Michal Hocko wrote: > On Wed 12-07-17 20:40:19, Wei Wang wrote: > > This patch adds support for reporting blocks of pages on the free list > > specified by the caller. > > > > As pages can leave the free list during this call or immediately > > afterwards, they are not guaranteed to be free after the function > > returns. The only guarantee this makes is that the page was on the free > > list at some point in time after the function has been invoked. > > > > Therefore, it is not safe for caller to use any pages on the returned > > block or to discard data that is put there after the function returns. > > However, it is safe for caller to discard data that was in one of these > > pages before the function was invoked. > > I do not understand what is the point of such a function and how it is > used because the patch doesn't give us any user (I haven't checked other > patches yet). > > But just from the semantic point of view this sounds like a horrible > idea. The only way to get a free block of pages is to call the page > allocator. I am tempted to give it Nack right on those grounds but I > would like to hear more about what you actually want to achieve. OK, so I gave it another thought and giving a page which is still on the free list to a random module is just a free ticket to a disaster. Nacked-by: Michal Hocko <mhocko@suse.com> > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > > Signed-off-by: Liang Li <liang.z.li@intel.com> > > --- > > include/linux/mm.h | 5 +++ > > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 101 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 46b9ac5..76cb433 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > > unsigned long zone_start_pfn, unsigned long *zholes_size); > > extern void free_initmem(void); > > > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > > + unsigned int migratetype, > > + struct page **page); > > +#endif > > /* > > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > > * into the buddy system. The freed pages will be poisoned with pattern > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 64b7d82..8b3c9dd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > > show_swap_cache_info(); > > } > > > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > > + > > +/* > > + * Heuristically get a page block in the system that is unused. > > + * It is possible that pages from the page block are used immediately after > > + * report_unused_page_block() returns. It is the caller's responsibility > > + * to either detect or prevent the use of such pages. > > + * > > + * The free list to check: zone->free_area[order].free_list[migratetype]. > > + * > > + * If the caller supplied page block (i.e. **page) is on the free list, offer > > + * the next page block on the list to the caller. Otherwise, offer the first > > + * page block on the list. > > + * > > + * Note: it is not safe for caller to use any pages on the returned > > + * block or to discard data that is put there after the function returns. > > + * However, it is safe for caller to discard data that was in one of these > > + * pages before the function was invoked. > > + * > > + * Return 0 when a page block is found on the caller specified free list. > > + */ > > +int report_unused_page_block(struct zone *zone, unsigned int order, > > + unsigned int migratetype, struct page **page) > > +{ > > + struct zone *this_zone; > > + struct list_head *this_list; > > + int ret = 0; > > + unsigned long flags; > > + > > + /* Sanity check */ > > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > > + migratetype >= MIGRATE_TYPES) > > + return -EINVAL; > > + > > + /* Zone validity check */ > > + for_each_populated_zone(this_zone) { > > + if (zone == this_zone) > > + break; > > + } > > + > > + /* Got a non-existent zone from the caller? */ > > + if (zone != this_zone) > > + return -EINVAL; > > Huh, what do you check for here? Why don't you simply > populated_zone(zone)? > > > + > > + spin_lock_irqsave(&this_zone->lock, flags); > > + > > + this_list = &zone->free_area[order].free_list[migratetype]; > > + if (list_empty(this_list)) { > > + *page = NULL; > > + ret = 1; > > + goto out; > > + } > > + > > + /* The caller is asking for the first free page block on the list */ > > + if ((*page) == NULL) { > > + *page = list_first_entry(this_list, struct page, lru); > > + ret = 0; > > + goto out; > > + } > > + > > + /* > > + * The page block passed from the caller is not on this free list > > + * anymore (e.g. a 1MB free page block has been split). In this case, > > + * offer the first page block on the free list that the caller is > > + * asking for. > > + */ > > + if (PageBuddy(*page) && order != page_order(*page)) { > > + *page = list_first_entry(this_list, struct page, lru); > > + ret = 0; > > + goto out; > > + } > > + > > + /* > > + * The page block passed from the caller has been the last page block > > + * on the list. > > + */ > > + if ((*page)->lru.next == this_list) { > > + *page = NULL; > > + ret = 1; > > + goto out; > > + } > > + > > + /* > > + * Finally, fall into the regular case: the page block passed from the > > + * caller is still on the free list. Offer the next one. > > + */ > > + *page = list_next_entry((*page), lru); > > + ret = 0; > > +out: > > + spin_unlock_irqrestore(&this_zone->lock, flags); > > + return ret; > > +} > > +EXPORT_SYMBOL(report_unused_page_block); > > + > > +#endif > > + > > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > > { > > zoneref->zone = zone; > > -- > > 2.7.4 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > Michal Hocko > SUSE Labs
On Fri, Jul 14, 2017 at 02:54:30PM +0200, Michal Hocko wrote: > On Fri 14-07-17 14:30:23, Michal Hocko wrote: > > On Wed 12-07-17 20:40:19, Wei Wang wrote: > > > This patch adds support for reporting blocks of pages on the free list > > > specified by the caller. > > > > > > As pages can leave the free list during this call or immediately > > > afterwards, they are not guaranteed to be free after the function > > > returns. The only guarantee this makes is that the page was on the free > > > list at some point in time after the function has been invoked. > > > > > > Therefore, it is not safe for caller to use any pages on the returned > > > block or to discard data that is put there after the function returns. > > > However, it is safe for caller to discard data that was in one of these > > > pages before the function was invoked. > > > > I do not understand what is the point of such a function and how it is > > used because the patch doesn't give us any user (I haven't checked other > > patches yet). > > > > But just from the semantic point of view this sounds like a horrible > > idea. The only way to get a free block of pages is to call the page > > allocator. I am tempted to give it Nack right on those grounds but I > > would like to hear more about what you actually want to achieve. > > OK, so I gave it another thought and giving a page which is still on the > free list to a random module is just a free ticket to a disaster. > Nacked-by: Michal Hocko <mhocko@suse.com> I agree it should be EXPORT_SYMBOL_GPL. Too much power to give to non-GPL modules. But pls take a look at the explanation I posted. Any kind of hypervisor hinting will need to do this by definition - best we can do is keep a lock while we do this. > > > > > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > > > Signed-off-by: Liang Li <liang.z.li@intel.com> > > > --- > > > include/linux/mm.h | 5 +++ > > > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 101 insertions(+) > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 46b9ac5..76cb433 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > > > unsigned long zone_start_pfn, unsigned long *zholes_size); > > > extern void free_initmem(void); > > > > > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > > > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > > > + unsigned int migratetype, > > > + struct page **page); > > > +#endif > > > /* > > > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > > > * into the buddy system. The freed pages will be poisoned with pattern > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index 64b7d82..8b3c9dd 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > > > show_swap_cache_info(); > > > } > > > > > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > > > + > > > +/* > > > + * Heuristically get a page block in the system that is unused. > > > + * It is possible that pages from the page block are used immediately after > > > + * report_unused_page_block() returns. It is the caller's responsibility > > > + * to either detect or prevent the use of such pages. > > > + * > > > + * The free list to check: zone->free_area[order].free_list[migratetype]. > > > + * > > > + * If the caller supplied page block (i.e. **page) is on the free list, offer > > > + * the next page block on the list to the caller. Otherwise, offer the first > > > + * page block on the list. > > > + * > > > + * Note: it is not safe for caller to use any pages on the returned > > > + * block or to discard data that is put there after the function returns. > > > + * However, it is safe for caller to discard data that was in one of these > > > + * pages before the function was invoked. > > > + * > > > + * Return 0 when a page block is found on the caller specified free list. > > > + */ > > > +int report_unused_page_block(struct zone *zone, unsigned int order, > > > + unsigned int migratetype, struct page **page) > > > +{ > > > + struct zone *this_zone; > > > + struct list_head *this_list; > > > + int ret = 0; > > > + unsigned long flags; > > > + > > > + /* Sanity check */ > > > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > > > + migratetype >= MIGRATE_TYPES) > > > + return -EINVAL; > > > + > > > + /* Zone validity check */ > > > + for_each_populated_zone(this_zone) { > > > + if (zone == this_zone) > > > + break; > > > + } > > > + > > > + /* Got a non-existent zone from the caller? */ > > > + if (zone != this_zone) > > > + return -EINVAL; > > > > Huh, what do you check for here? Why don't you simply > > populated_zone(zone)? > > > > > + > > > + spin_lock_irqsave(&this_zone->lock, flags); > > > + > > > + this_list = &zone->free_area[order].free_list[migratetype]; > > > + if (list_empty(this_list)) { > > > + *page = NULL; > > > + ret = 1; > > > + goto out; > > > + } > > > + > > > + /* The caller is asking for the first free page block on the list */ > > > + if ((*page) == NULL) { > > > + *page = list_first_entry(this_list, struct page, lru); > > > + ret = 0; > > > + goto out; > > > + } > > > + > > > + /* > > > + * The page block passed from the caller is not on this free list > > > + * anymore (e.g. a 1MB free page block has been split). In this case, > > > + * offer the first page block on the free list that the caller is > > > + * asking for. > > > + */ > > > + if (PageBuddy(*page) && order != page_order(*page)) { > > > + *page = list_first_entry(this_list, struct page, lru); > > > + ret = 0; > > > + goto out; > > > + } > > > + > > > + /* > > > + * The page block passed from the caller has been the last page block > > > + * on the list. > > > + */ > > > + if ((*page)->lru.next == this_list) { > > > + *page = NULL; > > > + ret = 1; > > > + goto out; > > > + } > > > + > > > + /* > > > + * Finally, fall into the regular case: the page block passed from the > > > + * caller is still on the free list. Offer the next one. > > > + */ > > > + *page = list_next_entry((*page), lru); > > > + ret = 0; > > > +out: > > > + spin_unlock_irqrestore(&this_zone->lock, flags); > > > + return ret; > > > +} > > > +EXPORT_SYMBOL(report_unused_page_block); > > > + > > > +#endif > > > + > > > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > > > { > > > zoneref->zone = zone; > > > -- > > > 2.7.4 > > > > > > -- > > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > > the body to majordomo@kvack.org. For more info on Linux MM, > > > see: http://www.linux-mm.org/ . > > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > > -- > > Michal Hocko > > SUSE Labs > > -- > Michal Hocko > SUSE Labs
On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote: > On Wed 12-07-17 20:40:19, Wei Wang wrote: > > This patch adds support for reporting blocks of pages on the free list > > specified by the caller. > > > > As pages can leave the free list during this call or immediately > > afterwards, they are not guaranteed to be free after the function > > returns. The only guarantee this makes is that the page was on the free > > list at some point in time after the function has been invoked. > > > > Therefore, it is not safe for caller to use any pages on the returned > > block or to discard data that is put there after the function returns. > > However, it is safe for caller to discard data that was in one of these > > pages before the function was invoked. > > I do not understand what is the point of such a function and how it is > used because the patch doesn't give us any user (I haven't checked other > patches yet). > > But just from the semantic point of view this sounds like a horrible > idea. The only way to get a free block of pages is to call the page > allocator. I am tempted to give it Nack right on those grounds but I > would like to hear more about what you actually want to achieve. Basically it's a performance hint to the hypervisor. For example, these pages would be good candidates to move around as they are not mapped into any running applications. As such, it's important not to slow down other parts of the system too much - otherwise we are speeding up one part of the system while we slow down other parts of it, which is why it's trying to drop the lock as soon a possible. As long as hypervisor does not assume it can drop these pages, and as long it's correct in most cases. we are OK even if the hint is slightly wrong because hypervisor notifications are racing with allocations. There are patches to do more tricks - if hypervisor tracks all memory writes we might actually use this hint to discard data - but that is just implementation detail. > > Signed-off-by: Wei Wang <wei.w.wang@intel.com> > > Signed-off-by: Liang Li <liang.z.li@intel.com> > > --- > > include/linux/mm.h | 5 +++ > > mm/page_alloc.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 101 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index 46b9ac5..76cb433 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > > unsigned long zone_start_pfn, unsigned long *zholes_size); > > extern void free_initmem(void); > > > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > > +extern int report_unused_page_block(struct zone *zone, unsigned int order, > > + unsigned int migratetype, > > + struct page **page); > > +#endif > > /* > > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > > * into the buddy system. The freed pages will be poisoned with pattern > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 64b7d82..8b3c9dd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > > show_swap_cache_info(); > > } > > > > +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) > > + > > +/* > > + * Heuristically get a page block in the system that is unused. > > + * It is possible that pages from the page block are used immediately after > > + * report_unused_page_block() returns. It is the caller's responsibility > > + * to either detect or prevent the use of such pages. > > + * > > + * The free list to check: zone->free_area[order].free_list[migratetype]. > > + * > > + * If the caller supplied page block (i.e. **page) is on the free list, offer > > + * the next page block on the list to the caller. Otherwise, offer the first > > + * page block on the list. > > + * > > + * Note: it is not safe for caller to use any pages on the returned > > + * block or to discard data that is put there after the function returns. > > + * However, it is safe for caller to discard data that was in one of these > > + * pages before the function was invoked. > > + * > > + * Return 0 when a page block is found on the caller specified free list. > > + */ > > +int report_unused_page_block(struct zone *zone, unsigned int order, > > + unsigned int migratetype, struct page **page) > > +{ > > + struct zone *this_zone; > > + struct list_head *this_list; > > + int ret = 0; > > + unsigned long flags; > > + > > + /* Sanity check */ > > + if (zone == NULL || page == NULL || order >= MAX_ORDER || > > + migratetype >= MIGRATE_TYPES) > > + return -EINVAL; > > + > > + /* Zone validity check */ > > + for_each_populated_zone(this_zone) { > > + if (zone == this_zone) > > + break; > > + } > > + > > + /* Got a non-existent zone from the caller? */ > > + if (zone != this_zone) > > + return -EINVAL; > > Huh, what do you check for here? Why don't you simply > populated_zone(zone)? > > > + > > + spin_lock_irqsave(&this_zone->lock, flags); > > + > > + this_list = &zone->free_area[order].free_list[migratetype]; > > + if (list_empty(this_list)) { > > + *page = NULL; > > + ret = 1; > > + goto out; > > + } > > + > > + /* The caller is asking for the first free page block on the list */ > > + if ((*page) == NULL) { > > + *page = list_first_entry(this_list, struct page, lru); > > + ret = 0; > > + goto out; > > + } > > + > > + /* > > + * The page block passed from the caller is not on this free list > > + * anymore (e.g. a 1MB free page block has been split). In this case, > > + * offer the first page block on the free list that the caller is > > + * asking for. > > + */ > > + if (PageBuddy(*page) && order != page_order(*page)) { > > + *page = list_first_entry(this_list, struct page, lru); > > + ret = 0; > > + goto out; > > + } > > + > > + /* > > + * The page block passed from the caller has been the last page block > > + * on the list. > > + */ > > + if ((*page)->lru.next == this_list) { > > + *page = NULL; > > + ret = 1; > > + goto out; > > + } > > + > > + /* > > + * Finally, fall into the regular case: the page block passed from the > > + * caller is still on the free list. Offer the next one. > > + */ > > + *page = list_next_entry((*page), lru); > > + ret = 0; > > +out: > > + spin_unlock_irqrestore(&this_zone->lock, flags); > > + return ret; > > +} > > +EXPORT_SYMBOL(report_unused_page_block); > > + > > +#endif > > + > > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > > { > > zoneref->zone = zone; > > -- > > 2.7.4 > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > Michal Hocko > SUSE Labs
On Fri 14-07-17 22:17:13, Michael S. Tsirkin wrote: > On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote: > > On Wed 12-07-17 20:40:19, Wei Wang wrote: > > > This patch adds support for reporting blocks of pages on the free list > > > specified by the caller. > > > > > > As pages can leave the free list during this call or immediately > > > afterwards, they are not guaranteed to be free after the function > > > returns. The only guarantee this makes is that the page was on the free > > > list at some point in time after the function has been invoked. > > > > > > Therefore, it is not safe for caller to use any pages on the returned > > > block or to discard data that is put there after the function returns. > > > However, it is safe for caller to discard data that was in one of these > > > pages before the function was invoked. > > > > I do not understand what is the point of such a function and how it is > > used because the patch doesn't give us any user (I haven't checked other > > patches yet). > > > > But just from the semantic point of view this sounds like a horrible > > idea. The only way to get a free block of pages is to call the page > > allocator. I am tempted to give it Nack right on those grounds but I > > would like to hear more about what you actually want to achieve. > > Basically it's a performance hint to the hypervisor. > For example, these pages would be good candidates to > move around as they are not mapped into any running > applications. > > As such, it's important not to slow down other parts of the system too > much - otherwise we are speeding up one part of the system while we slow > down other parts of it, which is why it's trying to drop the lock as > soon a possible. So why cannot you simply allocate those page and then do whatever you need. You can tell the page allocator to do only a lightweight allocation by the gfp_mask - e.g. GFP_NOWAIT or if you even do not want to risk kswapd intervening then 0 mask. > As long as hypervisor does not assume it can drop these pages, and as > long it's correct in most cases. we are OK even if the hint is slightly > wrong because hypervisor notifications are racing with allocations. But the page could have been reused anytime after the lock is dropped and you cannot check for that except for elevating the reference count.
On 07/17/2017 11:24 PM, Michal Hocko wrote: > On Fri 14-07-17 22:17:13, Michael S. Tsirkin wrote: >> On Fri, Jul 14, 2017 at 02:30:23PM +0200, Michal Hocko wrote: >>> On Wed 12-07-17 20:40:19, Wei Wang wrote: >>>> This patch adds support for reporting blocks of pages on the free list >>>> specified by the caller. >>>> >>>> As pages can leave the free list during this call or immediately >>>> afterwards, they are not guaranteed to be free after the function >>>> returns. The only guarantee this makes is that the page was on the free >>>> list at some point in time after the function has been invoked. >>>> >>>> Therefore, it is not safe for caller to use any pages on the returned >>>> block or to discard data that is put there after the function returns. >>>> However, it is safe for caller to discard data that was in one of these >>>> pages before the function was invoked. >>> I do not understand what is the point of such a function and how it is >>> used because the patch doesn't give us any user (I haven't checked other >>> patches yet). >>> >>> But just from the semantic point of view this sounds like a horrible >>> idea. The only way to get a free block of pages is to call the page >>> allocator. I am tempted to give it Nack right on those grounds but I >>> would like to hear more about what you actually want to achieve. >> Basically it's a performance hint to the hypervisor. >> For example, these pages would be good candidates to >> move around as they are not mapped into any running >> applications. >> >> As such, it's important not to slow down other parts of the system too >> much - otherwise we are speeding up one part of the system while we slow >> down other parts of it, which is why it's trying to drop the lock as >> soon a possible. Probably I should have included the introduction of the usages in the log. Hope it is not too later to explain here: Live migration needs to transfer the VM's memory from the source machine to the destination round by round. For the 1st round, all the VM's memory is transferred. From the 2nd round, only the pieces of memory that were written by the guest (after the 1st round) are transferred. One method that is popularly used by the hypervisor to track which part of memory is written is to write-protect all the guest memory. This patch enables the optimization of the 1st round memory transfer - the hypervisor can skip the transfer of guest unused pages in the 1st round. It is not concerned that the memory pages are used after they are given to the hypervisor as a hint of the unused pages, because they will be tracked by the hypervisor and transferred in the next round if they are used and written. > So why cannot you simply allocate those page and then do whatever you > need. You can tell the page allocator to do only a lightweight > allocation by the gfp_mask - e.g. GFP_NOWAIT or if you even do not want > to risk kswapd intervening then 0 mask. Here are the 2 reasons that we can't get the hint of unused pages by allocating them: 1) It's expected that live migration shouldn't affect the things running inside the VM - take away all the free pages from the guest would greatly slow down the activities inside guest (e.g. the network transmission may be stuck due to the lack of sk_buf). 2) The hint of free pages are used to optimize the 1st round memory transfer, so the hint is expect to be gotten by the hypervisor as quick as possible. Depending on the memory size of the guest, allocation of all the free memory would be too long for the case. Hope it clarifies the use case. >> As long as hypervisor does not assume it can drop these pages, and as >> long it's correct in most cases. we are OK even if the hint is slightly >> wrong because hypervisor notifications are racing with allocations. > But the page could have been reused anytime after the lock is dropped > and you cannot check for that except for elevating the reference count. As also introduced above, the hypervisor uses a dirty page logging mechanism to track which memory page is written by the guest when live migration begins. Best, Wei
On Tue 18-07-17 10:12:14, Wei Wang wrote: [...] > Probably I should have included the introduction of the usages in > the log. Hope it is not too later to explain here: Yes this should have been described in the cover. > Live migration needs to transfer the VM's memory from the source > machine to the destination round by round. For the 1st round, all the VM's > memory is transferred. From the 2nd round, only the pieces of memory > that were written by the guest (after the 1st round) are transferred. One > method that is popularly used by the hypervisor to track which part of > memory is written is to write-protect all the guest memory. > > This patch enables the optimization of the 1st round memory transfer - > the hypervisor can skip the transfer of guest unused pages in the 1st round. All you should need is the check for the page reference count, no? I assume you do some sort of pfn walk and so you should be able to get an access to the struct page.
On 07/19/2017 04:13 PM, Michal Hocko wrote: > On Tue 18-07-17 10:12:14, Wei Wang wrote: > [...] >> Probably I should have included the introduction of the usages in >> the log. Hope it is not too later to explain here: > Yes this should have been described in the cover. OK, I will do it in the next version. > >> Live migration needs to transfer the VM's memory from the source >> machine to the destination round by round. For the 1st round, all the VM's >> memory is transferred. From the 2nd round, only the pieces of memory >> that were written by the guest (after the 1st round) are transferred. One >> method that is popularly used by the hypervisor to track which part of >> memory is written is to write-protect all the guest memory. >> >> This patch enables the optimization of the 1st round memory transfer - >> the hypervisor can skip the transfer of guest unused pages in the 1st round. > All you should need is the check for the page reference count, no? I > assume you do some sort of pfn walk and so you should be able to get an > access to the struct page. Not necessarily - the guest struct page is not seen by the hypervisor. The hypervisor only gets those guest pfns which are hinted as unused. From the hypervisor (host) point of view, a guest physical address corresponds to a virtual address of a host process. So, once the hypervisor knows a guest physical page is unsued, it knows that the corresponding virtual memory of the process doesn't need to be transferred in the 1st round. Best, Wei
On Wed 19-07-17 20:01:18, Wei Wang wrote: > On 07/19/2017 04:13 PM, Michal Hocko wrote: [... > >All you should need is the check for the page reference count, no? I > >assume you do some sort of pfn walk and so you should be able to get an > >access to the struct page. > > Not necessarily - the guest struct page is not seen by the hypervisor. The > hypervisor only gets those guest pfns which are hinted as unused. From the > hypervisor (host) point of view, a guest physical address corresponds to a > virtual address of a host process. So, once the hypervisor knows a guest > physical page is unsued, it knows that the corresponding virtual memory of > the process doesn't need to be transferred in the 1st round. I am sorry, but I do not understand. Why cannot _guest_ simply check the struct page ref count and send them to the hypervisor? Is there any documentation which describes the workflow or code which would use your new API?
On 07/24/2017 05:00 PM, Michal Hocko wrote: > On Wed 19-07-17 20:01:18, Wei Wang wrote: >> On 07/19/2017 04:13 PM, Michal Hocko wrote: > [... >>> All you should need is the check for the page reference count, no? I >>> assume you do some sort of pfn walk and so you should be able to get an >>> access to the struct page. >> Not necessarily - the guest struct page is not seen by the hypervisor. The >> hypervisor only gets those guest pfns which are hinted as unused. From the >> hypervisor (host) point of view, a guest physical address corresponds to a >> virtual address of a host process. So, once the hypervisor knows a guest >> physical page is unsued, it knows that the corresponding virtual memory of >> the process doesn't need to be transferred in the 1st round. > I am sorry, but I do not understand. Why cannot _guest_ simply check the > struct page ref count and send them to the hypervisor? Were you suggesting the following? 1) get a free page block from the page list using the API; 2) if page->ref_count == 0, send it to the hypervisor Btw, ref_count may also change at any time. > Is there any > documentation which describes the workflow or code which would use your > new API? > It's used in the balloon driver (patch 8). We don't have any docs yet, but I think the high level workflow is the two steps above. Best, Wei
On Tue 25-07-17 17:32:00, Wei Wang wrote: > On 07/24/2017 05:00 PM, Michal Hocko wrote: > >On Wed 19-07-17 20:01:18, Wei Wang wrote: > >>On 07/19/2017 04:13 PM, Michal Hocko wrote: > >[... > >>>All you should need is the check for the page reference count, no? I > >>>assume you do some sort of pfn walk and so you should be able to get an > >>>access to the struct page. > >>Not necessarily - the guest struct page is not seen by the hypervisor. The > >>hypervisor only gets those guest pfns which are hinted as unused. From the > >>hypervisor (host) point of view, a guest physical address corresponds to a > >>virtual address of a host process. So, once the hypervisor knows a guest > >>physical page is unsued, it knows that the corresponding virtual memory of > >>the process doesn't need to be transferred in the 1st round. > >I am sorry, but I do not understand. Why cannot _guest_ simply check the > >struct page ref count and send them to the hypervisor? > > Were you suggesting the following? > 1) get a free page block from the page list using the API; No. Use a pfn walk, check the reference count and skip those pages which have 0 ref count. I suspected that you need to do some sort of the pfn walk anyway because you somehow have to evaluate a memory to migrate, right? > 2) if page->ref_count == 0, send it to the hypervisor yes > Btw, ref_count may also change at any time. > > >Is there any > >documentation which describes the workflow or code which would use your > >new API? > > > > It's used in the balloon driver (patch 8). We don't have any docs yet, but > I think the high level workflow is the two steps above. I will have a look.
On 07/25/2017 07:25 PM, Michal Hocko wrote: > On Tue 25-07-17 17:32:00, Wei Wang wrote: >> On 07/24/2017 05:00 PM, Michal Hocko wrote: >>> On Wed 19-07-17 20:01:18, Wei Wang wrote: >>>> On 07/19/2017 04:13 PM, Michal Hocko wrote: >>> [... >>>>> All you should need is the check for the page reference count, no? I >>>>> assume you do some sort of pfn walk and so you should be able to get an >>>>> access to the struct page. >>>> Not necessarily - the guest struct page is not seen by the hypervisor. The >>>> hypervisor only gets those guest pfns which are hinted as unused. From the >>>> hypervisor (host) point of view, a guest physical address corresponds to a >>>> virtual address of a host process. So, once the hypervisor knows a guest >>>> physical page is unsued, it knows that the corresponding virtual memory of >>>> the process doesn't need to be transferred in the 1st round. >>> I am sorry, but I do not understand. Why cannot _guest_ simply check the >>> struct page ref count and send them to the hypervisor? >> Were you suggesting the following? >> 1) get a free page block from the page list using the API; > No. Use a pfn walk, check the reference count and skip those pages which > have 0 ref count. "pfn walk" - do you mean start from the first pfn, and scan all the pfns that the VM has? > I suspected that you need to do some sort of the pfn > walk anyway because you somehow have to evaluate a memory to migrate, > right? We don't need to do the pfn walk in the guest kernel. When the API reports, for example, a 2MB free page block, the API caller offers to the hypervisor the base address of the page block, and size=2MB, to the hypervisor. The hypervisor maintains a bitmap of all the guest physical memory (a bit corresponds to a guest pfn). When migrating memory, only the pfns that are set in the bitmap are transferred to the destination machine. So, when the hypervisor receives a 2MB free page block, the corresponding bits in the bitmap are cleared. Best, Wei
On Tue 25-07-17 19:56:24, Wei Wang wrote: > On 07/25/2017 07:25 PM, Michal Hocko wrote: > >On Tue 25-07-17 17:32:00, Wei Wang wrote: > >>On 07/24/2017 05:00 PM, Michal Hocko wrote: > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > >>>[... > >>>>>All you should need is the check for the page reference count, no? I > >>>>>assume you do some sort of pfn walk and so you should be able to get an > >>>>>access to the struct page. > >>>>Not necessarily - the guest struct page is not seen by the hypervisor. The > >>>>hypervisor only gets those guest pfns which are hinted as unused. From the > >>>>hypervisor (host) point of view, a guest physical address corresponds to a > >>>>virtual address of a host process. So, once the hypervisor knows a guest > >>>>physical page is unsued, it knows that the corresponding virtual memory of > >>>>the process doesn't need to be transferred in the 1st round. > >>>I am sorry, but I do not understand. Why cannot _guest_ simply check the > >>>struct page ref count and send them to the hypervisor? > >>Were you suggesting the following? > >>1) get a free page block from the page list using the API; > >No. Use a pfn walk, check the reference count and skip those pages which > >have 0 ref count. > > > "pfn walk" - do you mean start from the first pfn, and scan all the pfns > that the VM has? yes > >I suspected that you need to do some sort of the pfn > >walk anyway because you somehow have to evaluate a memory to migrate, > >right? > > We don't need to do the pfn walk in the guest kernel. When the API > reports, for example, a 2MB free page block, the API caller offers to > the hypervisor the base address of the page block, and size=2MB, to > the hypervisor. So you want to skip pfn walks by regularly calling into the page allocator to update your bitmap. If that is the case then would an API that would allow you to update your bitmap via a callback be s sufficient? Something like void walk_free_mem(int node, int min_order, void (*visit)(unsigned long pfn, unsigned long nr_pages)) The function will call the given callback for each free memory block on the given node starting from the given min_order. The callback will be strictly an atomic and very light context. You can update your bitmap from there. This would address my main concern that the allocator internals would get outside of the allocator proper. A nasty callback which would be too expensive could still stall other allocations and cause latencies but the locking will be under mm core control at least. Does that sound useful?
On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote: > On Tue 25-07-17 19:56:24, Wei Wang wrote: > > On 07/25/2017 07:25 PM, Michal Hocko wrote: > > >On Tue 25-07-17 17:32:00, Wei Wang wrote: > > >>On 07/24/2017 05:00 PM, Michal Hocko wrote: > > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > > >>>[... > > We don't need to do the pfn walk in the guest kernel. When the API > > reports, for example, a 2MB free page block, the API caller offers to > > the hypervisor the base address of the page block, and size=2MB, to > > the hypervisor. > > So you want to skip pfn walks by regularly calling into the page allocator to > update your bitmap. If that is the case then would an API that would allow you > to update your bitmap via a callback be s sufficient? Something like > void walk_free_mem(int node, int min_order, > void (*visit)(unsigned long pfn, unsigned long nr_pages)) > > The function will call the given callback for each free memory block on the given > node starting from the given min_order. The callback will be strictly an atomic > and very light context. You can update your bitmap from there. I would need to introduce more about the background here: The hypervisor and the guest live in their own address space. The hypervisor's bitmap isn't seen by the guest. I think we also wouldn't be able to give a callback function from the hypervisor to the guest in this case. > > This would address my main concern that the allocator internals would get > outside of the allocator proper. What issue would it have to expose the internal, for_each_zone()? I think new code which would call it will also be strictly checked when they are pushed to upstream. Best, Wei
On Tue 25-07-17 14:47:16, Wang, Wei W wrote: > On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote: > > On Tue 25-07-17 19:56:24, Wei Wang wrote: > > > On 07/25/2017 07:25 PM, Michal Hocko wrote: > > > >On Tue 25-07-17 17:32:00, Wei Wang wrote: > > > >>On 07/24/2017 05:00 PM, Michal Hocko wrote: > > > >>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > > > >>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > > > >>>[... > > > We don't need to do the pfn walk in the guest kernel. When the API > > > reports, for example, a 2MB free page block, the API caller offers to > > > the hypervisor the base address of the page block, and size=2MB, to > > > the hypervisor. > > > > So you want to skip pfn walks by regularly calling into the page allocator to > > update your bitmap. If that is the case then would an API that would allow you > > to update your bitmap via a callback be s sufficient? Something like > > void walk_free_mem(int node, int min_order, > > void (*visit)(unsigned long pfn, unsigned long nr_pages)) > > > > The function will call the given callback for each free memory block on the given > > node starting from the given min_order. The callback will be strictly an atomic > > and very light context. You can update your bitmap from there. > > I would need to introduce more about the background here: > The hypervisor and the guest live in their own address space. The hypervisor's bitmap > isn't seen by the guest. I think we also wouldn't be able to give a callback function > from the hypervisor to the guest in this case. How did you plan to use your original API which export struct page array then? > > This would address my main concern that the allocator internals would get > > outside of the allocator proper. > > What issue would it have to expose the internal, for_each_zone()? zone is a MM internal concept. No code outside of the MM proper should really care about zones.
On 07/25/2017 10:53 PM, Michal Hocko wrote: > On Tue 25-07-17 14:47:16, Wang, Wei W wrote: >> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote: >>> On Tue 25-07-17 19:56:24, Wei Wang wrote: >>>> On 07/25/2017 07:25 PM, Michal Hocko wrote: >>>>> On Tue 25-07-17 17:32:00, Wei Wang wrote: >>>>>> On 07/24/2017 05:00 PM, Michal Hocko wrote: >>>>>>> On Wed 19-07-17 20:01:18, Wei Wang wrote: >>>>>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote: >>>>>>> [... >>>> We don't need to do the pfn walk in the guest kernel. When the API >>>> reports, for example, a 2MB free page block, the API caller offers to >>>> the hypervisor the base address of the page block, and size=2MB, to >>>> the hypervisor. >>> So you want to skip pfn walks by regularly calling into the page allocator to >>> update your bitmap. If that is the case then would an API that would allow you >>> to update your bitmap via a callback be s sufficient? Something like >>> void walk_free_mem(int node, int min_order, >>> void (*visit)(unsigned long pfn, unsigned long nr_pages)) >>> >>> The function will call the given callback for each free memory block on the given >>> node starting from the given min_order. The callback will be strictly an atomic >>> and very light context. You can update your bitmap from there. >> I would need to introduce more about the background here: >> The hypervisor and the guest live in their own address space. The hypervisor's bitmap >> isn't seen by the guest. I think we also wouldn't be able to give a callback function >> from the hypervisor to the guest in this case. > How did you plan to use your original API which export struct page array > then? That's where the virtio-balloon driver comes in. It uses a shared ring mechanism to send the guest memory info to the hypervisor. We didn't expose the struct page array from the guest to the hypervisor. For example, when a 2MB free page block is reported from the free page list, the info put on the ring is just (base address of the 2MB continuous memory, size=2M). > >>> This would address my main concern that the allocator internals would get >>> outside of the allocator proper. >> What issue would it have to expose the internal, for_each_zone()? > zone is a MM internal concept. No code outside of the MM proper should > really care about zones. I think this is also what Andrew suggested in the previous discussion: https://lkml.org/lkml/2017/3/16/951 Move the code to virtio-balloon and a little layering violation seems acceptable. Best, Wei
On Wed 26-07-17 10:22:23, Wei Wang wrote: > On 07/25/2017 10:53 PM, Michal Hocko wrote: > >On Tue 25-07-17 14:47:16, Wang, Wei W wrote: > >>On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote: > >>>On Tue 25-07-17 19:56:24, Wei Wang wrote: > >>>>On 07/25/2017 07:25 PM, Michal Hocko wrote: > >>>>>On Tue 25-07-17 17:32:00, Wei Wang wrote: > >>>>>>On 07/24/2017 05:00 PM, Michal Hocko wrote: > >>>>>>>On Wed 19-07-17 20:01:18, Wei Wang wrote: > >>>>>>>>On 07/19/2017 04:13 PM, Michal Hocko wrote: > >>>>>>>[... > >>>>We don't need to do the pfn walk in the guest kernel. When the API > >>>>reports, for example, a 2MB free page block, the API caller offers to > >>>>the hypervisor the base address of the page block, and size=2MB, to > >>>>the hypervisor. > >>>So you want to skip pfn walks by regularly calling into the page allocator to > >>>update your bitmap. If that is the case then would an API that would allow you > >>>to update your bitmap via a callback be s sufficient? Something like > >>> void walk_free_mem(int node, int min_order, > >>> void (*visit)(unsigned long pfn, unsigned long nr_pages)) > >>> > >>>The function will call the given callback for each free memory block on the given > >>>node starting from the given min_order. The callback will be strictly an atomic > >>>and very light context. You can update your bitmap from there. > >>I would need to introduce more about the background here: > >>The hypervisor and the guest live in their own address space. The hypervisor's bitmap > >>isn't seen by the guest. I think we also wouldn't be able to give a callback function > >>from the hypervisor to the guest in this case. > >How did you plan to use your original API which export struct page array > >then? > > > That's where the virtio-balloon driver comes in. It uses a shared ring > mechanism to > send the guest memory info to the hypervisor. > > We didn't expose the struct page array from the guest to the hypervisor. For > example, when > a 2MB free page block is reported from the free page list, the info put on > the ring is just > (base address of the 2MB continuous memory, size=2M). So what exactly prevents virtio-balloon from using the above proposed callback mechanism and export what is needed to the hypervisor? > >>>This would address my main concern that the allocator internals would get > >>>outside of the allocator proper. > >>What issue would it have to expose the internal, for_each_zone()? > >zone is a MM internal concept. No code outside of the MM proper should > >really care about zones. > > I think this is also what Andrew suggested in the previous discussion: > https://lkml.org/lkml/2017/3/16/951 > > Move the code to virtio-balloon and a little layering violation seems > acceptable. Andrew didn't suggest to expose zones to modules. If I read his words properly he said that a functionality which "provides a snapshot of the present system unused pages" is just too specific for virtio usecase to be a generic function and as such it should be in virtio. I tend to agree. Even the proposed callback API is a layer violation. We should just make sure that the API is not inherently dangerous. That is why I have chosen to give only pfn and nr_pages to the caller. Sure somebody could argue that the caller could do pfn_to_page and do nasty things. That would be a fair argument but nothing really prevents anybody to do th pfn walk already so there shouldn't inherently more risk. We can document what we expect from the API user and that would be much easier to describe than struct page API IMHO.
On 07/26/2017 06:24 PM, Michal Hocko wrote: > On Wed 26-07-17 10:22:23, Wei Wang wrote: >> On 07/25/2017 10:53 PM, Michal Hocko wrote: >>> On Tue 25-07-17 14:47:16, Wang, Wei W wrote: >>>> On Tuesday, July 25, 2017 8:42 PM, hal Hocko wrote: >>>>> On Tue 25-07-17 19:56:24, Wei Wang wrote: >>>>>> On 07/25/2017 07:25 PM, Michal Hocko wrote: >>>>>>> On Tue 25-07-17 17:32:00, Wei Wang wrote: >>>>>>>> On 07/24/2017 05:00 PM, Michal Hocko wrote: >>>>>>>>> On Wed 19-07-17 20:01:18, Wei Wang wrote: >>>>>>>>>> On 07/19/2017 04:13 PM, Michal Hocko wrote: >>>>>>>>> [... >>>>>> We don't need to do the pfn walk in the guest kernel. When the API >>>>>> reports, for example, a 2MB free page block, the API caller offers to >>>>>> the hypervisor the base address of the page block, and size=2MB, to >>>>>> the hypervisor. >>>>> So you want to skip pfn walks by regularly calling into the page allocator to >>>>> update your bitmap. If that is the case then would an API that would allow you >>>>> to update your bitmap via a callback be s sufficient? Something like >>>>> void walk_free_mem(int node, int min_order, >>>>> void (*visit)(unsigned long pfn, unsigned long nr_pages)) >>>>> >>>>> The function will call the given callback for each free memory block on the given >>>>> node starting from the given min_order. The callback will be strictly an atomic >>>>> and very light context. You can update your bitmap from there. >>>> I would need to introduce more about the background here: >>>> The hypervisor and the guest live in their own address space. The hypervisor's bitmap >>>> isn't seen by the guest. I think we also wouldn't be able to give a callback function >>> >from the hypervisor to the guest in this case. >>> How did you plan to use your original API which export struct page array >>> then? >> >> That's where the virtio-balloon driver comes in. It uses a shared ring >> mechanism to >> send the guest memory info to the hypervisor. >> >> We didn't expose the struct page array from the guest to the hypervisor. For >> example, when >> a 2MB free page block is reported from the free page list, the info put on >> the ring is just >> (base address of the 2MB continuous memory, size=2M). > So what exactly prevents virtio-balloon from using the above proposed > callback mechanism and export what is needed to the hypervisor? I thought about it more. Probably we can use the callback function with a little change like this: void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, unsigned long pfn, unsigned long nr_pages)) { ... for_each_populated_zone(zone) { for_each_migratetype_order(order, type) { report_unused_page_block(zone, order, type, &page); // from patch 6 pfn = page_to_pfn(page); visit(opaque1, pfn, 1 << order); } } } The above function scans all the free list and directly sends each free page block to the hypervisor via the virtio_balloon callback below. No need to implement a bitmap. In virtio-balloon, we have the callback: void *virtio_balloon_report_unused_pages(void *opaque, unsigned long pfn, unsigned long nr_pages) { struct virtio_balloon *vb = (struct virtio_balloon *)opaque; ...put the free page block to the the ring of vb; } What do you think? Best, Wei
On Wed 26-07-17 19:44:23, Wei Wang wrote: [...] > I thought about it more. Probably we can use the callback function with a > little change like this: > > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, unsigned long > pfn, > unsigned long nr_pages)) > { > ... > for_each_populated_zone(zone) { > for_each_migratetype_order(order, type) { > report_unused_page_block(zone, order, type, &page); > // from patch 6 > pfn = page_to_pfn(page); > visit(opaque1, pfn, 1 << order); > } > } > } > > The above function scans all the free list and directly sends each free page > block to the > hypervisor via the virtio_balloon callback below. No need to implement a > bitmap. > > In virtio-balloon, we have the callback: > void *virtio_balloon_report_unused_pages(void *opaque, unsigned long pfn, > unsigned long nr_pages) > { > struct virtio_balloon *vb = (struct virtio_balloon *)opaque; > ...put the free page block to the the ring of vb; > } > > > What do you think? I do not mind conveying a context to the callback. I would still prefer to keep the original min_order to check semantic though. Why? Well, it doesn't make much sense to scan low order free blocks all the time because they are simply too volatile. Larger blocks tend to surivive for longer. So I assume you would only care about larger free blocks. This will also make the call cheaper.
On Wednesday, July 26, 2017 7:55 PM, Michal Hocko wrote: > On Wed 26-07-17 19:44:23, Wei Wang wrote: > [...] > > I thought about it more. Probably we can use the callback function > > with a little change like this: > > > > void walk_free_mem(void *opaque1, void (*visit)(void *opaque2, > > unsigned long pfn, > > unsigned long nr_pages)) > > { > > ... > > for_each_populated_zone(zone) { > > for_each_migratetype_order(order, type) { > > report_unused_page_block(zone, order, type, > > &page); // from patch 6 > > pfn = page_to_pfn(page); > > visit(opaque1, pfn, 1 << order); > > } > > } > > } > > > > The above function scans all the free list and directly sends each > > free page block to the hypervisor via the virtio_balloon callback > > below. No need to implement a bitmap. > > > > In virtio-balloon, we have the callback: > > void *virtio_balloon_report_unused_pages(void *opaque, unsigned long > > pfn, unsigned long nr_pages) { > > struct virtio_balloon *vb = (struct virtio_balloon *)opaque; > > ...put the free page block to the the ring of vb; } > > > > > > What do you think? > > I do not mind conveying a context to the callback. I would still prefer > to keep the original min_order to check semantic though. Why? Well, > it doesn't make much sense to scan low order free blocks all the time > because they are simply too volatile. Larger blocks tend to surivive for > longer. So I assume you would only care about larger free blocks. This > will also make the call cheaper. > -- OK, I will keep min order there in the next version. Best, Wei
diff --git a/include/linux/mm.h b/include/linux/mm.h index 46b9ac5..76cb433 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1835,6 +1835,11 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, unsigned long zone_start_pfn, unsigned long *zholes_size); extern void free_initmem(void); +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) +extern int report_unused_page_block(struct zone *zone, unsigned int order, + unsigned int migratetype, + struct page **page); +#endif /* * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) * into the buddy system. The freed pages will be poisoned with pattern diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 64b7d82..8b3c9dd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4753,6 +4753,102 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) show_swap_cache_info(); } +#if IS_ENABLED(CONFIG_VIRTIO_BALLOON) + +/* + * Heuristically get a page block in the system that is unused. + * It is possible that pages from the page block are used immediately after + * report_unused_page_block() returns. It is the caller's responsibility + * to either detect or prevent the use of such pages. + * + * The free list to check: zone->free_area[order].free_list[migratetype]. + * + * If the caller supplied page block (i.e. **page) is on the free list, offer + * the next page block on the list to the caller. Otherwise, offer the first + * page block on the list. + * + * Note: it is not safe for caller to use any pages on the returned + * block or to discard data that is put there after the function returns. + * However, it is safe for caller to discard data that was in one of these + * pages before the function was invoked. + * + * Return 0 when a page block is found on the caller specified free list. + */ +int report_unused_page_block(struct zone *zone, unsigned int order, + unsigned int migratetype, struct page **page) +{ + struct zone *this_zone; + struct list_head *this_list; + int ret = 0; + unsigned long flags; + + /* Sanity check */ + if (zone == NULL || page == NULL || order >= MAX_ORDER || + migratetype >= MIGRATE_TYPES) + return -EINVAL; + + /* Zone validity check */ + for_each_populated_zone(this_zone) { + if (zone == this_zone) + break; + } + + /* Got a non-existent zone from the caller? */ + if (zone != this_zone) + return -EINVAL; + + spin_lock_irqsave(&this_zone->lock, flags); + + this_list = &zone->free_area[order].free_list[migratetype]; + if (list_empty(this_list)) { + *page = NULL; + ret = 1; + goto out; + } + + /* The caller is asking for the first free page block on the list */ + if ((*page) == NULL) { + *page = list_first_entry(this_list, struct page, lru); + ret = 0; + goto out; + } + + /* + * The page block passed from the caller is not on this free list + * anymore (e.g. a 1MB free page block has been split). In this case, + * offer the first page block on the free list that the caller is + * asking for. + */ + if (PageBuddy(*page) && order != page_order(*page)) { + *page = list_first_entry(this_list, struct page, lru); + ret = 0; + goto out; + } + + /* + * The page block passed from the caller has been the last page block + * on the list. + */ + if ((*page)->lru.next == this_list) { + *page = NULL; + ret = 1; + goto out; + } + + /* + * Finally, fall into the regular case: the page block passed from the + * caller is still on the free list. Offer the next one. + */ + *page = list_next_entry((*page), lru); + ret = 0; +out: + spin_unlock_irqrestore(&this_zone->lock, flags); + return ret; +} +EXPORT_SYMBOL(report_unused_page_block); + +#endif + static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) { zoneref->zone = zone;