diff mbox series

mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions

Message ID 160990599013.2430134.11556277600719835946.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series mm: Teach pfn_to_online_page() about ZONE_DEVICE section collisions | expand

Commit Message

Dan Williams Jan. 6, 2021, 4:07 a.m. UTC
While pfn_to_online_page() is able to determine pfn_valid() at
subsection granularity it is not able to reliably determine if a given
pfn is also online if the section is mixed with ZONE_DEVICE pfns.

Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
section that mixes ZONE_DEVICE pfns with other online pfns. With
SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
back to a slow-path check for ZONE_DEVICE pfns in an online section.

With this implementation of pfn_to_online_page() pfn-walkers mostly only
need to check section metadata to determine pfn validity. In the
rare case of mixed-zone sections the pfn-walker will skip offline
ZONE_DEVICE pfns as expected.

Other notes:

Because the collision case is rare, and for simplicity, the
SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.

pfn_to_online_page() was already borderline too large to be a macro /
inline function, but the additional logic certainly pushed it over that
threshold, and so it is moved to an out-of-line helper.

Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Michal Hocko <mhocko@suse.com>
Reported-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---

This compiles and passes the nvdimm unit tests, but I have not tested
with pfn walkers in the presence of ZONE_DEVICE collisions.


 include/linux/memory_hotplug.h |   17 +---------
 include/linux/mmzone.h         |   22 ++++++++----
 mm/memory_hotplug.c            |   71 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 87 insertions(+), 23 deletions(-)

Comments

Michal Hocko Jan. 6, 2021, 9:55 a.m. UTC | #1
On Tue 05-01-21 20:07:18, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixed with ZONE_DEVICE pfns.

I would call out the problem more explicitly. E.g. something like
"
This means that pfn_to_online_page can lead to false positives and allow
to return a struct page which is not fully initialized because it
belongs to ZONE_DEVICE (PUT AN EXAMPLE OF SUCH AN UNITIALIZED PAGE
HERE). That can lead to either crash on PagePoisoned assertion or a
silently broken page state with debugging disabled.
"

I would also appreciate a more specific note about how the existing HW
can trigger this. You have mentioned 64MB subsection examples in the
other email. It would be great to mention it here as well.

> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section.
> 
> With this implementation of pfn_to_online_page() pfn-walkers mostly only
> need to check section metadata to determine pfn validity. In the
> rare case of mixed-zone sections the pfn-walker will skip offline
> ZONE_DEVICE pfns as expected.

The above paragraph is slightly confusing. You do not require
pfn-walkers to check anything right? Section metadata is something that
is and should be hidden from them. They are asking for an online page
and get it or NULL. Nothing more nothing less.

 
> Other notes:
> 
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.

I do not see a problem with that.
 
> pfn_to_online_page() was already borderline too large to be a macro /
> inline function, but the additional logic certainly pushed it over that
> threshold, and so it is moved to an out-of-line helper.

Worth a separate patch.

The approach is sensible. Thanks!

I was worried that we do not have sufficient space for a new flag but
the comment explains we have 6 bits available.  I haven't double checked
that for the current state of the code. The comment is quite recent and
I do not remember any substantial changes in this area. Still something
that is rather fragile because an unexpected alignment would be a
runtime failure which is good to stop random corruptions but not ideal.

Is there any way to explicitly test for this? E.g. enforce a shared
section by qemu and then trigger a pfn walk?

> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reported-by: Michal Hocko <mhocko@suse.com>
> Reported-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

[...]

> +static int zone_id(const struct zone *zone)
> +{
> +	struct pglist_data *pgdat = zone->zone_pgdat;
> +
> +	return zone - pgdat->node_zones;
> +}

We already have zone_idx()

> +
> +static void section_taint_zone_device(struct zone *zone, unsigned long pfn)
> +{
> +	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
> +
> +	if (zone_id(zone) != ZONE_DEVICE)
> +		return;
> +
> +	if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
> +		return;
> +
> +	ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE;
> +}
> +
>  /*
>   * Associate the pfn range with the given zone, initializing the memmaps
>   * and resizing the pgdat/zone data to span the added pages. After this
> @@ -707,6 +769,15 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  	resize_pgdat_range(pgdat, start_pfn, nr_pages);
>  	pgdat_resize_unlock(pgdat, &flags);
>  
> +	/*
> +	 * Subsection population requires care in pfn_to_online_page().
> +	 * Set the taint to enable the slow path detection of
> +	 * ZONE_DEVICE pages in an otherwise  ZONE_{NORMAL,MOVABLE}
> +	 * section.
> +	 */
> +	section_taint_zone_device(zone, start_pfn);
> +	section_taint_zone_device(zone, start_pfn + nr_pages);

I think it would be better to add the checks here and only set the flag
in the called function. SECTION_TAINT_ZONE_DEVICE should go to where we
define helpers for ther flags.

> +
>  	/*
>  	 * TODO now we have a visible range of pages which are not associated
>  	 * with their zone properly. Not nice but set_pfnblock_flags_mask
>
David Hildenbrand Jan. 6, 2021, 9:56 a.m. UTC | #2
On 06.01.21 05:07, Dan Williams wrote:
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixed with ZONE_DEVICE pfns.
> 
> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section.
> 
> With this implementation of pfn_to_online_page() pfn-walkers mostly only
> need to check section metadata to determine pfn validity. In the
> rare case of mixed-zone sections the pfn-walker will skip offline
> ZONE_DEVICE pfns as expected.
> 
> Other notes:
> 
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
> 
> pfn_to_online_page() was already borderline too large to be a macro /
> inline function, but the additional logic certainly pushed it over that
> threshold, and so it is moved to an out-of-line helper.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reported-by: Michal Hocko <mhocko@suse.com>
> Reported-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

[...]

> +#define SECTION_MARKED_PRESENT		(1UL<<0)
> +#define SECTION_HAS_MEM_MAP		(1UL<<1)
> +#define SECTION_IS_ONLINE		(1UL<<2)
> +#define SECTION_IS_EARLY		(1UL<<3)
> +#define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
> +#define SECTION_MAP_LAST_BIT		(1UL<<5)
> +#define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
> +#define SECTION_NID_SHIFT		3
>  
>  static inline struct page *__section_mem_map_addr(struct mem_section *section)
>  {
> @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section *section)
>  	return (section && (section->section_mem_map & SECTION_IS_ONLINE));
>  }
>  
> +static inline int online_device_section(struct mem_section *section)
> +{
> +	unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
> +
> +	return section && ((section->section_mem_map & flags) == flags);
> +}
> +
>  static inline int online_section_nr(unsigned long nr)
>  {
>  	return online_section(__nr_to_section(nr));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..9f36968e6188 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,47 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
>  	return 0;
>  }
>  
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> +	unsigned long nr = pfn_to_section_nr(pfn);
> +	struct dev_pagemap *pgmap;
> +	struct mem_section *ms;
> +
> +	if (nr >= NR_MEM_SECTIONS)
> +		return NULL;
> +
> +	ms = __nr_to_section(nr);
> +
> +	if (!online_section(ms))
> +		return NULL;
> +
> +	if (!pfn_valid_within(pfn))
> +		return NULL;
> +
> +	if (!online_device_section(ms))
> +		return pfn_to_page(pfn);
> +
> +	/*
> +	 * Slowpath: when ZONE_DEVICE collides with
> +	 * ZONE_{NORMAL,MOVABLE} within the same section some pfns in
> +	 * the section may be 'offline' but 'valid'. Only
> +	 * get_dev_pagemap() can determine sub-section online status.
> +	 */
> +	pgmap = get_dev_pagemap(pfn, NULL);
> +	put_dev_pagemap(pgmap);
> +
> +	/* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
> +	if (pgmap)
> +		return NULL;
> +	return pfn_to_page(pfn);
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);

Note that this is not sufficient in the general case. I already
mentioned that we effectively override an already initialized memmap.

---

[        SECTION             ]
Before:
[ ZONE_NORMAL ][    Hole     ]

The hole has some node/zone (currently 0/0, discussions ongoing on how
to optimize that to e.g., ZONE_NORMAL in this example) and is
PG_reserved - looks like an ordinary memory hole.

After memremap:
[ ZONE_NORMAL ][ ZONE_DEVICE ]

The already initialized memmap was converted to ZONE_DEVICE. Your
slowpath will work.

After memunmap (no poisioning):
[ ZONE_NORMAL ][ ZONE_DEVICE ]

The slow path is no longer working. pfn_to_online_page() might return
something that is ZONE_DEVICE.

After memunmap (poisioning):
[ ZONE_NORMAL ][ POISONED    ]

The slow path is no longer working. pfn_to_online_page() might return
something that will BUG_ON via page_to_nid() etc.

---

Reason is that pfn_to_online_page() does no care about sub-sections. And
for now, it didn't had to. If there was an online section, it either was

a) Completely present. The whole memmap is initialized to sane values.
b) Partially present. The whole memmap is initialized to sane values.

memremap/memunmap messes with case b)

Well have to further tweak pfn_to_online_page(). You'll have to also
check pfn_section_valid() *at least* on the slow path. Less-hacky would
be checking it also in the "somehwat-faster" path - that would cover
silently overriding a memmap that's visible via pfn_to_online_page().
Might slow down things a bit.


Not completely opposed to this, but I would certainly still prefer just
avoiding this corner case completely instead of patching around it. Thanks!
David Hildenbrand Jan. 6, 2021, 10:04 a.m. UTC | #3
> Am 06.01.2021 um 05:08 schrieb Dan Williams <dan.j.williams@intel.com>:
> 
> While pfn_to_online_page() is able to determine pfn_valid() at
> subsection granularity it is not able to reliably determine if a given
> pfn is also online if the section is mixed with ZONE_DEVICE pfns.
> 
> Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> section that mixes ZONE_DEVICE pfns with other online pfns. With
> SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> back to a slow-path check for ZONE_DEVICE pfns in an online section.
> 
> With this implementation of pfn_to_online_page() pfn-walkers mostly only
> need to check section metadata to determine pfn validity. In the
> rare case of mixed-zone sections the pfn-walker will skip offline
> ZONE_DEVICE pfns as expected.
> 
> Other notes:
> 
> Because the collision case is rare, and for simplicity, the
> SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
> 
> pfn_to_online_page() was already borderline too large to be a macro /
> inline function, but the additional logic certainly pushed it over that
> threshold, and so it is moved to an out-of-line helper.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Reported-by: Michal Hocko <mhocko@suse.com>
> Reported-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> 
> This compiles and passes the nvdimm unit tests, but I have not tested
> with pfn walkers in the presence of ZONE_DEVICE collisions.
> 
> 
> include/linux/memory_hotplug.h |   17 +---------
> include/linux/mmzone.h         |   22 ++++++++----
> mm/memory_hotplug.c            |   71 ++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 87 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 15acce5ab106..3d99de0db2dd 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -16,22 +16,7 @@ struct resource;
> struct vmem_altmap;
> 
> #ifdef CONFIG_MEMORY_HOTPLUG
> -/*
> - * Return page for the valid pfn only if the page is online. All pfn
> - * walkers which rely on the fully initialized page->flags and others
> - * should use this rather than pfn_valid && pfn_to_page
> - */
> -#define pfn_to_online_page(pfn)                       \
> -({                                   \
> -    struct page *___page = NULL;                   \
> -    unsigned long ___pfn = pfn;                   \
> -    unsigned long ___nr = pfn_to_section_nr(___pfn);       \
> -                                   \
> -    if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> -        pfn_valid_within(___pfn))                   \
> -        ___page = pfn_to_page(___pfn);               \
> -    ___page;                           \
> -})
> +struct page *pfn_to_online_page(unsigned long pfn);
> 
> /*
>  * Types for free bootmem stored in page->lru.next. These have to be in
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b593316bff3d..0b5c44f730b4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1273,13 +1273,14 @@ extern size_t mem_section_usage_size(void);
>  *      which results in PFN_SECTION_SHIFT equal 6.
>  * To sum it up, at least 6 bits are available.
>  */
> -#define    SECTION_MARKED_PRESENT    (1UL<<0)
> -#define SECTION_HAS_MEM_MAP    (1UL<<1)
> -#define SECTION_IS_ONLINE    (1UL<<2)
> -#define SECTION_IS_EARLY    (1UL<<3)
> -#define SECTION_MAP_LAST_BIT    (1UL<<4)
> -#define SECTION_MAP_MASK    (~(SECTION_MAP_LAST_BIT-1))
> -#define SECTION_NID_SHIFT    3
> +#define SECTION_MARKED_PRESENT        (1UL<<0)
> +#define SECTION_HAS_MEM_MAP        (1UL<<1)
> +#define SECTION_IS_ONLINE        (1UL<<2)
> +#define SECTION_IS_EARLY        (1UL<<3)
> +#define SECTION_TAINT_ZONE_DEVICE    (1UL<<4)
> +#define SECTION_MAP_LAST_BIT        (1UL<<5)
> +#define SECTION_MAP_MASK        (~(SECTION_MAP_LAST_BIT-1))
> +#define SECTION_NID_SHIFT        3
> 
> static inline struct page *__section_mem_map_addr(struct mem_section *section)
> {
> @@ -1318,6 +1319,13 @@ static inline int online_section(struct mem_section *section)
>    return (section && (section->section_mem_map & SECTION_IS_ONLINE));
> }
> 
> +static inline int online_device_section(struct mem_section *section)
> +{
> +    unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
> +
> +    return section && ((section->section_mem_map & flags) == flags);
> +}
> +
> static inline int online_section_nr(unsigned long nr)
> {
>    return online_section(__nr_to_section(nr));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index f9d57b9be8c7..9f36968e6188 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -300,6 +300,47 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
>    return 0;
> }
> 
> +/*
> + * Return page for the valid pfn only if the page is online. All pfn
> + * walkers which rely on the fully initialized page->flags and others
> + * should use this rather than pfn_valid && pfn_to_page
> + */
> +struct page *pfn_to_online_page(unsigned long pfn)
> +{
> +    unsigned long nr = pfn_to_section_nr(pfn);
> +    struct dev_pagemap *pgmap;
> +    struct mem_section *ms;
> +
> +    if (nr >= NR_MEM_SECTIONS)
> +        return NULL;
> +
> +    ms = __nr_to_section(nr);
> +
> +    if (!online_section(ms))
> +        return NULL;
> +
> +    if (!pfn_valid_within(pfn))
> +        return NULL;
> +
> +    if (!online_device_section(ms))
> +        return pfn_to_page(pfn);
> +
> +    /*
> +     * Slowpath: when ZONE_DEVICE collides with
> +     * ZONE_{NORMAL,MOVABLE} within the same section some pfns in
> +     * the section may be 'offline' but 'valid'. Only
> +     * get_dev_pagemap() can determine sub-section online status.
> +     */
> +    pgmap = get_dev_pagemap(pfn, NULL);
> +    put_dev_pagemap(pgmap);
> +
> +    /* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
> +    if (pgmap)
> +        return NULL;
> +    return pfn_to_page(pfn);
> +}
> +EXPORT_SYMBOL_GPL(pfn_to_online_page);
> +
> /*
>  * Reasonably generic function for adding memory.  It is
>  * expected that archs that support memory hotplug will
> @@ -678,6 +719,27 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
>    pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
> 
> }
> +
> +static int zone_id(const struct zone *zone)
> +{
> +    struct pglist_data *pgdat = zone->zone_pgdat;
> +
> +    return zone - pgdat->node_zones;
> +}
> +
> +static void section_taint_zone_device(struct zone *zone, unsigned long pfn)
> +{
> +    struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
> +
> +    if (zone_id(zone) != ZONE_DEVICE)
> +        return;
> +
> +    if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
> +        return;

I think you can simplify this. Just taint any early section here, independent of the pfn.
Michal Hocko Jan. 6, 2021, 10:42 a.m. UTC | #4
On Wed 06-01-21 10:56:19, David Hildenbrand wrote:
[...]
> Note that this is not sufficient in the general case. I already
> mentioned that we effectively override an already initialized memmap.
> 
> ---
> 
> [        SECTION             ]
> Before:
> [ ZONE_NORMAL ][    Hole     ]
> 
> The hole has some node/zone (currently 0/0, discussions ongoing on how
> to optimize that to e.g., ZONE_NORMAL in this example) and is
> PG_reserved - looks like an ordinary memory hole.
> 
> After memremap:
> [ ZONE_NORMAL ][ ZONE_DEVICE ]
> 
> The already initialized memmap was converted to ZONE_DEVICE. Your
> slowpath will work.
> 
> After memunmap (no poisioning):
> [ ZONE_NORMAL ][ ZONE_DEVICE ]
> 
> The slow path is no longer working. pfn_to_online_page() might return
> something that is ZONE_DEVICE.
> 
> After memunmap (poisioning):
> [ ZONE_NORMAL ][ POISONED    ]
> 
> The slow path is no longer working. pfn_to_online_page() might return
> something that will BUG_ON via page_to_nid() etc.
> 
> ---
> 
> Reason is that pfn_to_online_page() does no care about sub-sections. And
> for now, it didn't had to. If there was an online section, it either was
> 
> a) Completely present. The whole memmap is initialized to sane values.
> b) Partially present. The whole memmap is initialized to sane values.
> 
> memremap/memunmap messes with case b)

I do not see we ever clear the newly added flag and my understanding is
that the subsection removed would lead to get_dev_pagemap returning a
NULL. Which would obviously need to be checked for pfn_to_online_page.
Or do I miss anything and the above is not the case and we could still
get false positives?

> Well have to further tweak pfn_to_online_page(). You'll have to also
> check pfn_section_valid() *at least* on the slow path. Less-hacky would
> be checking it also in the "somehwat-faster" path - that would cover
> silently overriding a memmap that's visible via pfn_to_online_page().
> Might slow down things a bit.
> 
> 
> Not completely opposed to this, but I would certainly still prefer just
> avoiding this corner case completely instead of patching around it. Thanks!

Well, I would love to have no surprises either. So far there was not
actual argument why the pmem reserved space cannot be fully initialized.
On the other hand making sure that pfn_to_online_page sounds like the
right thing to do. And having an explicit check for zone device there in
a slow path makes sense to me.
David Hildenbrand Jan. 6, 2021, 11:22 a.m. UTC | #5
On 06.01.21 11:42, Michal Hocko wrote:
> On Wed 06-01-21 10:56:19, David Hildenbrand wrote:
> [...]
>> Note that this is not sufficient in the general case. I already
>> mentioned that we effectively override an already initialized memmap.
>>
>> ---
>>
>> [        SECTION             ]
>> Before:
>> [ ZONE_NORMAL ][    Hole     ]
>>
>> The hole has some node/zone (currently 0/0, discussions ongoing on how
>> to optimize that to e.g., ZONE_NORMAL in this example) and is
>> PG_reserved - looks like an ordinary memory hole.
>>
>> After memremap:
>> [ ZONE_NORMAL ][ ZONE_DEVICE ]
>>
>> The already initialized memmap was converted to ZONE_DEVICE. Your
>> slowpath will work.
>>
>> After memunmap (no poisioning):
>> [ ZONE_NORMAL ][ ZONE_DEVICE ]
>>
>> The slow path is no longer working. pfn_to_online_page() might return
>> something that is ZONE_DEVICE.
>>
>> After memunmap (poisioning):
>> [ ZONE_NORMAL ][ POISONED    ]
>>
>> The slow path is no longer working. pfn_to_online_page() might return
>> something that will BUG_ON via page_to_nid() etc.
>>
>> ---
>>
>> Reason is that pfn_to_online_page() does no care about sub-sections. And
>> for now, it didn't had to. If there was an online section, it either was
>>
>> a) Completely present. The whole memmap is initialized to sane values.
>> b) Partially present. The whole memmap is initialized to sane values.
>>
>> memremap/memunmap messes with case b)
> 
> I do not see we ever clear the newly added flag and my understanding is
> that the subsection removed would lead to get_dev_pagemap returning a
> NULL. Which would obviously need to be checked for pfn_to_online_page.
> Or do I miss anything and the above is not the case and we could still
> get false positives?

See my example above ("After memunmap").

We're still in the slow pathg. pfn_to_online_page() will return a struct
page as get_dev_pagemap() is now  NULL.

Yet page_zone(page) will either
- BUG_ON (memmap was poisoned)
- return ZONE_DEVICE zone (memmap not poisoned when memunmapping)

As I said, can be tackled by checking for pfn_section_valid() at least
on the slow path. Ideally also on the fast path.

> 
>> Well have to further tweak pfn_to_online_page(). You'll have to also
>> check pfn_section_valid() *at least* on the slow path. Less-hacky would
>> be checking it also in the "somehwat-faster" path - that would cover
>> silently overriding a memmap that's visible via pfn_to_online_page().
>> Might slow down things a bit.
>>
>>
>> Not completely opposed to this, but I would certainly still prefer just
>> avoiding this corner case completely instead of patching around it. Thanks!
> 
> Well, I would love to have no surprises either. So far there was not
> actual argument why the pmem reserved space cannot be fully initialized.

Yes, I'm still hoping Dan can clarify that.

> On the other hand making sure that pfn_to_online_page sounds like the
> right thing to do. And having an explicit check for zone device there in
> a slow path makes sense to me.

As I said, I'd favor to simplify and just get rid of the special case,
instead of coming up with increasingly complex ways to deal with it.
pfn_to_online_page() used to be simple, essentially checking a single
flag was sufficient in most setups.
Michal Hocko Jan. 6, 2021, 11:38 a.m. UTC | #6
On Wed 06-01-21 12:22:57, David Hildenbrand wrote:
> On 06.01.21 11:42, Michal Hocko wrote:
> > On Wed 06-01-21 10:56:19, David Hildenbrand wrote:
> > [...]
> >> Note that this is not sufficient in the general case. I already
> >> mentioned that we effectively override an already initialized memmap.
> >>
> >> ---
> >>
> >> [        SECTION             ]
> >> Before:
> >> [ ZONE_NORMAL ][    Hole     ]
> >>
> >> The hole has some node/zone (currently 0/0, discussions ongoing on how
> >> to optimize that to e.g., ZONE_NORMAL in this example) and is
> >> PG_reserved - looks like an ordinary memory hole.
> >>
> >> After memremap:
> >> [ ZONE_NORMAL ][ ZONE_DEVICE ]
> >>
> >> The already initialized memmap was converted to ZONE_DEVICE. Your
> >> slowpath will work.
> >>
> >> After memunmap (no poisioning):
> >> [ ZONE_NORMAL ][ ZONE_DEVICE ]
> >>
> >> The slow path is no longer working. pfn_to_online_page() might return
> >> something that is ZONE_DEVICE.
> >>
> >> After memunmap (poisioning):
> >> [ ZONE_NORMAL ][ POISONED    ]
> >>
> >> The slow path is no longer working. pfn_to_online_page() might return
> >> something that will BUG_ON via page_to_nid() etc.
> >>
> >> ---
> >>
> >> Reason is that pfn_to_online_page() does no care about sub-sections. And
> >> for now, it didn't had to. If there was an online section, it either was
> >>
> >> a) Completely present. The whole memmap is initialized to sane values.
> >> b) Partially present. The whole memmap is initialized to sane values.
> >>
> >> memremap/memunmap messes with case b)
> > 
> > I do not see we ever clear the newly added flag and my understanding is
> > that the subsection removed would lead to get_dev_pagemap returning a
> > NULL. Which would obviously need to be checked for pfn_to_online_page.
> > Or do I miss anything and the above is not the case and we could still
> > get false positives?
> 
> See my example above ("After memunmap").
> 
> We're still in the slow pathg. pfn_to_online_page() will return a struct
> page as get_dev_pagemap() is now  NULL.

Yeah, my bad. I have clearly misread the patch. We would need som other
means than relying on get_dev_pagemap if it doesn't survive the
memunmap. :/
Dan Williams Jan. 6, 2021, 8:02 p.m. UTC | #7
On Wed, Jan 6, 2021 at 3:23 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.01.21 11:42, Michal Hocko wrote:
> > On Wed 06-01-21 10:56:19, David Hildenbrand wrote:
> > [...]
> >> Note that this is not sufficient in the general case. I already
> >> mentioned that we effectively override an already initialized memmap.
> >>
> >> ---
> >>
> >> [        SECTION             ]
> >> Before:
> >> [ ZONE_NORMAL ][    Hole     ]
> >>
> >> The hole has some node/zone (currently 0/0, discussions ongoing on how
> >> to optimize that to e.g., ZONE_NORMAL in this example) and is
> >> PG_reserved - looks like an ordinary memory hole.
> >>
> >> After memremap:
> >> [ ZONE_NORMAL ][ ZONE_DEVICE ]
> >>
> >> The already initialized memmap was converted to ZONE_DEVICE. Your
> >> slowpath will work.
> >>
> >> After memunmap (no poisioning):
> >> [ ZONE_NORMAL ][ ZONE_DEVICE ]
> >>
> >> The slow path is no longer working. pfn_to_online_page() might return
> >> something that is ZONE_DEVICE.
> >>
> >> After memunmap (poisioning):
> >> [ ZONE_NORMAL ][ POISONED    ]
> >>
> >> The slow path is no longer working. pfn_to_online_page() might return
> >> something that will BUG_ON via page_to_nid() etc.
> >>
> >> ---
> >>
> >> Reason is that pfn_to_online_page() does no care about sub-sections. And
> >> for now, it didn't had to. If there was an online section, it either was
> >>
> >> a) Completely present. The whole memmap is initialized to sane values.
> >> b) Partially present. The whole memmap is initialized to sane values.
> >>
> >> memremap/memunmap messes with case b)
> >
> > I do not see we ever clear the newly added flag and my understanding is
> > that the subsection removed would lead to get_dev_pagemap returning a
> > NULL. Which would obviously need to be checked for pfn_to_online_page.
> > Or do I miss anything and the above is not the case and we could still
> > get false positives?
>
> See my example above ("After memunmap").
>
> We're still in the slow pathg. pfn_to_online_page() will return a struct
> page as get_dev_pagemap() is now  NULL.
>
> Yet page_zone(page) will either
> - BUG_ON (memmap was poisoned)
> - return ZONE_DEVICE zone (memmap not poisoned when memunmapping)
>
> As I said, can be tackled by checking for pfn_section_valid() at least
> on the slow path. Ideally also on the fast path.

Good eye, I glazed over that the existing pfn_section_valid() check in
pfn_valid() is obviated by early_section(). I'll respin with a
standalone pfn_section_valid() gate in pfn_to_online_page().

>
> >
> >> Well have to further tweak pfn_to_online_page(). You'll have to also
> >> check pfn_section_valid() *at least* on the slow path. Less-hacky would
> >> be checking it also in the "somehwat-faster" path - that would cover
> >> silently overriding a memmap that's visible via pfn_to_online_page().
> >> Might slow down things a bit.
> >>
> >>
> >> Not completely opposed to this, but I would certainly still prefer just
> >> avoiding this corner case completely instead of patching around it. Thanks!
> >
> > Well, I would love to have no surprises either. So far there was not
> > actual argument why the pmem reserved space cannot be fully initialized.
>
> Yes, I'm still hoping Dan can clarify that.

Complexity and effective utility (once pfn_to_online_page() is fixed)
are the roadblocks in my mind. The altmap is there to allow for PMEM
capacity to be used as memmap space, so there would need to be code to
break that circular dependency and allocate a memmap for the metadata
space from DRAM and the rest of the memmap space for the data capacity
from pmem itself. That memmap-for-pmem-metadata will still represent
offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
is a PMEM namespace mode called "raw" that eschews DAX and 'struct
page' for pmem and just behaves like a memory-backed block device. The
end result is still that pfn walkers need to discover if a PMEM pfn
has a page, so I don't see what "sometimes there's an
memmap-for-pmem-metadata" buys us?

>
> > On the other hand making sure that pfn_to_online_page sounds like the
> > right thing to do. And having an explicit check for zone device there in
> > a slow path makes sense to me.
>
> As I said, I'd favor to simplify and just get rid of the special case,
> instead of coming up with increasingly complex ways to deal with it.
> pfn_to_online_page() used to be simple, essentially checking a single
> flag was sufficient in most setups.

I think the logic to throw away System RAM that might collide with
PMEM and soft-reserved memory within a section is on the order of the
same code complexity as the patch proposed here, no? Certainly the
throw-away concept itself is easier to grasp, but I don't think that
would be reflected in the code patch to achieve it... willing to be
proved wrong with a patch.
David Hildenbrand Jan. 7, 2021, 9:15 a.m. UTC | #8
[...]

>>> Well, I would love to have no surprises either. So far there was not
>>> actual argument why the pmem reserved space cannot be fully initialized.
>>
>> Yes, I'm still hoping Dan can clarify that.
> 
> Complexity and effective utility (once pfn_to_online_page() is fixed)
> are the roadblocks in my mind. The altmap is there to allow for PMEM
> capacity to be used as memmap space, so there would need to be code to
> break that circular dependency and allocate a memmap for the metadata
> space from DRAM and the rest of the memmap space for the data capacity
> from pmem itself. That memmap-for-pmem-metadata will still represent
> offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
> is going to be doing pfn_to_page() on PMEM metadata? Secondly, there

Assume I do

pgmap = get_dev_pagemap(pfn, NULL);
if (pgmap)
	return pfn_to_page(pfn);
return NULL;

on a random pfn because I want to inspect ZONE_DEVICE PFNs.

IIUC, the memmap I get might usually be initialized, except we're
hitting a PFN in the reserved altmap space. Correct?

Do we expect this handling to not be valid - i.e., initialization to be
of no utility? (to make it fly we would have to explicitly check for
PFNs in the altmap reserved space, which wouldn't be too problematic)

> is a PMEM namespace mode called "raw" that eschews DAX and 'struct
> page' for pmem and just behaves like a memory-backed block device. The
> end result is still that pfn walkers need to discover if a PMEM pfn
> has a page, so I don't see what "sometimes there's an
> memmap-for-pmem-metadata" buys us?

Right, but that's as easy as doing a pfn_valid() first.


Let me try to express what I (and I think Michal) mean:

In pagemap_range(), we

1. move_pfn_range_to_zone()->memmap_init_zone(): initialize the memmap
of the PMEM device used as memmap itself ("self host", confusing). We
skip initializing the memmap for the the reserved altmap space.

2. memmap_init_zone_device(): initialize the memmap of everything
outside of the altmap space.

IIUC, the memmap of the reserved altmap space remains uninitialized.
What speaks against just initializing that part in e.g., 2. or similarly
after 1.?


I'm planning on documenting the result of this discussion in the code,
so people don't have to scratch their head whenever stumbling over the
altmap reserved space.

> 
>>
>>> On the other hand making sure that pfn_to_online_page sounds like the
>>> right thing to do. And having an explicit check for zone device there in
>>> a slow path makes sense to me.
>>
>> As I said, I'd favor to simplify and just get rid of the special case,
>> instead of coming up with increasingly complex ways to deal with it.
>> pfn_to_online_page() used to be simple, essentially checking a single
>> flag was sufficient in most setups.
> 
> I think the logic to throw away System RAM that might collide with
> PMEM and soft-reserved memory within a section is on the order of the
> same code complexity as the patch proposed here, no? Certainly the
> throw-away concept itself is easier to grasp, but I don't think that
> would be reflected in the code patch to achieve it... willing to be
> proved wrong with a patch.

Well, at least it sounds easier to go over memblock holes and
align-up/down some relevant PFNs to section boundaries, ending up with
no affect to runtime performance later (e.g., pfn_to_online_page()). But
I agree that most probably the devil is in the detail - e.g., handling
different kind of holes (different flavors of "reserved") and syncing up
other data structures (e.g., resource tree).

I don't have time to look into that right now, but might look into it in
the future. For now I'm fine with this approach, assuming we don't
discover other corner cases that turn it even more complex. I'm happy
that we finally talk about it and fix it!
Dan Williams Jan. 12, 2021, 9:15 a.m. UTC | #9
On Wed, Jan 6, 2021 at 1:55 AM Michal Hocko <mhocko@suse.com> wrote:
>


> On Tue 05-01-21 20:07:18, Dan Williams wrote:
> > While pfn_to_online_page() is able to determine pfn_valid() at
> > subsection granularity it is not able to reliably determine if a given
> > pfn is also online if the section is mixed with ZONE_DEVICE pfns.
>
> I would call out the problem more explicitly. E.g. something like
> "
> This means that pfn_to_online_page can lead to false positives and allow
> to return a struct page which is not fully initialized because it
> belongs to ZONE_DEVICE (PUT AN EXAMPLE OF SUCH AN UNITIALIZED PAGE
> HERE). That can lead to either crash on PagePoisoned assertion or a
> silently broken page state with debugging disabled.
> "

Apologies for the long pause in this conversation, I went off and
wrote a test to trigger this condition so I could quote it directly.
It turns out soft_offline_page(), even before fixing
pfn_to_online_page(), is broken as it leaks a page reference.

>
> I would also appreciate a more specific note about how the existing HW
> can trigger this. You have mentioned 64MB subsection examples in the
> other email. It would be great to mention it here as well.

Sure.

>
> > Update move_pfn_range_to_zone() to flag (SECTION_TAINT_ZONE_DEVICE) a
> > section that mixes ZONE_DEVICE pfns with other online pfns. With
> > SECTION_TAINT_ZONE_DEVICE to delineate, pfn_to_online_page() can fall
> > back to a slow-path check for ZONE_DEVICE pfns in an online section.
> >
> > With this implementation of pfn_to_online_page() pfn-walkers mostly only
> > need to check section metadata to determine pfn validity. In the
> > rare case of mixed-zone sections the pfn-walker will skip offline
> > ZONE_DEVICE pfns as expected.
>
> The above paragraph is slightly confusing. You do not require
> pfn-walkers to check anything right? Section metadata is something that
> is and should be hidden from them. They are asking for an online page
> and get it or NULL. Nothing more nothing less.
>

Yeah, I'll drop it. I was describing what service pfn_to_online_page()
performs for a pfn-walker, but it's awkwardly worded.

>
> > Other notes:
> >
> > Because the collision case is rare, and for simplicity, the
> > SECTION_TAINT_ZONE_DEVICE flag is never cleared once set.
>
> I do not see a problem with that.
>
> > pfn_to_online_page() was already borderline too large to be a macro /
> > inline function, but the additional logic certainly pushed it over that
> > threshold, and so it is moved to an out-of-line helper.
>
> Worth a separate patch.
>
> The approach is sensible. Thanks!
>
> I was worried that we do not have sufficient space for a new flag but
> the comment explains we have 6 bits available.  I haven't double checked
> that for the current state of the code. The comment is quite recent and
> I do not remember any substantial changes in this area. Still something
> that is rather fragile because an unexpected alignment would be a
> runtime failure which is good to stop random corruptions but not ideal.
>
> Is there any way to explicitly test for this? E.g. enforce a shared
> section by qemu and then trigger a pfn walk?
>
> > Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Reported-by: Michal Hocko <mhocko@suse.com>
> > Reported-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> [...]
>
> > +static int zone_id(const struct zone *zone)
> > +{
> > +     struct pglist_data *pgdat = zone->zone_pgdat;
> > +
> > +     return zone - pgdat->node_zones;
> > +}
>
> We already have zone_idx()

Noted.

>
> > +
> > +static void section_taint_zone_device(struct zone *zone, unsigned long pfn)
> > +{
> > +     struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
> > +
> > +     if (zone_id(zone) != ZONE_DEVICE)
> > +             return;
> > +
> > +     if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
> > +             return;
> > +
> > +     ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE;
> > +}
> > +
> >  /*
> >   * Associate the pfn range with the given zone, initializing the memmaps
> >   * and resizing the pgdat/zone data to span the added pages. After this
> > @@ -707,6 +769,15 @@ void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> >       resize_pgdat_range(pgdat, start_pfn, nr_pages);
> >       pgdat_resize_unlock(pgdat, &flags);
> >
> > +     /*
> > +      * Subsection population requires care in pfn_to_online_page().
> > +      * Set the taint to enable the slow path detection of
> > +      * ZONE_DEVICE pages in an otherwise  ZONE_{NORMAL,MOVABLE}
> > +      * section.
> > +      */
> > +     section_taint_zone_device(zone, start_pfn);
> > +     section_taint_zone_device(zone, start_pfn + nr_pages);
>
> I think it would be better to add the checks here and only set the flag
> in the called function. SECTION_TAINT_ZONE_DEVICE should go to where we
> define helpers for ther flags.
>

Done.
Dan Williams Jan. 12, 2021, 9:18 a.m. UTC | #10
On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
>
> [...]
>
> >>> Well, I would love to have no surprises either. So far there was not
> >>> actual argument why the pmem reserved space cannot be fully initialized.
> >>
> >> Yes, I'm still hoping Dan can clarify that.
> >
> > Complexity and effective utility (once pfn_to_online_page() is fixed)
> > are the roadblocks in my mind. The altmap is there to allow for PMEM
> > capacity to be used as memmap space, so there would need to be code to
> > break that circular dependency and allocate a memmap for the metadata
> > space from DRAM and the rest of the memmap space for the data capacity
> > from pmem itself. That memmap-for-pmem-metadata will still represent
> > offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
> > is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
>
> Assume I do
>
> pgmap = get_dev_pagemap(pfn, NULL);
> if (pgmap)
>         return pfn_to_page(pfn);
> return NULL;
>
> on a random pfn because I want to inspect ZONE_DEVICE PFNs.

I keep getting hung up on the motivation to do random pfn inspection?

The problems we have found to date have required different solutions.
The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it
could rely on the fact that the page already had an elevated reference
count. The get_user_pages() path only looks up ZONE_DEVICE pfns when
it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers
have been a problem, but with pfn_to_online_page() fixed what is the
remaining motivation to inspect ZONE_DEVICE pfns?

> IIUC, the memmap I get might usually be initialized, except we're
> hitting a PFN in the reserved altmap space. Correct?

The pagemap currently returns true for every pfn in the range
including those in the altmap.



>
> Do we expect this handling to not be valid - i.e., initialization to be
> of no utility? (to make it fly we would have to explicitly check for
> PFNs in the altmap reserved space, which wouldn't be too problematic)
>
> > is a PMEM namespace mode called "raw" that eschews DAX and 'struct
> > page' for pmem and just behaves like a memory-backed block device. The
> > end result is still that pfn walkers need to discover if a PMEM pfn
> > has a page, so I don't see what "sometimes there's an
> > memmap-for-pmem-metadata" buys us?
>
> Right, but that's as easy as doing a pfn_valid() first.
>
>
> Let me try to express what I (and I think Michal) mean:
>
> In pagemap_range(), we
>
> 1. move_pfn_range_to_zone()->memmap_init_zone(): initialize the memmap
> of the PMEM device used as memmap itself ("self host", confusing). We
> skip initializing the memmap for the the reserved altmap space.
>
> 2. memmap_init_zone_device(): initialize the memmap of everything
> outside of the altmap space.
>
> IIUC, the memmap of the reserved altmap space remains uninitialized.
> What speaks against just initializing that part in e.g., 2. or similarly
> after 1.?
>
>
> I'm planning on documenting the result of this discussion in the code,
> so people don't have to scratch their head whenever stumbling over the
> altmap reserved space.
>
> >
> >>
> >>> On the other hand making sure that pfn_to_online_page sounds like the
> >>> right thing to do. And having an explicit check for zone device there in
> >>> a slow path makes sense to me.
> >>
> >> As I said, I'd favor to simplify and just get rid of the special case,
> >> instead of coming up with increasingly complex ways to deal with it.
> >> pfn_to_online_page() used to be simple, essentially checking a single
> >> flag was sufficient in most setups.
> >
> > I think the logic to throw away System RAM that might collide with
> > PMEM and soft-reserved memory within a section is on the order of the
> > same code complexity as the patch proposed here, no? Certainly the
> > throw-away concept itself is easier to grasp, but I don't think that
> > would be reflected in the code patch to achieve it... willing to be
> > proved wrong with a patch.
>
> Well, at least it sounds easier to go over memblock holes and
> align-up/down some relevant PFNs to section boundaries, ending up with
> no affect to runtime performance later (e.g., pfn_to_online_page()). But
> I agree that most probably the devil is in the detail - e.g., handling
> different kind of holes (different flavors of "reserved") and syncing up
> other data structures (e.g., resource tree).
>
> I don't have time to look into that right now, but might look into it in
> the future. For now I'm fine with this approach, assuming we don't
> discover other corner cases that turn it even more complex. I'm happy
> that we finally talk about it and fix it!
>
>
> --
> Thanks,
>
> David / dhildenb
>
David Hildenbrand Jan. 12, 2021, 9:44 a.m. UTC | #11
On 12.01.21 10:18, Dan Williams wrote:
> On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> [...]
>>
>>>>> Well, I would love to have no surprises either. So far there was not
>>>>> actual argument why the pmem reserved space cannot be fully initialized.
>>>>
>>>> Yes, I'm still hoping Dan can clarify that.
>>>
>>> Complexity and effective utility (once pfn_to_online_page() is fixed)
>>> are the roadblocks in my mind. The altmap is there to allow for PMEM
>>> capacity to be used as memmap space, so there would need to be code to
>>> break that circular dependency and allocate a memmap for the metadata
>>> space from DRAM and the rest of the memmap space for the data capacity
>>> from pmem itself. That memmap-for-pmem-metadata will still represent
>>> offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
>>> is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
>>
>> Assume I do
>>
>> pgmap = get_dev_pagemap(pfn, NULL);
>> if (pgmap)
>>         return pfn_to_page(pfn);
>> return NULL;
>>
>> on a random pfn because I want to inspect ZONE_DEVICE PFNs.
> 
> I keep getting hung up on the motivation to do random pfn inspection?
> 
> The problems we have found to date have required different solutions.
> The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it
> could rely on the fact that the page already had an elevated reference
> count. The get_user_pages() path only looks up ZONE_DEVICE pfns when
> it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers
> have been a problem, but with pfn_to_online_page() fixed what is the
> remaining motivation to inspect ZONE_DEVICE pfns?

1) Let's assume we want to implement zone shrinking
(remove_pfn_range_from_zone()->shrink_zone_span()) for ZONE_DEVICE at
some point.

A simple approach would be going via get_dev_pagemap(pfn,
NULL)->pfn_to_page(pfn), checking for the zone.

If that's not possible, then extending dev_pagemap (e.g., indicating the
nid) might also work (unless there is another way to get the nid).


2) Let's take a look at mm/memory-failure.c:memory_failure_dev_pagemap()

IIUC, we might end up doing pfn_to_page(pfn) on a pfn in the reserved
altmap space, so one with an uninitialized memmap.

E.g., in dax_lock_page() we access page->mapping, which might just be
garbage. dax_mapping() will de-reference garbage.

Most probably I am missing something here.



Question is: what are the expectations regarding the memmap if
get_dev_pagemap() succeeded.

I'm fine documenting that "get_dev_pagemap() does not guarantee that the
"struct page" returned by pfn_to_page() was initialized and can safely
be used. E.g., it might be a pfn in the reserved altmap space, for which
the memmap is never initialized. Accessing it might be dangerous.".

Then, there has to be a check at relevant places (e.g.,
memory_failure_dev_pagemap()), checking somehow if the memmap content
can actually be used.
Dan Williams Jan. 12, 2021, 8:52 p.m. UTC | #12
On Tue, Jan 12, 2021 at 1:44 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 12.01.21 10:18, Dan Williams wrote:
> > On Thu, Jan 7, 2021 at 1:16 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> [...]
> >>
> >>>>> Well, I would love to have no surprises either. So far there was not
> >>>>> actual argument why the pmem reserved space cannot be fully initialized.
> >>>>
> >>>> Yes, I'm still hoping Dan can clarify that.
> >>>
> >>> Complexity and effective utility (once pfn_to_online_page() is fixed)
> >>> are the roadblocks in my mind. The altmap is there to allow for PMEM
> >>> capacity to be used as memmap space, so there would need to be code to
> >>> break that circular dependency and allocate a memmap for the metadata
> >>> space from DRAM and the rest of the memmap space for the data capacity
> >>> from pmem itself. That memmap-for-pmem-metadata will still represent
> >>> offline pages. So once pfn_to_online_page() is fixed, what pfn-walker
> >>> is going to be doing pfn_to_page() on PMEM metadata? Secondly, there
> >>
> >> Assume I do
> >>
> >> pgmap = get_dev_pagemap(pfn, NULL);
> >> if (pgmap)
> >>         return pfn_to_page(pfn);
> >> return NULL;
> >>
> >> on a random pfn because I want to inspect ZONE_DEVICE PFNs.
> >
> > I keep getting hung up on the motivation to do random pfn inspection?
> >
> > The problems we have found to date have required different solutions.
> > The KVM bug didn't use get_dev_pagemap() to inspect the pfn because it
> > could rely on the fact that the page already had an elevated reference
> > count. The get_user_pages() path only looks up ZONE_DEVICE pfns when
> > it see {pte,pmd,pud}_devmap set in the page table entry. pfn walkers
> > have been a problem, but with pfn_to_online_page() fixed what is the
> > remaining motivation to inspect ZONE_DEVICE pfns?
>
> 1) Let's assume we want to implement zone shrinking
> (remove_pfn_range_from_zone()->shrink_zone_span()) for ZONE_DEVICE at
> some point.

I don't expect that will ever be something the kernel will want to do
given the association of pgmap to the lifetime of a given device
configuration. The mechanism to mutate a ZONE_DEVICE mapping is unbind
device, reconfigure device, bind device to establish a new ZONE_DEVICE
mapping.

>
> A simple approach would be going via get_dev_pagemap(pfn,
> NULL)->pfn_to_page(pfn), checking for the zone.
>
> If that's not possible, then extending dev_pagemap (e.g., indicating the
> nid) might also work (unless there is another way to get the nid).
>
>
> 2) Let's take a look at mm/memory-failure.c:memory_failure_dev_pagemap()
>
> IIUC, we might end up doing pfn_to_page(pfn) on a pfn in the reserved
> altmap space, so one with an uninitialized memmap.
>
> E.g., in dax_lock_page() we access page->mapping, which might just be
> garbage. dax_mapping() will de-reference garbage.
>
> Most probably I am missing something here.

No you're not, this is a real issue because get_dev_pagemap() is valid
for the metadata space. I need to add a patch to validate
get_dev_pagemap() vs the pfns that are data vs metadata.


>
>
>
> Question is: what are the expectations regarding the memmap if
> get_dev_pagemap() succeeded.
>
> I'm fine documenting that "get_dev_pagemap() does not guarantee that the
> "struct page" returned by pfn_to_page() was initialized and can safely
> be used. E.g., it might be a pfn in the reserved altmap space, for which
> the memmap is never initialized. Accessing it might be dangerous.".
>
> Then, there has to be a check at relevant places (e.g.,
> memory_failure_dev_pagemap()), checking somehow if the memmap content
> can actually be used.

Ok, let me audit and fix that up.

Thanks David.
diff mbox series

Patch

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 15acce5ab106..3d99de0db2dd 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,22 +16,7 @@  struct resource;
 struct vmem_altmap;
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-/*
- * Return page for the valid pfn only if the page is online. All pfn
- * walkers which rely on the fully initialized page->flags and others
- * should use this rather than pfn_valid && pfn_to_page
- */
-#define pfn_to_online_page(pfn)					   \
-({								   \
-	struct page *___page = NULL;				   \
-	unsigned long ___pfn = pfn;				   \
-	unsigned long ___nr = pfn_to_section_nr(___pfn);	   \
-								   \
-	if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
-	    pfn_valid_within(___pfn))				   \
-		___page = pfn_to_page(___pfn);			   \
-	___page;						   \
-})
+struct page *pfn_to_online_page(unsigned long pfn);
 
 /*
  * Types for free bootmem stored in page->lru.next. These have to be in
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316bff3d..0b5c44f730b4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1273,13 +1273,14 @@  extern size_t mem_section_usage_size(void);
  *      which results in PFN_SECTION_SHIFT equal 6.
  * To sum it up, at least 6 bits are available.
  */
-#define	SECTION_MARKED_PRESENT	(1UL<<0)
-#define SECTION_HAS_MEM_MAP	(1UL<<1)
-#define SECTION_IS_ONLINE	(1UL<<2)
-#define SECTION_IS_EARLY	(1UL<<3)
-#define SECTION_MAP_LAST_BIT	(1UL<<4)
-#define SECTION_MAP_MASK	(~(SECTION_MAP_LAST_BIT-1))
-#define SECTION_NID_SHIFT	3
+#define SECTION_MARKED_PRESENT		(1UL<<0)
+#define SECTION_HAS_MEM_MAP		(1UL<<1)
+#define SECTION_IS_ONLINE		(1UL<<2)
+#define SECTION_IS_EARLY		(1UL<<3)
+#define SECTION_TAINT_ZONE_DEVICE	(1UL<<4)
+#define SECTION_MAP_LAST_BIT		(1UL<<5)
+#define SECTION_MAP_MASK		(~(SECTION_MAP_LAST_BIT-1))
+#define SECTION_NID_SHIFT		3
 
 static inline struct page *__section_mem_map_addr(struct mem_section *section)
 {
@@ -1318,6 +1319,13 @@  static inline int online_section(struct mem_section *section)
 	return (section && (section->section_mem_map & SECTION_IS_ONLINE));
 }
 
+static inline int online_device_section(struct mem_section *section)
+{
+	unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE;
+
+	return section && ((section->section_mem_map & flags) == flags);
+}
+
 static inline int online_section_nr(unsigned long nr)
 {
 	return online_section(__nr_to_section(nr));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9be8c7..9f36968e6188 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -300,6 +300,47 @@  static int check_hotplug_memory_addressable(unsigned long pfn,
 	return 0;
 }
 
+/*
+ * Return page for the valid pfn only if the page is online. All pfn
+ * walkers which rely on the fully initialized page->flags and others
+ * should use this rather than pfn_valid && pfn_to_page
+ */
+struct page *pfn_to_online_page(unsigned long pfn)
+{
+	unsigned long nr = pfn_to_section_nr(pfn);
+	struct dev_pagemap *pgmap;
+	struct mem_section *ms;
+
+	if (nr >= NR_MEM_SECTIONS)
+		return NULL;
+
+	ms = __nr_to_section(nr);
+
+	if (!online_section(ms))
+		return NULL;
+
+	if (!pfn_valid_within(pfn))
+		return NULL;
+
+	if (!online_device_section(ms))
+		return pfn_to_page(pfn);
+
+	/*
+	 * Slowpath: when ZONE_DEVICE collides with
+	 * ZONE_{NORMAL,MOVABLE} within the same section some pfns in
+	 * the section may be 'offline' but 'valid'. Only
+	 * get_dev_pagemap() can determine sub-section online status.
+	 */
+	pgmap = get_dev_pagemap(pfn, NULL);
+	put_dev_pagemap(pgmap);
+
+	/* The presence of a pgmap indicates ZONE_DEVICE offline pfn */
+	if (pgmap)
+		return NULL;
+	return pfn_to_page(pfn);
+}
+EXPORT_SYMBOL_GPL(pfn_to_online_page);
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
@@ -678,6 +719,27 @@  static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon
 	pgdat->node_spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - pgdat->node_start_pfn;
 
 }
+
+static int zone_id(const struct zone *zone)
+{
+	struct pglist_data *pgdat = zone->zone_pgdat;
+
+	return zone - pgdat->node_zones;
+}
+
+static void section_taint_zone_device(struct zone *zone, unsigned long pfn)
+{
+	struct mem_section *ms = __nr_to_section(pfn_to_section_nr(pfn));
+
+	if (zone_id(zone) != ZONE_DEVICE)
+		return;
+
+	if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+		return;
+
+	ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE;
+}
+
 /*
  * Associate the pfn range with the given zone, initializing the memmaps
  * and resizing the pgdat/zone data to span the added pages. After this
@@ -707,6 +769,15 @@  void __ref move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 	resize_pgdat_range(pgdat, start_pfn, nr_pages);
 	pgdat_resize_unlock(pgdat, &flags);
 
+	/*
+	 * Subsection population requires care in pfn_to_online_page().
+	 * Set the taint to enable the slow path detection of
+	 * ZONE_DEVICE pages in an otherwise  ZONE_{NORMAL,MOVABLE}
+	 * section.
+	 */
+	section_taint_zone_device(zone, start_pfn);
+	section_taint_zone_device(zone, start_pfn + nr_pages);
+
 	/*
 	 * TODO now we have a visible range of pages which are not associated
 	 * with their zone properly. Not nice but set_pfnblock_flags_mask