diff mbox series

[v1] mm: inititalize struct pages when adding a section

Message ID 20180727165454.27292-1-david@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v1] mm: inititalize struct pages when adding a section | expand

Commit Message

David Hildenbrand July 27, 2018, 4:54 p.m. UTC
Right now, struct pages are inititalized when memory is onlined, not
when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
memory hotplug")).

remove_memory() will call arch_remove_memory(). Here, we usually access
the struct page to get the zone of the pages.

So effectively, we access stale struct pages in case we remove memory that
was never onlined. So let's simply inititalize them earlier, when the
memory is added. We only have to take care of updating the zone once we
know it. We can use a dummy zone for that purpose.

So effectively, all pages will already be initialized and set to
reserved after memory was added but before it was onlined (and even the
memblock is added). We only inititalize pages once, to not degrade
performance.

This will also mean that user space dump tools will always see sane
struct pages once a memblock pops up.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/base/node.c    |  1 -
 include/linux/memory.h |  1 -
 include/linux/mm.h     | 10 ++++++++++
 mm/memory_hotplug.c    | 27 +++++++++++++++++++--------
 mm/page_alloc.c        | 23 +++++++++++------------
 5 files changed, 40 insertions(+), 22 deletions(-)

Comments

Pavel Tatashin July 27, 2018, 5:25 p.m. UTC | #1
Hi David,

On Fri, Jul 27, 2018 at 12:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> Right now, struct pages are inititalized when memory is onlined, not
> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
> memory hotplug")).
>
> remove_memory() will call arch_remove_memory(). Here, we usually access
> the struct page to get the zone of the pages.
>
> So effectively, we access stale struct pages in case we remove memory that
> was never onlined.

Yeah, this is a bug, thank you for catching it.

> So let's simply inititalize them earlier, when the
> memory is added. We only have to take care of updating the zone once we
> know it. We can use a dummy zone for that purpose.
>
> So effectively, all pages will already be initialized and set to
> reserved after memory was added but before it was onlined (and even the
> memblock is added). We only inititalize pages once, to not degrade
> performance.

Yes, but we still add one more npages loop, so there will be some
performance degradation, but not severe.

There are many conflicts with linux-next, please sync before sending
out next patch.

> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1162,7 +1162,15 @@ static inline void set_page_address(struct page *page, void *address)
>  {
>         page->virtual = address;
>  }
> +static void set_page_virtual(struct page *page, and enum zone_type zone)
> +{
> +       /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> +       if (!is_highmem_idx(zone))
> +               set_page_address(page, __va(pfn << PAGE_SHIFT));
> +}
>  #define page_address_init()  do { } while(0)
> +#else
> +#define set_page_virtual(page, zone)  do { } while(0)
>  #endif

Please use inline functions for both if WANT_PAGE_VIRTUAL case and else case.

>  #if defined(HASHED_PAGE_VIRTUAL)
> @@ -2116,6 +2124,8 @@ extern unsigned long find_min_pfn_with_active_regions(void);
>  extern void free_bootmem_with_active_regions(int nid,
>                                                 unsigned long max_low_pfn);
>  extern void sparse_memory_present_with_active_regions(int nid);
> +extern void __meminit init_single_page(struct page *page, unsigned long pfn,
> +                                      unsigned long zone, int nid);

I do not like making init_single_page() public. There is less chance
it is going to be inlined. I think a better way is to have a new
variant of memmap_init_zone that will handle hotplug case.

>
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7deb49f69e27..3f28ca3c3a33 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -250,6 +250,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
>                 struct vmem_altmap *altmap, bool want_memblock)
>  {
>         int ret;
> +       int i;
>
>         if (pfn_valid(phys_start_pfn))
>                 return -EEXIST;
> @@ -258,6 +259,23 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
>         if (ret < 0)
>                 return ret;
>
> +       /*
> +        * Initialize all pages in the section before fully exposing them to the
> +        * system so nobody will stumble over a half inititalized state.
> +        */
> +       for (i = 0; i < PAGES_PER_SECTION; i++) {
> +               unsigned long pfn = phys_start_pfn + i;
> +               struct page *page;
> +
> +               if (!pfn_valid(pfn))
> +                       continue;
> +               page = pfn_to_page(pfn);
> +
> +               /* dummy zone, the actual one will be set when onlining pages */
> +               init_single_page(page, pfn, ZONE_NORMAL, nid);
> +               SetPageReserved(page);
> +       }

Please move all of the above into a new memmap_init_hotplug() that
should be located in page_alloc.c


> @@ -5519,9 +5515,12 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>
>  not_early:
>                 page = pfn_to_page(pfn);
> -               __init_single_page(page, pfn, zone, nid);
> -               if (context == MEMMAP_HOTPLUG)
> -                       SetPageReserved(page);
> +               if (context == MEMMAP_HOTPLUG) {
> +                       /* everything but the zone was inititalized */
> +                       set_page_zone(page, zone);
> +                       set_page_virtual(page, zone);
> +               } else
> +                       init_single_page(page, pfn, zone, nid);
>

Please add a new function:
memmap_init_zone_hotplug() that will handle only the zone and virtual
fields for onlined hotplug memory.

Please remove: "enum memmap_context context" from everywhere.

Thank you,
Pavel
David Hildenbrand July 27, 2018, 6:01 p.m. UTC | #2
On 27.07.2018 19:25, Pavel Tatashin wrote:
> Hi David,
> 
> On Fri, Jul 27, 2018 at 12:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> Right now, struct pages are inititalized when memory is onlined, not
>> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>> memory hotplug")).
>>
>> remove_memory() will call arch_remove_memory(). Here, we usually access
>> the struct page to get the zone of the pages.
>>
>> So effectively, we access stale struct pages in case we remove memory that
>> was never onlined.
> 
> Yeah, this is a bug, thank you for catching it.
> 
>> So let's simply inititalize them earlier, when the
>> memory is added. We only have to take care of updating the zone once we
>> know it. We can use a dummy zone for that purpose.
>>
>> So effectively, all pages will already be initialized and set to
>> reserved after memory was added but before it was onlined (and even the
>> memblock is added). We only inititalize pages once, to not degrade
>> performance.
> 
> Yes, but we still add one more npages loop, so there will be some
> performance degradation, but not severe.
> 
> There are many conflicts with linux-next, please sync before sending
> out next patch.

Indeed, although I rebased, I was working on a branch based on Linus
tree ...

[...]

>>  not_early:
>>                 page = pfn_to_page(pfn);
>> -               __init_single_page(page, pfn, zone, nid);
>> -               if (context == MEMMAP_HOTPLUG)
>> -                       SetPageReserved(page);
>> +               if (context == MEMMAP_HOTPLUG) {
>> +                       /* everything but the zone was inititalized */
>> +                       set_page_zone(page, zone);
>> +                       set_page_virtual(page, zone);
>> +               } else
>> +                       init_single_page(page, pfn, zone, nid);
>>
> 
> Please add a new function:
> memmap_init_zone_hotplug() that will handle only the zone and virtual
> fields for onlined hotplug memory.
> 
> Please remove: "enum memmap_context context" from everywhere.

All your comments make sense. Will look into the details next week and
send a new version.

Thanks and enjoy your weekend!
Michal Hocko July 30, 2018, 11:30 a.m. UTC | #3
On Fri 27-07-18 18:54:54, David Hildenbrand wrote:
> Right now, struct pages are inititalized when memory is onlined, not
> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
> memory hotplug")).
> 
> remove_memory() will call arch_remove_memory(). Here, we usually access
> the struct page to get the zone of the pages.
> 
> So effectively, we access stale struct pages in case we remove memory that
> was never onlined. So let's simply inititalize them earlier, when the
> memory is added. We only have to take care of updating the zone once we
> know it. We can use a dummy zone for that purpose.

I have considered something like this when I was reworking memory
hotplug to not associate struct pages with zone before onlining and I
considered this to be rather fragile. I would really not like to get
back to that again if possible.

> So effectively, all pages will already be initialized and set to
> reserved after memory was added but before it was onlined (and even the
> memblock is added). We only inititalize pages once, to not degrade
> performance.

To be honest, I would rather see d0dc12e86b31 reverted. It is late in
the release cycle and if the patch is buggy then it should be reverted
rather than worked around. I found the optimization not really
convincing back then and this is still the case TBH.
David Hildenbrand July 30, 2018, 11:53 a.m. UTC | #4
On 30.07.2018 13:30, Michal Hocko wrote:
> On Fri 27-07-18 18:54:54, David Hildenbrand wrote:
>> Right now, struct pages are inititalized when memory is onlined, not
>> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>> memory hotplug")).
>>
>> remove_memory() will call arch_remove_memory(). Here, we usually access
>> the struct page to get the zone of the pages.
>>
>> So effectively, we access stale struct pages in case we remove memory that
>> was never onlined. So let's simply inititalize them earlier, when the
>> memory is added. We only have to take care of updating the zone once we
>> know it. We can use a dummy zone for that purpose.
> 
> I have considered something like this when I was reworking memory
> hotplug to not associate struct pages with zone before onlining and I
> considered this to be rather fragile. I would really not like to get
> back to that again if possible.
> 
>> So effectively, all pages will already be initialized and set to
>> reserved after memory was added but before it was onlined (and even the
>> memblock is added). We only inititalize pages once, to not degrade
>> performance.
> 
> To be honest, I would rather see d0dc12e86b31 reverted. It is late in
> the release cycle and if the patch is buggy then it should be reverted
> rather than worked around. I found the optimization not really
> convincing back then and this is still the case TBH.
> 

If I am not wrong, that's already broken in 4.17, no? What about that?

If we don't care about that, then I agree to reverting said commit for
v4.18.
Michal Hocko July 30, 2018, 12:05 p.m. UTC | #5
On Mon 30-07-18 13:53:06, David Hildenbrand wrote:
> On 30.07.2018 13:30, Michal Hocko wrote:
> > On Fri 27-07-18 18:54:54, David Hildenbrand wrote:
> >> Right now, struct pages are inititalized when memory is onlined, not
> >> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
> >> memory hotplug")).
> >>
> >> remove_memory() will call arch_remove_memory(). Here, we usually access
> >> the struct page to get the zone of the pages.
> >>
> >> So effectively, we access stale struct pages in case we remove memory that
> >> was never onlined. So let's simply inititalize them earlier, when the
> >> memory is added. We only have to take care of updating the zone once we
> >> know it. We can use a dummy zone for that purpose.
> > 
> > I have considered something like this when I was reworking memory
> > hotplug to not associate struct pages with zone before onlining and I
> > considered this to be rather fragile. I would really not like to get
> > back to that again if possible.
> > 
> >> So effectively, all pages will already be initialized and set to
> >> reserved after memory was added but before it was onlined (and even the
> >> memblock is added). We only inititalize pages once, to not degrade
> >> performance.
> > 
> > To be honest, I would rather see d0dc12e86b31 reverted. It is late in
> > the release cycle and if the patch is buggy then it should be reverted
> > rather than worked around. I found the optimization not really
> > convincing back then and this is still the case TBH.
> > 
> 
> If I am not wrong, that's already broken in 4.17, no? What about that?

Ohh, I thought this was merged in 4.18.
$ git describe --contains d0dc12e86b31 --match="v*"
v4.17-rc1~99^2~44

proves me wrong. This means that the fix is not so urgent as I thought.
If you can figure out a reasonable fix then it should be preferable to
the revert.

Fake zone sounds too hackish to me though.
David Hildenbrand July 30, 2018, 12:11 p.m. UTC | #6
On 30.07.2018 14:05, Michal Hocko wrote:
> On Mon 30-07-18 13:53:06, David Hildenbrand wrote:
>> On 30.07.2018 13:30, Michal Hocko wrote:
>>> On Fri 27-07-18 18:54:54, David Hildenbrand wrote:
>>>> Right now, struct pages are inititalized when memory is onlined, not
>>>> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>>>> memory hotplug")).
>>>>
>>>> remove_memory() will call arch_remove_memory(). Here, we usually access
>>>> the struct page to get the zone of the pages.
>>>>
>>>> So effectively, we access stale struct pages in case we remove memory that
>>>> was never onlined. So let's simply inititalize them earlier, when the
>>>> memory is added. We only have to take care of updating the zone once we
>>>> know it. We can use a dummy zone for that purpose.
>>>
>>> I have considered something like this when I was reworking memory
>>> hotplug to not associate struct pages with zone before onlining and I
>>> considered this to be rather fragile. I would really not like to get
>>> back to that again if possible.
>>>
>>>> So effectively, all pages will already be initialized and set to
>>>> reserved after memory was added but before it was onlined (and even the
>>>> memblock is added). We only inititalize pages once, to not degrade
>>>> performance.
>>>
>>> To be honest, I would rather see d0dc12e86b31 reverted. It is late in
>>> the release cycle and if the patch is buggy then it should be reverted
>>> rather than worked around. I found the optimization not really
>>> convincing back then and this is still the case TBH.
>>>
>>
>> If I am not wrong, that's already broken in 4.17, no? What about that?
> 
> Ohh, I thought this was merged in 4.18.
> $ git describe --contains d0dc12e86b31 --match="v*"
> v4.17-rc1~99^2~44
> 
> proves me wrong. This means that the fix is not so urgent as I thought.
> If you can figure out a reasonable fix then it should be preferable to
> the revert.
> 
> Fake zone sounds too hackish to me though.
> 

If I am not wrong, that's the same we had before d0dc12e86b31 but now it
is explicit and only one single value for all kernel configs
("ZONE_NORMAL").

Before d0dc12e86b31, struct pages were initialized to 0. So it was
(depending on the config) ZONE_DMA, ZONE_DMA32 or ZONE_NORMAL.

Now the value is random and might not even be a valid zone.
Pavel Tatashin July 30, 2018, 1:30 p.m. UTC | #7
On Mon, Jul 30, 2018 at 8:11 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.07.2018 14:05, Michal Hocko wrote:
> > On Mon 30-07-18 13:53:06, David Hildenbrand wrote:
> >> On 30.07.2018 13:30, Michal Hocko wrote:
> >>> On Fri 27-07-18 18:54:54, David Hildenbrand wrote:
> >>>> Right now, struct pages are inititalized when memory is onlined, not
> >>>> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
> >>>> memory hotplug")).
> >>>>
> >>>> remove_memory() will call arch_remove_memory(). Here, we usually access
> >>>> the struct page to get the zone of the pages.
> >>>>
> >>>> So effectively, we access stale struct pages in case we remove memory that
> >>>> was never onlined. So let's simply inititalize them earlier, when the
> >>>> memory is added. We only have to take care of updating the zone once we
> >>>> know it. We can use a dummy zone for that purpose.
> >>>
> >>> I have considered something like this when I was reworking memory
> >>> hotplug to not associate struct pages with zone before onlining and I
> >>> considered this to be rather fragile. I would really not like to get
> >>> back to that again if possible.
> >>>
> >>>> So effectively, all pages will already be initialized and set to
> >>>> reserved after memory was added but before it was onlined (and even the
> >>>> memblock is added). We only inititalize pages once, to not degrade
> >>>> performance.
> >>>
> >>> To be honest, I would rather see d0dc12e86b31 reverted. It is late in
> >>> the release cycle and if the patch is buggy then it should be reverted
> >>> rather than worked around. I found the optimization not really
> >>> convincing back then and this is still the case TBH.
> >>>
> >>
> >> If I am not wrong, that's already broken in 4.17, no? What about that?
> >
> > Ohh, I thought this was merged in 4.18.
> > $ git describe --contains d0dc12e86b31 --match="v*"
> > v4.17-rc1~99^2~44
> >
> > proves me wrong. This means that the fix is not so urgent as I thought.
> > If you can figure out a reasonable fix then it should be preferable to
> > the revert.
> >
> > Fake zone sounds too hackish to me though.
> >
>
> If I am not wrong, that's the same we had before d0dc12e86b31 but now it
> is explicit and only one single value for all kernel configs
> ("ZONE_NORMAL").
>
> Before d0dc12e86b31, struct pages were initialized to 0. So it was
> (depending on the config) ZONE_DMA, ZONE_DMA32 or ZONE_NORMAL.
>
> Now the value is random and might not even be a valid zone.

Hi David,

Have you figured out why we access struct pages during hot-unplug for
offlined memory? Also, a panic trace would be useful in the patch.

As I understand the bug may occur only when hotremove is enabled, and
default onlining of added memory is disabled. Is this correct? I
suspect the reason we have not heard about this bug is that it is rare
to add memory and not to online it.

Thank you,
Pavel
David Hildenbrand July 30, 2018, 1:51 p.m. UTC | #8
On 30.07.2018 15:30, Pavel Tatashin wrote:
> On Mon, Jul 30, 2018 at 8:11 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 30.07.2018 14:05, Michal Hocko wrote:
>>> On Mon 30-07-18 13:53:06, David Hildenbrand wrote:
>>>> On 30.07.2018 13:30, Michal Hocko wrote:
>>>>> On Fri 27-07-18 18:54:54, David Hildenbrand wrote:
>>>>>> Right now, struct pages are inititalized when memory is onlined, not
>>>>>> when it is added (since commit d0dc12e86b31 ("mm/memory_hotplug: optimize
>>>>>> memory hotplug")).
>>>>>>
>>>>>> remove_memory() will call arch_remove_memory(). Here, we usually access
>>>>>> the struct page to get the zone of the pages.
>>>>>>
>>>>>> So effectively, we access stale struct pages in case we remove memory that
>>>>>> was never onlined. So let's simply inititalize them earlier, when the
>>>>>> memory is added. We only have to take care of updating the zone once we
>>>>>> know it. We can use a dummy zone for that purpose.
>>>>>
>>>>> I have considered something like this when I was reworking memory
>>>>> hotplug to not associate struct pages with zone before onlining and I
>>>>> considered this to be rather fragile. I would really not like to get
>>>>> back to that again if possible.
>>>>>
>>>>>> So effectively, all pages will already be initialized and set to
>>>>>> reserved after memory was added but before it was onlined (and even the
>>>>>> memblock is added). We only inititalize pages once, to not degrade
>>>>>> performance.
>>>>>
>>>>> To be honest, I would rather see d0dc12e86b31 reverted. It is late in
>>>>> the release cycle and if the patch is buggy then it should be reverted
>>>>> rather than worked around. I found the optimization not really
>>>>> convincing back then and this is still the case TBH.
>>>>>
>>>>
>>>> If I am not wrong, that's already broken in 4.17, no? What about that?
>>>
>>> Ohh, I thought this was merged in 4.18.
>>> $ git describe --contains d0dc12e86b31 --match="v*"
>>> v4.17-rc1~99^2~44
>>>
>>> proves me wrong. This means that the fix is not so urgent as I thought.
>>> If you can figure out a reasonable fix then it should be preferable to
>>> the revert.
>>>
>>> Fake zone sounds too hackish to me though.
>>>
>>
>> If I am not wrong, that's the same we had before d0dc12e86b31 but now it
>> is explicit and only one single value for all kernel configs
>> ("ZONE_NORMAL").
>>
>> Before d0dc12e86b31, struct pages were initialized to 0. So it was
>> (depending on the config) ZONE_DMA, ZONE_DMA32 or ZONE_NORMAL.
>>
>> Now the value is random and might not even be a valid zone.
> 
> Hi David,
> 
> Have you figured out why we access struct pages during hot-unplug for
> offlined memory? Also, a panic trace would be useful in the patch.

__remove_pages() needs a zone as of now (e.g. to recalculate if the zone
is contiguous). This zone is taken from the first page of memory to be
removed. If the struct pages are uninitialized that value is random and
we might even get an invalid zone.

The zone is also used to locate pgdat.

No stack trace available so far, I'm just reading the code and try to
understand how this whole memory hotplug/unplug machinery works.

> 
> As I understand the bug may occur only when hotremove is enabled, and
> default onlining of added memory is disabled. Is this correct? I

Yes, or if onlining fails.

> suspect the reason we have not heard about this bug is that it is rare
> to add memory and not to online it.

I assume so, most distros online all memory that is available as it is
being added.

> 
> Thank you,
> Pavel
>
Michal Hocko July 30, 2018, 2:10 p.m. UTC | #9
On Mon 30-07-18 15:51:45, David Hildenbrand wrote:
> On 30.07.2018 15:30, Pavel Tatashin wrote:
[...]
> > Hi David,
> > 
> > Have you figured out why we access struct pages during hot-unplug for
> > offlined memory? Also, a panic trace would be useful in the patch.
> 
> __remove_pages() needs a zone as of now (e.g. to recalculate if the zone
> is contiguous). This zone is taken from the first page of memory to be
> removed. If the struct pages are uninitialized that value is random and
> we might even get an invalid zone.
>
> The zone is also used to locate pgdat.
> 
> No stack trace available so far, I'm just reading the code and try to
> understand how this whole memory hotplug/unplug machinery works.

Yes this is a mess (evolution of the code called otherwise ;) [1].
Functionality has been just added on top of not very well thought
through bases. This is a nice example of it. We are trying to get a zone
to 1) special case zone_device 2) recalculate zone state. The first
shouldn't be really needed because we should simply rely on altmap.
Whether it is used for zone device or not. 2) shouldn't be really needed
if the section is offline and we can check that trivially.

[1] on the other hand I can see why people were reluctant to understand
the mess and rather tweak their tiny thing on top...
David Hildenbrand July 30, 2018, 2:42 p.m. UTC | #10
On 30.07.2018 16:10, Michal Hocko wrote:
> On Mon 30-07-18 15:51:45, David Hildenbrand wrote:
>> On 30.07.2018 15:30, Pavel Tatashin wrote:
> [...]
>>> Hi David,
>>>
>>> Have you figured out why we access struct pages during hot-unplug for
>>> offlined memory? Also, a panic trace would be useful in the patch.
>>
>> __remove_pages() needs a zone as of now (e.g. to recalculate if the zone
>> is contiguous). This zone is taken from the first page of memory to be
>> removed. If the struct pages are uninitialized that value is random and
>> we might even get an invalid zone.
>>
>> The zone is also used to locate pgdat.
>>
>> No stack trace available so far, I'm just reading the code and try to
>> understand how this whole memory hotplug/unplug machinery works.
> 
> Yes this is a mess (evolution of the code called otherwise ;) [1].

So I guess I should not feel bad if I am having problems understanding
all the details? ;)

> Functionality has been just added on top of not very well thought
> through bases. This is a nice example of it. We are trying to get a zone
> to 1) special case zone_device 2) recalculate zone state. The first
> shouldn't be really needed because we should simply rely on altmap.
> Whether it is used for zone device or not. 2) shouldn't be really needed
> if the section is offline and we can check that trivially.
> 

About 2, I am not sure if this is the case and that easy. To me it looks
more like remove_pages() fixes up things that should be done in
offline_pages(). Especially, if the same memory was onlined/offlined to
different zones we might be in trouble (looking at code on a very high
level view).

"if the section is offline and we can check that trivially" is not was
is being used here. It is "of the section was online and is now offline".

Accessing a zone when removing memory sounds very wrong. offline_pages()
should cleanup everything that online_pages() did.

Removing memory with pages that are still online should not be allowed.
And I think this is already enforced via check_memblock_offlined_cb().

> [1] on the other hand I can see why people were reluctant to understand
> the mess and rather tweak their tiny thing on top...
>
Michal Hocko July 30, 2018, 2:50 p.m. UTC | #11
On Mon 30-07-18 16:42:27, David Hildenbrand wrote:
> On 30.07.2018 16:10, Michal Hocko wrote:
> > On Mon 30-07-18 15:51:45, David Hildenbrand wrote:
> >> On 30.07.2018 15:30, Pavel Tatashin wrote:
> > [...]
> >>> Hi David,
> >>>
> >>> Have you figured out why we access struct pages during hot-unplug for
> >>> offlined memory? Also, a panic trace would be useful in the patch.
> >>
> >> __remove_pages() needs a zone as of now (e.g. to recalculate if the zone
> >> is contiguous). This zone is taken from the first page of memory to be
> >> removed. If the struct pages are uninitialized that value is random and
> >> we might even get an invalid zone.
> >>
> >> The zone is also used to locate pgdat.
> >>
> >> No stack trace available so far, I'm just reading the code and try to
> >> understand how this whole memory hotplug/unplug machinery works.
> > 
> > Yes this is a mess (evolution of the code called otherwise ;) [1].
> 
> So I guess I should not feel bad if I am having problems understanding
> all the details? ;)
> 
> > Functionality has been just added on top of not very well thought
> > through bases. This is a nice example of it. We are trying to get a zone
> > to 1) special case zone_device 2) recalculate zone state. The first
> > shouldn't be really needed because we should simply rely on altmap.
> > Whether it is used for zone device or not. 2) shouldn't be really needed
> > if the section is offline and we can check that trivially.
> > 
> 
> About 2, I am not sure if this is the case and that easy. To me it looks
> more like remove_pages() fixes up things that should be done in
> offline_pages(). Especially, if the same memory was onlined/offlined to
> different zones we might be in trouble (looking at code on a very high
> level view).

Well, this might be possible. Hotplug remove path was on my todo list
for a long time. I didn't get that far TBH. shrink_zone_span begs for
some attention.

> "if the section is offline and we can check that trivially" is not was
> is being used here. It is "of the section was online and is now offline".

yes.

> Accessing a zone when removing memory sounds very wrong. offline_pages()
> should cleanup everything that online_pages() did.
> 
> Removing memory with pages that are still online should not be allowed.
> And I think this is already enforced via check_memblock_offlined_cb().

yes.
David Hildenbrand July 30, 2018, 3:03 p.m. UTC | #12
On 30.07.2018 16:50, Michal Hocko wrote:
> On Mon 30-07-18 16:42:27, David Hildenbrand wrote:
>> On 30.07.2018 16:10, Michal Hocko wrote:
>>> On Mon 30-07-18 15:51:45, David Hildenbrand wrote:
>>>> On 30.07.2018 15:30, Pavel Tatashin wrote:
>>> [...]
>>>>> Hi David,
>>>>>
>>>>> Have you figured out why we access struct pages during hot-unplug for
>>>>> offlined memory? Also, a panic trace would be useful in the patch.
>>>>
>>>> __remove_pages() needs a zone as of now (e.g. to recalculate if the zone
>>>> is contiguous). This zone is taken from the first page of memory to be
>>>> removed. If the struct pages are uninitialized that value is random and
>>>> we might even get an invalid zone.
>>>>
>>>> The zone is also used to locate pgdat.
>>>>
>>>> No stack trace available so far, I'm just reading the code and try to
>>>> understand how this whole memory hotplug/unplug machinery works.
>>>
>>> Yes this is a mess (evolution of the code called otherwise ;) [1].
>>
>> So I guess I should not feel bad if I am having problems understanding
>> all the details? ;)
>>
>>> Functionality has been just added on top of not very well thought
>>> through bases. This is a nice example of it. We are trying to get a zone
>>> to 1) special case zone_device 2) recalculate zone state. The first
>>> shouldn't be really needed because we should simply rely on altmap.
>>> Whether it is used for zone device or not. 2) shouldn't be really needed
>>> if the section is offline and we can check that trivially.
>>>
>>
>> About 2, I am not sure if this is the case and that easy. To me it looks
>> more like remove_pages() fixes up things that should be done in
>> offline_pages(). Especially, if the same memory was onlined/offlined to
>> different zones we might be in trouble (looking at code on a very high
>> level view).
> 
> Well, this might be possible. Hotplug remove path was on my todo list
> for a long time. I didn't get that far TBH. shrink_zone_span begs for
> some attention.
> 

So i guess we agree that the right fix for this is to not touch struct
pages when removing memory, correct?
Pavel Tatashin July 30, 2018, 3:45 p.m. UTC | #13
>
> So i guess we agree that the right fix for this is to not touch struct
> pages when removing memory, correct?

Yes in my opinion that would be the correct fix.

Thank you,
Pavel

>
> --
>
> Thanks,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/drivers/base/node.c b/drivers/base/node.c
index a5e821d09656..3ec78f80afe2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -408,7 +408,6 @@  int register_mem_sect_under_node(struct memory_block *mem_blk, int nid,
 	if (!mem_blk)
 		return -EFAULT;
 
-	mem_blk->nid = nid;
 	if (!node_online(nid))
 		return 0;
 
diff --git a/include/linux/memory.h b/include/linux/memory.h
index a6ddefc60517..8a0864a65a98 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -33,7 +33,6 @@  struct memory_block {
 	void *hw;			/* optional pointer to fw/hw data */
 	int (*phys_callback)(struct memory_block *);
 	struct device dev;
-	int nid;			/* NID for this memory block */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d3a3842316b8..e6bf3527b7a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1162,7 +1162,15 @@  static inline void set_page_address(struct page *page, void *address)
 {
 	page->virtual = address;
 }
+static void set_page_virtual(struct page *page, and enum zone_type zone)
+{
+	/* The shift won't overflow because ZONE_NORMAL is below 4G. */
+	if (!is_highmem_idx(zone))
+		set_page_address(page, __va(pfn << PAGE_SHIFT));
+}
 #define page_address_init()  do { } while(0)
+#else
+#define set_page_virtual(page, zone)  do { } while(0)
 #endif
 
 #if defined(HASHED_PAGE_VIRTUAL)
@@ -2116,6 +2124,8 @@  extern unsigned long find_min_pfn_with_active_regions(void);
 extern void free_bootmem_with_active_regions(int nid,
 						unsigned long max_low_pfn);
 extern void sparse_memory_present_with_active_regions(int nid);
+extern void __meminit init_single_page(struct page *page, unsigned long pfn,
+				       unsigned long zone, int nid);
 
 #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7deb49f69e27..3f28ca3c3a33 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -250,6 +250,7 @@  static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 		struct vmem_altmap *altmap, bool want_memblock)
 {
 	int ret;
+	int i;
 
 	if (pfn_valid(phys_start_pfn))
 		return -EEXIST;
@@ -258,6 +259,23 @@  static int __meminit __add_section(int nid, unsigned long phys_start_pfn,
 	if (ret < 0)
 		return ret;
 
+	/*
+	 * Initialize all pages in the section before fully exposing them to the
+	 * system so nobody will stumble over a half inititalized state.
+	 */
+	for (i = 0; i < PAGES_PER_SECTION; i++) {
+		unsigned long pfn = phys_start_pfn + i;
+		struct page *page;
+
+		if (!pfn_valid(pfn))
+			continue;
+		page = pfn_to_page(pfn);
+
+		/* dummy zone, the actual one will be set when onlining pages */
+		init_single_page(page, pfn, ZONE_NORMAL, nid);
+		SetPageReserved(page);
+	}
+
 	if (!want_memblock)
 		return 0;
 
@@ -891,15 +909,8 @@  int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	int nid;
 	int ret;
 	struct memory_notify arg;
-	struct memory_block *mem;
-
-	/*
-	 * We can't use pfn_to_nid() because nid might be stored in struct page
-	 * which is not yet initialized. Instead, we find nid from memory block.
-	 */
-	mem = find_memory_block(__pfn_to_section(pfn));
-	nid = mem->nid;
 
+	nid = pfn_to_nid(pfn);
 	/* associate pfn range with the zone */
 	zone = move_pfn_range(online_type, nid, pfn, nr_pages);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a790ef4be74e..8d81df4c40ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,7 +1168,7 @@  static void free_one_page(struct zone *zone,
 	spin_unlock(&zone->lock);
 }
 
-static void __meminit __init_single_page(struct page *page, unsigned long pfn,
+void __meminit init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
 	mm_zero_struct_page(page);
@@ -1178,11 +1178,7 @@  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 	page_cpupid_reset_last(page);
 
 	INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
-	/* The shift won't overflow because ZONE_NORMAL is below 4G. */
-	if (!is_highmem_idx(zone))
-		set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+	set_page_virtual(page, zone);
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -1203,7 +1199,7 @@  static void __meminit init_reserved_page(unsigned long pfn)
 		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
 			break;
 	}
-	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
+	init_single_page(pfn_to_page(pfn), pfn, zid, nid);
 }
 #else
 static inline void init_reserved_page(unsigned long pfn)
@@ -1520,7 +1516,7 @@  static unsigned long  __init deferred_init_pages(int nid, int zid,
 		} else {
 			page++;
 		}
-		__init_single_page(page, pfn, zid, nid);
+		init_single_page(page, pfn, zid, nid);
 		nr_pages++;
 	}
 	return (nr_pages);
@@ -5519,9 +5515,12 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 not_early:
 		page = pfn_to_page(pfn);
-		__init_single_page(page, pfn, zone, nid);
-		if (context == MEMMAP_HOTPLUG)
-			SetPageReserved(page);
+		if (context == MEMMAP_HOTPLUG) {
+			/* everything but the zone was inititalized */
+			set_page_zone(page, zone);
+			set_page_virtual(page, zone);
+		} else
+			init_single_page(page, pfn, zone, nid);
 
 		/*
 		 * Mark the block movable so that blocks are reserved for
@@ -6386,7 +6385,7 @@  void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 #if defined(CONFIG_HAVE_MEMBLOCK) && !defined(CONFIG_FLAT_NODE_MEM_MAP)
 /*
  * Only struct pages that are backed by physical memory are zeroed and
- * initialized by going through __init_single_page(). But, there are some
+ * initialized by going through init_single_page(). But, there are some
  * struct pages which are reserved in memblock allocator and their fields
  * may be accessed (for example page_to_pfn() on some configuration accesses
  * flags). We must explicitly zero those struct pages.