diff mbox series

mm/spase: never partially remove memmap for early section

Message ID 20200623094258.6705-1-richard.weiyang@linux.alibaba.com (mailing list archive)
State New, archived
Headers show
Series mm/spase: never partially remove memmap for early section | expand

Commit Message

Wei Yang June 23, 2020, 9:42 a.m. UTC
For early sections, we assumes its memmap will never be partially
removed. But current behavior breaks this.

Let's correct it.

Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
---
 mm/sparse.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

David Hildenbrand June 23, 2020, 12:44 p.m. UTC | #1
On 23.06.20 11:42, Wei Yang wrote:
> For early sections, we assumes its memmap will never be partially
> removed. But current behavior breaks this.
> 
> Let's correct it.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> ---
>  mm/sparse.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b2b9a3e34696..1a0069f492f5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>  	}
>  
> -	if (section_is_early && memmap)
> -		free_map_bootmem(memmap);
> -	else
> +	if (!section_is_early)
>  		depopulate_section_memmap(pfn, nr_pages, altmap);
> +	else if (memmap)
> +		free_map_bootmem(memmap);
>  
>  	if (empty)
>  		ms->section_mem_map = (unsigned long)NULL;
> 

Agreed, that's what pfn_valid() and section_activate() expect.

"If we hot-add memory into such a section then we do not need to
populate the memmap and can simply reuse what is already there." - this
is the case when hot-adding sub-sections into partially populated early
sections, and has to be the case when re-hot-adding after hot-removing.

Acked-by: David Hildenbrand <david@redhat.com>


I am also not convinced that the complicated sparse_decode_mem_map()
handling in that function is required - ms->section_mem_map &
SECTION_MAP_MASK is sufficient for this use case of removing the memmap
of a full early section once empty.
Wei Yang June 23, 2020, 1:02 p.m. UTC | #2
On Tue, Jun 23, 2020 at 02:44:02PM +0200, David Hildenbrand wrote:
>On 23.06.20 11:42, Wei Yang wrote:
>> For early sections, we assumes its memmap will never be partially
>> removed. But current behavior breaks this.
>> 
>> Let's correct it.
>> 
>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>> ---
>>  mm/sparse.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index b2b9a3e34696..1a0069f492f5 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>  		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>>  	}
>>  
>> -	if (section_is_early && memmap)
>> -		free_map_bootmem(memmap);
>> -	else
>> +	if (!section_is_early)
>>  		depopulate_section_memmap(pfn, nr_pages, altmap);
>> +	else if (memmap)
>> +		free_map_bootmem(memmap);
>>  
>>  	if (empty)
>>  		ms->section_mem_map = (unsigned long)NULL;
>> 
>
>Agreed, that's what pfn_valid() and section_activate() expect.
>
>"If we hot-add memory into such a section then we do not need to
>populate the memmap and can simply reuse what is already there." - this
>is the case when hot-adding sub-sections into partially populated early
>sections, and has to be the case when re-hot-adding after hot-removing.
>
>Acked-by: David Hildenbrand <david@redhat.com>
>
>
>I am also not convinced that the complicated sparse_decode_mem_map()
>handling in that function is required - ms->section_mem_map &
>SECTION_MAP_MASK is sufficient for this use case of removing the memmap
>of a full early section once empty.
>

You mean remove this line?

    	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);

Then what to passed to free_map_bootmem() ?

>-- 
>Thanks,
>
>David / dhildenb
David Hildenbrand June 23, 2020, 1:16 p.m. UTC | #3
On 23.06.20 15:02, Wei Yang wrote:
> On Tue, Jun 23, 2020 at 02:44:02PM +0200, David Hildenbrand wrote:
>> On 23.06.20 11:42, Wei Yang wrote:
>>> For early sections, we assumes its memmap will never be partially
>>> removed. But current behavior breaks this.
>>>
>>> Let's correct it.
>>>
>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>>> ---
>>>  mm/sparse.c | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/sparse.c b/mm/sparse.c
>>> index b2b9a3e34696..1a0069f492f5 100644
>>> --- a/mm/sparse.c
>>> +++ b/mm/sparse.c
>>> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>>  		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>>>  	}
>>>  
>>> -	if (section_is_early && memmap)
>>> -		free_map_bootmem(memmap);
>>> -	else
>>> +	if (!section_is_early)
>>>  		depopulate_section_memmap(pfn, nr_pages, altmap);
>>> +	else if (memmap)
>>> +		free_map_bootmem(memmap);
>>>  
>>>  	if (empty)
>>>  		ms->section_mem_map = (unsigned long)NULL;
>>>
>>
>> Agreed, that's what pfn_valid() and section_activate() expect.
>>
>> "If we hot-add memory into such a section then we do not need to
>> populate the memmap and can simply reuse what is already there." - this
>> is the case when hot-adding sub-sections into partially populated early
>> sections, and has to be the case when re-hot-adding after hot-removing.
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
>>
>>
>> I am also not convinced that the complicated sparse_decode_mem_map()
>> handling in that function is required - ms->section_mem_map &
>> SECTION_MAP_MASK is sufficient for this use case of removing the memmap
>> of a full early section once empty.
>>
> 
> You mean remove this line?
> 
>     	memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> 
> Then what to passed to free_map_bootmem() ?

Never mind, I misread something,  sparse_decode_mem_map() is indeed
necessary.
Michal Hocko June 23, 2020, 3:18 p.m. UTC | #4
On Tue 23-06-20 17:42:58, Wei Yang wrote:
> For early sections, we assumes its memmap will never be partially
> removed. But current behavior breaks this.
> 
> Let's correct it.
> 
> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>

Can a user trigger this or is this a theoretical bug?

> ---
>  mm/sparse.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b2b9a3e34696..1a0069f492f5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>  		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>  	}
>  
> -	if (section_is_early && memmap)
> -		free_map_bootmem(memmap);
> -	else
> +	if (!section_is_early)

This begs a comment.

>  		depopulate_section_memmap(pfn, nr_pages, altmap);
> +	else if (memmap)
> +		free_map_bootmem(memmap);
>  
>  	if (empty)
>  		ms->section_mem_map = (unsigned long)NULL;
> -- 
> 2.20.1 (Apple Git-117)
>
Wei Yang June 23, 2020, 9:48 p.m. UTC | #5
On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote:
>On Tue 23-06-20 17:42:58, Wei Yang wrote:
>> For early sections, we assumes its memmap will never be partially
>> removed. But current behavior breaks this.
>> 
>> Let's correct it.
>> 
>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>
>Can a user trigger this or is this a theoretical bug?
>

I don't expect to have a non-section aligned system, so this is a theoretical
bug to me.

>> ---
>>  mm/sparse.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index b2b9a3e34696..1a0069f492f5 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>  		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>>  	}
>>  
>> -	if (section_is_early && memmap)
>> -		free_map_bootmem(memmap);
>> -	else
>> +	if (!section_is_early)
>
>This begs a comment.
>

Like:

    /* Only depopulate sub-section memmap for non early section. */

Looks good to you?

>>  		depopulate_section_memmap(pfn, nr_pages, altmap);
>> +	else if (memmap)
>> +		free_map_bootmem(memmap);
>>  
>>  	if (empty)
>>  		ms->section_mem_map = (unsigned long)NULL;
>> -- 
>> 2.20.1 (Apple Git-117)
>> 
>
>-- 
>Michal Hocko
>SUSE Labs
Dan Williams June 24, 2020, 12:21 a.m. UTC | #6
On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
<richard.weiyang@linux.alibaba.com> wrote:
>
> For early sections, we assumes its memmap will never be partially
> removed. But current behavior breaks this.

Where do we assume that?

The primary use case for this was mapping pmem that collides with
System-RAM in the same 128MB section. That collision will certainly be
depopulated on-demand depending on the state of the pmem device. So,
I'm not understanding the problem or the benefit of this change.
Wei Yang June 24, 2020, 1:11 a.m. UTC | #7
On Tue, Jun 23, 2020 at 05:21:06PM -0700, Dan Williams wrote:
>On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
><richard.weiyang@linux.alibaba.com> wrote:
>>
>> For early sections, we assumes its memmap will never be partially
>> removed. But current behavior breaks this.
>
>Where do we assume that?
>
>The primary use case for this was mapping pmem that collides with
>System-RAM in the same 128MB section. That collision will certainly be
>depopulated on-demand depending on the state of the pmem device. So,
>I'm not understanding the problem or the benefit of this change.

Hi, Dan

There is a discussion in the thread you just replied:

    mm/shuffle: don't move pages between zones and don't read garbage memmaps

Besides this, the comment in section_activate() says:

    * The early init code does not consider partially populated
    * initial sections, it simply assumes that memory will never be
    * referenced.  If we hot-add memory into such a section then we
    * do not need to populate the memmap and can simply reuse what
    * is already there.

Per my understanding, if we hot-add then hot-remove the sub-section, we may
not have a valid memmep for this part sub-section? Because we depopulate it at
hot-remove while we don't populate it when hot-add.

Is my understanding correct?
Baoquan He June 24, 2020, 1:47 a.m. UTC | #8
On 06/23/20 at 05:21pm, Dan Williams wrote:
> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
> <richard.weiyang@linux.alibaba.com> wrote:
> >
> > For early sections, we assumes its memmap will never be partially
> > removed. But current behavior breaks this.
> 
> Where do we assume that?
> 
> The primary use case for this was mapping pmem that collides with
> System-RAM in the same 128MB section. That collision will certainly be
> depopulated on-demand depending on the state of the pmem device. So,
> I'm not understanding the problem or the benefit of this change.

I was also confused when review this patch, the patch log is a little
short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
do build memmap for the whole memory section during boot, even though
some of them may be partially populated. We just mark the subsection map
for present pages. 

Later, if pmem device is mapped into the partially boot memory section,
we just fill the relevant subsection map, do return directly, w/o building
the memmap for it, in section_activate(). Because the memmap for the
unpresent RAM part have been there. I guess this is what Wei is trying to 
do to keep the behaviour be consistent for pmem device adding, or
pmem device removing and later adding again.

Please correct me if I am wrong.

To me, fixing it looks good. But a clear doc or code comment is
necessary so that people can understand the code with less time.
Leaving it as is doesn't cause harm. I personally tend to choose
the former.

	paging_init()
	    ->sparse_init()
	        ->sparse_init_nid()
	          {
                      ...
                      for_each_present_section_nr(pnum_begin, pnum) {
                          ...
                          map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
                                     nid, NULL);
                          ...
                      }
                  }
             ...
             ->zone_sizes_init()
                 ->free_area_init()
                   {
                       for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
                           subsection_map_init(start_pfn, end_pfn - start_pfn);
                       }
                   {
		
         __add_pages()
             ->sparse_add_section()
                 ->section_activate()
                   {
                       ...
                       fill_subsection_map();
                       if (nr_pages < PAGES_PER_SECTION && early_section(ms))   <----------*********
                           return pfn_to_page(pfn);
                       ...
                   }
>
Baoquan He June 24, 2020, 2:14 a.m. UTC | #9
On 06/24/20 at 09:47am, Baoquan He wrote:
> On 06/23/20 at 05:21pm, Dan Williams wrote:
> > On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
> > <richard.weiyang@linux.alibaba.com> wrote:
> > >
> > > For early sections, we assumes its memmap will never be partially
> > > removed. But current behavior breaks this.
> > 
> > Where do we assume that?
> > 
> > The primary use case for this was mapping pmem that collides with
> > System-RAM in the same 128MB section. That collision will certainly be
> > depopulated on-demand depending on the state of the pmem device. So,
> > I'm not understanding the problem or the benefit of this change.
> 
> I was also confused when review this patch, the patch log is a little
> short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
> do build memmap for the whole memory section during boot, even though
> some of them may be partially populated. We just mark the subsection map
> for present pages. 
> 
> Later, if pmem device is mapped into the partially boot memory section,
> we just fill the relevant subsection map, do return directly, w/o building
> the memmap for it, in section_activate(). Because the memmap for the
> unpresent RAM part have been there. I guess this is what Wei is trying to 
> do to keep the behaviour be consistent for pmem device adding, or

OK, from Wei's reply I realized this patch is a necessary fix. If we
depoluate the partial memmap for pmem removing, the later pmem re-adding
won't have a valid memmap.

> pmem device removing and later adding again.
> 
> Please correct me if I am wrong.
> 
> To me, fixing it looks good. But a clear doc or code comment is
> necessary so that people can understand the code with less time.
> Leaving it as is doesn't cause harm. I personally tend to choose
> the former.
> 
> 	paging_init()
> 	    ->sparse_init()
> 	        ->sparse_init_nid()
> 	          {
>                       ...
>                       for_each_present_section_nr(pnum_begin, pnum) {
>                           ...
>                           map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
>                                      nid, NULL);
>                           ...
>                       }
>                   }
>              ...
>              ->zone_sizes_init()
>                  ->free_area_init()
>                    {
>                        for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>                            subsection_map_init(start_pfn, end_pfn - start_pfn);
>                        }
>                    {
> 		
>          __add_pages()
>              ->sparse_add_section()
>                  ->section_activate()
>                    {
>                        ...
>                        fill_subsection_map();
>                        if (nr_pages < PAGES_PER_SECTION && early_section(ms))   <----------*********
>                            return pfn_to_page(pfn);
>                        ...
>                    }
> > 
> 
>
Wei Yang June 24, 2020, 3:46 a.m. UTC | #10
On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote:
>On 06/23/20 at 05:21pm, Dan Williams wrote:
>> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
>> <richard.weiyang@linux.alibaba.com> wrote:
>> >
>> > For early sections, we assumes its memmap will never be partially
>> > removed. But current behavior breaks this.
>> 
>> Where do we assume that?
>> 
>> The primary use case for this was mapping pmem that collides with
>> System-RAM in the same 128MB section. That collision will certainly be
>> depopulated on-demand depending on the state of the pmem device. So,
>> I'm not understanding the problem or the benefit of this change.
>
>I was also confused when review this patch, the patch log is a little
>short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
>do build memmap for the whole memory section during boot, even though
>some of them may be partially populated. We just mark the subsection map
>for present pages. 
>
>Later, if pmem device is mapped into the partially boot memory section,
>we just fill the relevant subsection map, do return directly, w/o building
>the memmap for it, in section_activate(). Because the memmap for the
>unpresent RAM part have been there. I guess this is what Wei is trying to 
>do to keep the behaviour be consistent for pmem device adding, or
>pmem device removing and later adding again.
>
>Please correct me if I am wrong.

You are right here.

>
>To me, fixing it looks good. But a clear doc or code comment is
>necessary so that people can understand the code with less time.
>Leaving it as is doesn't cause harm. I personally tend to choose
>the former.
>

The former is to add a clear doc?

>	paging_init()
>	    ->sparse_init()
>	        ->sparse_init_nid()
>	          {
>                      ...
>                      for_each_present_section_nr(pnum_begin, pnum) {
>                          ...
>                          map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
>                                     nid, NULL);
>                          ...
>                      }
>                  }
>             ...
>             ->zone_sizes_init()
>                 ->free_area_init()
>                   {
>                       for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
>                           subsection_map_init(start_pfn, end_pfn - start_pfn);
>                       }
>                   {
>		
>         __add_pages()
>             ->sparse_add_section()
>                 ->section_activate()
>                   {
>                       ...
>                       fill_subsection_map();
>                       if (nr_pages < PAGES_PER_SECTION && early_section(ms))   <----------*********
>                           return pfn_to_page(pfn);
>                       ...
>                   }
>>
Baoquan He June 24, 2020, 3:52 a.m. UTC | #11
On 06/24/20 at 11:46am, Wei Yang wrote:
> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote:
> >On 06/23/20 at 05:21pm, Dan Williams wrote:
> >> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
> >> <richard.weiyang@linux.alibaba.com> wrote:
> >> >
> >> > For early sections, we assumes its memmap will never be partially
> >> > removed. But current behavior breaks this.
> >> 
> >> Where do we assume that?
> >> 
> >> The primary use case for this was mapping pmem that collides with
> >> System-RAM in the same 128MB section. That collision will certainly be
> >> depopulated on-demand depending on the state of the pmem device. So,
> >> I'm not understanding the problem or the benefit of this change.
> >
> >I was also confused when review this patch, the patch log is a little
> >short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
> >do build memmap for the whole memory section during boot, even though
> >some of them may be partially populated. We just mark the subsection map
> >for present pages. 
> >
> >Later, if pmem device is mapped into the partially boot memory section,
> >we just fill the relevant subsection map, do return directly, w/o building
> >the memmap for it, in section_activate(). Because the memmap for the
> >unpresent RAM part have been there. I guess this is what Wei is trying to 
> >do to keep the behaviour be consistent for pmem device adding, or
> >pmem device removing and later adding again.
> >
> >Please correct me if I am wrong.
> 
> You are right here.
> 
> >
> >To me, fixing it looks good. But a clear doc or code comment is
> >necessary so that people can understand the code with less time.
> >Leaving it as is doesn't cause harm. I personally tend to choose
> >the former.
> >
> 
> The former is to add a clear doc?

Sorry for the confusion. The former means the fix in your patch. Maybe a
improved log and some code comment adding can make it more perfect.

> 
> >	paging_init()
> >	    ->sparse_init()
> >	        ->sparse_init_nid()
> >	          {
> >                      ...
> >                      for_each_present_section_nr(pnum_begin, pnum) {
> >                          ...
> >                          map = __populate_section_memmap(pfn, PAGES_PER_SECTION,
> >                                     nid, NULL);
> >                          ...
> >                      }
> >                  }
> >             ...
> >             ->zone_sizes_init()
> >                 ->free_area_init()
> >                   {
> >                       for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
> >                           subsection_map_init(start_pfn, end_pfn - start_pfn);
> >                       }
> >                   {
> >		
> >         __add_pages()
> >             ->sparse_add_section()
> >                 ->section_activate()
> >                   {
> >                       ...
> >                       fill_subsection_map();
> >                       if (nr_pages < PAGES_PER_SECTION && early_section(ms))   <----------*********
> >                           return pfn_to_page(pfn);
> >                       ...
> >                   }
> >> 
> 
> -- 
> Wei Yang
> Help you, Help me
>
Wei Yang June 24, 2020, 3:56 a.m. UTC | #12
On Wed, Jun 24, 2020 at 11:52:36AM +0800, Baoquan He wrote:
>On 06/24/20 at 11:46am, Wei Yang wrote:
>> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote:
>> >On 06/23/20 at 05:21pm, Dan Williams wrote:
>> >> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
>> >> <richard.weiyang@linux.alibaba.com> wrote:
>> >> >
>> >> > For early sections, we assumes its memmap will never be partially
>> >> > removed. But current behavior breaks this.
>> >> 
>> >> Where do we assume that?
>> >> 
>> >> The primary use case for this was mapping pmem that collides with
>> >> System-RAM in the same 128MB section. That collision will certainly be
>> >> depopulated on-demand depending on the state of the pmem device. So,
>> >> I'm not understanding the problem or the benefit of this change.
>> >
>> >I was also confused when review this patch, the patch log is a little
>> >short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
>> >do build memmap for the whole memory section during boot, even though
>> >some of them may be partially populated. We just mark the subsection map
>> >for present pages. 
>> >
>> >Later, if pmem device is mapped into the partially boot memory section,
>> >we just fill the relevant subsection map, do return directly, w/o building
>> >the memmap for it, in section_activate(). Because the memmap for the
>> >unpresent RAM part have been there. I guess this is what Wei is trying to 
>> >do to keep the behaviour be consistent for pmem device adding, or
>> >pmem device removing and later adding again.
>> >
>> >Please correct me if I am wrong.
>> 
>> You are right here.
>> 
>> >
>> >To me, fixing it looks good. But a clear doc or code comment is
>> >necessary so that people can understand the code with less time.
>> >Leaving it as is doesn't cause harm. I personally tend to choose
>> >the former.
>> >
>> 
>> The former is to add a clear doc?
>
>Sorry for the confusion. The former means the fix in your patch. Maybe a
>improved log and some code comment adding can make it more perfect.
>

Sure, I would try to add more log and comments, in case you have some good
suggestion, just let me know :)
Wei Yang June 24, 2020, 6:13 a.m. UTC | #13
On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote:
>On Tue 23-06-20 17:42:58, Wei Yang wrote:
>> For early sections, we assumes its memmap will never be partially
>> removed. But current behavior breaks this.
>> 
>> Let's correct it.
>> 
>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>
>Can a user trigger this or is this a theoretical bug?

Let me rewrite the changelog a little. Look forward any comments.

   For early sections, its memmap is handled specially even sub-section is
   enabled. The memmap could only be populated as a whole.
   
   Quoted from the comment of section_activate():
   
       * The early init code does not consider partially populated
       * initial sections, it simply assumes that memory will never be
       * referenced.  If we hot-add memory into such a section then we
       * do not need to populate the memmap and can simply reuse what
       * is already there.
   
   While current section_deactivate() breaks this rule. When hot-remove a
   sub-section, section_deactivate() would depopulate its memmap. The
   consequence is if we hot-add this subsection again, its memmap never get
   proper populated.

>
>> ---
>>  mm/sparse.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> 
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index b2b9a3e34696..1a0069f492f5 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -825,10 +825,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
>>  		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
>>  	}
>>  
>> -	if (section_is_early && memmap)
>> -		free_map_bootmem(memmap);
>> -	else
>> +	if (!section_is_early)
>
>This begs a comment.
>
>>  		depopulate_section_memmap(pfn, nr_pages, altmap);
>> +	else if (memmap)
>> +		free_map_bootmem(memmap);
>>  
>>  	if (empty)
>>  		ms->section_mem_map = (unsigned long)NULL;
>> -- 
>> 2.20.1 (Apple Git-117)
>> 
>
>-- 
>Michal Hocko
>SUSE Labs
David Hildenbrand June 24, 2020, 7:48 a.m. UTC | #14
On 23.06.20 17:18, Michal Hocko wrote:
> On Tue 23-06-20 17:42:58, Wei Yang wrote:
>> For early sections, we assumes its memmap will never be partially
>> removed. But current behavior breaks this.
>>
>> Let's correct it.
>>
>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> 
> Can a user trigger this or is this a theoretical bug?

I tried to reproduce it, but somehow I get unexpected behavior.
With a hacked QEMU I can get

$ cat /proc/iomem
[...]
100000000-143ffffff : System RAM
144000000-343dfffff : Persistent Memory
  144000000-1441fffff : namespace0.0
  144200000-144ffffff : dax0.0

After
$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

I get

$ cat /proc/iomem
[...]
100000000-143ffffff : System RAM
144000000-343dfffff : Persistent Memory
  144000000-1441fffff : namespace0.0
  144200000-144ffffff : dax0.0

I can trigger remove+re-add via 
$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

So we clearly have an overlap between System RAM and dax0.0 within a section.
However, I never get early_section() to trigger in
section_activate()/section_deactivate() ? That's unexpected


Definitely something seems to go wrong after the first "ndctl create-namespace".
Using a random page walker:

[root@localhost ~]# ./page-types 
[  387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI
[  387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G        W         5.8.0-rc2 #20
[  387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[  387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0
[  387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f
[  387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202
[  387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78
[  387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000
[  387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000
[  387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80
[  387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0
[  387.027974] FS:  00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000
[  387.028699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0
[  387.029864] Call Trace:
[  387.030108]  kpageflags_read+0xcc/0x160
[  387.030473]  proc_reg_read+0x53/0x80
[  387.030809]  vfs_read+0x9d/0x150
[  387.031114]  ksys_pread64+0x65/0xa0
[  387.031449]  do_syscall_64+0x4d/0x90
[  387.031783]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
David Hildenbrand June 24, 2020, 8:04 a.m. UTC | #15
On 24.06.20 09:48, David Hildenbrand wrote:
> On 23.06.20 17:18, Michal Hocko wrote:
>> On Tue 23-06-20 17:42:58, Wei Yang wrote:
>>> For early sections, we assumes its memmap will never be partially
>>> removed. But current behavior breaks this.
>>>
>>> Let's correct it.
>>>
>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>>
>> Can a user trigger this or is this a theoretical bug?
> 
> I tried to reproduce it, but somehow I get unexpected behavior.
> With a hacked QEMU I can get
> 
> $ cat /proc/iomem
> [...]
> 100000000-143ffffff : System RAM
> 144000000-343dfffff : Persistent Memory
>   144000000-1441fffff : namespace0.0
>   144200000-144ffffff : dax0.0
> 
> After
> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
> 
> I get
> 
> $ cat /proc/iomem
> [...]
> 100000000-143ffffff : System RAM
> 144000000-343dfffff : Persistent Memory
>   144000000-1441fffff : namespace0.0
>   144200000-144ffffff : dax0.0
> 
> I can trigger remove+re-add via 
> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
> 
> So we clearly have an overlap between System RAM and dax0.0 within a section.
> However, I never get early_section() to trigger in
> section_activate()/section_deactivate() ? That's unexpected

... booting the correct kernel I can see the printk's I added ... I'll
continue to poke it with a stick.
Wei Yang June 24, 2020, 8:13 a.m. UTC | #16
On Wed, Jun 24, 2020 at 09:48:43AM +0200, David Hildenbrand wrote:
>On 23.06.20 17:18, Michal Hocko wrote:
>> On Tue 23-06-20 17:42:58, Wei Yang wrote:
>>> For early sections, we assumes its memmap will never be partially
>>> removed. But current behavior breaks this.
>>>
>>> Let's correct it.
>>>
>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>> 
>> Can a user trigger this or is this a theoretical bug?
>
>I tried to reproduce it, but somehow I get unexpected behavior.
>With a hacked QEMU I can get

David,

Thanks for your effort. Would you mind sharing your qemu command line, so that
I can have a try at my side?

>
>$ cat /proc/iomem
>[...]
>100000000-143ffffff : System RAM
>144000000-343dfffff : Persistent Memory
>  144000000-1441fffff : namespace0.0
>  144200000-144ffffff : dax0.0
>
>After
>$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
>
>I get
>
>$ cat /proc/iomem
>[...]
>100000000-143ffffff : System RAM
>144000000-343dfffff : Persistent Memory
>  144000000-1441fffff : namespace0.0
>  144200000-144ffffff : dax0.0
>

The memory layout seems not changed.

>I can trigger remove+re-add via 
>$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

Do we need to change the mode to force the reconfig?

>
>So we clearly have an overlap between System RAM and dax0.0 within a section.
>However, I never get early_section() to trigger in
>section_activate()/section_deactivate() ? That's unexpected
>
>
>Definitely something seems to go wrong after the first "ndctl create-namespace".
>Using a random page walker:
>
>[root@localhost ~]# ./page-types 

What is this page-types?

>[  387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI
>[  387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G        W         5.8.0-rc2 #20
>[  387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
>[  387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0
>[  387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f
>[  387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202
>[  387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78
>[  387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000
>[  387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000
>[  387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80
>[  387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0
>[  387.027974] FS:  00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000
>[  387.028699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[  387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0
>[  387.029864] Call Trace:
>[  387.030108]  kpageflags_read+0xcc/0x160
>[  387.030473]  proc_reg_read+0x53/0x80
>[  387.030809]  vfs_read+0x9d/0x150
>[  387.031114]  ksys_pread64+0x65/0xa0
>[  387.031449]  do_syscall_64+0x4d/0x90
>[  387.031783]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>
>-- 
>Thanks,
>
>David / dhildenb
David Hildenbrand June 24, 2020, 8:41 a.m. UTC | #17
On 24.06.20 10:13, Wei Yang wrote:
> On Wed, Jun 24, 2020 at 09:48:43AM +0200, David Hildenbrand wrote:
>> On 23.06.20 17:18, Michal Hocko wrote:
>>> On Tue 23-06-20 17:42:58, Wei Yang wrote:
>>>> For early sections, we assumes its memmap will never be partially
>>>> removed. But current behavior breaks this.
>>>>
>>>> Let's correct it.
>>>>
>>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>>>
>>> Can a user trigger this or is this a theoretical bug?
>>
>> I tried to reproduce it, but somehow I get unexpected behavior.
>> With a hacked QEMU I can get
> 
> David,
> 
> Thanks for your effort. Would you mind sharing your qemu command line, so that
> I can have a try at my side?

The following change to QEMU:

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 51b3050d01..c6a78d83c0 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms,
         }
 
         machine->device_memory->base =
-            ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
+            0x100000000ULL + x86ms->above_4g_mem_size;
 
         if (pcmc->enforce_aligned_dimm) {
             /* size device region assuming 1G page max alignment per slot */


Then you can use the following QEMU cmdline:

#! /bin/bash
sudo x86_64-softmmu/qemu-system-x86_64 \
    --enable-kvm \
    -m 4160M,maxmem=20G,slots=1 \
    -smp sockets=2,cores=16 \
    -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
    -machine pc,nvdimm \
    -nographic \
    -hda /home/dhildenb/git/Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \
    -nodefaults \
    -net nic -net user \
    -chardev stdio,nosignal,id=serial \
    -device isa-serial,chardev=serial \
    -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
    -mon chardev=monitor,mode=readline \
    -object memory-backend-ram,id=mem0,size=8G \
    -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k


> 
>>
>> $ cat /proc/iomem
>> [...]
>> 100000000-143ffffff : System RAM
>> 144000000-343dfffff : Persistent Memory
>>  144000000-1441fffff : namespace0.0
>>  144200000-144ffffff : dax0.0
>>
>> After
>> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
>>
>> I get
>>
>> $ cat /proc/iomem
>> [...]
>> 100000000-143ffffff : System RAM
>> 144000000-343dfffff : Persistent Memory
>>  144000000-1441fffff : namespace0.0
>>  144200000-144ffffff : dax0.0
>>
> 
> The memory layout seems not changed.

Yeah, sorry, int he first example, dax0.0 should be missing (on a fresh
QEMU instance instead of after a reboot).

> 
>> I can trigger remove+re-add via 
>> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
> 
> Do we need to change the mode to force the reconfig?

No, that's sufficient.

> 
>>
>> So we clearly have an overlap between System RAM and dax0.0 within a section.
>> However, I never get early_section() to trigger in
>> section_activate()/section_deactivate() ? That's unexpected
>>
>>
>> Definitely something seems to go wrong after the first "ndctl create-namespace".
>> Using a random page walker:
>>
>> [root@localhost ~]# ./page-types 
> 
> What is this page-types?
> 
>> [  387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI
>> [  387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G        W         5.8.0-rc2 #20
>> [  387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
>> [  387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0
>> [  387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f
>> [  387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202
>> [  387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78
>> [  387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000
>> [  387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000
>> [  387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80
>> [  387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0
>> [  387.027974] FS:  00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000
>> [  387.028699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0
>> [  387.029864] Call Trace:
>> [  387.030108]  kpageflags_read+0xcc/0x160
>> [  387.030473]  proc_reg_read+0x53/0x80
>> [  387.030809]  vfs_read+0x9d/0x150
>> [  387.031114]  ksys_pread64+0x65/0xa0
>> [  387.031449]  do_syscall_64+0x4d/0x90
>> [  387.031783]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

It's a tool located in Linux tools/vm/page-types.


So with two printk's, I can spot:

[   24.185558] section_deactivate(): depopulate_section_memmap
[   24.202689] pmem0: detected capacity change from 0 to 16777216
[   24.229747] section_activate(): early_section, not populating memmap
[   24.229777] memmap_init_zone_device initialised 3584 pages in 0ms

But nothing actually breaks .... because *drummroll* we use huge pages in the vmemmap,
so the partial depopulate will not actually depopulate anything here. Huge page is 2M,
the memmap of 128MB sections is exactly 2MB == one hugepages. Trying to depopulate a
fraction (e.g., 16MB) of that won't do anything.


Now, forcing a CPU without hugepages - PSE (QEMU: "-cpu host,pse=off)", I can trigger
via  ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

[   18.044973] pmem0: detected capacity change from 0 to 16777216
[   18.073813] BUG: unable to handle page fault for address: ffffec73c51000b4
[   18.076236] #PF: supervisor write access in kernel mode
[   18.077211] #PF: error_code(0x0002) - not-present page
[   18.077943] PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0
[   18.078551] Oops: 0002 [#1] SMP NOPTI
[   18.078915] CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G        W         5.8.0-rc2+ #24
[   18.079718] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
[   18.080750] RIP: 0010:memmap_init_zone+0x154/0x1c2
[   18.081213] Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f
[   18.082902] RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282
[   18.083395] RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000
[   18.084057] RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080
[   18.084704] RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708
[   18.085369] R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004
[   18.086020] R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000
[   18.086670] FS:  00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000
[   18.087417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.087965] CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0
[   18.088623] Call Trace:
[   18.088884]  move_pfn_range_to_zone+0x128/0x150
[   18.089313]  memremap_pages+0x4e4/0x5a0
[   18.089681]  devm_memremap_pages+0x1e/0x60
[   18.090081]  dev_dax_probe+0x69/0x160 [device_dax]
[   18.090533]  really_probe+0x298/0x3c0
[   18.090896]  driver_probe_device+0xe1/0x150
[   18.091303]  ? driver_allows_async_probing+0x50/0x50
[   18.091780]  bus_for_each_drv+0x7e/0xc0
[   18.092154]  __device_attach+0xdf/0x160
[   18.092527]  bus_probe_device+0x8e/0xa0
[   18.092916]  device_add+0x3b9/0x740
[   18.093258]  __devm_create_dev_dax+0x127/0x1c0
[   18.093686]  __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core]
[   18.094200]  dax_pmem_probe+0xc/0x1b [dax_pmem]
[   18.094632]  nvdimm_bus_probe+0x69/0x1c0 [libnvdimm]
[   18.095109]  really_probe+0x147/0x3c0
[   18.095470]  driver_probe_device+0xe1/0x150
[   18.095883]  device_driver_attach+0x53/0x60
[   18.096289]  bind_store+0xd1/0x110
[   18.096627]  kernfs_fop_write+0xce/0x1b0
[   18.097017]  vfs_write+0xb6/0x1a0
[   18.097350]  ksys_write+0x5f/0xe0
[   18.097681]  do_syscall_64+0x4d/0x90
[   18.098041]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Michal Hocko June 24, 2020, 8:50 a.m. UTC | #18
On Wed 24-06-20 10:41:18, David Hildenbrand wrote:
[...]
> But nothing actually breaks .... because *drummroll* we use huge pages in the vmemmap,
> so the partial depopulate will not actually depopulate anything here. Huge page is 2M,
> the memmap of 128MB sections is exactly 2MB == one hugepages. Trying to depopulate a
> fraction (e.g., 16MB) of that won't do anything.
> 
> 
> Now, forcing a CPU without hugepages - PSE (QEMU: "-cpu host,pse=off)", I can trigger
> via  ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M

OK, now I am following! Thanks a lot. That is something to be mentioned
in the changelog. Small pages might be used if the hotplug happens on a
system with fragmented memory so this is something we do care about.

Thanks a lot David!
David Hildenbrand June 24, 2020, 8:51 a.m. UTC | #19
On 24.06.20 05:56, Wei Yang wrote:
> On Wed, Jun 24, 2020 at 11:52:36AM +0800, Baoquan He wrote:
>> On 06/24/20 at 11:46am, Wei Yang wrote:
>>> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote:
>>>> On 06/23/20 at 05:21pm, Dan Williams wrote:
>>>>> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
>>>>> <richard.weiyang@linux.alibaba.com> wrote:
>>>>>>
>>>>>> For early sections, we assumes its memmap will never be partially
>>>>>> removed. But current behavior breaks this.
>>>>>
>>>>> Where do we assume that?
>>>>>
>>>>> The primary use case for this was mapping pmem that collides with
>>>>> System-RAM in the same 128MB section. That collision will certainly be
>>>>> depopulated on-demand depending on the state of the pmem device. So,
>>>>> I'm not understanding the problem or the benefit of this change.
>>>>
>>>> I was also confused when review this patch, the patch log is a little
>>>> short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
>>>> do build memmap for the whole memory section during boot, even though
>>>> some of them may be partially populated. We just mark the subsection map
>>>> for present pages. 
>>>>
>>>> Later, if pmem device is mapped into the partially boot memory section,
>>>> we just fill the relevant subsection map, do return directly, w/o building
>>>> the memmap for it, in section_activate(). Because the memmap for the
>>>> unpresent RAM part have been there. I guess this is what Wei is trying to 
>>>> do to keep the behaviour be consistent for pmem device adding, or
>>>> pmem device removing and later adding again.
>>>>
>>>> Please correct me if I am wrong.
>>>
>>> You are right here.
>>>
>>>>
>>>> To me, fixing it looks good. But a clear doc or code comment is
>>>> necessary so that people can understand the code with less time.
>>>> Leaving it as is doesn't cause harm. I personally tend to choose
>>>> the former.
>>>>
>>>
>>> The former is to add a clear doc?
>>
>> Sorry for the confusion. The former means the fix in your patch. Maybe a
>> improved log and some code comment adding can make it more perfect.
>>
> 
> Sure, I would try to add more log and comments, in case you have some good
> suggestion, just let me know :)
> 

We have documented this is section_activate() and pfn_valid()
sufficiently. Maybe add a pointer like

/*
 * The memmap of early sections is always fully populated. See
 * section_activate() and pfn_valid() .
 */
Dan Williams June 24, 2020, 4:10 p.m. UTC | #20
On Tue, Jun 23, 2020 at 11:14 PM Wei Yang
<richard.weiyang@linux.alibaba.com> wrote:
>
> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote:
> >On Tue 23-06-20 17:42:58, Wei Yang wrote:
> >> For early sections, we assumes its memmap will never be partially
> >> removed. But current behavior breaks this.
> >>
> >> Let's correct it.
> >>
> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> >
> >Can a user trigger this or is this a theoretical bug?
>
> Let me rewrite the changelog a little. Look forward any comments.
>
>    For early sections, its memmap is handled specially even sub-section is
>    enabled. The memmap could only be populated as a whole.
>
>    Quoted from the comment of section_activate():
>
>        * The early init code does not consider partially populated
>        * initial sections, it simply assumes that memory will never be
>        * referenced.  If we hot-add memory into such a section then we
>        * do not need to populate the memmap and can simply reuse what
>        * is already there.
>
>    While current section_deactivate() breaks this rule. When hot-remove a
>    sub-section, section_deactivate() would depopulate its memmap. The
>    consequence is if we hot-add this subsection again, its memmap never get
>    proper populated.

Ok, forgive the latency as re-fetched this logic into my mental cache.
So what I was remembering was the initial state of the code that
special cased early sections, and that still seems to be the case in
pfn_valid(). IIRC early_sections / bootmem are blocked from being
removed entirely. Partial / subsection removals are ok.
Wei Yang June 24, 2020, 10:05 p.m. UTC | #21
On Wed, Jun 24, 2020 at 09:10:09AM -0700, Dan Williams wrote:
>On Tue, Jun 23, 2020 at 11:14 PM Wei Yang
><richard.weiyang@linux.alibaba.com> wrote:
>>
>> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote:
>> >On Tue 23-06-20 17:42:58, Wei Yang wrote:
>> >> For early sections, we assumes its memmap will never be partially
>> >> removed. But current behavior breaks this.
>> >>
>> >> Let's correct it.
>> >>
>> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>> >
>> >Can a user trigger this or is this a theoretical bug?
>>
>> Let me rewrite the changelog a little. Look forward any comments.
>>
>>    For early sections, its memmap is handled specially even sub-section is
>>    enabled. The memmap could only be populated as a whole.
>>
>>    Quoted from the comment of section_activate():
>>
>>        * The early init code does not consider partially populated
>>        * initial sections, it simply assumes that memory will never be
>>        * referenced.  If we hot-add memory into such a section then we
>>        * do not need to populate the memmap and can simply reuse what
>>        * is already there.
>>
>>    While current section_deactivate() breaks this rule. When hot-remove a
>>    sub-section, section_deactivate() would depopulate its memmap. The
>>    consequence is if we hot-add this subsection again, its memmap never get
>>    proper populated.
>
>Ok, forgive the latency as re-fetched this logic into my mental cache.
>So what I was remembering was the initial state of the code that
>special cased early sections, and that still seems to be the case in
>pfn_valid(). IIRC early_sections / bootmem are blocked from being
>removed entirely. Partial / subsection removals are ok.

Would you mind giving more words? Partial subsection removal is ok, so no need
to fix this?
Wei Yang June 24, 2020, 10:08 p.m. UTC | #22
On Wed, Jun 24, 2020 at 10:51:08AM +0200, David Hildenbrand wrote:
>On 24.06.20 05:56, Wei Yang wrote:
>> On Wed, Jun 24, 2020 at 11:52:36AM +0800, Baoquan He wrote:
>>> On 06/24/20 at 11:46am, Wei Yang wrote:
>>>> On Wed, Jun 24, 2020 at 09:47:37AM +0800, Baoquan He wrote:
>>>>> On 06/23/20 at 05:21pm, Dan Williams wrote:
>>>>>> On Tue, Jun 23, 2020 at 2:43 AM Wei Yang
>>>>>> <richard.weiyang@linux.alibaba.com> wrote:
>>>>>>>
>>>>>>> For early sections, we assumes its memmap will never be partially
>>>>>>> removed. But current behavior breaks this.
>>>>>>
>>>>>> Where do we assume that?
>>>>>>
>>>>>> The primary use case for this was mapping pmem that collides with
>>>>>> System-RAM in the same 128MB section. That collision will certainly be
>>>>>> depopulated on-demand depending on the state of the pmem device. So,
>>>>>> I'm not understanding the problem or the benefit of this change.
>>>>>
>>>>> I was also confused when review this patch, the patch log is a little
>>>>> short and simple. From the current code, with SPARSE_VMEMMAP enabled, we
>>>>> do build memmap for the whole memory section during boot, even though
>>>>> some of them may be partially populated. We just mark the subsection map
>>>>> for present pages. 
>>>>>
>>>>> Later, if pmem device is mapped into the partially boot memory section,
>>>>> we just fill the relevant subsection map, do return directly, w/o building
>>>>> the memmap for it, in section_activate(). Because the memmap for the
>>>>> unpresent RAM part have been there. I guess this is what Wei is trying to 
>>>>> do to keep the behaviour be consistent for pmem device adding, or
>>>>> pmem device removing and later adding again.
>>>>>
>>>>> Please correct me if I am wrong.
>>>>
>>>> You are right here.
>>>>
>>>>>
>>>>> To me, fixing it looks good. But a clear doc or code comment is
>>>>> necessary so that people can understand the code with less time.
>>>>> Leaving it as is doesn't cause harm. I personally tend to choose
>>>>> the former.
>>>>>
>>>>
>>>> The former is to add a clear doc?
>>>
>>> Sorry for the confusion. The former means the fix in your patch. Maybe a
>>> improved log and some code comment adding can make it more perfect.
>>>
>> 
>> Sure, I would try to add more log and comments, in case you have some good
>> suggestion, just let me know :)
>> 
>
>We have documented this is section_activate() and pfn_valid()
>sufficiently. Maybe add a pointer like
>
>/*
> * The memmap of early sections is always fully populated. See
> * section_activate() and pfn_valid() .
> */

Thanks, I have added this above the "if" check.

>
>-- 
>Thanks,
>
>David / dhildenb
Dan Williams June 24, 2020, 10:20 p.m. UTC | #23
On Wed, Jun 24, 2020 at 3:06 PM Wei Yang
<richard.weiyang@linux.alibaba.com> wrote:
>
> On Wed, Jun 24, 2020 at 09:10:09AM -0700, Dan Williams wrote:
> >On Tue, Jun 23, 2020 at 11:14 PM Wei Yang
> ><richard.weiyang@linux.alibaba.com> wrote:
> >>
> >> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote:
> >> >On Tue 23-06-20 17:42:58, Wei Yang wrote:
> >> >> For early sections, we assumes its memmap will never be partially
> >> >> removed. But current behavior breaks this.
> >> >>
> >> >> Let's correct it.
> >> >>
> >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
> >> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> >> >
> >> >Can a user trigger this or is this a theoretical bug?
> >>
> >> Let me rewrite the changelog a little. Look forward any comments.
> >>
> >>    For early sections, its memmap is handled specially even sub-section is
> >>    enabled. The memmap could only be populated as a whole.
> >>
> >>    Quoted from the comment of section_activate():
> >>
> >>        * The early init code does not consider partially populated
> >>        * initial sections, it simply assumes that memory will never be
> >>        * referenced.  If we hot-add memory into such a section then we
> >>        * do not need to populate the memmap and can simply reuse what
> >>        * is already there.
> >>
> >>    While current section_deactivate() breaks this rule. When hot-remove a
> >>    sub-section, section_deactivate() would depopulate its memmap. The
> >>    consequence is if we hot-add this subsection again, its memmap never get
> >>    proper populated.
> >
> >Ok, forgive the latency as re-fetched this logic into my mental cache.
> >So what I was remembering was the initial state of the code that
> >special cased early sections, and that still seems to be the case in
> >pfn_valid(). IIRC early_sections / bootmem are blocked from being
> >removed entirely. Partial / subsection removals are ok.
>
> Would you mind giving more words? Partial subsection removal is ok, so no need
> to fix this?

Early sections establish a memmap for the full section. There's
conceptually nothing wrong with unplugging the non-system-RAM portion
of the memmap, but it would need to be careful, at least on x86, to
map the partial section with PTEs instead of PMDs.

So, you are right that there is a mismatch here, but I think the
comprehensive fix is to allow early sections to be partially
depopulated/repopulated rather than have section_activate() and
section_deacticate() special case early sections. The special casing
is problematic in retrospect as section_deactivate() can't be
maintained without understand special rules in section_activate().
Wei Yang June 24, 2020, 10:27 p.m. UTC | #24
On Wed, Jun 24, 2020 at 10:41:18AM +0200, David Hildenbrand wrote:
>On 24.06.20 10:13, Wei Yang wrote:
>> On Wed, Jun 24, 2020 at 09:48:43AM +0200, David Hildenbrand wrote:
>>> On 23.06.20 17:18, Michal Hocko wrote:
>>>> On Tue 23-06-20 17:42:58, Wei Yang wrote:
>>>>> For early sections, we assumes its memmap will never be partially
>>>>> removed. But current behavior breaks this.
>>>>>
>>>>> Let's correct it.
>>>>>
>>>>> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>>>>> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>>>>
>>>> Can a user trigger this or is this a theoretical bug?
>>>
>>> I tried to reproduce it, but somehow I get unexpected behavior.
>>> With a hacked QEMU I can get
>> 
>> David,
>> 
>> Thanks for your effort. Would you mind sharing your qemu command line, so that
>> I can have a try at my side?
>
>The following change to QEMU:
>
>diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>index 51b3050d01..c6a78d83c0 100644
>--- a/hw/i386/pc.c
>+++ b/hw/i386/pc.c
>@@ -1010,7 +1010,7 @@ void pc_memory_init(PCMachineState *pcms,
>         }
> 
>         machine->device_memory->base =
>-            ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
>+            0x100000000ULL + x86ms->above_4g_mem_size;
> 
>         if (pcmc->enforce_aligned_dimm) {
>             /* size device region assuming 1G page max alignment per slot */
>
>
>Then you can use the following QEMU cmdline:
>
>#! /bin/bash
>sudo x86_64-softmmu/qemu-system-x86_64 \
>    --enable-kvm \
>    -m 4160M,maxmem=20G,slots=1 \
>    -smp sockets=2,cores=16 \
>    -numa node,nodeid=0,cpus=0-1 -numa node,nodeid=1,cpus=2-3 \
>    -machine pc,nvdimm \
>    -nographic \
>    -hda /home/dhildenb/git/Fedora-Cloud-Base-31-1.9.x86_64.qcow2 \
>    -nodefaults \
>    -net nic -net user \
>    -chardev stdio,nosignal,id=serial \
>    -device isa-serial,chardev=serial \
>    -chardev socket,id=monitor,path=/var/tmp/monitor,server,nowait \
>    -mon chardev=monitor,mode=readline \
>    -object memory-backend-ram,id=mem0,size=8G \
>    -device nvdimm,id=vm0,memdev=mem0,node=0,addr=0x144000000,label-size=128k
>
>
>> 
>>>
>>> $ cat /proc/iomem
>>> [...]
>>> 100000000-143ffffff : System RAM
>>> 144000000-343dfffff : Persistent Memory
>>>  144000000-1441fffff : namespace0.0
>>>  144200000-144ffffff : dax0.0
>>>
>>> After
>>> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
>>>
>>> I get
>>>
>>> $ cat /proc/iomem
>>> [...]
>>> 100000000-143ffffff : System RAM
>>> 144000000-343dfffff : Persistent Memory
>>>  144000000-1441fffff : namespace0.0
>>>  144200000-144ffffff : dax0.0
>>>
>> 
>> The memory layout seems not changed.
>
>Yeah, sorry, int he first example, dax0.0 should be missing (on a fresh
>QEMU instance instead of after a reboot).
>
>> 
>>> I can trigger remove+re-add via 
>>> $ ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
>> 
>> Do we need to change the mode to force the reconfig?
>
>No, that's sufficient.
>
>> 
>>>
>>> So we clearly have an overlap between System RAM and dax0.0 within a section.
>>> However, I never get early_section() to trigger in
>>> section_activate()/section_deactivate() ? That's unexpected
>>>
>>>
>>> Definitely something seems to go wrong after the first "ndctl create-namespace".
>>> Using a random page walker:
>>>
>>> [root@localhost ~]# ./page-types 
>> 
>> What is this page-types?
>> 
>>> [  387.019229] general protection fault, probably for non-canonical address 0xfdfdfdfdfdfdfdfc: 0000 [#1] SMP NOPTI
>>> [  387.020172] CPU: 17 PID: 1314 Comm: page-types Kdump: loaded Tainted: G        W         5.8.0-rc2 #20
>>> [  387.021015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
>>> [  387.022056] RIP: 0010:stable_page_flags+0x27/0x3f0
>>> [  387.022519] Code: 00 00 00 66 66 66 66 90 48 85 ff 0f 84 d2 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48f
>>> [  387.024291] RSP: 0018:ffff9f8781057e58 EFLAGS: 00010202
>>> [  387.024775] RAX: fdfdfdfdfdfdfdfc RBX: fdfdfdfdfdfdfdfd RCX: 00007ffc4f4f1f78
>>> [  387.025423] RDX: 0000000000000001 RSI: 0000002000000000 RDI: ffffc590c5100000
>>> [  387.026052] RBP: ffffc590c5100000 R08: 0000000000000001 R09: 0000000000000000
>>> [  387.026696] R10: 0000000000000000 R11: 0000000000000000 R12: 00007ffc4f4f1f80
>>> [  387.027324] R13: 0000000000040000 R14: 00007ffc4f4d1f80 R15: ffffffffa1577ee0
>>> [  387.027974] FS:  00007f91a0f9c580(0000) GS:ffff888b3f7c0000(0000) knlGS:0000000000000000
>>> [  387.028699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  387.029223] CR2: 0000000000449b00 CR3: 000000007a7fc000 CR4: 00000000000006e0
>>> [  387.029864] Call Trace:
>>> [  387.030108]  kpageflags_read+0xcc/0x160
>>> [  387.030473]  proc_reg_read+0x53/0x80
>>> [  387.030809]  vfs_read+0x9d/0x150
>>> [  387.031114]  ksys_pread64+0x65/0xa0
>>> [  387.031449]  do_syscall_64+0x4d/0x90
>>> [  387.031783]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>It's a tool located in Linux tools/vm/page-types.
>
>
>So with two printk's, I can spot:
>
>[   24.185558] section_deactivate(): depopulate_section_memmap
>[   24.202689] pmem0: detected capacity change from 0 to 16777216
>[   24.229747] section_activate(): early_section, not populating memmap
>[   24.229777] memmap_init_zone_device initialised 3584 pages in 0ms
>
>But nothing actually breaks .... because *drummroll* we use huge pages in the vmemmap,
>so the partial depopulate will not actually depopulate anything here. Huge page is 2M,
>the memmap of 128MB sections is exactly 2MB == one hugepages. Trying to depopulate a
>fraction (e.g., 16MB) of that won't do anything.
>

Thanks David. I miss some knowledge here about the vmemmap and 2MB. Would you
minding pointing me the location which affects this behavior? I don't see how
depopulate_section_memmap would be affect by this.

>
>Now, forcing a CPU without hugepages - PSE (QEMU: "-cpu host,pse=off)", I can trigger
>via  ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax --size=16M
>
>[   18.044973] pmem0: detected capacity change from 0 to 16777216
>[   18.073813] BUG: unable to handle page fault for address: ffffec73c51000b4
>[   18.076236] #PF: supervisor write access in kernel mode
>[   18.077211] #PF: error_code(0x0002) - not-present page
>[   18.077943] PGD 81ff8067 P4D 81ff8067 PUD 81ff7067 PMD 1437cb067 PTE 0
>[   18.078551] Oops: 0002 [#1] SMP NOPTI
>[   18.078915] CPU: 16 PID: 1348 Comm: ndctl Kdump: loaded Tainted: G        W         5.8.0-rc2+ #24
>[   18.079718] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.4
>[   18.080750] RIP: 0010:memmap_init_zone+0x154/0x1c2
>[   18.081213] Code: 77 16 f6 40 10 02 74 10 48 03 48 08 48 89 cb 48 c1 eb 0c e9 3a ff ff ff 48 89 df 48 c1 e7 06 48f
>[   18.082902] RSP: 0018:ffffbdc7011a39b0 EFLAGS: 00010282
>[   18.083395] RAX: ffffec73c5100088 RBX: 0000000000144002 RCX: 0000000000144000
>[   18.084057] RDX: 0000000000000004 RSI: 007ffe0000000000 RDI: ffffec73c5100080
>[   18.084704] RBP: 027ffe0000000000 R08: 0000000000000001 R09: ffff9f8d38f6d708
>[   18.085369] R10: ffffec73c0000000 R11: 0000000000000000 R12: 0000000000000004
>[   18.086020] R13: 0000000000000001 R14: 0000000000144200 R15: 0000000000000000
>[   18.086670] FS:  00007efe6b65d780(0000) GS:ffff9f8d3f780000(0000) knlGS:0000000000000000
>[   18.087417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[   18.087965] CR2: ffffec73c51000b4 CR3: 000000007d718000 CR4: 0000000000340ee0
>[   18.088623] Call Trace:
>[   18.088884]  move_pfn_range_to_zone+0x128/0x150
>[   18.089313]  memremap_pages+0x4e4/0x5a0
>[   18.089681]  devm_memremap_pages+0x1e/0x60
>[   18.090081]  dev_dax_probe+0x69/0x160 [device_dax]
>[   18.090533]  really_probe+0x298/0x3c0
>[   18.090896]  driver_probe_device+0xe1/0x150
>[   18.091303]  ? driver_allows_async_probing+0x50/0x50
>[   18.091780]  bus_for_each_drv+0x7e/0xc0
>[   18.092154]  __device_attach+0xdf/0x160
>[   18.092527]  bus_probe_device+0x8e/0xa0
>[   18.092916]  device_add+0x3b9/0x740
>[   18.093258]  __devm_create_dev_dax+0x127/0x1c0
>[   18.093686]  __dax_pmem_probe+0x1f2/0x219 [dax_pmem_core]
>[   18.094200]  dax_pmem_probe+0xc/0x1b [dax_pmem]
>[   18.094632]  nvdimm_bus_probe+0x69/0x1c0 [libnvdimm]
>[   18.095109]  really_probe+0x147/0x3c0
>[   18.095470]  driver_probe_device+0xe1/0x150
>[   18.095883]  device_driver_attach+0x53/0x60
>[   18.096289]  bind_store+0xd1/0x110
>[   18.096627]  kernfs_fop_write+0xce/0x1b0
>[   18.097017]  vfs_write+0xb6/0x1a0
>[   18.097350]  ksys_write+0x5f/0xe0
>[   18.097681]  do_syscall_64+0x4d/0x90
>[   18.098041]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>

I would put this into changelog.

>
>-- 
>Thanks,
>
>David / dhildenb
Wei Yang June 24, 2020, 10:44 p.m. UTC | #25
On Wed, Jun 24, 2020 at 03:20:59PM -0700, Dan Williams wrote:
>On Wed, Jun 24, 2020 at 3:06 PM Wei Yang
><richard.weiyang@linux.alibaba.com> wrote:
>>
>> On Wed, Jun 24, 2020 at 09:10:09AM -0700, Dan Williams wrote:
>> >On Tue, Jun 23, 2020 at 11:14 PM Wei Yang
>> ><richard.weiyang@linux.alibaba.com> wrote:
>> >>
>> >> On Tue, Jun 23, 2020 at 05:18:28PM +0200, Michal Hocko wrote:
>> >> >On Tue 23-06-20 17:42:58, Wei Yang wrote:
>> >> >> For early sections, we assumes its memmap will never be partially
>> >> >> removed. But current behavior breaks this.
>> >> >>
>> >> >> Let's correct it.
>> >> >>
>> >> >> Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug")
>> >> >> Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
>> >> >
>> >> >Can a user trigger this or is this a theoretical bug?
>> >>
>> >> Let me rewrite the changelog a little. Look forward any comments.
>> >>
>> >>    For early sections, its memmap is handled specially even sub-section is
>> >>    enabled. The memmap could only be populated as a whole.
>> >>
>> >>    Quoted from the comment of section_activate():
>> >>
>> >>        * The early init code does not consider partially populated
>> >>        * initial sections, it simply assumes that memory will never be
>> >>        * referenced.  If we hot-add memory into such a section then we
>> >>        * do not need to populate the memmap and can simply reuse what
>> >>        * is already there.
>> >>
>> >>    While current section_deactivate() breaks this rule. When hot-remove a
>> >>    sub-section, section_deactivate() would depopulate its memmap. The
>> >>    consequence is if we hot-add this subsection again, its memmap never get
>> >>    proper populated.
>> >
>> >Ok, forgive the latency as re-fetched this logic into my mental cache.
>> >So what I was remembering was the initial state of the code that
>> >special cased early sections, and that still seems to be the case in
>> >pfn_valid(). IIRC early_sections / bootmem are blocked from being
>> >removed entirely. Partial / subsection removals are ok.
>>
>> Would you mind giving more words? Partial subsection removal is ok, so no need
>> to fix this?
>
>Early sections establish a memmap for the full section. There's
>conceptually nothing wrong with unplugging the non-system-RAM portion
>of the memmap, but it would need to be careful, at least on x86, to
>map the partial section with PTEs instead of PMDs.
>
>So, you are right that there is a mismatch here, but I think the
>comprehensive fix is to allow early sections to be partially
>depopulated/repopulated rather than have section_activate() and
>section_deacticate() special case early sections. The special casing
>is problematic in retrospect as section_deactivate() can't be
>maintained without understand special rules in section_activate().

Hmm... This means we need to adjust pfn_valid() too, which always return true
for early sections.
Dan Williams June 24, 2020, 11:47 p.m. UTC | #26
On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
<richard.weiyang@linux.alibaba.com> wrote:
[..]
> >So, you are right that there is a mismatch here, but I think the
> >comprehensive fix is to allow early sections to be partially
> >depopulated/repopulated rather than have section_activate() and
> >section_deacticate() special case early sections. The special casing
> >is problematic in retrospect as section_deactivate() can't be
> >maintained without understand special rules in section_activate().
>
> Hmm... This means we need to adjust pfn_valid() too, which always return true
> for early sections.

Right, rather than carry workarounds in 3 locations, and the bug that
has resulted from then getting out of sync, just teach early section
mapping to allow for the subsection populate/depopulate.
David Hildenbrand June 25, 2020, 5:53 a.m. UTC | #27
> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
> 
> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
> <richard.weiyang@linux.alibaba.com> wrote:
> [..]
>>> So, you are right that there is a mismatch here, but I think the
>>> comprehensive fix is to allow early sections to be partially
>>> depopulated/repopulated rather than have section_activate() and
>>> section_deacticate() special case early sections. The special casing
>>> is problematic in retrospect as section_deactivate() can't be
>>> maintained without understand special rules in section_activate().
>> 
>> Hmm... This means we need to adjust pfn_valid() too, which always return true
>> for early sections.
> 
> Right, rather than carry workarounds in 3 locations, and the bug that
> has resulted from then getting out of sync, just teach early section
> mapping to allow for the subsection populate/depopulate.
> 

I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.

At least my ack stands.
Dan Williams June 25, 2020, 7:46 p.m. UTC | #28
On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote:
>
>
>
> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
> >
> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
> > <richard.weiyang@linux.alibaba.com> wrote:
> > [..]
> >>> So, you are right that there is a mismatch here, but I think the
> >>> comprehensive fix is to allow early sections to be partially
> >>> depopulated/repopulated rather than have section_activate() and
> >>> section_deacticate() special case early sections. The special casing
> >>> is problematic in retrospect as section_deactivate() can't be
> >>> maintained without understand special rules in section_activate().
> >>
> >> Hmm... This means we need to adjust pfn_valid() too, which always return true
> >> for early sections.
> >
> > Right, rather than carry workarounds in 3 locations, and the bug that
> > has resulted from then getting out of sync, just teach early section
> > mapping to allow for the subsection populate/depopulate.
> >
>
> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.

Agree, yes, let's do the simple fix first for 5.8 and the special-case
elimination work later.
Wei Yang June 25, 2020, 10:29 p.m. UTC | #29
On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote:
>On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>
>>
>> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
>> >
>> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
>> > <richard.weiyang@linux.alibaba.com> wrote:
>> > [..]
>> >>> So, you are right that there is a mismatch here, but I think the
>> >>> comprehensive fix is to allow early sections to be partially
>> >>> depopulated/repopulated rather than have section_activate() and
>> >>> section_deacticate() special case early sections. The special casing
>> >>> is problematic in retrospect as section_deactivate() can't be
>> >>> maintained without understand special rules in section_activate().
>> >>
>> >> Hmm... This means we need to adjust pfn_valid() too, which always return true
>> >> for early sections.
>> >
>> > Right, rather than carry workarounds in 3 locations, and the bug that
>> > has resulted from then getting out of sync, just teach early section
>> > mapping to allow for the subsection populate/depopulate.
>> >
>>
>> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
>
>Agree, yes, let's do the simple fix first for 5.8 and the special-case
>elimination work later.

Ok, let me send v2 with detailed change log and a comment in code first.

Thanks all for your time.
Wei Yang June 25, 2020, 10:39 p.m. UTC | #30
On Thu, Jun 25, 2020 at 07:53:37AM +0200, David Hildenbrand wrote:
>
>
>> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
>> 
>> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
>> <richard.weiyang@linux.alibaba.com> wrote:
>> [..]
>>>> So, you are right that there is a mismatch here, but I think the
>>>> comprehensive fix is to allow early sections to be partially
>>>> depopulated/repopulated rather than have section_activate() and
>>>> section_deacticate() special case early sections. The special casing
>>>> is problematic in retrospect as section_deactivate() can't be
>>>> maintained without understand special rules in section_activate().
>>> 
>>> Hmm... This means we need to adjust pfn_valid() too, which always return true
>>> for early sections.
>> 
>> Right, rather than carry workarounds in 3 locations, and the bug that
>> has resulted from then getting out of sync, just teach early section
>> mapping to allow for the subsection populate/depopulate.
>> 
>
>I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
>

Hi, David,

Which part of pfn_to_online_page() needs to be changed? pfn_valid_within()
would call pfn_valid() to check the pfn first. This looks enough for me.

I may not follow you at this part.

>At least my ack stands.
David Hildenbrand June 26, 2020, 4:59 a.m. UTC | #31
> Am 26.06.2020 um 00:40 schrieb Wei Yang <richard.weiyang@linux.alibaba.com>:
> 
> On Thu, Jun 25, 2020 at 07:53:37AM +0200, David Hildenbrand wrote:
>> 
>> 
>>>> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
>>> 
>>> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
>>> <richard.weiyang@linux.alibaba.com> wrote:
>>> [..]
>>>>> So, you are right that there is a mismatch here, but I think the
>>>>> comprehensive fix is to allow early sections to be partially
>>>>> depopulated/repopulated rather than have section_activate() and
>>>>> section_deacticate() special case early sections. The special casing
>>>>> is problematic in retrospect as section_deactivate() can't be
>>>>> maintained without understand special rules in section_activate().
>>>> 
>>>> Hmm... This means we need to adjust pfn_valid() too, which always return true
>>>> for early sections.
>>> 
>>> Right, rather than carry workarounds in 3 locations, and the bug that
>>> has resulted from then getting out of sync, just teach early section
>>> mapping to allow for the subsection populate/depopulate.
>>> 
>> 
>> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
>> 
> 
> Hi, David,
> 
> Which part of pfn_to_online_page() needs to be changed? pfn_valid_within()
> would call pfn_valid() to check the pfn first. This looks enough for me.

Not for all configurations. For some (e.g., x86 iirc) it‘s just a nop.
Wei Yang June 29, 2020, 8:34 a.m. UTC | #32
On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote:
>On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>
>>
>> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
>> >
>> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
>> > <richard.weiyang@linux.alibaba.com> wrote:
>> > [..]
>> >>> So, you are right that there is a mismatch here, but I think the
>> >>> comprehensive fix is to allow early sections to be partially
>> >>> depopulated/repopulated rather than have section_activate() and
>> >>> section_deacticate() special case early sections. The special casing
>> >>> is problematic in retrospect as section_deactivate() can't be
>> >>> maintained without understand special rules in section_activate().
>> >>
>> >> Hmm... This means we need to adjust pfn_valid() too, which always return true
>> >> for early sections.
>> >
>> > Right, rather than carry workarounds in 3 locations, and the bug that
>> > has resulted from then getting out of sync, just teach early section
>> > mapping to allow for the subsection populate/depopulate.
>> >
>>
>> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
>
>Agree, yes, let's do the simple fix first for 5.8 and the special-case
>elimination work later.

Dan,

A quick test shows this is not a simple task.

First, early sections don't set subsection bitmap, which is necessary for the
hot-add/remove.

To properly set subsection bitmap, we need to know how many subsections in
early section. While current code doesn't has a alignment requirement for
last early section. We mark the whole last early section as present.

I don't find a way to enable this.
Dan Williams June 29, 2020, 10:13 p.m. UTC | #33
On Mon, Jun 29, 2020 at 1:34 AM Wei Yang
<richard.weiyang@linux.alibaba.com> wrote:
>
> On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote:
> >On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>
> >>
> >> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
> >> >
> >> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
> >> > <richard.weiyang@linux.alibaba.com> wrote:
> >> > [..]
> >> >>> So, you are right that there is a mismatch here, but I think the
> >> >>> comprehensive fix is to allow early sections to be partially
> >> >>> depopulated/repopulated rather than have section_activate() and
> >> >>> section_deacticate() special case early sections. The special casing
> >> >>> is problematic in retrospect as section_deactivate() can't be
> >> >>> maintained without understand special rules in section_activate().
> >> >>
> >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true
> >> >> for early sections.
> >> >
> >> > Right, rather than carry workarounds in 3 locations, and the bug that
> >> > has resulted from then getting out of sync, just teach early section
> >> > mapping to allow for the subsection populate/depopulate.
> >> >
> >>
> >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
> >
> >Agree, yes, let's do the simple fix first for 5.8 and the special-case
> >elimination work later.
>
> Dan,
>
> A quick test shows this is not a simple task.

Thanks for taking a look...

> First, early sections don't set subsection bitmap, which is necessary for the
> hot-add/remove.
>
> To properly set subsection bitmap, we need to know how many subsections in
> early section. While current code doesn't has a alignment requirement for
> last early section. We mark the whole last early section as present.

I was thinking that the subsection map does not need to be accurate on
initial setup, it only needs to be accurate after the first removal.
However, that would result in new special casing that somewhat defeats
the purpose. The hardest part is potentially breaking up a PMD mapping
of the page array into a series of PTE mappings without disturbing
in-flight pfn_to_page() users.

> I don't find a way to enable this.

While I don't like that this bug crept into the mismatched special
casing of early sections, I'm now coming around to the same opinion.
I.e. that making the memmap for early sections permanent is a simpler
mechanism to maintain.
Wei Yang June 29, 2020, 10:58 p.m. UTC | #34
On Mon, Jun 29, 2020 at 03:13:25PM -0700, Dan Williams wrote:
>On Mon, Jun 29, 2020 at 1:34 AM Wei Yang
><richard.weiyang@linux.alibaba.com> wrote:
>>
>> On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote:
>> >On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote:
>> >>
>> >>
>> >>
>> >> > Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
>> >> >
>> >> > On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
>> >> > <richard.weiyang@linux.alibaba.com> wrote:
>> >> > [..]
>> >> >>> So, you are right that there is a mismatch here, but I think the
>> >> >>> comprehensive fix is to allow early sections to be partially
>> >> >>> depopulated/repopulated rather than have section_activate() and
>> >> >>> section_deacticate() special case early sections. The special casing
>> >> >>> is problematic in retrospect as section_deactivate() can't be
>> >> >>> maintained without understand special rules in section_activate().
>> >> >>
>> >> >> Hmm... This means we need to adjust pfn_valid() too, which always return true
>> >> >> for early sections.
>> >> >
>> >> > Right, rather than carry workarounds in 3 locations, and the bug that
>> >> > has resulted from then getting out of sync, just teach early section
>> >> > mapping to allow for the subsection populate/depopulate.
>> >> >
>> >>
>> >> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
>> >
>> >Agree, yes, let's do the simple fix first for 5.8 and the special-case
>> >elimination work later.
>>
>> Dan,
>>
>> A quick test shows this is not a simple task.
>
>Thanks for taking a look...
>
>> First, early sections don't set subsection bitmap, which is necessary for the
>> hot-add/remove.
>>
>> To properly set subsection bitmap, we need to know how many subsections in
>> early section. While current code doesn't has a alignment requirement for
>> last early section. We mark the whole last early section as present.
>
>I was thinking that the subsection map does not need to be accurate on
>initial setup, it only needs to be accurate after the first removal.
>However, that would result in new special casing that somewhat defeats
>the purpose. The hardest part is potentially breaking up a PMD mapping
>of the page array into a series of PTE mappings without disturbing
>in-flight pfn_to_page() users.
>
>> I don't find a way to enable this.
>
>While I don't like that this bug crept into the mismatched special
>casing of early sections, I'm now coming around to the same opinion.
>I.e. that making the memmap for early sections permanent is a simpler
>mechanism to maintain.

I think so ...
David Hildenbrand June 30, 2020, 7:16 a.m. UTC | #35
On 30.06.20 00:58, Wei Yang wrote:
> On Mon, Jun 29, 2020 at 03:13:25PM -0700, Dan Williams wrote:
>> On Mon, Jun 29, 2020 at 1:34 AM Wei Yang
>> <richard.weiyang@linux.alibaba.com> wrote:
>>>
>>> On Thu, Jun 25, 2020 at 12:46:43PM -0700, Dan Williams wrote:
>>>> On Wed, Jun 24, 2020 at 10:53 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Am 25.06.2020 um 01:47 schrieb Dan Williams <dan.j.williams@intel.com>:
>>>>>>
>>>>>> On Wed, Jun 24, 2020 at 3:44 PM Wei Yang
>>>>>> <richard.weiyang@linux.alibaba.com> wrote:
>>>>>> [..]
>>>>>>>> So, you are right that there is a mismatch here, but I think the
>>>>>>>> comprehensive fix is to allow early sections to be partially
>>>>>>>> depopulated/repopulated rather than have section_activate() and
>>>>>>>> section_deacticate() special case early sections. The special casing
>>>>>>>> is problematic in retrospect as section_deactivate() can't be
>>>>>>>> maintained without understand special rules in section_activate().
>>>>>>>
>>>>>>> Hmm... This means we need to adjust pfn_valid() too, which always return true
>>>>>>> for early sections.
>>>>>>
>>>>>> Right, rather than carry workarounds in 3 locations, and the bug that
>>>>>> has resulted from then getting out of sync, just teach early section
>>>>>> mapping to allow for the subsection populate/depopulate.
>>>>>>
>>>>>
>>>>> I prefer the easy fix first - IOW what we Here here. Especially, pfn_to_online_page() will need changes as well.
>>>>
>>>> Agree, yes, let's do the simple fix first for 5.8 and the special-case
>>>> elimination work later.
>>>
>>> Dan,
>>>
>>> A quick test shows this is not a simple task.
>>
>> Thanks for taking a look...
>>
>>> First, early sections don't set subsection bitmap, which is necessary for the
>>> hot-add/remove.
>>>
>>> To properly set subsection bitmap, we need to know how many subsections in
>>> early section. While current code doesn't has a alignment requirement for
>>> last early section. We mark the whole last early section as present.
>>
>> I was thinking that the subsection map does not need to be accurate on
>> initial setup, it only needs to be accurate after the first removal.
>> However, that would result in new special casing that somewhat defeats
>> the purpose. The hardest part is potentially breaking up a PMD mapping
>> of the page array into a series of PTE mappings without disturbing
>> in-flight pfn_to_page() users.
>>
>>> I don't find a way to enable this.
>>
>> While I don't like that this bug crept into the mismatched special
>> casing of early sections, I'm now coming around to the same opinion.
>> I.e. that making the memmap for early sections permanent is a simpler
>> mechanism to maintain.
> 
> I think so ...
> 

Yes, and I think having to replace quite some pfn_valid_within() - nops
- by pfn_valid() just to handle one corner case might not be worth it.
At least for now.
diff mbox series

Patch

diff --git a/mm/sparse.c b/mm/sparse.c
index b2b9a3e34696..1a0069f492f5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -825,10 +825,10 @@  static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
 		ms->section_mem_map &= ~SECTION_HAS_MEM_MAP;
 	}
 
-	if (section_is_early && memmap)
-		free_map_bootmem(memmap);
-	else
+	if (!section_is_early)
 		depopulate_section_memmap(pfn, nr_pages, altmap);
+	else if (memmap)
+		free_map_bootmem(memmap);
 
 	if (empty)
 		ms->section_mem_map = (unsigned long)NULL;