diff mbox series

[V3,2/4] arm64/mm: Hold memory hotplug lock while walking for kernel page table dump

Message ID 1557824407-19092-3-git-send-email-anshuman.khandual@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64/mm: Enable memory hot remove | expand

Commit Message

Anshuman Khandual May 14, 2019, 9 a.m. UTC
The arm64 pagetable dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.

When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addressses,
leading to a number of potential problems.

Intermediate levels of table may by freed during memory hot-remove, or when
installing a huge mapping in the vmalloc region. To avoid racing with these
cases, take the memory hotplug lock when walking the kernel page table.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/mm/ptdump_debugfs.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

David Hildenbrand May 14, 2019, 9:16 a.m. UTC | #1
On 14.05.19 11:00, Anshuman Khandual wrote:
> The arm64 pagetable dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addressses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove, or when
> installing a huge mapping in the vmalloc region. To avoid racing with these
> cases, take the memory hotplug lock when walking the kernel page table.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  arch/arm64/mm/ptdump_debugfs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 064163f..80171d1 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -7,7 +7,10 @@
>  static int ptdump_show(struct seq_file *m, void *v)
>  {
>  	struct ptdump_info *info = m->private;
> +
> +	get_online_mems();

The name of that function is somewhat stale now, it used to refer to
online/offlining of pages only. The underlying lock is the
mem_hotplug_lock. Maybe we should rename that function some day ...

>  	ptdump_walk_pgd(m, info);
> +	put_online_mems();
>  	return 0;

Acked-by: David Hildenbrand <david@redhat.com>

>  }
>  DEFINE_SHOW_ATTRIBUTE(ptdump);
>
Mark Rutland May 14, 2019, 3:40 p.m. UTC | #2
On Tue, May 14, 2019 at 02:30:05PM +0530, Anshuman Khandual wrote:
> The arm64 pagetable dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addressses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove, or when
> installing a huge mapping in the vmalloc region. To avoid racing with these
> cases, take the memory hotplug lock when walking the kernel page table.
> 
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>

Can we please move this after the next patch (which addresses the huge
vmap case), and change the last paragraph to:

  Intermediate levels of table may by freed during memory hot-remove,
  which will be enabled by a subsequent patch. To avoid racing with
  this, take the memory hotplug lock when walking the kernel page table.

With that, this looks good to me.

Thanks,
Mark.

> ---
>  arch/arm64/mm/ptdump_debugfs.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
> index 064163f..80171d1 100644
> --- a/arch/arm64/mm/ptdump_debugfs.c
> +++ b/arch/arm64/mm/ptdump_debugfs.c
> @@ -7,7 +7,10 @@
>  static int ptdump_show(struct seq_file *m, void *v)
>  {
>  	struct ptdump_info *info = m->private;
> +
> +	get_online_mems();
>  	ptdump_walk_pgd(m, info);
> +	put_online_mems();
>  	return 0;
>  }
>  DEFINE_SHOW_ATTRIBUTE(ptdump);
> -- 
> 2.7.4
>
Anshuman Khandual May 15, 2019, 1:56 a.m. UTC | #3
On 05/14/2019 09:10 PM, Mark Rutland wrote:
> On Tue, May 14, 2019 at 02:30:05PM +0530, Anshuman Khandual wrote:
>> The arm64 pagetable dump code can race with concurrent modification of the
>> kernel page tables. When a leaf entries are modified concurrently, the dump
>> code may log stale or inconsistent information for a VA range, but this is
>> otherwise not harmful.
>>
>> When intermediate levels of table are freed, the dump code will continue to
>> use memory which has been freed and potentially reallocated for another
>> purpose. In such cases, the dump code may dereference bogus addressses,
>> leading to a number of potential problems.
>>
>> Intermediate levels of table may by freed during memory hot-remove, or when
>> installing a huge mapping in the vmalloc region. To avoid racing with these
>> cases, take the memory hotplug lock when walking the kernel page table.
>>
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> Can we please move this after the next patch (which addresses the huge
> vmap case), and change the last paragraph to:
> 
>   Intermediate levels of table may by freed during memory hot-remove,
>   which will be enabled by a subsequent patch. To avoid racing with
>   this, take the memory hotplug lock when walking the kernel page table.
> 
> With that, this looks good to me.

Sure will do.
Michal Hocko May 15, 2019, 4:58 p.m. UTC | #4
On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> The arm64 pagetable dump code can race with concurrent modification of the
> kernel page tables. When a leaf entries are modified concurrently, the dump
> code may log stale or inconsistent information for a VA range, but this is
> otherwise not harmful.
> 
> When intermediate levels of table are freed, the dump code will continue to
> use memory which has been freed and potentially reallocated for another
> purpose. In such cases, the dump code may dereference bogus addressses,
> leading to a number of potential problems.
> 
> Intermediate levels of table may by freed during memory hot-remove, or when
> installing a huge mapping in the vmalloc region. To avoid racing with these
> cases, take the memory hotplug lock when walking the kernel page table.

Why is this a problem only on arm64 and why do we even care for debugfs?
Does anybody rely on this thing to be reliable? Do we even need it? Who
is using the file?

I am asking because I would really love to make mem hotplug locking less
scattered outside of the core MM than more. Most users simply shouldn't
care. Pfn walkers should rely on pfn_to_online_page.
Mark Rutland May 16, 2019, 10:23 a.m. UTC | #5
Hi Michal,

On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > The arm64 pagetable dump code can race with concurrent modification of the
> > kernel page tables. When a leaf entries are modified concurrently, the dump
> > code may log stale or inconsistent information for a VA range, but this is
> > otherwise not harmful.
> > 
> > When intermediate levels of table are freed, the dump code will continue to
> > use memory which has been freed and potentially reallocated for another
> > purpose. In such cases, the dump code may dereference bogus addressses,
> > leading to a number of potential problems.
> > 
> > Intermediate levels of table may by freed during memory hot-remove, or when
> > installing a huge mapping in the vmalloc region. To avoid racing with these
> > cases, take the memory hotplug lock when walking the kernel page table.
> 
> Why is this a problem only on arm64 

It looks like it's not -- I think we're just the first to realise this.

AFAICT x86's debugfs ptdump has the same issue if run conccurently with
memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
would have the same issue.

> and why do we even care for debugfs? Does anybody rely on this thing
> to be reliable? Do we even need it? Who is using the file?

The debugfs part is used intermittently by a few people working on the
arm64 kernel page tables. We use that both to sanity-check that kernel
page tables are created/updated correctly after changes to the arm64 mmu
code, and also to debug issues if/when we encounter issues that appear
to be the result of kernel page table corruption.

So while it's rare to need it, it's really useful to have when we do
need it, and I'd rather not remove it. I'd also rather that it didn't
have latent issues where we can accidentally crash the kernel when using
it, which is what this patch is addressing.

> I am asking because I would really love to make mem hotplug locking less
> scattered outside of the core MM than more. Most users simply shouldn't
> care. Pfn walkers should rely on pfn_to_online_page.

I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
doesn't ensure that the page remains online. Is there a way to achieve
that other than get_online_mems()?

The big problem for the ptdump code is when tables are freed, since the
pages can be reused elsewhere (or hot-removed), causing the ptdump code
to explode.

Thanks,
Mark.
Michal Hocko May 16, 2019, 11:05 a.m. UTC | #6
On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> Hi Michal,
> 
> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > > The arm64 pagetable dump code can race with concurrent modification of the
> > > kernel page tables. When a leaf entries are modified concurrently, the dump
> > > code may log stale or inconsistent information for a VA range, but this is
> > > otherwise not harmful.
> > > 
> > > When intermediate levels of table are freed, the dump code will continue to
> > > use memory which has been freed and potentially reallocated for another
> > > purpose. In such cases, the dump code may dereference bogus addressses,
> > > leading to a number of potential problems.
> > > 
> > > Intermediate levels of table may by freed during memory hot-remove, or when
> > > installing a huge mapping in the vmalloc region. To avoid racing with these
> > > cases, take the memory hotplug lock when walking the kernel page table.
> > 
> > Why is this a problem only on arm64 
> 
> It looks like it's not -- I think we're just the first to realise this.
> 
> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> would have the same issue.
> 
> > and why do we even care for debugfs? Does anybody rely on this thing
> > to be reliable? Do we even need it? Who is using the file?
> 
> The debugfs part is used intermittently by a few people working on the
> arm64 kernel page tables. We use that both to sanity-check that kernel
> page tables are created/updated correctly after changes to the arm64 mmu
> code, and also to debug issues if/when we encounter issues that appear
> to be the result of kernel page table corruption.

OK, I see. Thanks for the clarification.

> So while it's rare to need it, it's really useful to have when we do
> need it, and I'd rather not remove it. I'd also rather that it didn't
> have latent issues where we can accidentally crash the kernel when using
> it, which is what this patch is addressing.

While I agree, do we rather want to document that you shouldn't be using
the debugging tool while the hotplug is ongoing because you might get a
garbage or crash the kernel in the worst case? In other words is the
absolute correctness worth the additional maint. burden wrt. to future
hotplug changes?

> > I am asking because I would really love to make mem hotplug locking less
> > scattered outside of the core MM than more. Most users simply shouldn't
> > care. Pfn walkers should rely on pfn_to_online_page.
> 
> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> doesn't ensure that the page remains online. Is there a way to achieve
> that other than get_online_mems()?

You have to pin the page to make sure the hotplug is not going to
offline it.

> The big problem for the ptdump code is when tables are freed, since the
> pages can be reused elsewhere (or hot-removed), causing the ptdump code
> to explode.

Yes, I see the danger. I am just wondering whether living with that is
reasonable considering this is a debugfs code.
Anshuman Khandual May 16, 2019, 11:06 a.m. UTC | #7
On 05/16/2019 03:53 PM, Mark Rutland wrote:
> Hi Michal,
> 
> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>> The arm64 pagetable dump code can race with concurrent modification of the
>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>> code may log stale or inconsistent information for a VA range, but this is
>>> otherwise not harmful.
>>>
>>> When intermediate levels of table are freed, the dump code will continue to
>>> use memory which has been freed and potentially reallocated for another
>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>> leading to a number of potential problems.
>>>
>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>> cases, take the memory hotplug lock when walking the kernel page table.
>>
>> Why is this a problem only on arm64 
> 
> It looks like it's not -- I think we're just the first to realise this.
> 
> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> would have the same issue.
> 
>> and why do we even care for debugfs? Does anybody rely on this thing
>> to be reliable? Do we even need it? Who is using the file?
> 
> The debugfs part is used intermittently by a few people working on the
> arm64 kernel page tables. We use that both to sanity-check that kernel
> page tables are created/updated correctly after changes to the arm64 mmu
> code, and also to debug issues if/when we encounter issues that appear
> to be the result of kernel page table corruption.
> 
> So while it's rare to need it, it's really useful to have when we do
> need it, and I'd rather not remove it. I'd also rather that it didn't
> have latent issues where we can accidentally crash the kernel when using
> it, which is what this patch is addressing.
> 
>> I am asking because I would really love to make mem hotplug locking less
>> scattered outside of the core MM than more. Most users simply shouldn't
>> care. Pfn walkers should rely on pfn_to_online_page.
> 
> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> doesn't ensure that the page remains online. Is there a way to achieve
> that other than get_online_mems()?

Still wondering how pfn_to_online_page() is applicable here. It validates
a given PFN and whether its online from sparse section mapping perspective
before giving it's struct page. IIUC it is used during a linear scanning
of a physical address range not for a page table walk. So how it can solve
the problem when a struct page which was used as an intermediate level page
table page gets released back to the buddy from another concurrent thread ?
Michal Hocko May 16, 2019, 11:16 a.m. UTC | #8
On Thu 16-05-19 16:36:12, Anshuman Khandual wrote:
> On 05/16/2019 03:53 PM, Mark Rutland wrote:
> > Hi Michal,
> > 
> > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> >> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> >>> The arm64 pagetable dump code can race with concurrent modification of the
> >>> kernel page tables. When a leaf entries are modified concurrently, the dump
> >>> code may log stale or inconsistent information for a VA range, but this is
> >>> otherwise not harmful.
> >>>
> >>> When intermediate levels of table are freed, the dump code will continue to
> >>> use memory which has been freed and potentially reallocated for another
> >>> purpose. In such cases, the dump code may dereference bogus addressses,
> >>> leading to a number of potential problems.
> >>>
> >>> Intermediate levels of table may by freed during memory hot-remove, or when
> >>> installing a huge mapping in the vmalloc region. To avoid racing with these
> >>> cases, take the memory hotplug lock when walking the kernel page table.
> >>
> >> Why is this a problem only on arm64 
> > 
> > It looks like it's not -- I think we're just the first to realise this.
> > 
> > AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> > memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> > would have the same issue.
> > 
> >> and why do we even care for debugfs? Does anybody rely on this thing
> >> to be reliable? Do we even need it? Who is using the file?
> > 
> > The debugfs part is used intermittently by a few people working on the
> > arm64 kernel page tables. We use that both to sanity-check that kernel
> > page tables are created/updated correctly after changes to the arm64 mmu
> > code, and also to debug issues if/when we encounter issues that appear
> > to be the result of kernel page table corruption.
> > 
> > So while it's rare to need it, it's really useful to have when we do
> > need it, and I'd rather not remove it. I'd also rather that it didn't
> > have latent issues where we can accidentally crash the kernel when using
> > it, which is what this patch is addressing.
> > 
> >> I am asking because I would really love to make mem hotplug locking less
> >> scattered outside of the core MM than more. Most users simply shouldn't
> >> care. Pfn walkers should rely on pfn_to_online_page.
> > 
> > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > doesn't ensure that the page remains online. Is there a way to achieve
> > that other than get_online_mems()?
> 
> Still wondering how pfn_to_online_page() is applicable here. It validates
> a given PFN and whether its online from sparse section mapping perspective
> before giving it's struct page. IIUC it is used during a linear scanning
> of a physical address range not for a page table walk. So how it can solve
> the problem when a struct page which was used as an intermediate level page
> table page gets released back to the buddy from another concurrent thread ?

Well, my comment about pfn_to_online_page was more generic and it might
not apply to this specific case. I meant to say that the code outside of
the core MM shouldn't really care about the hotplug locking.
Mark Rutland May 22, 2019, 4:42 p.m. UTC | #9
On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
> On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> > Hi Michal,
> > 
> > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > > > The arm64 pagetable dump code can race with concurrent modification of the
> > > > kernel page tables. When a leaf entries are modified concurrently, the dump
> > > > code may log stale or inconsistent information for a VA range, but this is
> > > > otherwise not harmful.
> > > > 
> > > > When intermediate levels of table are freed, the dump code will continue to
> > > > use memory which has been freed and potentially reallocated for another
> > > > purpose. In such cases, the dump code may dereference bogus addressses,
> > > > leading to a number of potential problems.
> > > > 
> > > > Intermediate levels of table may by freed during memory hot-remove, or when
> > > > installing a huge mapping in the vmalloc region. To avoid racing with these
> > > > cases, take the memory hotplug lock when walking the kernel page table.
> > > 
> > > Why is this a problem only on arm64 
> > 
> > It looks like it's not -- I think we're just the first to realise this.
> > 
> > AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> > memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> > would have the same issue.
> > 
> > > and why do we even care for debugfs? Does anybody rely on this thing
> > > to be reliable? Do we even need it? Who is using the file?
> > 
> > The debugfs part is used intermittently by a few people working on the
> > arm64 kernel page tables. We use that both to sanity-check that kernel
> > page tables are created/updated correctly after changes to the arm64 mmu
> > code, and also to debug issues if/when we encounter issues that appear
> > to be the result of kernel page table corruption.
> 
> OK, I see. Thanks for the clarification.
> 
> > So while it's rare to need it, it's really useful to have when we do
> > need it, and I'd rather not remove it. I'd also rather that it didn't
> > have latent issues where we can accidentally crash the kernel when using
> > it, which is what this patch is addressing.
> 
> While I agree, do we rather want to document that you shouldn't be using
> the debugging tool while the hotplug is ongoing because you might get a
> garbage or crash the kernel in the worst case? In other words is the
> absolute correctness worth the additional maint. burden wrt. to future
> hotplug changes?

I don't think that it's reasonable for this code to bring down the
kernel unless the kernel page tables are already corrupt. I agree we
should minimize the impact on other code, and I'm happy to penalize
ptdump so long as it's functional and safe.

I would like it to be possible to use the ptdump code to debug
hot-remove, so I'd rather not make the two mutually exclusive. I'd also
like it to be possible to use this in-the-field, and for that asking an
admin to potentially crash their system isn't likely to fly.

> > > I am asking because I would really love to make mem hotplug locking less
> > > scattered outside of the core MM than more. Most users simply shouldn't
> > > care. Pfn walkers should rely on pfn_to_online_page.

Jut to check, is your plan to limit access to the hotplug lock, or to
redesign the locking scheme?

> > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > doesn't ensure that the page remains online. Is there a way to achieve
> > that other than get_online_mems()?
> 
> You have to pin the page to make sure the hotplug is not going to
> offline it.

I'm not exactly sure how pinning works -- is there a particular set of
functions I should look at for that?

I guess that if/when we allocate the vmemmap from hotpluggable memory
that will require the pinning code to take the hotplug lock internally
to ensure that the struct page is accessible while we attempt to pin it?

Thanks,
Mark.
David Hildenbrand May 23, 2019, 8:40 a.m. UTC | #10
On 16.05.19 13:16, Michal Hocko wrote:
> On Thu 16-05-19 16:36:12, Anshuman Khandual wrote:
>> On 05/16/2019 03:53 PM, Mark Rutland wrote:
>>> Hi Michal,
>>>
>>> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>>>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>>>> The arm64 pagetable dump code can race with concurrent modification of the
>>>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>>>> code may log stale or inconsistent information for a VA range, but this is
>>>>> otherwise not harmful.
>>>>>
>>>>> When intermediate levels of table are freed, the dump code will continue to
>>>>> use memory which has been freed and potentially reallocated for another
>>>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>>>> leading to a number of potential problems.
>>>>>
>>>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>>>> cases, take the memory hotplug lock when walking the kernel page table.
>>>>
>>>> Why is this a problem only on arm64 
>>>
>>> It looks like it's not -- I think we're just the first to realise this.
>>>
>>> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
>>> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
>>> would have the same issue.
>>>
>>>> and why do we even care for debugfs? Does anybody rely on this thing
>>>> to be reliable? Do we even need it? Who is using the file?
>>>
>>> The debugfs part is used intermittently by a few people working on the
>>> arm64 kernel page tables. We use that both to sanity-check that kernel
>>> page tables are created/updated correctly after changes to the arm64 mmu
>>> code, and also to debug issues if/when we encounter issues that appear
>>> to be the result of kernel page table corruption.
>>>
>>> So while it's rare to need it, it's really useful to have when we do
>>> need it, and I'd rather not remove it. I'd also rather that it didn't
>>> have latent issues where we can accidentally crash the kernel when using
>>> it, which is what this patch is addressing.
>>>
>>>> I am asking because I would really love to make mem hotplug locking less
>>>> scattered outside of the core MM than more. Most users simply shouldn't
>>>> care. Pfn walkers should rely on pfn_to_online_page.
>>>
>>> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
>>> doesn't ensure that the page remains online. Is there a way to achieve
>>> that other than get_online_mems()?
>>
>> Still wondering how pfn_to_online_page() is applicable here. It validates
>> a given PFN and whether its online from sparse section mapping perspective
>> before giving it's struct page. IIUC it is used during a linear scanning
>> of a physical address range not for a page table walk. So how it can solve
>> the problem when a struct page which was used as an intermediate level page
>> table page gets released back to the buddy from another concurrent thread ?
> 
> Well, my comment about pfn_to_online_page was more generic and it might
> not apply to this specific case. I meant to say that the code outside of
> the core MM shouldn't really care about the hotplug locking.
> 

What am I missing, how is it guaranteed that a page doesn't get
offlined/removed without holding a lock here?

We would at least need some RCU mechnism or similar to sync against
pages vanishing.

pfn_to_online_page() assumes that somebody touches a page he doesn't
own. There has to be some way for core-mm to realize this and defer
offlining/removinf.
Anshuman Khandual May 24, 2019, 5:43 a.m. UTC | #11
On 05/23/2019 02:10 PM, David Hildenbrand wrote:
> On 16.05.19 13:16, Michal Hocko wrote:
>> On Thu 16-05-19 16:36:12, Anshuman Khandual wrote:
>>> On 05/16/2019 03:53 PM, Mark Rutland wrote:
>>>> Hi Michal,
>>>>
>>>> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>>>>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>>>>> The arm64 pagetable dump code can race with concurrent modification of the
>>>>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>>>>> code may log stale or inconsistent information for a VA range, but this is
>>>>>> otherwise not harmful.
>>>>>>
>>>>>> When intermediate levels of table are freed, the dump code will continue to
>>>>>> use memory which has been freed and potentially reallocated for another
>>>>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>>>>> leading to a number of potential problems.
>>>>>>
>>>>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>>>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>>>>> cases, take the memory hotplug lock when walking the kernel page table.
>>>>>
>>>>> Why is this a problem only on arm64 
>>>>
>>>> It looks like it's not -- I think we're just the first to realise this.
>>>>
>>>> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
>>>> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
>>>> would have the same issue.
>>>>
>>>>> and why do we even care for debugfs? Does anybody rely on this thing
>>>>> to be reliable? Do we even need it? Who is using the file?
>>>>
>>>> The debugfs part is used intermittently by a few people working on the
>>>> arm64 kernel page tables. We use that both to sanity-check that kernel
>>>> page tables are created/updated correctly after changes to the arm64 mmu
>>>> code, and also to debug issues if/when we encounter issues that appear
>>>> to be the result of kernel page table corruption.
>>>>
>>>> So while it's rare to need it, it's really useful to have when we do
>>>> need it, and I'd rather not remove it. I'd also rather that it didn't
>>>> have latent issues where we can accidentally crash the kernel when using
>>>> it, which is what this patch is addressing.
>>>>
>>>>> I am asking because I would really love to make mem hotplug locking less
>>>>> scattered outside of the core MM than more. Most users simply shouldn't
>>>>> care. Pfn walkers should rely on pfn_to_online_page.
>>>>
>>>> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
>>>> doesn't ensure that the page remains online. Is there a way to achieve
>>>> that other than get_online_mems()?
>>>
>>> Still wondering how pfn_to_online_page() is applicable here. It validates
>>> a given PFN and whether its online from sparse section mapping perspective
>>> before giving it's struct page. IIUC it is used during a linear scanning
>>> of a physical address range not for a page table walk. So how it can solve
>>> the problem when a struct page which was used as an intermediate level page
>>> table page gets released back to the buddy from another concurrent thread ?
>>
>> Well, my comment about pfn_to_online_page was more generic and it might
>> not apply to this specific case. I meant to say that the code outside of
>> the core MM shouldn't really care about the hotplug locking.
>>
> 
> What am I missing, how is it guaranteed that a page doesn't get
> offlined/removed without holding a lock here?

It is not guaranteed.

> 
> We would at least need some RCU mechnism or similar to sync against
> pages vanishing.

Yes, if we dont take memory_hotplug_lock preventing memory hot remove.

> 
> pfn_to_online_page() assumes that somebody touches a page he doesn't
> own. There has to be some way for core-mm to realize this and defer
> offlining/removinf.

First of all I am not sure yet if Michal really meant that reference
should be taken on all struct pages (while dumping kernel page table)
for each range (minimum hot remove granularity) to prevent them from
being hot removed.
Anshuman Khandual May 24, 2019, 6:06 a.m. UTC | #12
On 05/22/2019 10:12 PM, Mark Rutland wrote:
> On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
>> On Thu 16-05-19 11:23:54, Mark Rutland wrote:
>>> Hi Michal,
>>>
>>> On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
>>>> On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
>>>>> The arm64 pagetable dump code can race with concurrent modification of the
>>>>> kernel page tables. When a leaf entries are modified concurrently, the dump
>>>>> code may log stale or inconsistent information for a VA range, but this is
>>>>> otherwise not harmful.
>>>>>
>>>>> When intermediate levels of table are freed, the dump code will continue to
>>>>> use memory which has been freed and potentially reallocated for another
>>>>> purpose. In such cases, the dump code may dereference bogus addressses,
>>>>> leading to a number of potential problems.
>>>>>
>>>>> Intermediate levels of table may by freed during memory hot-remove, or when
>>>>> installing a huge mapping in the vmalloc region. To avoid racing with these
>>>>> cases, take the memory hotplug lock when walking the kernel page table.
>>>>
>>>> Why is this a problem only on arm64 
>>>
>>> It looks like it's not -- I think we're just the first to realise this.
>>>
>>> AFAICT x86's debugfs ptdump has the same issue if run conccurently with
>>> memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
>>> would have the same issue.
>>>
>>>> and why do we even care for debugfs? Does anybody rely on this thing
>>>> to be reliable? Do we even need it? Who is using the file?
>>>
>>> The debugfs part is used intermittently by a few people working on the
>>> arm64 kernel page tables. We use that both to sanity-check that kernel
>>> page tables are created/updated correctly after changes to the arm64 mmu
>>> code, and also to debug issues if/when we encounter issues that appear
>>> to be the result of kernel page table corruption.
>>
>> OK, I see. Thanks for the clarification.
>>
>>> So while it's rare to need it, it's really useful to have when we do
>>> need it, and I'd rather not remove it. I'd also rather that it didn't
>>> have latent issues where we can accidentally crash the kernel when using
>>> it, which is what this patch is addressing.
>>
>> While I agree, do we rather want to document that you shouldn't be using
>> the debugging tool while the hotplug is ongoing because you might get a
>> garbage or crash the kernel in the worst case? In other words is the
>> absolute correctness worth the additional maint. burden wrt. to future
>> hotplug changes?
> 
> I don't think that it's reasonable for this code to bring down the
> kernel unless the kernel page tables are already corrupt. I agree we
> should minimize the impact on other code, and I'm happy to penalize
> ptdump so long as it's functional and safe.
> 
> I would like it to be possible to use the ptdump code to debug
> hot-remove, so I'd rather not make the two mutually exclusive. I'd also
> like it to be possible to use this in-the-field, and for that asking an
> admin to potentially crash their system isn't likely to fly.
> 
>>>> I am asking because I would really love to make mem hotplug locking less
>>>> scattered outside of the core MM than more. Most users simply shouldn't
>>>> care. Pfn walkers should rely on pfn_to_online_page.
> 
> Jut to check, is your plan to limit access to the hotplug lock, or to
> redesign the locking scheme?
> 
>>> I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
>>> doesn't ensure that the page remains online. Is there a way to achieve
>>> that other than get_online_mems()?
>>
>> You have to pin the page to make sure the hotplug is not going to
>> offline it.
> 
> I'm not exactly sure how pinning works -- is there a particular set of
> functions I should look at for that?
> 
> I guess that if/when we allocate the vmemmap from hotpluggable memory
> that will require the pinning code to take the hotplug lock internally
> to ensure that the struct page is accessible while we attempt to pin it?

I am bit confused here.

Which pages are we trying to pin ?

1) init_mm page table pages (vmemmap + linear) for the range to be hot-removed
2) struct pages for the PFN range to be hot-removed

We need hot-remove process to be blocked enough not to free the intermediate
level page table pages which will ensure kernel does not crash during ptdump.

AFAICT

1) Holding reference on (1) prevent freeing of pgtable pages during hot-remove
2) Holding reference on (2) prevent range PFN from being hot removed which in
   turn can prevent forward progress for hot-remove process and hence possibly
   prevent freeing of intermediate level pgtable pages.

But both the above solutions are bit complex and will consume more cycles as
compared to just take a memory_hotplug_lock. In case of (1) it is bit tricker
as ptdump code has to first walk init_mm to get to all the pgtable pages for
taking a reference/lock on them. Just wondering if it is worth the trouble.
Michal Hocko May 27, 2019, 7:20 a.m. UTC | #13
On Wed 22-05-19 17:42:13, Mark Rutland wrote:
> On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
> > On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> > > Hi Michal,
> > > 
> > > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > > > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:
> > > > > The arm64 pagetable dump code can race with concurrent modification of the
> > > > > kernel page tables. When a leaf entries are modified concurrently, the dump
> > > > > code may log stale or inconsistent information for a VA range, but this is
> > > > > otherwise not harmful.
> > > > > 
> > > > > When intermediate levels of table are freed, the dump code will continue to
> > > > > use memory which has been freed and potentially reallocated for another
> > > > > purpose. In such cases, the dump code may dereference bogus addressses,
> > > > > leading to a number of potential problems.
> > > > > 
> > > > > Intermediate levels of table may by freed during memory hot-remove, or when
> > > > > installing a huge mapping in the vmalloc region. To avoid racing with these
> > > > > cases, take the memory hotplug lock when walking the kernel page table.
> > > > 
> > > > Why is this a problem only on arm64 
> > > 
> > > It looks like it's not -- I think we're just the first to realise this.
> > > 
> > > AFAICT x86's debugfs ptdump has the same issue if run conccurently with
> > > memory hot remove. If 32-bit arm supported hot-remove, its ptdump code
> > > would have the same issue.
> > > 
> > > > and why do we even care for debugfs? Does anybody rely on this thing
> > > > to be reliable? Do we even need it? Who is using the file?
> > > 
> > > The debugfs part is used intermittently by a few people working on the
> > > arm64 kernel page tables. We use that both to sanity-check that kernel
> > > page tables are created/updated correctly after changes to the arm64 mmu
> > > code, and also to debug issues if/when we encounter issues that appear
> > > to be the result of kernel page table corruption.
> > 
> > OK, I see. Thanks for the clarification.
> > 
> > > So while it's rare to need it, it's really useful to have when we do
> > > need it, and I'd rather not remove it. I'd also rather that it didn't
> > > have latent issues where we can accidentally crash the kernel when using
> > > it, which is what this patch is addressing.
> > 
> > While I agree, do we rather want to document that you shouldn't be using
> > the debugging tool while the hotplug is ongoing because you might get a
> > garbage or crash the kernel in the worst case? In other words is the
> > absolute correctness worth the additional maint. burden wrt. to future
> > hotplug changes?
> 
> I don't think that it's reasonable for this code to bring down the
> kernel unless the kernel page tables are already corrupt. I agree we
> should minimize the impact on other code, and I'm happy to penalize
> ptdump so long as it's functional and safe.
> 
> I would like it to be possible to use the ptdump code to debug
> hot-remove, so I'd rather not make the two mutually exclusive. I'd also
> like it to be possible to use this in-the-field, and for that asking an
> admin to potentially crash their system isn't likely to fly.

OK, fair enough.

> > > > I am asking because I would really love to make mem hotplug locking less
> > > > scattered outside of the core MM than more. Most users simply shouldn't
> > > > care. Pfn walkers should rely on pfn_to_online_page.
> 
> Jut to check, is your plan to limit access to the hotplug lock, or to
> redesign the locking scheme?

To change the locking to lock hotpluged ranges rather than having a
global lock as the operation is inherently pfn range scoped.

> > > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > > doesn't ensure that the page remains online. Is there a way to achieve
> > > that other than get_online_mems()?
> > 
> > You have to pin the page to make sure the hotplug is not going to
> > offline it.
> 
> I'm not exactly sure how pinning works -- is there a particular set of
> functions I should look at for that?

Pinning (get_page) on any page of the range will deffer the hotremove
operation and therefore the page tables cannot go away as well.

That being said, I thought the API is mostly for debugging and "you
should better know what you are doing" kinda thing (based on debugfs
being used here). If this is really useful in its current form and
should be used also while the hotremove is in progress then ok.
Once we actually get to rework the locking then we will have another
spot to handle but that's the life.
Mark Rutland May 28, 2019, 2:09 p.m. UTC | #14
On Mon, May 27, 2019 at 09:20:01AM +0200, Michal Hocko wrote:
> On Wed 22-05-19 17:42:13, Mark Rutland wrote:
> > On Thu, May 16, 2019 at 01:05:29PM +0200, Michal Hocko wrote:
> > > On Thu 16-05-19 11:23:54, Mark Rutland wrote:
> > > > On Wed, May 15, 2019 at 06:58:47PM +0200, Michal Hocko wrote:
> > > > > On Tue 14-05-19 14:30:05, Anshuman Khandual wrote:

> > I don't think that it's reasonable for this code to bring down the
> > kernel unless the kernel page tables are already corrupt. I agree we
> > should minimize the impact on other code, and I'm happy to penalize
> > ptdump so long as it's functional and safe.
> > 
> > I would like it to be possible to use the ptdump code to debug
> > hot-remove, so I'd rather not make the two mutually exclusive. I'd also
> > like it to be possible to use this in-the-field, and for that asking an
> > admin to potentially crash their system isn't likely to fly.
> 
> OK, fair enough.
> 
> > > > > I am asking because I would really love to make mem hotplug locking less
> > > > > scattered outside of the core MM than more. Most users simply shouldn't
> > > > > care. Pfn walkers should rely on pfn_to_online_page.
> > 
> > Jut to check, is your plan to limit access to the hotplug lock, or to
> > redesign the locking scheme?
> 
> To change the locking to lock hotpluged ranges rather than having a
> global lock as the operation is inherently pfn range scoped.

Ok. That sounds like something we could adapt the ptdump code to handle
without too much pain (modulo how much of that you want to expose
outside of the core mm code).

> > > > I'm not sure if that would help us here; IIUC pfn_to_online_page() alone
> > > > doesn't ensure that the page remains online. Is there a way to achieve
> > > > that other than get_online_mems()?
> > > 
> > > You have to pin the page to make sure the hotplug is not going to
> > > offline it.
> > 
> > I'm not exactly sure how pinning works -- is there a particular set of
> > functions I should look at for that?
> 
> Pinning (get_page) on any page of the range will deffer the hotremove
> operation and therefore the page tables cannot go away as well.
> 
> That being said, I thought the API is mostly for debugging and "you
> should better know what you are doing" kinda thing (based on debugfs
> being used here). If this is really useful in its current form and
> should be used also while the hotremove is in progress then ok.
> Once we actually get to rework the locking then we will have another
> spot to handle but that's the life.

Great.

FWIW, I'm happy to rework the ptdump code to help with that.

Thanks,
Mark.
diff mbox series

Patch

diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f..80171d1 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -7,7 +7,10 @@ 
 static int ptdump_show(struct seq_file *m, void *v)
 {
 	struct ptdump_info *info = m->private;
+
+	get_online_mems();
 	ptdump_walk_pgd(m, info);
+	put_online_mems();
 	return 0;
 }
 DEFINE_SHOW_ATTRIBUTE(ptdump);