diff mbox

arm64: fix the overlap between the kernel image and vmalloc address

Message ID 58FDFD90.8050300@huawei.com (mailing list archive)
State New, archived
Headers show

Commit Message

zhong jiang April 24, 2017, 1:28 p.m. UTC
On 2017/4/24 18:44, Mark Rutland wrote:
> Hi,
>
> Thanks for reporting the problematic usage of is_vmalloc_addr() and
> vmalloc_to_page() here. That is a real problem that we need to address.
>
> On Mon, Apr 24, 2017 at 05:22:09PM +0800, zhongjiang wrote:
>> From: zhong jiang <zhongjiang@huawei.com>
>>
>> Recently, xiaojun report the following issue.
>>
>> [ 4544.984139] Unable to handle kernel paging request at virtual address ffff804392800000
>> [ 4544.991995] pgd = ffff80096745f000
>> [ 4544.995369] [ffff804392800000] *pgd=0000000000000000
>> [ 4545.425416] fc00: ffff0000081fdfd0 0000ffffa9c87440
>> [ 4545.430248] [<ffff0000083a1000>] __memcpy+0x100/0x180
>> [ 4545.435253] [<ffff000008270f64>] read_kcore+0x21c/0x3b0
>> [ 4545.440429] [<ffff00000826340c>] proc_reg_read+0x64/0x90
>> [ 4545.445691] [<ffff0000081fb83c>] __vfs_read+0x1c/0x108
>> [ 4545.450779] [<ffff0000081fcb28>] vfs_read+0x80/0x130
>> [ 4545.455696] [<ffff0000081fe014>] SyS_read+0x44/0xa0
>> [ 4545.460528] [<ffff000008082f30>] el0_svc_naked+0x24/0x28
>> [ 4545.465790] Code: d503201f d503201f d503201f d503201f (a8c12027)
>> [ 4545.471852] ---[ end trace 4d1897f94759f461 ]---
>> [ 4545.476435] note: cat[8976] exited with preempt_count 2
>>
>> I find the issue is introduced when applying commit f9040773b7bb
>> ("arm64: move kernel image to base of vmalloc area"). This patch
>> make the kernel image overlap with vmalloc area. It will result in
>> vmalloc area have the huge page table. but the vmalloc_to_page is
>> not realize the change. and the function is public to any arch.
> So the issue is that we have the callchain below for a kernel image
> address:
>
> read_kcore()
> ->is_vmalloc_or_module_addr() // returns true
> ->vread()
> -->aligned_vread()
> --->vmalloc_to_page()
>
> In is_vmalloc{,or_module}_addr() we just check the addr against
> VMALLOC_START and VMALLOC_END, so they will return true for a kernel
> image address.
>
> Then, we call vmalloc_to_page(). While this only handles mappings made
> at page granularity, the kernel image mapping may have used sections. So
> this tries a bogus walk to the pte level.
>
> Evidently, we assume that any memory in the vmalloc area (or module
> areas) is mapped at page granularity. Is that always the case?
  I do not see a vmalloc area mapped in huge page table so far. 
> AFAICT, memremap'd memory isn't necessarily, but vread() should skip
> that due to the VM_IOREMAP flag on the vma. The KASAN region should be
> below MODULES_VADDR on arm64. I'm not sure if there's anything else.
> Does it make sense to teach vmalloc_to_page() about section mappings?
   I do not know it has any limitation. if it is ok , it is worthy to try.
> Should we special-case kernel image handling, e.g. with new
> is_kernel_image_addr() / kernel_image_to_page() helpers?
  yes ,  it seems to the best way to implents it  without performance back.
> Do we need to shuffle things around such that the kernel image is not
> between VMALLOC_START and VMALLOC_END?
>
>> I fix it by change the init mapping to make it keep the accordance
>> with vmalloc area mapping.
>>
>> Fixes: f9040773b7bb ("arm64: move kernel image to base of vmalloc area")
>> Reported-by: tan xiaojun <tanxiaojun@huawei.com>
>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>> ---
>>  arch/arm64/mm/mmu.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 17243e4..2d8b34d 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -185,7 +185,7 @@ static void alloc_init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
>>
>>               /* try section mapping first */
>>               if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
>> -                   !page_mappings_only) {
>> +                   !page_mappings_only && !is_vmalloc_addr((void *)addr)) {
>>                       /*
>>                        * Set the contiguous bit for the subsequent group of
>>                        * PMDs if its size and alignment are appropriate.
>> @@ -256,7 +256,8 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
>>               /*
>>                * For 4K granule only, attempt to put down a 1GB block
>>                */
>> -             if (use_1G_block(addr, next, phys) && !page_mappings_only) {
>> +             if (use_1G_block(addr, next, phys) && !page_mappings_only &&
>> +                                     !is_vmalloc_addr((void *)addr)) {
>>                       pud_set_huge(pud, phys, prot);
>>
> This will force the kernel image mappings to use page granularity, which
> will come at a significant TLB pressure cost, and would be incredibly
> unfortunate.
  you are right.  it seems to so.  The simplest way to add is_kernel_image_addr helper.
> I would rather we solved this through other means.
>
> Thanks,
> Mark.
> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
>
 > The following patch is the implment. Any thought?

Signed-off-by: zhong jiang <zhongjiang@huawei.com>

  diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b..851ac35 100644

Comments

Mark Rutland April 24, 2017, 3:51 p.m. UTC | #1
On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:
> On 2017/4/24 18:44, Mark Rutland wrote:
> > So the issue is that we have the callchain below for a kernel image
> > address:
> >
> > read_kcore()
> > ->is_vmalloc_or_module_addr() // returns true
> > ->vread()
> > -->aligned_vread()
> > --->vmalloc_to_page()
> >
> > In is_vmalloc{,or_module}_addr() we just check the addr against
> > VMALLOC_START and VMALLOC_END, so they will return true for a kernel
> > image address.
> >
> > Then, we call vmalloc_to_page(). While this only handles mappings made
> > at page granularity, the kernel image mapping may have used sections. So
> > this tries a bogus walk to the pte level.

> > Should we special-case kernel image handling, e.g. with new
> > is_kernel_image_addr() / kernel_image_to_page() helpers?

>   yes ,  it seems to the best way to implents it  without performance back.

> The following patch is the implment. Any thought?
> 
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> 
>   diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b84615b..851ac35 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -475,6 +475,15 @@ static inline bool is_vmalloc_addr(const void *x)
>         return false;
>  #endif
>  }
> +
> +static inline bool is_kernel_image_addr(const void *x)
> +{
> +       unsigned long addr = (unsigned long)x;
> +
> +       return addr >= (unsigned long)_stext && addr < (unsigned long)_end;
> +
> +}
> +
>  #ifdef CONFIG_MMU
>  extern int is_vmalloc_or_module_addr(const void *x);
>  #else
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 3ca82d4..9a9ef65 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -229,6 +229,42 @@ int is_vmalloc_or_module_addr(const void *x)
>         return is_vmalloc_addr(x);
>  }
> 
> +static struct page *kernel_image_to_page(const void *kernel_addr, pgd_t *pgd)
> +{
> +       unsigned long addr = (unsigned long)kernel_addr;
> +       struct page *page = NULL;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       if (pgd_none(*pgd))
> +               goto out;
> +
> +       pud = pud_offset(pgd, addr);
> +       if (pud_none(*pud))
> +               goto out;
> +
> +       if (pud_sect(*pud))
> +               return pud_page(*pud);

The *_sect() helpers are arch-specific, so we cannot use them in generic
code. This would need to be architecture-specific.

Secondly, this will return head page of the section regardless of which
page in the section the address corresponds to

> +
> +       pmd = pmd_offset(*pmd, addr);
> +       if (pmd_none(*pmd))
> +               goto out;
> +
> +       if (pmd_sect(*pmd))
> +               return pmd_page(*pmd);

Likewise on both counts.

> +
> +       pte = pte_offset_kernel(pmd, addr);
> +       if (pte_none(*pte))
> +               goto out;
> +
> +       page = pte_page(*pte);
> +
> +out:
> +       return page;
> +
> +}

Given we know what the address should map to, I don't think we need to
walk the page tables here. I think this can be:

static struct page *kernel_image_to_page(const void *addr)
{
	return virt_to_page(lm_alias(vmalloc_addr));
}

> +
>  /*
>    * Walk a vmap address to the struct page it maps.
>    */
> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>          */
>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
> 
> +       if (is_kernel_image_addr(vmalloc_addr))
> +               return kernel_image_to_page(vmalloc_addr, pgd);

It's not clear to me that this is the right place for this to live.

It might be best to code the kernel image logic directly in kcore (and
kmem), assuming everyone's OK with that approach.

Thanks,
Mark.
Laura Abbott April 24, 2017, 5:52 p.m. UTC | #2
On 04/24/2017 08:51 AM, Mark Rutland wrote:
> On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:
>> On 2017/4/24 18:44, Mark Rutland wrote:
>>> So the issue is that we have the callchain below for a kernel image
>>> address:
>>>
>>> read_kcore()
>>> ->is_vmalloc_or_module_addr() // returns true
>>> ->vread()
>>> -->aligned_vread()
>>> --->vmalloc_to_page()
>>>
>>> In is_vmalloc{,or_module}_addr() we just check the addr against
>>> VMALLOC_START and VMALLOC_END, so they will return true for a kernel
>>> image address.
>>>
>>> Then, we call vmalloc_to_page(). While this only handles mappings made
>>> at page granularity, the kernel image mapping may have used sections. So
>>> this tries a bogus walk to the pte level.
> 
>>> Should we special-case kernel image handling, e.g. with new
>>> is_kernel_image_addr() / kernel_image_to_page() helpers?
> 
>>   yes ,  it seems to the best way to implents it  without performance back.
> 
>> The following patch is the implment. Any thought?
>>
>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>
>>   diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b84615b..851ac35 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -475,6 +475,15 @@ static inline bool is_vmalloc_addr(const void *x)
>>         return false;
>>  #endif
>>  }
>> +
>> +static inline bool is_kernel_image_addr(const void *x)
>> +{
>> +       unsigned long addr = (unsigned long)x;
>> +
>> +       return addr >= (unsigned long)_stext && addr < (unsigned long)_end;
>> +
>> +}
>> +
>>  #ifdef CONFIG_MMU
>>  extern int is_vmalloc_or_module_addr(const void *x);
>>  #else
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 3ca82d4..9a9ef65 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -229,6 +229,42 @@ int is_vmalloc_or_module_addr(const void *x)
>>         return is_vmalloc_addr(x);
>>  }
>>
>> +static struct page *kernel_image_to_page(const void *kernel_addr, pgd_t *pgd)
>> +{
>> +       unsigned long addr = (unsigned long)kernel_addr;
>> +       struct page *page = NULL;
>> +       pud_t *pud;
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       if (pgd_none(*pgd))
>> +               goto out;
>> +
>> +       pud = pud_offset(pgd, addr);
>> +       if (pud_none(*pud))
>> +               goto out;
>> +
>> +       if (pud_sect(*pud))
>> +               return pud_page(*pud);
> 
> The *_sect() helpers are arch-specific, so we cannot use them in generic
> code. This would need to be architecture-specific.
> 
> Secondly, this will return head page of the section regardless of which
> page in the section the address corresponds to
> 
>> +
>> +       pmd = pmd_offset(*pmd, addr);
>> +       if (pmd_none(*pmd))
>> +               goto out;
>> +
>> +       if (pmd_sect(*pmd))
>> +               return pmd_page(*pmd);
> 
> Likewise on both counts.
> 
>> +
>> +       pte = pte_offset_kernel(pmd, addr);
>> +       if (pte_none(*pte))
>> +               goto out;
>> +
>> +       page = pte_page(*pte);
>> +
>> +out:
>> +       return page;
>> +
>> +}
> 
> Given we know what the address should map to, I don't think we need to
> walk the page tables here. I think this can be:
> 
> static struct page *kernel_image_to_page(const void *addr)
> {
> 	return virt_to_page(lm_alias(vmalloc_addr));
> }
> 
>> +
>>  /*
>>    * Walk a vmap address to the struct page it maps.
>>    */
>> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>          */
>>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>
>> +       if (is_kernel_image_addr(vmalloc_addr))
>> +               return kernel_image_to_page(vmalloc_addr, pgd);
> 
> It's not clear to me that this is the right place for this to live.
> 
> It might be best to code the kernel image logic directly in kcore (and
> kmem), assuming everyone's OK with that approach.
> 

That will fix kcore and kmem but this will show up in other places too.
We've gone through and made sure that virt_addr_valid returns
true if and only if virt_to_page returns a valid address. I don't know
if we can make as strong a claim about is_vmalloc_addr and
vmalloc_to_page in all cases but is_vmalloc_addr should not return true
for the kernel image. That would at least let kcore fall back to
kern_addr_valid which should correctly handle the kernel image.
The suggestion to move the kernel image out of VMALLOC_START/VMALLOC_END
seems like the best approach although I haven't tried a prototype
at all.

Thanks,
Laura
Ard Biesheuvel April 24, 2017, 5:56 p.m. UTC | #3
On 24 April 2017 at 18:52, Laura Abbott <labbott@redhat.com> wrote:
> On 04/24/2017 08:51 AM, Mark Rutland wrote:
>> On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:
>>> On 2017/4/24 18:44, Mark Rutland wrote:
>>>> So the issue is that we have the callchain below for a kernel image
>>>> address:
>>>>
>>>> read_kcore()
>>>> ->is_vmalloc_or_module_addr() // returns true
>>>> ->vread()
>>>> -->aligned_vread()
>>>> --->vmalloc_to_page()
>>>>
>>>> In is_vmalloc{,or_module}_addr() we just check the addr against
>>>> VMALLOC_START and VMALLOC_END, so they will return true for a kernel
>>>> image address.
>>>>
>>>> Then, we call vmalloc_to_page(). While this only handles mappings made
>>>> at page granularity, the kernel image mapping may have used sections. So
>>>> this tries a bogus walk to the pte level.
>>
>>>> Should we special-case kernel image handling, e.g. with new
>>>> is_kernel_image_addr() / kernel_image_to_page() helpers?
>>
>>>   yes ,  it seems to the best way to implents it  without performance back.
>>
>>> The following patch is the implment. Any thought?
>>>
>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>>
>>>   diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index b84615b..851ac35 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -475,6 +475,15 @@ static inline bool is_vmalloc_addr(const void *x)
>>>         return false;
>>>  #endif
>>>  }
>>> +
>>> +static inline bool is_kernel_image_addr(const void *x)
>>> +{
>>> +       unsigned long addr = (unsigned long)x;
>>> +
>>> +       return addr >= (unsigned long)_stext && addr < (unsigned long)_end;
>>> +
>>> +}
>>> +
>>>  #ifdef CONFIG_MMU
>>>  extern int is_vmalloc_or_module_addr(const void *x);
>>>  #else
>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>> index 3ca82d4..9a9ef65 100644
>>> --- a/mm/vmalloc.c
>>> +++ b/mm/vmalloc.c
>>> @@ -229,6 +229,42 @@ int is_vmalloc_or_module_addr(const void *x)
>>>         return is_vmalloc_addr(x);
>>>  }
>>>
>>> +static struct page *kernel_image_to_page(const void *kernel_addr, pgd_t *pgd)
>>> +{
>>> +       unsigned long addr = (unsigned long)kernel_addr;
>>> +       struct page *page = NULL;
>>> +       pud_t *pud;
>>> +       pmd_t *pmd;
>>> +       pte_t *pte;
>>> +
>>> +       if (pgd_none(*pgd))
>>> +               goto out;
>>> +
>>> +       pud = pud_offset(pgd, addr);
>>> +       if (pud_none(*pud))
>>> +               goto out;
>>> +
>>> +       if (pud_sect(*pud))
>>> +               return pud_page(*pud);
>>
>> The *_sect() helpers are arch-specific, so we cannot use them in generic
>> code. This would need to be architecture-specific.
>>
>> Secondly, this will return head page of the section regardless of which
>> page in the section the address corresponds to
>>
>>> +
>>> +       pmd = pmd_offset(*pmd, addr);
>>> +       if (pmd_none(*pmd))
>>> +               goto out;
>>> +
>>> +       if (pmd_sect(*pmd))
>>> +               return pmd_page(*pmd);
>>
>> Likewise on both counts.
>>
>>> +
>>> +       pte = pte_offset_kernel(pmd, addr);
>>> +       if (pte_none(*pte))
>>> +               goto out;
>>> +
>>> +       page = pte_page(*pte);
>>> +
>>> +out:
>>> +       return page;
>>> +
>>> +}
>>
>> Given we know what the address should map to, I don't think we need to
>> walk the page tables here. I think this can be:
>>
>> static struct page *kernel_image_to_page(const void *addr)
>> {
>>       return virt_to_page(lm_alias(vmalloc_addr));
>> }
>>
>>> +
>>>  /*
>>>    * Walk a vmap address to the struct page it maps.
>>>    */
>>> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>>          */
>>>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>>
>>> +       if (is_kernel_image_addr(vmalloc_addr))
>>> +               return kernel_image_to_page(vmalloc_addr, pgd);
>>
>> It's not clear to me that this is the right place for this to live.
>>
>> It might be best to code the kernel image logic directly in kcore (and
>> kmem), assuming everyone's OK with that approach.
>>
>
> That will fix kcore and kmem but this will show up in other places too.
> We've gone through and made sure that virt_addr_valid returns
> true if and only if virt_to_page returns a valid address. I don't know
> if we can make as strong a claim about is_vmalloc_addr and
> vmalloc_to_page in all cases but is_vmalloc_addr should not return true
> for the kernel image. That would at least let kcore fall back to
> kern_addr_valid which should correctly handle the kernel image.
> The suggestion to move the kernel image out of VMALLOC_START/VMALLOC_END
> seems like the best approach although I haven't tried a prototype
> at all.
>

Moving the kernel into the vmalloc region was kind of the point, for
KASLR. Undoing that means either disabling KASLR, or splitting the
vmalloc region into a KASLR region only for the kernel, and a vmalloc
region like we had when vmlinux lived in the linear region.

In general, I think we should be able to deal with different kinds of
mappings with different granularity in the vmalloc region. If
necessary, we could introduce a VM_xxx flag for the kernel to
distinguish such regions from ordinary VM_MAP regions.
zhong jiang April 25, 2017, 8:13 a.m. UTC | #4
On 2017/4/25 1:56, Ard Biesheuvel wrote:
> On 24 April 2017 at 18:52, Laura Abbott <labbott@redhat.com> wrote:
>> On 04/24/2017 08:51 AM, Mark Rutland wrote:
>>> On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:
>>>> On 2017/4/24 18:44, Mark Rutland wrote:
>>>>> So the issue is that we have the callchain below for a kernel image
>>>>> address:
>>>>>
>>>>> read_kcore()
>>>>> ->is_vmalloc_or_module_addr() // returns true
>>>>> ->vread()
>>>>> -->aligned_vread()
>>>>> --->vmalloc_to_page()
>>>>>
>>>>> In is_vmalloc{,or_module}_addr() we just check the addr against
>>>>> VMALLOC_START and VMALLOC_END, so they will return true for a kernel
>>>>> image address.
>>>>>
>>>>> Then, we call vmalloc_to_page(). While this only handles mappings made
>>>>> at page granularity, the kernel image mapping may have used sections. So
>>>>> this tries a bogus walk to the pte level.
>>>>> Should we special-case kernel image handling, e.g. with new
>>>>> is_kernel_image_addr() / kernel_image_to_page() helpers?
>>>>   yes ,  it seems to the best way to implents it  without performance back.
>>>> The following patch is the implment. Any thought?
>>>>
>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>>>
>>>>   diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index b84615b..851ac35 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -475,6 +475,15 @@ static inline bool is_vmalloc_addr(const void *x)
>>>>         return false;
>>>>  #endif
>>>>  }
>>>> +
>>>> +static inline bool is_kernel_image_addr(const void *x)
>>>> +{
>>>> +       unsigned long addr = (unsigned long)x;
>>>> +
>>>> +       return addr >= (unsigned long)_stext && addr < (unsigned long)_end;
>>>> +
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_MMU
>>>>  extern int is_vmalloc_or_module_addr(const void *x);
>>>>  #else
>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>>>> index 3ca82d4..9a9ef65 100644
>>>> --- a/mm/vmalloc.c
>>>> +++ b/mm/vmalloc.c
>>>> @@ -229,6 +229,42 @@ int is_vmalloc_or_module_addr(const void *x)
>>>>         return is_vmalloc_addr(x);
>>>>  }
>>>>
>>>> +static struct page *kernel_image_to_page(const void *kernel_addr, pgd_t *pgd)
>>>> +{
>>>> +       unsigned long addr = (unsigned long)kernel_addr;
>>>> +       struct page *page = NULL;
>>>> +       pud_t *pud;
>>>> +       pmd_t *pmd;
>>>> +       pte_t *pte;
>>>> +
>>>> +       if (pgd_none(*pgd))
>>>> +               goto out;
>>>> +
>>>> +       pud = pud_offset(pgd, addr);
>>>> +       if (pud_none(*pud))
>>>> +               goto out;
>>>> +
>>>> +       if (pud_sect(*pud))
>>>> +               return pud_page(*pud);
>>> The *_sect() helpers are arch-specific, so we cannot use them in generic
>>> code. This would need to be architecture-specific.
>>>
>>> Secondly, this will return head page of the section regardless of which
>>> page in the section the address corresponds to
>>>
>>>> +
>>>> +       pmd = pmd_offset(*pmd, addr);
>>>> +       if (pmd_none(*pmd))
>>>> +               goto out;
>>>> +
>>>> +       if (pmd_sect(*pmd))
>>>> +               return pmd_page(*pmd);
>>> Likewise on both counts.
>>>
>>>> +
>>>> +       pte = pte_offset_kernel(pmd, addr);
>>>> +       if (pte_none(*pte))
>>>> +               goto out;
>>>> +
>>>> +       page = pte_page(*pte);
>>>> +
>>>> +out:
>>>> +       return page;
>>>> +
>>>> +}
>>> Given we know what the address should map to, I don't think we need to
>>> walk the page tables here. I think this can be:
>>>
>>> static struct page *kernel_image_to_page(const void *addr)
>>> {
>>>       return virt_to_page(lm_alias(vmalloc_addr));
>>> }
>>>
>>>> +
>>>>  /*
>>>>    * Walk a vmap address to the struct page it maps.
>>>>    */
>>>> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>>>          */
>>>>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>>>
>>>> +       if (is_kernel_image_addr(vmalloc_addr))
>>>> +               return kernel_image_to_page(vmalloc_addr, pgd);
>>> It's not clear to me that this is the right place for this to live.
>>>
>>> It might be best to code the kernel image logic directly in kcore (and
>>> kmem), assuming everyone's OK with that approach.
>>>
>> That will fix kcore and kmem but this will show up in other places too.
>> We've gone through and made sure that virt_addr_valid returns
>> true if and only if virt_to_page returns a valid address. I don't know
>> if we can make as strong a claim about is_vmalloc_addr and
>> vmalloc_to_page in all cases but is_vmalloc_addr should not return true
>> for the kernel image. That would at least let kcore fall back to
>> kern_addr_valid which should correctly handle the kernel image.
>> The suggestion to move the kernel image out of VMALLOC_START/VMALLOC_END
>> seems like the best approach although I haven't tried a prototype
>> at all.
>>
> Moving the kernel into the vmalloc region was kind of the point, for
> KASLR. Undoing that means either disabling KASLR, or splitting the
> vmalloc region into a KASLR region only for the kernel, and a vmalloc
> region like we had when vmlinux lived in the linear region.
>
> In general, I think we should be able to deal with different kinds of
> mappings with different granularity in the vmalloc region. If
> necessary, we could introduce a VM_xxx flag for the kernel to
> distinguish such regions from ordinary VM_MAP regions.
> .
>
 Hi, Ard

 My initial thought is same with you. valloc_to_page is public for any ARCH.
 Only arm64 change the mappings granularity. if we need to adjust the function
 to any ARCH, it should implement it in each ARCH. or add CONFIG_ARM64 code to
 exclude the other ARCH. because the pud_* helper is arch related.
 
 Mark proposed that add kernel_image_is_page helper is more simple.
 and I prefer to the Mark opinions. Do you think? or I miss anythings.
 
 Depend on you and other maintainers.
 
 Thanks
 zhongjiang
zhong jiang April 25, 2017, 2:11 p.m. UTC | #5
On 2017/4/24 23:51, Mark Rutland wrote:
> On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:
>> On 2017/4/24 18:44, Mark Rutland wrote:
>>> So the issue is that we have the callchain below for a kernel image
>>> address:
>>>
>>> read_kcore()
>>> ->is_vmalloc_or_module_addr() // returns true
>>> ->vread()
>>> -->aligned_vread()
>>> --->vmalloc_to_page()
>>>
>>> In is_vmalloc{,or_module}_addr() we just check the addr against
>>> VMALLOC_START and VMALLOC_END, so they will return true for a kernel
>>> image address.
>>>
>>> Then, we call vmalloc_to_page(). While this only handles mappings made
>>> at page granularity, the kernel image mapping may have used sections. So
>>> this tries a bogus walk to the pte level.
>>> Should we special-case kernel image handling, e.g. with new
>>> is_kernel_image_addr() / kernel_image_to_page() helpers?
>>   yes ,  it seems to the best way to implents it  without performance back.
>> The following patch is the implment. Any thought?
>>
>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>
>>   diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b84615b..851ac35 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -475,6 +475,15 @@ static inline bool is_vmalloc_addr(const void *x)
>>         return false;
>>  #endif
>>  }
>> +
>> +static inline bool is_kernel_image_addr(const void *x)
>> +{
>> +       unsigned long addr = (unsigned long)x;
>> +
>> +       return addr >= (unsigned long)_stext && addr < (unsigned long)_end;
>> +
>> +}
>> +
>>  #ifdef CONFIG_MMU
>>  extern int is_vmalloc_or_module_addr(const void *x);
>>  #else
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 3ca82d4..9a9ef65 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -229,6 +229,42 @@ int is_vmalloc_or_module_addr(const void *x)
>>         return is_vmalloc_addr(x);
>>  }
>>
>> +static struct page *kernel_image_to_page(const void *kernel_addr, pgd_t *pgd)
>> +{
>> +       unsigned long addr = (unsigned long)kernel_addr;
>> +       struct page *page = NULL;
>> +       pud_t *pud;
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       if (pgd_none(*pgd))
>> +               goto out;
>> +
>> +       pud = pud_offset(pgd, addr);
>> +       if (pud_none(*pud))
>> +               goto out;
>> +
>> +       if (pud_sect(*pud))
>> +               return pud_page(*pud);
> The *_sect() helpers are arch-specific, so we cannot use them in generic
> code. This would need to be architecture-specific.
>
> Secondly, this will return head page of the section regardless of which
> page in the section the address corresponds to
  yes,   the code is too harry to finish.  it is so mess.  we will send the formal patch later.
  if you still accept the approach.

 Thanks
 zhongjiang
>> +
>> +       pmd = pmd_offset(*pmd, addr);
>> +       if (pmd_none(*pmd))
>> +               goto out;
>> +
>> +       if (pmd_sect(*pmd))
>> +               return pmd_page(*pmd);
> Likewise on both counts.
>
>> +
>> +       pte = pte_offset_kernel(pmd, addr);
>> +       if (pte_none(*pte))
>> +               goto out;
>> +
>> +       page = pte_page(*pte);
>> +
>> +out:
>> +       return page;
>> +
>> +}
> Given we know what the address should map to, I don't think we need to
> walk the page tables here. I think this can be:
>
> static struct page *kernel_image_to_page(const void *addr)
> {
> 	return virt_to_page(lm_alias(vmalloc_addr));
> }
>
>> +
>>  /*
>>    * Walk a vmap address to the struct page it maps.
>>    */
>> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>>          */
>>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>>
>> +       if (is_kernel_image_addr(vmalloc_addr))
>> +               return kernel_image_to_page(vmalloc_addr, pgd);
> It's not clear to me that this is the right place for this to live.
>
> It might be best to code the kernel image logic directly in kcore (and
> kmem), assuming everyone's OK with that approach.
>
> Thanks,
> Mark.
> .
>
Mark Rutland April 25, 2017, 2:51 p.m. UTC | #6
On Mon, Apr 24, 2017 at 10:52:08AM -0700, Laura Abbott wrote:
> On 04/24/2017 08:51 AM, Mark Rutland wrote:
> > On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:

> >>  /*
> >>    * Walk a vmap address to the struct page it maps.
> >>    */
> >> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
> >>          */
> >>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
> >>
> >> +       if (is_kernel_image_addr(vmalloc_addr))
> >> +               return kernel_image_to_page(vmalloc_addr, pgd);
> > 
> > It's not clear to me that this is the right place for this to live.
> > 
> > It might be best to code the kernel image logic directly in kcore (and
> > kmem), assuming everyone's OK with that approach.
> > 
> 
> That will fix kcore and kmem but this will show up in other places too.

True.

> We've gone through and made sure that virt_addr_valid returns
> true if and only if virt_to_page returns a valid address. I don't know
> if we can make as strong a claim about is_vmalloc_addr and
> vmalloc_to_page in all cases but is_vmalloc_addr should not return true
> for the kernel image. That would at least let kcore fall back to
> kern_addr_valid which should correctly handle the kernel image.

That would largely be my preference.

My fear is that other users of is_vmalloc_addr() are doing the right
thing for the kernel image today (e.g. not doing virt_to_phys()),
because they see it as a vmalloc addr.

So we might have to audit all of those.

> The suggestion to move the kernel image out of VMALLOC_START/VMALLOC_END
> seems like the best approach although I haven't tried a prototype
> at all.

Given that (AFAICT) we're the only architecture that puts the kernel in
the vmalloc area, I agree that this is likely to be the simplest correct
approach. The interaction with KASLR is somewhat unfortunate.

Thanks,
Mark.
Mark Rutland April 25, 2017, 3:02 p.m. UTC | #7
On Mon, Apr 24, 2017 at 06:56:43PM +0100, Ard Biesheuvel wrote:
> On 24 April 2017 at 18:52, Laura Abbott <labbott@redhat.com> wrote:
> > On 04/24/2017 08:51 AM, Mark Rutland wrote:
> >> On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:

> >> static struct page *kernel_image_to_page(const void *addr)
> >> {
> >>       return virt_to_page(lm_alias(vmalloc_addr));
> >> }
> >>
> >>> +
> >>>  /*
> >>>    * Walk a vmap address to the struct page it maps.
> >>>    */
> >>> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
> >>>          */
> >>>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
> >>>
> >>> +       if (is_kernel_image_addr(vmalloc_addr))
> >>> +               return kernel_image_to_page(vmalloc_addr, pgd);
> >>
> >> It's not clear to me that this is the right place for this to live.
> >>
> >> It might be best to code the kernel image logic directly in kcore (and
> >> kmem), assuming everyone's OK with that approach.
> >>
> >
> > That will fix kcore and kmem but this will show up in other places too.
> > We've gone through and made sure that virt_addr_valid returns
> > true if and only if virt_to_page returns a valid address. I don't know
> > if we can make as strong a claim about is_vmalloc_addr and
> > vmalloc_to_page in all cases but is_vmalloc_addr should not return true
> > for the kernel image. That would at least let kcore fall back to
> > kern_addr_valid which should correctly handle the kernel image.
> > The suggestion to move the kernel image out of VMALLOC_START/VMALLOC_END
> > seems like the best approach although I haven't tried a prototype
> > at all.
> 
> Moving the kernel into the vmalloc region was kind of the point, for
> KASLR. Undoing that means either disabling KASLR, or splitting the
> vmalloc region into a KASLR region only for the kernel, and a vmalloc
> region like we had when vmlinux lived in the linear region.

AFAICT, x86 don't place the kernel in the vmalloc region for its KASLR
implementation. Is that just to avoid the issues that we're seeing, or
are there aother constraints on x86?

> In general, I think we should be able to deal with different kinds of
> mappings with different granularity in the vmalloc region. If
> necessary, we could introduce a VM_xxx flag for the kernel to
> distinguish such regions from ordinary VM_MAP regions.

I don't think that vmalloc_to_page() should have to deal with anything
other than the usual page-granular memory mappings in the vmalloc area,
as it currently does. We'd have to write a page table walker per-arch
for that to work, and the only thing it would benefit is arm64's kernel
image mapping.

Adding VM_xxx (e.g. VM_KERNEL) sounds sane to me regardless of anything
else.

That still leaves the question as to what is_vmalloc_addr(addr) (should)
imply about addr, though.

Thanks,
Mark.
Ard Biesheuvel April 25, 2017, 3:18 p.m. UTC | #8
On 25 April 2017 at 16:02, Mark Rutland <mark.rutland@arm.com> wrote:
> On Mon, Apr 24, 2017 at 06:56:43PM +0100, Ard Biesheuvel wrote:
>> On 24 April 2017 at 18:52, Laura Abbott <labbott@redhat.com> wrote:
>> > On 04/24/2017 08:51 AM, Mark Rutland wrote:
>> >> On Mon, Apr 24, 2017 at 09:28:48PM +0800, zhong jiang wrote:
>
>> >> static struct page *kernel_image_to_page(const void *addr)
>> >> {
>> >>       return virt_to_page(lm_alias(vmalloc_addr));
>> >> }
>> >>
>> >>> +
>> >>>  /*
>> >>>    * Walk a vmap address to the struct page it maps.
>> >>>    */
>> >>> @@ -244,6 +280,9 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
>> >>>          */
>> >>>         VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));
>> >>>
>> >>> +       if (is_kernel_image_addr(vmalloc_addr))
>> >>> +               return kernel_image_to_page(vmalloc_addr, pgd);
>> >>
>> >> It's not clear to me that this is the right place for this to live.
>> >>
>> >> It might be best to code the kernel image logic directly in kcore (and
>> >> kmem), assuming everyone's OK with that approach.
>> >>
>> >
>> > That will fix kcore and kmem but this will show up in other places too.
>> > We've gone through and made sure that virt_addr_valid returns
>> > true if and only if virt_to_page returns a valid address. I don't know
>> > if we can make as strong a claim about is_vmalloc_addr and
>> > vmalloc_to_page in all cases but is_vmalloc_addr should not return true
>> > for the kernel image. That would at least let kcore fall back to
>> > kern_addr_valid which should correctly handle the kernel image.
>> > The suggestion to move the kernel image out of VMALLOC_START/VMALLOC_END
>> > seems like the best approach although I haven't tried a prototype
>> > at all.
>>
>> Moving the kernel into the vmalloc region was kind of the point, for
>> KASLR. Undoing that means either disabling KASLR, or splitting the
>> vmalloc region into a KASLR region only for the kernel, and a vmalloc
>> region like we had when vmlinux lived in the linear region.
>
> AFAICT, x86 don't place the kernel in the vmalloc region for its KASLR
> implementation. Is that just to avoid the issues that we're seeing, or
> are there aother constraints on x86?
>

Given how a lot of architectures put kernel modules in the vmalloc
space, I don't think there is generally an issue with putting the
kernel there as well, other than what we are currently experiencing.

As discussed, ioremap regions are also likely to be mapped using block
mappings these days, and also reside between VMALLOC_START and
VMALLOC_END.

So in my opinion, since
a) we already put text and data in the vmalloc region, and
b) we already use block mappings in the vmalloc region

putting the kernel there is only a minor variation of what we already
have to deal with, and so I don't think there is anything wrong with
it. We simply have to tweak some tests here and there, but nothing
fundamental afaict.

>> In general, I think we should be able to deal with different kinds of
>> mappings with different granularity in the vmalloc region. If
>> necessary, we could introduce a VM_xxx flag for the kernel to
>> distinguish such regions from ordinary VM_MAP regions.
>
> I don't think that vmalloc_to_page() should have to deal with anything
> other than the usual page-granular memory mappings in the vmalloc area,
> as it currently does. We'd have to write a page table walker per-arch
> for that to work, and the only thing it would benefit is arm64's kernel
> image mapping.
>
> Adding VM_xxx (e.g. VM_KERNEL) sounds sane to me regardless of anything
> else.
>
> That still leaves the question as to what is_vmalloc_addr(addr) (should)
> imply about addr, though.
>
> Thanks,
> Mark.
diff mbox

Patch

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -475,6 +475,15 @@  static inline bool is_vmalloc_addr(const void *x)
        return false;
 #endif
 }
+
+static inline bool is_kernel_image_addr(const void *x)
+{
+       unsigned long addr = (unsigned long)x;
+
+       return addr >= (unsigned long)_stext && addr < (unsigned long)_end;
+
+}
+
 #ifdef CONFIG_MMU
 extern int is_vmalloc_or_module_addr(const void *x);
 #else
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3ca82d4..9a9ef65 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -229,6 +229,42 @@  int is_vmalloc_or_module_addr(const void *x)
        return is_vmalloc_addr(x);
 }

+static struct page *kernel_image_to_page(const void *kernel_addr, pgd_t *pgd)
+{
+       unsigned long addr = (unsigned long)kernel_addr;
+       struct page *page = NULL;
+       pud_t *pud;
+       pmd_t *pmd;
+       pte_t *pte;
+
+       if (pgd_none(*pgd))
+               goto out;
+
+       pud = pud_offset(pgd, addr);
+       if (pud_none(*pud))
+               goto out;
+
+       if (pud_sect(*pud))
+               return pud_page(*pud);
+
+       pmd = pmd_offset(*pmd, addr);
+       if (pmd_none(*pmd))
+               goto out;
+
+       if (pmd_sect(*pmd))
+               return pmd_page(*pmd);
+
+       pte = pte_offset_kernel(pmd, addr);
+       if (pte_none(*pte))
+               goto out;
+
+       page = pte_page(*pte);
+
+out:
+       return page;
+
+}
+
 /*
   * Walk a vmap address to the struct page it maps.
   */
@@ -244,6 +280,9 @@  struct page *vmalloc_to_page(const void *vmalloc_addr)
         */
        VIRTUAL_BUG_ON(!is_vmalloc_or_module_addr(vmalloc_addr));

+       if (is_kernel_image_addr(vmalloc_addr))
+               return kernel_image_to_page(vmalloc_addr, pgd);
+
        if (!pgd_none(*pgd)) {
                pud_t *pud = pud_offset(pgd, addr);
                if (!pud_none(*pud)) {