diff mbox series

[v3,08/17] arm64, trans_pgd: make trans_pgd_map_page generic

Message ID 20190821183204.23576-9-pasha.tatashin@soleen.com (mailing list archive)
State New, archived
Headers show
Series arm64: MMU enabled kexec relocation | expand

Commit Message

Pasha Tatashin Aug. 21, 2019, 6:31 p.m. UTC
Currently, trans_pgd_map_page has assumptions that are relevant to
hibernate. But, to make it generic we must allow it to use any allocator
and also, can't assume that entries do not exist in the page table
already. Also, we can't use init_mm here.

Also, add "flags" for trans_pgd_info, they are going to be used
in copy functions once they are generalized.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 arch/arm64/include/asm/trans_pgd.h | 39 +++++++++++++-
 arch/arm64/kernel/hibernate.c      | 13 ++++-
 arch/arm64/mm/trans_pgd.c          | 82 +++++++++++++++++++++---------
 3 files changed, 107 insertions(+), 27 deletions(-)

Comments

James Morse Sept. 6, 2019, 3:20 p.m. UTC | #1
Hi Pavel,

On 21/08/2019 19:31, Pavel Tatashin wrote:
> Currently, trans_pgd_map_page has assumptions that are relevant to
> hibernate. But, to make it generic we must allow it to use any allocator

Sounds familiar: you removed this in patch 2.


> and also, can't assume that entries do not exist in the page table
> already.

This thing creates a set of page tables to map one page: the relocation code.
This is mapped in TTBR0_EL1.
It can assume existing entries do not exist, because it creates the single-entry levels as
it goes. Kexec also needs to map precisely one page for relocation. You don't need to
generalise this.

'trans_pgd_create_copy()' is what creates a copy the linear map. This is mapped in TTBR1_EL1.

There is no reason for kexec to behave differently here.


> Also, we can't use init_mm here.

Why not? arm64's pgd_populate() doesn't use the mm. It's only there to make it obvious
this is an EL1 mapping we are creating. We use the kernel-asid with the new mapping.

The __ version is a lot less readable. Please don't use the page tables as an array: this
is what the offset helpers are for.


> Also, add "flags" for trans_pgd_info, they are going to be used
> in copy functions once they are generalized.

You don't need to 'generalize' this to support hypothetical users.
There are only two: hibernate and kexec, both of which are very specialised. Making these
things top-level marionette strings will tangle the logic.

The copy_p?d() functions should decide if they should manipulate _this_ entry based on
_this_ entry and the kernel configuration. This is only really done in _copy_pte(), which
is where it should stay.


> diff --git a/arch/arm64/include/asm/trans_pgd.h b/arch/arm64/include/asm/trans_pgd.h
> index c7b5402b7d87..e3d022b1b526 100644
> --- a/arch/arm64/include/asm/trans_pgd.h
> +++ b/arch/arm64/include/asm/trans_pgd.h
> @@ -11,10 +11,45 @@
>  #include <linux/bits.h>
>  #include <asm/pgtable-types.h>
>  
> +/*
> + * trans_alloc_page
> + *	- Allocator that should return exactly one uninitilaized page, if this
> + *	 allocator fails, trans_pgd returns -ENOMEM error.
> + *
> + * trans_alloc_arg
> + *	- Passed to trans_alloc_page as an argument

This is very familiar.


> + * trans_flags
> + *	- bitmap with flags that control how page table is filled.
> + *	  TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
> + *			 writeable by removing RDONLY flag from PTE.

Why would you ever keep the read-only flags in a set of page tables that exist to let you
overwrite memory?


> + *	  TRANS_MKVALID: during page table copy, if PTE present, but not valid,
> + *			 make it valid.

Please keep this logic together with the !pte_none(pte) and debug_pagealloc_enabled()
check, where it is today.

Making an entry valid without those checks should never be necessary.


> + *	  TRANS_CHECKPFN: During page table copy, for every PTE entry check that
> + *			  PFN that this PTE points to is valid. Otherwise return
> + *			  -ENXIO

Hibernate does this when inventing a new mapping. This is how we check the kernel
should be able to read/write this page. If !pfn_valid(), the page should not be mapped.

Why do you need to turn this off?

It us only necessary at the leaf level, and only if debug-pagealloc is in use. Please keep
all these bits together, as its much harder to understand why this entry needs inventing
when its split up like this.



> diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
> index 6ee81bbaa37f..17426dc8cb54 100644
> --- a/arch/arm64/kernel/hibernate.c
> +++ b/arch/arm64/kernel/hibernate.c
> @@ -179,6 +179,12 @@ int arch_hibernation_header_restore(void *addr)
>  }
>  EXPORT_SYMBOL(arch_hibernation_header_restore);
>  
> +static void *
> +hibernate_page_alloc(void *arg)
> +{
> +	return (void *)get_safe_page((gfp_t)(unsigned long)arg);
> +}
> +
>  /*
>   * Copies length bytes, starting at src_start into an new page,
>   * perform cache maintentance, then maps it at the specified address low
> @@ -195,6 +201,11 @@ static int create_safe_exec_page(void *src_start, size_t length,
>  				 unsigned long dst_addr,
>  				 phys_addr_t *phys_dst_addr)
>  {
> +	struct trans_pgd_info trans_info = {
> +		.trans_alloc_page	= hibernate_page_alloc,
> +		.trans_alloc_arg	= (void *)GFP_ATOMIC,
> +		.trans_flags		= 0,
> +	};
>  	void *page = (void *)get_safe_page(GFP_ATOMIC);
>  	pgd_t *trans_pgd;
>  	int rc;
> @@ -209,7 +220,7 @@ static int create_safe_exec_page(void *src_start, size_t length,
>  	if (!trans_pgd)
>  		return -ENOMEM;
>  
> -	rc = trans_pgd_map_page(trans_pgd, page, dst_addr,
> +	rc = trans_pgd_map_page(&trans_info, trans_pgd, page, dst_addr,
>  				PAGE_KERNEL_EXEC);
>  	if (rc)
>  		return rc;
> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
> index 00b62d8640c2..dbabccd78cc4 100644
> --- a/arch/arm64/mm/trans_pgd.c
> +++ b/arch/arm64/mm/trans_pgd.c
> @@ -17,6 +17,16 @@
>  #include <asm/pgtable.h>
>  #include <linux/suspend.h>
>  
> +static void *trans_alloc(struct trans_pgd_info *info)
> +{
> +	void *page = info->trans_alloc_page(info->trans_alloc_arg);
> +
> +	if (page)
> +		clear_page(page);

The hibernate allocator already does this. As your reason for doing this is to make this
faster, it seems odd we do this twice.

If zeroed pages are necessary, the allocator should do it. (It already needs to be a
use-case specific allocator)


> +
> +	return page;
> +}
> +
>  static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
>  {
>  	pte_t pte = READ_ONCE(*src_ptep);
> @@ -172,40 +182,64 @@ int trans_pgd_create_copy(pgd_t **dst_pgdp, unsigned long start,
>  	return rc;
>  }
>  
> -int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long dst_addr,
> -		       pgprot_t pgprot)
> +int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
> +		       void *page, unsigned long dst_addr, pgprot_t pgprot)
>  {
> -	pgd_t *pgdp;
> -	pud_t *pudp;
> -	pmd_t *pmdp;
> -	pte_t *ptep;
> -
> -	pgdp = pgd_offset_raw(trans_pgd, dst_addr);
> -	if (pgd_none(READ_ONCE(*pgdp))) {
> -		pudp = (void *)get_safe_page(GFP_ATOMIC);
> -		if (!pudp)
> +	int pgd_idx = pgd_index(dst_addr);
> +	int pud_idx = pud_index(dst_addr);
> +	int pmd_idx = pmd_index(dst_addr);
> +	int pte_idx = pte_index(dst_addr);

Yuck.



> +	pgd_t *pgdp = trans_pgd;
> +	pgd_t pgd = READ_ONCE(pgdp[pgd_idx]);
> +	pud_t *pudp, pud;
> +	pmd_t *pmdp, pmd;
> +	pte_t *ptep, pte;
> +
> +	if (pgd_none(pgd)) {
> +		pud_t *t = trans_alloc(info);
> +
> +		if (!t)
>  			return -ENOMEM;

> -		pgd_populate(&init_mm, pgdp, pudp);
> +
> +		__pgd_populate(&pgdp[pgd_idx], __pa(t), PUD_TYPE_TABLE);
> +		pgd = READ_ONCE(pgdp[pgd_idx]);


Please keep the pgd_populate() call. If there is some reason we can't pass init_mm, we can
pass NULL, or a fake mm pointer instead.

Going behind the page table helpers back to play with the table directly is a maintenance
headache.


>  	}
>  


> -	pudp = pud_offset(pgdp, dst_addr);
> -	if (pud_none(READ_ONCE(*pudp))) {
> -		pmdp = (void *)get_safe_page(GFP_ATOMIC);
> -		if (!pmdp)
> +	pudp = __va(pgd_page_paddr(pgd));
> +	pud = READ_ONCE(pudp[pud_idx]);
> +	if (pud_sect(pud)) {
> +		return -ENXIO;
> +	} else if (pud_none(pud) || pud_sect(pud)) {
> +		pmd_t *t = trans_alloc(info);
> +
> +		if (!t)
>  			return -ENOMEM;

Choke on block mappings? This should never happen because this function should only create
the tables necessary to map one page. Not a block mapping in sight.

(see my comments on patch 6)


Thanks,

James
Pasha Tatashin Sept. 6, 2019, 6:58 p.m. UTC | #2
On Fri, Sep 6, 2019 at 11:20 AM James Morse <james.morse@arm.com> wrote:
>
> Hi Pavel,
>
> On 21/08/2019 19:31, Pavel Tatashin wrote:
> > Currently, trans_pgd_map_page has assumptions that are relevant to
> > hibernate. But, to make it generic we must allow it to use any allocator
>
> Sounds familiar: you removed this in patch 2.

Yes, will fix it. Thank  you.

>
>
> > and also, can't assume that entries do not exist in the page table
> > already.
>
> This thing creates a set of page tables to map one page: the relocation code.
> This is mapped in TTBR0_EL1.
> It can assume existing entries do not exist, because it creates the single-entry levels as
> it goes. Kexec also needs to map precisely one page for relocation. You don't need to
> generalise this.
>
> 'trans_pgd_create_copy()' is what creates a copy the linear map. This is mapped in TTBR1_EL1.
>
> There is no reason for kexec to behave differently here.

This is again left over from earlier version where I had a flag for
this assumption. But later redesigned kexec to never have conflicting
mapptings.
I will fix this commit log.

>
>
> > Also, we can't use init_mm here.
>
> Why not? arm64's pgd_populate() doesn't use the mm. It's only there to make it obvious
> this is an EL1 mapping we are creating. We use the kernel-asid with the new mapping.

I understand, and we can use "mm" pointer here, but the problem of
doing so in trans_pdg_* is that it makes the design look ugly. We are
creating page tables for context that runs "mm" because it is between
kernels when everything is overwritten. Yet, relying on "mm" to create
these page tables is odd.

>
> The __ version is a lot less readable.

The only __version of macros I am using is for "populate" calls: for
example, __pmd_populate  instead of pmd_populate etc. I will use non
'__' variants with NULL argument instead of mm.

> Please don't use the page tables as an array: this is what the offset helpers are for.

Sure, I can use:

pte_offset_kernel()
pmd_offset()
pud_offset()
pgd_offset_raw()

The code becomes a little less efficient, because offsets return
pointer to the entry after READ_ONCE, and we need to use another
READ_ONCE() to read its content to parse its value in for example
pud_table(), pud_none() etc . In my case we use READ_ONCE() only one
time  per entry and operate on the content multiple times. Also,
because of unfortunate differences in macro names, the code become a
little less symmetric. Still, I can change the code to use _offsets
here. Please let me know if you still think it is better to use them
here.

>
>
> > Also, add "flags" for trans_pgd_info, they are going to be used
> > in copy functions once they are generalized.
>
> You don't need to 'generalize' this to support hypothetical users.
> There are only two: hibernate and kexec, both of which are very specialised. Making these
> things top-level marionette strings will tangle the logic.

Will do that (see reply below)

>
> The copy_p?d() functions should decide if they should manipulate _this_ entry based on
> _this_ entry and the kernel configuration. This is only really done in _copy_pte(), which
> is where it should stay.

I am sorry, I do not understand this comment. Could you please
elaborate what would you like me to change.

>
>
> > diff --git a/arch/arm64/include/asm/trans_pgd.h b/arch/arm64/include/asm/trans_pgd.h
> > index c7b5402b7d87..e3d022b1b526 100644
> > --- a/arch/arm64/include/asm/trans_pgd.h
> > +++ b/arch/arm64/include/asm/trans_pgd.h
> > @@ -11,10 +11,45 @@
> >  #include <linux/bits.h>
> >  #include <asm/pgtable-types.h>
> >
> > +/*
> > + * trans_alloc_page
> > + *   - Allocator that should return exactly one uninitilaized page, if this
> > + *    allocator fails, trans_pgd returns -ENOMEM error.
> > + *
> > + * trans_alloc_arg
> > + *   - Passed to trans_alloc_page as an argument
>
> This is very familiar.

Sorry, What do you mean?

>
>
> > + * trans_flags
> > + *   - bitmap with flags that control how page table is filled.
> > + *     TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
> > + *                    writeable by removing RDONLY flag from PTE.
>
> Why would you ever keep the read-only flags in a set of page tables that exist to let you
> overwrite memory?

It meant to take care of this comment, and keep it in hibernate specific code:
329                 /*
330                  * Resume will overwrite areas that may be marked read only
331                  * (code, rodata). Clear the RDONLY bit from the temporary
332                  * mappings we use during restore.
333                  */
334                 .trans_flags            = TRANS_MKWRITE,
335         };

But, sure, this makes sense I will remove this flag, and will do
RDONLY unconditionally.

I re-evaluated "flags", and figured that they are indeed not needed.
So, I will embed them into the code directly.

>
>
> > + *     TRANS_MKVALID: during page table copy, if PTE present, but not valid,
> > + *                    make it valid.
>
> Please keep this logic together with the !pte_none(pte) and debug_pagealloc_enabled()
> check, where it is today.
>
> Making an entry valid without those checks should never be necessary.

Yes, will do that.

>
>
> > + *     TRANS_CHECKPFN: During page table copy, for every PTE entry check that
> > + *                     PFN that this PTE points to is valid. Otherwise return
> > + *                     -ENXIO
>
> Hibernate does this when inventing a new mapping. This is how we check the kernel
> should be able to read/write this page. If !pfn_valid(), the page should not be mapped.
>
> Why do you need to turn this off?
>
> It us only necessary at the leaf level, and only if debug-pagealloc is in use. Please keep
> all these bits together, as its much harder to understand why this entry needs inventing
> when its split up like this.
>
>
>

> > diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
> > index 00b62d8640c2..dbabccd78cc4 100644
> > --- a/arch/arm64/mm/trans_pgd.c
> > +++ b/arch/arm64/mm/trans_pgd.c
> > @@ -17,6 +17,16 @@
> >  #include <asm/pgtable.h>
> >  #include <linux/suspend.h>
> >
> > +static void *trans_alloc(struct trans_pgd_info *info)
> > +{
> > +     void *page = info->trans_alloc_page(info->trans_alloc_arg);
> > +
> > +     if (page)
> > +             clear_page(page);
>
> The hibernate allocator already does this. As your reason for doing this is to make this
> faster, it seems odd we do this twice.
>
> If zeroed pages are necessary, the allocator should do it. (It already needs to be a
> use-case specific allocator)

Makes sense, I will change the requirement for allocator to return
zeroed memory.

>
>
> > +
> > +     return page;
> > +}
> > +
> >  static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
> >  {
> >       pte_t pte = READ_ONCE(*src_ptep);
> > @@ -172,40 +182,64 @@ int trans_pgd_create_copy(pgd_t **dst_pgdp, unsigned long start,
> >       return rc;
> >  }
> >
> > -int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long dst_addr,
> > -                    pgprot_t pgprot)
> > +int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
> > +                    void *page, unsigned long dst_addr, pgprot_t pgprot)
> >  {
> > -     pgd_t *pgdp;
> > -     pud_t *pudp;
> > -     pmd_t *pmdp;
> > -     pte_t *ptep;
> > -
> > -     pgdp = pgd_offset_raw(trans_pgd, dst_addr);
> > -     if (pgd_none(READ_ONCE(*pgdp))) {
> > -             pudp = (void *)get_safe_page(GFP_ATOMIC);
> > -             if (!pudp)
> > +     int pgd_idx = pgd_index(dst_addr);
> > +     int pud_idx = pud_index(dst_addr);
> > +     int pmd_idx = pmd_index(dst_addr);
> > +     int pte_idx = pte_index(dst_addr);
>
> Yuck.
>

What's wrong with pre-calculating indices? :)

>
>
> > +     pgd_t *pgdp = trans_pgd;
> > +     pgd_t pgd = READ_ONCE(pgdp[pgd_idx]);
> > +     pud_t *pudp, pud;
> > +     pmd_t *pmdp, pmd;
> > +     pte_t *ptep, pte;
> > +
> > +     if (pgd_none(pgd)) {
> > +             pud_t *t = trans_alloc(info);
> > +
> > +             if (!t)
> >                       return -ENOMEM;
>
> > -             pgd_populate(&init_mm, pgdp, pudp);
> > +
> > +             __pgd_populate(&pgdp[pgd_idx], __pa(t), PUD_TYPE_TABLE);
> > +             pgd = READ_ONCE(pgdp[pgd_idx]);
>
>
> Please keep the pgd_populate() call. If there is some reason we can't pass init_mm, we can
> pass NULL, or a fake mm pointer instead.\\

Hm, we could use NULL instead of "mm", I will do that, thanks.

>
> Going behind the page table helpers back to play with the table directly is a maintenance
> headache.
>
>
> >       }
> >
>
>
> > -     pudp = pud_offset(pgdp, dst_addr);
> > -     if (pud_none(READ_ONCE(*pudp))) {
> > -             pmdp = (void *)get_safe_page(GFP_ATOMIC);
> > -             if (!pmdp)
> > +     pudp = __va(pgd_page_paddr(pgd));
> > +     pud = READ_ONCE(pudp[pud_idx]);
> > +     if (pud_sect(pud)) {
> > +             return -ENXIO;
> > +     } else if (pud_none(pud) || pud_sect(pud)) {
> > +             pmd_t *t = trans_alloc(info);
> > +
> > +             if (!t)
> >                       return -ENOMEM;
>
> Choke on block mappings? This should never happen because this function should only create
> the tables necessary to map one page. Not a block mapping in sight.
>
> (see my comments on patch 6)

I can remove this, but what should I replace it with BUG() or silently
ignore, and assume no huge page hre? I thought the idea is not to use
BUG() calls in kernel code, and return errors instead. If, in the
future PUD size mappings are added, how is that going to be detected?

Thank you,
Pasha
James Morse Oct. 11, 2019, 6:15 p.m. UTC | #3
Hi Pavel,

On 06/09/2019 19:58, Pavel Tatashin wrote:
> On Fri, Sep 6, 2019 at 11:20 AM James Morse <james.morse@arm.com> wrote:
>> On 21/08/2019 19:31, Pavel Tatashin wrote:
>>> Currently, trans_pgd_map_page has assumptions that are relevant to
>>> hibernate. But, to make it generic we must allow it to use any allocator

>>> and also, can't assume that entries do not exist in the page table
>>> already.

[...]

>> Please don't use the page tables as an array: this is what the offset helpers are for.
> 
> Sure, I can use:
> 
> pte_offset_kernel()
> pmd_offset()
> pud_offset()
> pgd_offset_raw()

> The code becomes a little less efficient, because offsets return
> pointer to the entry after READ_ONCE, and we need to use another
> READ_ONCE() to read its content to parse its value in for example
> pud_table(), pud_none() etc . In my case we use READ_ONCE() only one
> time  per entry and operate on the content multiple times. Also,
> because of unfortunate differences in macro names, the code become a
> little less symmetric. Still, I can change the code to use _offsets
> here. Please let me know if you still think it is better to use them
> here.

We should make this as clearly readable as possible, that way reviewers can spot the bugs.
Using the helpers makes this more maintainable, as the helpers may be where strange things
like 52bit VA get implemented.

Making it fast is the compilers job. I agree it can't remove READ_ONCE()es, but I think
the difference between one and two READ_ONCE()es per leaf-entry is insignificant when we
go on to copy megabytes worth of data.


>> The copy_p?d() functions should decide if they should manipulate _this_ entry based on
>> _this_ entry and the kernel configuration. This is only really done in _copy_pte(), which
>> is where it should stay.
> 
> I am sorry, I do not understand this comment. Could you please
> elaborate what would you like me to change.

Consider the current _copy_pte():
|	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
|		/*
|		 * debug_pagealloc will removed the PTE_VALID bit if
|		 * the page isn't in use by the resume kernel. It may have
|		 * been in use by the original kernel, in which case we need
|		 * to put it back in our copy to do the restore.
|		 *
|		 * Before marking this entry valid, check the pfn should
|		 * be mapped.
|		 */
|		BUG_ON(!pfn_valid(pte_pfn(pte)));
|
|		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
|	}

From this it is very obvious that we only put the valid bits back into the page table if
debug_pagealloc is enabled and the not-valid PTE's PFN points to memory that was part of
the linear map.

If this logic gets moved apart, and strung together with global variables, its not at all
clear what happens.


>>> diff --git a/arch/arm64/include/asm/trans_pgd.h b/arch/arm64/include/asm/trans_pgd.h
>>> index c7b5402b7d87..e3d022b1b526 100644
>>> --- a/arch/arm64/include/asm/trans_pgd.h
>>> +++ b/arch/arm64/include/asm/trans_pgd.h
>>> @@ -11,10 +11,45 @@
>>>  #include <linux/bits.h>
>>>  #include <asm/pgtable-types.h>
>>>
>>> +/*
>>> + * trans_alloc_page
>>> + *   - Allocator that should return exactly one uninitilaized page, if this
>>> + *    allocator fails, trans_pgd returns -ENOMEM error.
>>> + *
>>> + * trans_alloc_arg
>>> + *   - Passed to trans_alloc_page as an argument
>>
>> This is very familiar.
> 
> Sorry, What do you mean?

This stuff used to take a pointer to a function that allocates a page, and an argument for
that allocator ... until patch 2 when you squashed it all in... only to undo it here. This
looks like churn.


>>> + * trans_flags

[...]

> I re-evaluated "flags", and figured that they are indeed not needed.
> So, I will embed them into the code directly.

Great!



>>> diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
>>> index 00b62d8640c2..dbabccd78cc4 100644
>>> --- a/arch/arm64/mm/trans_pgd.c
>>> +++ b/arch/arm64/mm/trans_pgd.c
>>> @@ -17,6 +17,16 @@
>>>  #include <asm/pgtable.h>
>>>  #include <linux/suspend.h>

>>>
>>> -int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long dst_addr,
>>> -                    pgprot_t pgprot)
>>> +int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
>>> +                    void *page, unsigned long dst_addr, pgprot_t pgprot)
>>>  {
>>> -     pgd_t *pgdp;
>>> -     pud_t *pudp;
>>> -     pmd_t *pmdp;
>>> -     pte_t *ptep;
>>> -
>>> -     pgdp = pgd_offset_raw(trans_pgd, dst_addr);
>>> -     if (pgd_none(READ_ONCE(*pgdp))) {
>>> -             pudp = (void *)get_safe_page(GFP_ATOMIC);
>>> -             if (!pudp)
>>> +     int pgd_idx = pgd_index(dst_addr);
>>> +     int pud_idx = pud_index(dst_addr);
>>> +     int pmd_idx = pmd_index(dst_addr);
>>> +     int pte_idx = pte_index(dst_addr);
>>
>> Yuck.
>>
> 
> What's wrong with pre-calculating indices? :)

The only thing to do with them is access the page tables as a C array. This stuff is the
business of the helpers, please use them. Its a maintenance headache if you don't.


>>> -     pudp = pud_offset(pgdp, dst_addr);
>>> -     if (pud_none(READ_ONCE(*pudp))) {
>>> -             pmdp = (void *)get_safe_page(GFP_ATOMIC);
>>> -             if (!pmdp)
>>> +     pudp = __va(pgd_page_paddr(pgd));
>>> +     pud = READ_ONCE(pudp[pud_idx]);
>>> +     if (pud_sect(pud)) {
>>> +             return -ENXIO;
>>> +     } else if (pud_none(pud) || pud_sect(pud)) {
>>> +             pmd_t *t = trans_alloc(info);
>>> +
>>> +             if (!t)
>>>                       return -ENOMEM;
>>
>> Choke on block mappings? This should never happen because this function should only create
>> the tables necessary to map one page. Not a block mapping in sight.
>>
>> (see my comments on patch 6)

> I can remove this, but what should I replace it with BUG() or silently
> ignore, and assume no huge page hre? I thought the idea is not to use
> BUG() calls in kernel code, and return errors instead. If, in the
> future PUD size mappings are added, how is that going to be detected?

...if in the future...

Could your turn RODATA_FULL_DEFAULT_ENABLED off in your kernel config, then check debugfs
kernel_page_tables export. You should see blocks mappings for the large contiguous blocks
of memory.


Thanks,

James
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/trans_pgd.h b/arch/arm64/include/asm/trans_pgd.h
index c7b5402b7d87..e3d022b1b526 100644
--- a/arch/arm64/include/asm/trans_pgd.h
+++ b/arch/arm64/include/asm/trans_pgd.h
@@ -11,10 +11,45 @@ 
 #include <linux/bits.h>
 #include <asm/pgtable-types.h>
 
+/*
+ * trans_alloc_page
+ *	- Allocator that should return exactly one uninitilaized page, if this
+ *	 allocator fails, trans_pgd returns -ENOMEM error.
+ *
+ * trans_alloc_arg
+ *	- Passed to trans_alloc_page as an argument
+ *
+ * trans_flags
+ *	- bitmap with flags that control how page table is filled.
+ *	  TRANS_MKWRITE: during page table copy make PTE, PME, and PUD page
+ *			 writeable by removing RDONLY flag from PTE.
+ *	  TRANS_MKVALID: during page table copy, if PTE present, but not valid,
+ *			 make it valid.
+ *	  TRANS_CHECKPFN: During page table copy, for every PTE entry check that
+ *			  PFN that this PTE points to is valid. Otherwise return
+ *			  -ENXIO
+ */
+
+#define	TRANS_MKWRITE	BIT(0)
+#define	TRANS_MKVALID	BIT(1)
+#define	TRANS_CHECKPFN	BIT(2)
+
+struct trans_pgd_info {
+	void * (*trans_alloc_page)(void *arg);
+	void *trans_alloc_arg;
+	unsigned long trans_flags;
+};
+
 int trans_pgd_create_copy(pgd_t **dst_pgdp, unsigned long start,
 			  unsigned long end);
 
-int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long dst_addr,
-		       pgprot_t pgprot);
+/*
+ * Add map entry to trans_pgd for a base-size page at PTE level.
+ * page:	page to be mapped.
+ * dst_addr:	new VA address for the pages
+ * pgprot:	protection for the page.
+ */
+int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
+		       void *page, unsigned long dst_addr, pgprot_t pgprot);
 
 #endif /* _ASM_TRANS_TABLE_H */
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 6ee81bbaa37f..17426dc8cb54 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -179,6 +179,12 @@  int arch_hibernation_header_restore(void *addr)
 }
 EXPORT_SYMBOL(arch_hibernation_header_restore);
 
+static void *
+hibernate_page_alloc(void *arg)
+{
+	return (void *)get_safe_page((gfp_t)(unsigned long)arg);
+}
+
 /*
  * Copies length bytes, starting at src_start into an new page,
  * perform cache maintentance, then maps it at the specified address low
@@ -195,6 +201,11 @@  static int create_safe_exec_page(void *src_start, size_t length,
 				 unsigned long dst_addr,
 				 phys_addr_t *phys_dst_addr)
 {
+	struct trans_pgd_info trans_info = {
+		.trans_alloc_page	= hibernate_page_alloc,
+		.trans_alloc_arg	= (void *)GFP_ATOMIC,
+		.trans_flags		= 0,
+	};
 	void *page = (void *)get_safe_page(GFP_ATOMIC);
 	pgd_t *trans_pgd;
 	int rc;
@@ -209,7 +220,7 @@  static int create_safe_exec_page(void *src_start, size_t length,
 	if (!trans_pgd)
 		return -ENOMEM;
 
-	rc = trans_pgd_map_page(trans_pgd, page, dst_addr,
+	rc = trans_pgd_map_page(&trans_info, trans_pgd, page, dst_addr,
 				PAGE_KERNEL_EXEC);
 	if (rc)
 		return rc;
diff --git a/arch/arm64/mm/trans_pgd.c b/arch/arm64/mm/trans_pgd.c
index 00b62d8640c2..dbabccd78cc4 100644
--- a/arch/arm64/mm/trans_pgd.c
+++ b/arch/arm64/mm/trans_pgd.c
@@ -17,6 +17,16 @@ 
 #include <asm/pgtable.h>
 #include <linux/suspend.h>
 
+static void *trans_alloc(struct trans_pgd_info *info)
+{
+	void *page = info->trans_alloc_page(info->trans_alloc_arg);
+
+	if (page)
+		clear_page(page);
+
+	return page;
+}
+
 static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
 	pte_t pte = READ_ONCE(*src_ptep);
@@ -172,40 +182,64 @@  int trans_pgd_create_copy(pgd_t **dst_pgdp, unsigned long start,
 	return rc;
 }
 
-int trans_pgd_map_page(pgd_t *trans_pgd, void *page, unsigned long dst_addr,
-		       pgprot_t pgprot)
+int trans_pgd_map_page(struct trans_pgd_info *info, pgd_t *trans_pgd,
+		       void *page, unsigned long dst_addr, pgprot_t pgprot)
 {
-	pgd_t *pgdp;
-	pud_t *pudp;
-	pmd_t *pmdp;
-	pte_t *ptep;
-
-	pgdp = pgd_offset_raw(trans_pgd, dst_addr);
-	if (pgd_none(READ_ONCE(*pgdp))) {
-		pudp = (void *)get_safe_page(GFP_ATOMIC);
-		if (!pudp)
+	int pgd_idx = pgd_index(dst_addr);
+	int pud_idx = pud_index(dst_addr);
+	int pmd_idx = pmd_index(dst_addr);
+	int pte_idx = pte_index(dst_addr);
+	pgd_t *pgdp = trans_pgd;
+	pgd_t pgd = READ_ONCE(pgdp[pgd_idx]);
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
+
+	if (pgd_none(pgd)) {
+		pud_t *t = trans_alloc(info);
+
+		if (!t)
 			return -ENOMEM;
-		pgd_populate(&init_mm, pgdp, pudp);
+
+		__pgd_populate(&pgdp[pgd_idx], __pa(t), PUD_TYPE_TABLE);
+		pgd = READ_ONCE(pgdp[pgd_idx]);
 	}
 
-	pudp = pud_offset(pgdp, dst_addr);
-	if (pud_none(READ_ONCE(*pudp))) {
-		pmdp = (void *)get_safe_page(GFP_ATOMIC);
-		if (!pmdp)
+	pudp = __va(pgd_page_paddr(pgd));
+	pud = READ_ONCE(pudp[pud_idx]);
+	if (pud_sect(pud)) {
+		return -ENXIO;
+	} else if (pud_none(pud) || pud_sect(pud)) {
+		pmd_t *t = trans_alloc(info);
+
+		if (!t)
 			return -ENOMEM;
-		pud_populate(&init_mm, pudp, pmdp);
+
+		__pud_populate(&pudp[pud_idx], __pa(t), PMD_TYPE_TABLE);
+		pud = READ_ONCE(pudp[pud_idx]);
 	}
 
-	pmdp = pmd_offset(pudp, dst_addr);
-	if (pmd_none(READ_ONCE(*pmdp))) {
-		ptep = (void *)get_safe_page(GFP_ATOMIC);
-		if (!ptep)
+	pmdp = __va(pud_page_paddr(pud));
+	pmd = READ_ONCE(pmdp[pmd_idx]);
+	if (pmd_sect(pmd)) {
+		return -ENXIO;
+	} else if (pmd_none(pmd) || pmd_sect(pmd)) {
+		pte_t *t = trans_alloc(info);
+
+		if (!t)
 			return -ENOMEM;
-		pmd_populate_kernel(&init_mm, pmdp, ptep);
+
+		__pmd_populate(&pmdp[pmd_idx], __pa(t), PTE_TYPE_PAGE);
+		pmd = READ_ONCE(pmdp[pmd_idx]);
 	}
 
-	ptep = pte_offset_kernel(pmdp, dst_addr);
-	set_pte(ptep, pfn_pte(virt_to_pfn(page), PAGE_KERNEL_EXEC));
+	ptep = __va(pmd_page_paddr(pmd));
+	pte = READ_ONCE(ptep[pte_idx]);
+
+	if (!pte_none(pte))
+		return -ENXIO;
+
+	set_pte(&ptep[pte_idx], pfn_pte(virt_to_pfn(page), pgprot));
 
 	return 0;
 }