Message ID | 20240731130026.8467-3-jgross@suse.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | mini-os: mm: use a generic page table walker | expand |
Hello, Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: > -pgentry_t *need_pgt(unsigned long va) > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, > + pgentry_t *pte, void *par) > { [...] > + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) Did you mean (*pte & _PAGE_PSE)? > { > + *result = pte; > + return 1; > } Samuel
On 31.07.24 23:27, Samuel Thibault wrote: > Hello, > > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: >> -pgentry_t *need_pgt(unsigned long va) >> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, >> + pgentry_t *pte, void *par) >> { > [...] >> + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) > > Did you mean (*pte & _PAGE_PSE)? No. I want to bail out if the PTE does not require a page table to be added to it. This is the case if the PTE is valid or if it is at the lowest page table level already. An invalid PTE with PSE set is still invalid, so the PSE bit has no real meaning. Juergen
Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit: > On 31.07.24 23:27, Samuel Thibault wrote: > > Hello, > > > > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: > > > -pgentry_t *need_pgt(unsigned long va) > > > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, > > > + pgentry_t *pte, void *par) > > > { > > [...] > > > + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) > > > > Did you mean (*pte & _PAGE_PSE)? > > No. I don't understand: it doesn't map what I know of need_pgt and what the existing code is doing. AIKI, the point of need_pgt is to make sure there is a L1 page table entry for a VA, and return it, so the caller can put in it at pte for a mfn or such. In the case a PSE is met, we don't go further, and it's up to the caller to decide what it wants to do (most often it's actually unexpected and asserted out). In both cases, the PRESENT bit of the pte whose address is returned does not matter, most often it's the caller which will set it. The existing code for need_pgt thus always adds page table entries down to level1 (except if _PAGE_PSE is met, i.e. a large page was already set up): the termination was: [... walk down to level 2] - if ( tab[offset] & _PAGE_PSE ) - return &tab[offset]; [... walk down to level 1] - return &tab[offset]; i.e. we always return either a PSE L2 entry, or an L1 entry, and never above (while keeping sure that the entries above are present). > I want to bail out if the PTE does not require a page table to be > added to it. This is the case if the PTE is valid There being an entry at e.g. level 3 (without PSE) does not mean that we have an entry at level 1 for the VA, so we have to continue adding levels for the VA in the sparse page table tree. > An invalid PTE with PSE set is still invalid, so the PSE bit has no > real meaning. Mmm, I don't think we ever create entries with PSE but not PRESENT (i.e. we don't "plan" to have PSE entries), but even if we did, if we "plan" to have PSE entries, we should return that entry, and it's up to the caller to fill it (and make it PRESENT) Samuel
On 01.08.24 09:39, Samuel Thibault wrote: > Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit: >> On 31.07.24 23:27, Samuel Thibault wrote: >>> Hello, >>> >>> Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: >>>> -pgentry_t *need_pgt(unsigned long va) >>>> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, >>>> + pgentry_t *pte, void *par) >>>> { >>> [...] >>>> + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) >>> >>> Did you mean (*pte & _PAGE_PSE)? >> >> No. > > I don't understand: it doesn't map what I know of need_pgt and what the > existing code is doing. > > AIKI, the point of need_pgt is to make sure there is a L1 page table > entry for a VA, and return it, so the caller can put in it at pte for a > mfn or such. In the case a PSE is met, we don't go further, and it's up > to the caller to decide what it wants to do (most often it's actually > unexpected and asserted out). In both cases, the PRESENT bit of the > pte whose address is returned does not matter, most often it's the > caller which will set it. > > The existing code for need_pgt thus always adds page table entries down > to level1 (except if _PAGE_PSE is met, i.e. a large page was already set > up): the termination was: > > [... walk down to level 2] You have omitted: ASSERT(tab[offset] & _PAGE_PRESENT); > - if ( tab[offset] & _PAGE_PSE ) > - return &tab[offset]; > [... walk down to level 1] > - return &tab[offset]; Juergen
Jürgen Groß, le jeu. 01 août 2024 10:13:07 +0200, a ecrit: > On 01.08.24 09:39, Samuel Thibault wrote: > > Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit: > > > On 31.07.24 23:27, Samuel Thibault wrote: > > > > Hello, > > > > > > > > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: > > > > > -pgentry_t *need_pgt(unsigned long va) > > > > > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, > > > > > + pgentry_t *pte, void *par) > > > > > { > > > > [...] > > > > > + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) > > > > > > > > Did you mean (*pte & _PAGE_PSE)? > > > > > > No. > > > > I don't understand: it doesn't map what I know of need_pgt and what the > > existing code is doing. > > > > AIKI, the point of need_pgt is to make sure there is a L1 page table > > entry for a VA, and return it, so the caller can put in it at pte for a > > mfn or such. In the case a PSE is met, we don't go further, and it's up > > to the caller to decide what it wants to do (most often it's actually > > unexpected and asserted out). In both cases, the PRESENT bit of the > > pte whose address is returned does not matter, most often it's the > > caller which will set it. > > > > The existing code for need_pgt thus always adds page table entries down > > to level1 (except if _PAGE_PSE is met, i.e. a large page was already set > > up): the termination was: > > > > [... walk down to level 2] > > You have omitted: > > ASSERT(tab[offset] & _PAGE_PRESENT); Right, that confirms that we never set PSE without PRESENT. We probably want to keep that assertion in the PSE case. But we still want to continue walking down to level1 (or pse) even when L4/L3/L2 have PRESENT. > > - if ( tab[offset] & _PAGE_PSE ) > > - return &tab[offset]; > > [... walk down to level 1] > > - return &tab[offset]; Samuel
On 01.08.24 11:15, Samuel Thibault wrote: > Jürgen Groß, le jeu. 01 août 2024 10:13:07 +0200, a ecrit: >> On 01.08.24 09:39, Samuel Thibault wrote: >>> Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit: >>>> On 31.07.24 23:27, Samuel Thibault wrote: >>>>> Hello, >>>>> >>>>> Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: >>>>>> -pgentry_t *need_pgt(unsigned long va) >>>>>> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, >>>>>> + pgentry_t *pte, void *par) >>>>>> { >>>>> [...] >>>>>> + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) >>>>> >>>>> Did you mean (*pte & _PAGE_PSE)? >>>> >>>> No. >>> >>> I don't understand: it doesn't map what I know of need_pgt and what the >>> existing code is doing. >>> >>> AIKI, the point of need_pgt is to make sure there is a L1 page table >>> entry for a VA, and return it, so the caller can put in it at pte for a >>> mfn or such. In the case a PSE is met, we don't go further, and it's up >>> to the caller to decide what it wants to do (most often it's actually >>> unexpected and asserted out). In both cases, the PRESENT bit of the >>> pte whose address is returned does not matter, most often it's the >>> caller which will set it. >>> >>> The existing code for need_pgt thus always adds page table entries down >>> to level1 (except if _PAGE_PSE is met, i.e. a large page was already set >>> up): the termination was: >>> >>> [... walk down to level 2] >> >> You have omitted: >> >> ASSERT(tab[offset] & _PAGE_PRESENT); > > Right, that confirms that we never set PSE without PRESENT. We probably > want to keep that assertion in the PSE case. Hmm, I'm not sure. The hardware allows to have any bit set in the PTE when PRESENT isn't set. It will never look at any other bit in this case. Why should the software require something different here? > But we still want to continue walking down to level1 (or pse) even when > L4/L3/L2 have PRESENT. We do that. In this case the is_leaf flag won't be set and need_pgt_func() will return early with the return value 0, meaning that the walk will be continued. Juergen
Jürgen Groß, le jeu. 01 août 2024 11:25:31 +0200, a ecrit: > On 01.08.24 11:15, Samuel Thibault wrote: > > Jürgen Groß, le jeu. 01 août 2024 10:13:07 +0200, a ecrit: > > > On 01.08.24 09:39, Samuel Thibault wrote: > > > > Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit: > > > > > On 31.07.24 23:27, Samuel Thibault wrote: > > > > > > Hello, > > > > > > > > > > > > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit: > > > > > > > -pgentry_t *need_pgt(unsigned long va) > > > > > > > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, > > > > > > > + pgentry_t *pte, void *par) > > > > > > > { > > > > > > [...] > > > > > > > + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) > > > > > > > > > > > > Did you mean (*pte & _PAGE_PSE)? > > > > > > > > > > No. > > > > > > > > I don't understand: it doesn't map what I know of need_pgt and what the > > > > existing code is doing. > > > > > > > > AIKI, the point of need_pgt is to make sure there is a L1 page table > > > > entry for a VA, and return it, so the caller can put in it at pte for a > > > > mfn or such. In the case a PSE is met, we don't go further, and it's up > > > > to the caller to decide what it wants to do (most often it's actually > > > > unexpected and asserted out). In both cases, the PRESENT bit of the > > > > pte whose address is returned does not matter, most often it's the > > > > caller which will set it. > > > > > > > > The existing code for need_pgt thus always adds page table entries down > > > > to level1 (except if _PAGE_PSE is met, i.e. a large page was already set > > > > up): the termination was: > > > > > > > > [... walk down to level 2] > > > > > > You have omitted: > > > > > > ASSERT(tab[offset] & _PAGE_PRESENT); > > > > Right, that confirms that we never set PSE without PRESENT. We probably > > want to keep that assertion in the PSE case. > > Hmm, I'm not sure. > > The hardware allows to have any bit set in the PTE when PRESENT isn't set. > It will never look at any other bit in this case. Why should the software > require something different here? To document&check what we expect to be producing. > > But we still want to continue walking down to level1 (or pse) even when > > L4/L3/L2 have PRESENT. > > We do that. In this case the is_leaf flag won't be set and need_pgt_func() > will return early with the return value 0, meaning that the walk will be > continued. Ah, I missed that indeed, I had not integrated that is_leaf is 1 also in the case of missing level>1 page. It'd probably be useful to emphasize in patch 1: + * is_leaf: true if PTE doesn't address another page table it should explicitly say (L1 PTE, or PSE, or not present yet) The *pte & _PAGE_PRESENT test alone still looks surprising to the reader, though: we don't return because the page is present, but because we reached the last possible level that we could fill, which is either L1, or an existing PSE. It is true that when is_leaf is 1, if it's not L1 but present, it's necessarily a PSE, but it's clearer to the reader to write what we meant: L1 or PSE, so something like: if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) { assert ( lvl == L1_FRAME || (*pte & _PAGE_PSE) ); // we are at last possible level [...] } Samuel
diff --git a/arch/x86/mm.c b/arch/x86/mm.c index cc4d41e9..accde291 100644 --- a/arch/x86/mm.c +++ b/arch/x86/mm.c @@ -518,57 +518,39 @@ static pgentry_t *get_pgt(unsigned long va) * return a valid PTE for a given virtual address. If PTE does not exist, * allocate page-table pages. */ -pgentry_t *need_pgt(unsigned long va) +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf, + pgentry_t *pte, void *par) { + pgentry_t **result = par; unsigned long pt_mfn; - pgentry_t *tab; unsigned long pt_pfn; - unsigned offset; + unsigned int idx; - tab = pt_base; - pt_mfn = virt_to_mfn(pt_base); + if ( !is_leaf ) + return 0; -#if defined(__x86_64__) - offset = l4_table_offset(va); - if ( !(tab[offset] & _PAGE_PRESENT) ) - { - pt_pfn = virt_to_pfn(alloc_page()); - if ( !pt_pfn ) - return NULL; - new_pt_frame(&pt_pfn, pt_mfn, offset, L3_FRAME); - } - ASSERT(tab[offset] & _PAGE_PRESENT); - pt_mfn = pte_to_mfn(tab[offset]); - tab = mfn_to_virt(pt_mfn); -#endif - offset = l3_table_offset(va); - if ( !(tab[offset] & _PAGE_PRESENT) ) - { - pt_pfn = virt_to_pfn(alloc_page()); - if ( !pt_pfn ) - return NULL; - new_pt_frame(&pt_pfn, pt_mfn, offset, L2_FRAME); - } - ASSERT(tab[offset] & _PAGE_PRESENT); - pt_mfn = pte_to_mfn(tab[offset]); - tab = mfn_to_virt(pt_mfn); - offset = l2_table_offset(va); - if ( !(tab[offset] & _PAGE_PRESENT) ) + if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) { - pt_pfn = virt_to_pfn(alloc_page()); - if ( !pt_pfn ) - return NULL; - new_pt_frame(&pt_pfn, pt_mfn, offset, L1_FRAME); + *result = pte; + return 1; } - ASSERT(tab[offset] & _PAGE_PRESENT); - if ( tab[offset] & _PAGE_PSE ) - return &tab[offset]; - pt_mfn = pte_to_mfn(tab[offset]); - tab = mfn_to_virt(pt_mfn); + pt_mfn = virt_to_mfn(pte); + pt_pfn = virt_to_pfn(alloc_page()); + if ( !pt_pfn ) + return -1; + idx = (va >> ptdata[lvl].shift) & (ptdata[lvl].entries - 1); + new_pt_frame(&pt_pfn, pt_mfn, idx, lvl - 1); - offset = l1_table_offset(va); - return &tab[offset]; + return 0; +} + +pgentry_t *need_pgt(unsigned long va) +{ + pgentry_t *tab = NULL; + + walk_pt(va, va, need_pgt_func, &tab); + return tab; } EXPORT_SYMBOL(need_pgt);
Instead of open coding a page table walk, use walk_pt() in need_pgt(). Signed-off-by: Juergen Gross <jgross@suse.com> --- arch/x86/mm.c | 66 +++++++++++++++++++-------------------------------- 1 file changed, 24 insertions(+), 42 deletions(-)