diff mbox series

[2/3] mini-os: mm: switch need_pgt() to use walk_pt()

Message ID 20240731130026.8467-3-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series mini-os: mm: use a generic page table walker | expand

Commit Message

Jürgen Groß July 31, 2024, 1 p.m. UTC
Instead of open coding a page table walk, use walk_pt() in need_pgt().

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/mm.c | 66 +++++++++++++++++++--------------------------------
 1 file changed, 24 insertions(+), 42 deletions(-)

Comments

Samuel Thibault July 31, 2024, 9:27 p.m. UTC | #1
Hello,

Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
> -pgentry_t *need_pgt(unsigned long va)
> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
> +                         pgentry_t *pte, void *par)
>  {
[...]
> +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )

Did you mean (*pte & _PAGE_PSE)?

>      {
> +        *result = pte;
> +        return 1;
>      }

Samuel
Jürgen Groß Aug. 1, 2024, 5:56 a.m. UTC | #2
On 31.07.24 23:27, Samuel Thibault wrote:
> Hello,
> 
> Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
>> -pgentry_t *need_pgt(unsigned long va)
>> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
>> +                         pgentry_t *pte, void *par)
>>   {
> [...]
>> +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
> 
> Did you mean (*pte & _PAGE_PSE)?

No. I want to bail out if the PTE does not require a page table to be
added to it. This is the case if the PTE is valid or if it is at the
lowest page table level already.

An invalid PTE with PSE set is still invalid, so the PSE bit has no
real meaning.


Juergen
Samuel Thibault Aug. 1, 2024, 7:39 a.m. UTC | #3
Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit:
> On 31.07.24 23:27, Samuel Thibault wrote:
> > Hello,
> > 
> > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
> > > -pgentry_t *need_pgt(unsigned long va)
> > > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
> > > +                         pgentry_t *pte, void *par)
> > >   {
> > [...]
> > > +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
> > 
> > Did you mean (*pte & _PAGE_PSE)?
> 
> No.

I don't understand: it doesn't map what I know of need_pgt and what the
existing code is doing.

AIKI, the point of need_pgt is to make sure there is a L1 page table
entry for a VA, and return it, so the caller can put in it at pte for a
mfn or such. In the case a PSE is met, we don't go further, and it's up
to the caller to decide what it wants to do (most often it's actually
unexpected and asserted out). In both cases, the PRESENT bit of the
pte whose address is returned does not matter, most often it's the
caller which will set it.

The existing code for need_pgt thus always adds page table entries down
to level1 (except if _PAGE_PSE is met, i.e. a large page was already set
up): the termination was:

[... walk down to level 2]
-    if ( tab[offset] & _PAGE_PSE )
-        return &tab[offset];
[... walk down to level 1]
-    return &tab[offset];

i.e. we always return either a PSE L2 entry, or an L1 entry, and never
above (while keeping sure that the entries above are present).

> I want to bail out if the PTE does not require a page table to be
> added to it. This is the case if the PTE is valid

There being an entry at e.g. level 3 (without PSE) does not mean that
we have an entry at level 1 for the VA, so we have to continue adding
levels for the VA in the sparse page table tree.

> An invalid PTE with PSE set is still invalid, so the PSE bit has no
> real meaning.

Mmm, I don't think we ever create entries with PSE but not PRESENT (i.e.
we don't "plan" to have PSE entries), but even if we did, if we "plan"
to have PSE entries, we should return that entry, and it's up to the
caller to fill it (and make it PRESENT)

Samuel
Jürgen Groß Aug. 1, 2024, 8:13 a.m. UTC | #4
On 01.08.24 09:39, Samuel Thibault wrote:
> Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit:
>> On 31.07.24 23:27, Samuel Thibault wrote:
>>> Hello,
>>>
>>> Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
>>>> -pgentry_t *need_pgt(unsigned long va)
>>>> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
>>>> +                         pgentry_t *pte, void *par)
>>>>    {
>>> [...]
>>>> +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
>>>
>>> Did you mean (*pte & _PAGE_PSE)?
>>
>> No.
> 
> I don't understand: it doesn't map what I know of need_pgt and what the
> existing code is doing.
> 
> AIKI, the point of need_pgt is to make sure there is a L1 page table
> entry for a VA, and return it, so the caller can put in it at pte for a
> mfn or such. In the case a PSE is met, we don't go further, and it's up
> to the caller to decide what it wants to do (most often it's actually
> unexpected and asserted out). In both cases, the PRESENT bit of the
> pte whose address is returned does not matter, most often it's the
> caller which will set it.
> 
> The existing code for need_pgt thus always adds page table entries down
> to level1 (except if _PAGE_PSE is met, i.e. a large page was already set
> up): the termination was:
> 
> [... walk down to level 2]

You have omitted:

        ASSERT(tab[offset] & _PAGE_PRESENT);

> -    if ( tab[offset] & _PAGE_PSE )
> -        return &tab[offset];
> [... walk down to level 1]
> -    return &tab[offset];


Juergen
Samuel Thibault Aug. 1, 2024, 9:15 a.m. UTC | #5
Jürgen Groß, le jeu. 01 août 2024 10:13:07 +0200, a ecrit:
> On 01.08.24 09:39, Samuel Thibault wrote:
> > Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit:
> > > On 31.07.24 23:27, Samuel Thibault wrote:
> > > > Hello,
> > > > 
> > > > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
> > > > > -pgentry_t *need_pgt(unsigned long va)
> > > > > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
> > > > > +                         pgentry_t *pte, void *par)
> > > > >    {
> > > > [...]
> > > > > +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
> > > > 
> > > > Did you mean (*pte & _PAGE_PSE)?
> > > 
> > > No.
> > 
> > I don't understand: it doesn't map what I know of need_pgt and what the
> > existing code is doing.
> > 
> > AIKI, the point of need_pgt is to make sure there is a L1 page table
> > entry for a VA, and return it, so the caller can put in it at pte for a
> > mfn or such. In the case a PSE is met, we don't go further, and it's up
> > to the caller to decide what it wants to do (most often it's actually
> > unexpected and asserted out). In both cases, the PRESENT bit of the
> > pte whose address is returned does not matter, most often it's the
> > caller which will set it.
> > 
> > The existing code for need_pgt thus always adds page table entries down
> > to level1 (except if _PAGE_PSE is met, i.e. a large page was already set
> > up): the termination was:
> > 
> > [... walk down to level 2]
> 
> You have omitted:
> 
>        ASSERT(tab[offset] & _PAGE_PRESENT);

Right, that confirms that we never set PSE without PRESENT. We probably
want to keep that assertion in the PSE case.

But we still want to continue walking down to level1 (or pse) even when
L4/L3/L2 have PRESENT.

> > -    if ( tab[offset] & _PAGE_PSE )
> > -        return &tab[offset];
> > [... walk down to level 1]
> > -    return &tab[offset];

Samuel
Jürgen Groß Aug. 1, 2024, 9:25 a.m. UTC | #6
On 01.08.24 11:15, Samuel Thibault wrote:
> Jürgen Groß, le jeu. 01 août 2024 10:13:07 +0200, a ecrit:
>> On 01.08.24 09:39, Samuel Thibault wrote:
>>> Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit:
>>>> On 31.07.24 23:27, Samuel Thibault wrote:
>>>>> Hello,
>>>>>
>>>>> Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
>>>>>> -pgentry_t *need_pgt(unsigned long va)
>>>>>> +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
>>>>>> +                         pgentry_t *pte, void *par)
>>>>>>     {
>>>>> [...]
>>>>>> +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
>>>>>
>>>>> Did you mean (*pte & _PAGE_PSE)?
>>>>
>>>> No.
>>>
>>> I don't understand: it doesn't map what I know of need_pgt and what the
>>> existing code is doing.
>>>
>>> AIKI, the point of need_pgt is to make sure there is a L1 page table
>>> entry for a VA, and return it, so the caller can put in it at pte for a
>>> mfn or such. In the case a PSE is met, we don't go further, and it's up
>>> to the caller to decide what it wants to do (most often it's actually
>>> unexpected and asserted out). In both cases, the PRESENT bit of the
>>> pte whose address is returned does not matter, most often it's the
>>> caller which will set it.
>>>
>>> The existing code for need_pgt thus always adds page table entries down
>>> to level1 (except if _PAGE_PSE is met, i.e. a large page was already set
>>> up): the termination was:
>>>
>>> [... walk down to level 2]
>>
>> You have omitted:
>>
>>         ASSERT(tab[offset] & _PAGE_PRESENT);
> 
> Right, that confirms that we never set PSE without PRESENT. We probably
> want to keep that assertion in the PSE case.

Hmm, I'm not sure.

The hardware allows to have any bit set in the PTE when PRESENT isn't set.
It will never look at any other bit in this case. Why should the software
require something different here?

> But we still want to continue walking down to level1 (or pse) even when
> L4/L3/L2 have PRESENT.

We do that. In this case the is_leaf flag won't be set and need_pgt_func()
will return early with the return value 0, meaning that the walk will be
continued.


Juergen
Samuel Thibault Aug. 1, 2024, 10:12 a.m. UTC | #7
Jürgen Groß, le jeu. 01 août 2024 11:25:31 +0200, a ecrit:
> On 01.08.24 11:15, Samuel Thibault wrote:
> > Jürgen Groß, le jeu. 01 août 2024 10:13:07 +0200, a ecrit:
> > > On 01.08.24 09:39, Samuel Thibault wrote:
> > > > Jürgen Groß, le jeu. 01 août 2024 07:56:36 +0200, a ecrit:
> > > > > On 31.07.24 23:27, Samuel Thibault wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > Juergen Gross, le mer. 31 juil. 2024 15:00:25 +0200, a ecrit:
> > > > > > > -pgentry_t *need_pgt(unsigned long va)
> > > > > > > +static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
> > > > > > > +                         pgentry_t *pte, void *par)
> > > > > > >     {
> > > > > > [...]
> > > > > > > +    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
> > > > > > 
> > > > > > Did you mean (*pte & _PAGE_PSE)?
> > > > > 
> > > > > No.
> > > > 
> > > > I don't understand: it doesn't map what I know of need_pgt and what the
> > > > existing code is doing.
> > > > 
> > > > AIKI, the point of need_pgt is to make sure there is a L1 page table
> > > > entry for a VA, and return it, so the caller can put in it at pte for a
> > > > mfn or such. In the case a PSE is met, we don't go further, and it's up
> > > > to the caller to decide what it wants to do (most often it's actually
> > > > unexpected and asserted out). In both cases, the PRESENT bit of the
> > > > pte whose address is returned does not matter, most often it's the
> > > > caller which will set it.
> > > > 
> > > > The existing code for need_pgt thus always adds page table entries down
> > > > to level1 (except if _PAGE_PSE is met, i.e. a large page was already set
> > > > up): the termination was:
> > > > 
> > > > [... walk down to level 2]
> > > 
> > > You have omitted:
> > > 
> > >         ASSERT(tab[offset] & _PAGE_PRESENT);
> > 
> > Right, that confirms that we never set PSE without PRESENT. We probably
> > want to keep that assertion in the PSE case.
> 
> Hmm, I'm not sure.
> 
> The hardware allows to have any bit set in the PTE when PRESENT isn't set.
> It will never look at any other bit in this case. Why should the software
> require something different here?

To document&check what we expect to be producing.

> > But we still want to continue walking down to level1 (or pse) even when
> > L4/L3/L2 have PRESENT.
> 
> We do that. In this case the is_leaf flag won't be set and need_pgt_func()
> will return early with the return value 0, meaning that the walk will be
> continued.

Ah, I missed that indeed, I had not integrated that is_leaf is 1 also in
the case of missing level>1 page. It'd probably be useful to emphasize
in patch 1:

+ * is_leaf: true if PTE doesn't address another page table

it should explicitly say (L1 PTE, or PSE, or not present yet)

The *pte & _PAGE_PRESENT test alone still looks surprising to the
reader, though: we don't return because the page is present, but because
we reached the last possible level that we could fill, which is either
L1, or an existing PSE. It is true that when is_leaf is 1, if it's not
L1 but present, it's necessarily a PSE, but it's clearer to the reader
to write what we meant: L1 or PSE, so something like:

if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) ) {
    assert ( lvl == L1_FRAME || (*pte & _PAGE_PSE) ); // we are at last possible level
    [...]
}

Samuel
diff mbox series

Patch

diff --git a/arch/x86/mm.c b/arch/x86/mm.c
index cc4d41e9..accde291 100644
--- a/arch/x86/mm.c
+++ b/arch/x86/mm.c
@@ -518,57 +518,39 @@  static pgentry_t *get_pgt(unsigned long va)
  * return a valid PTE for a given virtual address. If PTE does not exist,
  * allocate page-table pages.
  */
-pgentry_t *need_pgt(unsigned long va)
+static int need_pgt_func(unsigned long va, unsigned int lvl, bool is_leaf,
+                         pgentry_t *pte, void *par)
 {
+    pgentry_t **result = par;
     unsigned long pt_mfn;
-    pgentry_t *tab;
     unsigned long pt_pfn;
-    unsigned offset;
+    unsigned int idx;
 
-    tab = pt_base;
-    pt_mfn = virt_to_mfn(pt_base);
+    if ( !is_leaf )
+        return 0;
 
-#if defined(__x86_64__)
-    offset = l4_table_offset(va);
-    if ( !(tab[offset] & _PAGE_PRESENT) )
-    {
-        pt_pfn = virt_to_pfn(alloc_page());
-        if ( !pt_pfn )
-            return NULL;
-        new_pt_frame(&pt_pfn, pt_mfn, offset, L3_FRAME);
-    }
-    ASSERT(tab[offset] & _PAGE_PRESENT);
-    pt_mfn = pte_to_mfn(tab[offset]);
-    tab = mfn_to_virt(pt_mfn);
-#endif
-    offset = l3_table_offset(va);
-    if ( !(tab[offset] & _PAGE_PRESENT) ) 
-    {
-        pt_pfn = virt_to_pfn(alloc_page());
-        if ( !pt_pfn )
-            return NULL;
-        new_pt_frame(&pt_pfn, pt_mfn, offset, L2_FRAME);
-    }
-    ASSERT(tab[offset] & _PAGE_PRESENT);
-    pt_mfn = pte_to_mfn(tab[offset]);
-    tab = mfn_to_virt(pt_mfn);
-    offset = l2_table_offset(va);
-    if ( !(tab[offset] & _PAGE_PRESENT) )
+    if ( lvl == L1_FRAME || (*pte & _PAGE_PRESENT) )
     {
-        pt_pfn = virt_to_pfn(alloc_page());
-        if ( !pt_pfn )
-            return NULL;
-        new_pt_frame(&pt_pfn, pt_mfn, offset, L1_FRAME);
+        *result = pte;
+        return 1;
     }
-    ASSERT(tab[offset] & _PAGE_PRESENT);
-    if ( tab[offset] & _PAGE_PSE )
-        return &tab[offset];
 
-    pt_mfn = pte_to_mfn(tab[offset]);
-    tab = mfn_to_virt(pt_mfn);
+    pt_mfn = virt_to_mfn(pte);
+    pt_pfn = virt_to_pfn(alloc_page());
+    if ( !pt_pfn )
+        return -1;
+    idx = (va >> ptdata[lvl].shift) & (ptdata[lvl].entries - 1);
+    new_pt_frame(&pt_pfn, pt_mfn, idx, lvl - 1);
 
-    offset = l1_table_offset(va);
-    return &tab[offset];
+    return 0;
+}
+
+pgentry_t *need_pgt(unsigned long va)
+{
+    pgentry_t *tab = NULL;
+
+    walk_pt(va, va, need_pgt_func, &tab);
+    return tab;
 }
 EXPORT_SYMBOL(need_pgt);