diff mbox series

[v2,13/13] mm/gup: move private gup FOLL_ flags to internal.h

Message ID 13-v2-987e91b59705+36b-gup_tidy_jgg@nvidia.com (mailing list archive)
State New
Headers show
Series Simplify the external interface for GUP | expand

Commit Message

Jason Gunthorpe Jan. 24, 2023, 8:34 p.m. UTC
Move the flags that should not/are not used outside gup.c and related into
mm/internal.h to discourage driver abuse.

To make this more maintainable going forward compact the two FOLL ranges
with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
explict.

Switch to an enum so the whole thing is easier to read.

Cc: David Howells <dhowells@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/linux/mm_types.h | 57 ++++++++++++++++++++++++----------------
 mm/internal.h            | 15 +++++++++++
 2 files changed, 50 insertions(+), 22 deletions(-)

Comments

John Hubbard Jan. 25, 2023, 2:44 a.m. UTC | #1
On 1/24/23 12:34, Jason Gunthorpe wrote:
> Move the flags that should not/are not used outside gup.c and related into
> mm/internal.h to discourage driver abuse.
> 
> To make this more maintainable going forward compact the two FOLL ranges
> with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
> explict.
> 
> Switch to an enum so the whole thing is easier to read.
> 
> Cc: David Howells <dhowells@redhat.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   include/linux/mm_types.h | 57 ++++++++++++++++++++++++----------------
>   mm/internal.h            | 15 +++++++++++
>   2 files changed, 50 insertions(+), 22 deletions(-)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
David Hildenbrand Jan. 26, 2023, 12:48 p.m. UTC | #2
On 24.01.23 21:34, Jason Gunthorpe wrote:
> Move the flags that should not/are not used outside gup.c and related into
> mm/internal.h to discourage driver abuse.
> 
> To make this more maintainable going forward compact the two FOLL ranges
> with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
> explict.
> 
> Switch to an enum so the whole thing is easier to read.

Using a __bitwise type would be even better, but that requires quite 
some adjustments ...

The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there 
is only one caller that doesn't pass FOLL_GET (s390). We could either 
add a new function to "probe" that anything is mapped (IIRC that's the 
use case), or simply ref+unref.

So we might just be able to make FOLL_GET internal fairly easily, too ...

Anyhow

Acked-by: David Hildenbrand <david@redhat.com>
Jason Gunthorpe Jan. 26, 2023, 12:55 p.m. UTC | #3
On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote:
> On 24.01.23 21:34, Jason Gunthorpe wrote:
> > Move the flags that should not/are not used outside gup.c and related into
> > mm/internal.h to discourage driver abuse.
> > 
> > To make this more maintainable going forward compact the two FOLL ranges
> > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
> > explict.
> > 
> > Switch to an enum so the whole thing is easier to read.
> 
> Using a __bitwise type would be even better, but that requires quite some
> adjustments ...
> 
> The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is
> only one caller that doesn't pass FOLL_GET (s390). We could either add a new
> function to "probe" that anything is mapped (IIRC that's the use case), or
> simply ref+unref.

Is that code even safe as written? I don't really understand how it
can safely call lock_page() on something it doesn't have a reference
too ?

So adding the FOLL_GET and put_page seems like a good idea to me? At a
minimum this should get a comment to explain why it is OK.

S390 people?

int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
{
[..]
        rc = -ENXIO;
        page = follow_page(vma, uaddr, FOLL_WRITE);
        if (IS_ERR_OR_NULL(page))
                goto out;

        lock_page(page);
        ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);

Jason
David Hildenbrand Jan. 26, 2023, 1:06 p.m. UTC | #4
On 26.01.23 13:55, Jason Gunthorpe wrote:
> On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote:
>> On 24.01.23 21:34, Jason Gunthorpe wrote:
>>> Move the flags that should not/are not used outside gup.c and related into
>>> mm/internal.h to discourage driver abuse.
>>>
>>> To make this more maintainable going forward compact the two FOLL ranges
>>> with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
>>> explict.
>>>
>>> Switch to an enum so the whole thing is easier to read.
>>
>> Using a __bitwise type would be even better, but that requires quite some
>> adjustments ...
>>
>> The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is
>> only one caller that doesn't pass FOLL_GET (s390). We could either add a new
>> function to "probe" that anything is mapped (IIRC that's the use case), or
>> simply ref+unref.
> 
> Is that code even safe as written? I don't really understand how it
> can safely call lock_page() on something it doesn't have a reference
> too ?

Let me look into the details ... I remember reviewing that before I got 
to study the beauty of GUP in more detail.

CCin Claudio

> 
> So adding the FOLL_GET and put_page seems like a good idea to me? At a
> minimum this should get a comment to explain why it is OK.
> 

At first sight, it really feels like the right thing to do. Maybe 
another good reason to handle FOLL_GET vs. FOLL_PIN completely internal. 
Harder to abuse.


> S390 people?
> 
> int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> {
> [..]
>          rc = -ENXIO;
>          page = follow_page(vma, uaddr, FOLL_WRITE);
>          if (IS_ERR_OR_NULL(page))
>                  goto out;
> 
>          lock_page(page);
>          ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> 
> Jason
>
Claudio Imbrenda Jan. 26, 2023, 2:41 p.m. UTC | #5
On Thu, 26 Jan 2023 08:55:27 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote:
> > On 24.01.23 21:34, Jason Gunthorpe wrote:  
> > > Move the flags that should not/are not used outside gup.c and related into
> > > mm/internal.h to discourage driver abuse.
> > > 
> > > To make this more maintainable going forward compact the two FOLL ranges
> > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
> > > explict.
> > > 
> > > Switch to an enum so the whole thing is easier to read.  
> > 
> > Using a __bitwise type would be even better, but that requires quite some
> > adjustments ...
> > 
> > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is
> > only one caller that doesn't pass FOLL_GET (s390). We could either add a new
> > function to "probe" that anything is mapped (IIRC that's the use case), or
> > simply ref+unref.  
> 
> Is that code even safe as written? I don't really understand how it

yes (surprisingly) it is

> can safely call lock_page() on something it doesn't have a reference
> too ?

the code between lock_page and unlock_page will behave "properly" and
do nothing or at worst cause a tiny performance issue in the rare case
something changes between the follow_page and the page_lock, i.e. if
things are done on the wrong page.

make_secure_pte does some very hacky stuff involving page_ref_freeze,
with expected_page_refs counting how many references are expected.

the code is how it is because, when it was written, FOLL_PIN did
not exist, and FOLL_GET caused the page to be converted to unsecure.

I guess maybe we can make it work with FOLL_GET if expected_page_refs
takes into account the extra reference? (and then maybe we could a
put_page after unlock_page)

> 
> So adding the FOLL_GET and put_page seems like a good idea to me? At a
> minimum this should get a comment to explain why it is OK.
> 
> S390 people?
> 
> int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> {
> [..]
>         rc = -ENXIO;
>         page = follow_page(vma, uaddr, FOLL_WRITE);
>         if (IS_ERR_OR_NULL(page))
>                 goto out;
> 
>         lock_page(page);
>         ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> 
> Jason
David Hildenbrand Jan. 26, 2023, 2:46 p.m. UTC | #6
On 26.01.23 15:41, Claudio Imbrenda wrote:
> On Thu, 26 Jan 2023 08:55:27 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote:
>>> On 24.01.23 21:34, Jason Gunthorpe wrote:
>>>> Move the flags that should not/are not used outside gup.c and related into
>>>> mm/internal.h to discourage driver abuse.
>>>>
>>>> To make this more maintainable going forward compact the two FOLL ranges
>>>> with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
>>>> explict.
>>>>
>>>> Switch to an enum so the whole thing is easier to read.
>>>
>>> Using a __bitwise type would be even better, but that requires quite some
>>> adjustments ...
>>>
>>> The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is
>>> only one caller that doesn't pass FOLL_GET (s390). We could either add a new
>>> function to "probe" that anything is mapped (IIRC that's the use case), or
>>> simply ref+unref.
>>
>> Is that code even safe as written? I don't really understand how it
> 
> yes (surprisingly) it is
> 
>> can safely call lock_page() on something it doesn't have a reference
>> too ?
> 
> the code between lock_page and unlock_page will behave "properly" and
> do nothing or at worst cause a tiny performance issue in the rare case
> something changes between the follow_page and the page_lock, i.e. if
> things are done on the wrong page.

What prevents the page from getting unmapped (MADV_DONTNEED), freed, 
reallocated as a larger folio and the unlock_page() would target the 
wrong bit? I think even while freeing a locked page we might run into 
trouble ...
Jason Gunthorpe Jan. 26, 2023, 3:05 p.m. UTC | #7
On Thu, Jan 26, 2023 at 03:46:09PM +0100, David Hildenbrand wrote:
> On 26.01.23 15:41, Claudio Imbrenda wrote:
> > On Thu, 26 Jan 2023 08:55:27 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote:
> > > > On 24.01.23 21:34, Jason Gunthorpe wrote:
> > > > > Move the flags that should not/are not used outside gup.c and related into
> > > > > mm/internal.h to discourage driver abuse.
> > > > > 
> > > > > To make this more maintainable going forward compact the two FOLL ranges
> > > > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
> > > > > explict.
> > > > > 
> > > > > Switch to an enum so the whole thing is easier to read.
> > > > 
> > > > Using a __bitwise type would be even better, but that requires quite some
> > > > adjustments ...
> > > > 
> > > > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is
> > > > only one caller that doesn't pass FOLL_GET (s390). We could either add a new
> > > > function to "probe" that anything is mapped (IIRC that's the use case), or
> > > > simply ref+unref.
> > > 
> > > Is that code even safe as written? I don't really understand how it
> > 
> > yes (surprisingly) it is
> > 
> > > can safely call lock_page() on something it doesn't have a reference
> > > too ?
> > 
> > the code between lock_page and unlock_page will behave "properly" and
> > do nothing or at worst cause a tiny performance issue in the rare case
> > something changes between the follow_page and the page_lock, i.e. if
> > things are done on the wrong page.
> 
> What prevents the page from getting unmapped (MADV_DONTNEED), freed,
> reallocated as a larger folio and the unlock_page() would target the wrong
> bit? I think even while freeing a locked page we might run into trouble ...

Yep. 

The issue is you can't call lock_page() on something you don't have a
ref to.

The worst case would be the memory got unmapped from the VMA and the
entire memory space was hot-unpluged eg it was DAX or something. Now
the page pointer will oops if you call lock_page.

Why not just use the get_locked_pte() exclusively and do -EAGAIN or
-EBUSY if folio_try_lock fails, under the PTL? This already happens
for PageWriteback caes.

Jason
Claudio Imbrenda Jan. 26, 2023, 3:39 p.m. UTC | #8
On Thu, 26 Jan 2023 11:05:27 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jan 26, 2023 at 03:46:09PM +0100, David Hildenbrand wrote:
> > On 26.01.23 15:41, Claudio Imbrenda wrote:  
> > > On Thu, 26 Jan 2023 08:55:27 -0400
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Thu, Jan 26, 2023 at 01:48:46PM +0100, David Hildenbrand wrote:  
> > > > > On 24.01.23 21:34, Jason Gunthorpe wrote:  
> > > > > > Move the flags that should not/are not used outside gup.c and related into
> > > > > > mm/internal.h to discourage driver abuse.
> > > > > > 
> > > > > > To make this more maintainable going forward compact the two FOLL ranges
> > > > > > with new bit numbers from 0 to 11 and 16 to 21, using shifts so it is
> > > > > > explict.
> > > > > > 
> > > > > > Switch to an enum so the whole thing is easier to read.  
> > > > > 
> > > > > Using a __bitwise type would be even better, but that requires quite some
> > > > > adjustments ...
> > > > > 
> > > > > The primary leftover for FOLL_GET seems to be follow_page(). IIRC, there is
> > > > > only one caller that doesn't pass FOLL_GET (s390). We could either add a new
> > > > > function to "probe" that anything is mapped (IIRC that's the use case), or
> > > > > simply ref+unref.  
> > > > 
> > > > Is that code even safe as written? I don't really understand how it  
> > > 
> > > yes (surprisingly) it is
> > >   
> > > > can safely call lock_page() on something it doesn't have a reference
> > > > too ?  
> > > 
> > > the code between lock_page and unlock_page will behave "properly" and
> > > do nothing or at worst cause a tiny performance issue in the rare case
> > > something changes between the follow_page and the page_lock, i.e. if
> > > things are done on the wrong page.  
> > 
> > What prevents the page from getting unmapped (MADV_DONTNEED), freed,
> > reallocated as a larger folio and the unlock_page() would target the wrong
> > bit? I think even while freeing a locked page we might run into trouble ...  
> 
> Yep. 
> 
> The issue is you can't call lock_page() on something you don't have a
> ref to.

so we have been doing this wrong the whole time? oops

> 
> The worst case would be the memory got unmapped from the VMA and the
> entire memory space was hot-unpluged eg it was DAX or something. Now
> the page pointer will oops if you call lock_page.

we do not have memory mapped devices or anything, so this scenario is
highly unlikely (at last this)

> 
> Why not just use the get_locked_pte() exclusively and do -EAGAIN or
> -EBUSY if folio_try_lock fails, under the PTL? This already happens
> for PageWriteback caes.

I think I will need some time to process this sentence


I can tell you that the original goal of that function is to make sure
that there are no extra references. in particular, we want to prevent
I/O of any kind to be ongoing while the page becomes secure. (the I/O
will fail and, depending on which device it was, the whole system might
end up in a rather unhappy state)

transitioning from secure to non-secure instead is much easier

> 
> Jason
Jason Gunthorpe Jan. 26, 2023, 4:35 p.m. UTC | #9
On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote:

> I can tell you that the original goal of that function is to make sure
> that there are no extra references. in particular, we want to prevent
> I/O of any kind to be ongoing while the page becomes secure. (the I/O
> will fail and, depending on which device it was, the whole system might
> end up in a rather unhappy state)

Sure, but if there is concurrent IO you just try again right? It
doesn't wait for refs to drop for instance.

So make the lock_page work the same way:

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index 9f18a4af9c1319..847ee50b8672c6 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page)
 }
 
 static int make_secure_pte(pte_t *ptep, unsigned long addr,
-			   struct page *exp_page, struct uv_cb_header *uvcb)
+			   struct page *page, struct uv_cb_header *uvcb)
 {
 	pte_t entry = READ_ONCE(*ptep);
-	struct page *page;
 	int expected, cc = 0;
 
-	if (!pte_present(entry))
-		return -ENXIO;
-	if (pte_val(entry) & _PAGE_INVALID)
-		return -ENXIO;
-
-	page = pte_page(entry);
-	if (page != exp_page)
-		return -ENXIO;
 	if (PageWriteback(page))
 		return -EAGAIN;
 	expected = expected_page_refs(page);
@@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
 		goto out;
 
 	rc = -ENXIO;
-	page = follow_page(vma, uaddr, FOLL_WRITE);
-	if (IS_ERR_OR_NULL(page))
-		goto out;
-
-	lock_page(page);
 	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+
+	if (!pte_present(entry))
+		goto out_unlock_pte;
+	if (pte_val(entry) & _PAGE_INVALID)
+		goto out_unlock_pte;
+	page = pte_page(entry);
+
+	if (!trylock_page(page)) {
+		rc = -EAGAIN;
+		goto out_unlock_pte;
+	}
+
 	if (should_export_before_import(uvcb, gmap->mm))
 		uv_convert_from_secure(page_to_phys(page));
 	rc = make_secure_pte(ptep, uaddr, page, uvcb);
-	pte_unmap_unlock(ptep, ptelock);
 	unlock_page(page);
+out_unlock_pte:
+	pte_unmap_unlock(ptep, ptelock);
 out:
 	mmap_read_unlock(gmap->mm);
Claudio Imbrenda Jan. 26, 2023, 5:24 p.m. UTC | #10
On Thu, 26 Jan 2023 12:35:37 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote:
> 
> > I can tell you that the original goal of that function is to make sure
> > that there are no extra references. in particular, we want to prevent
> > I/O of any kind to be ongoing while the page becomes secure. (the I/O
> > will fail and, depending on which device it was, the whole system might
> > end up in a rather unhappy state)  
> 
> Sure, but if there is concurrent IO you just try again right? It
> doesn't wait for refs to drop for instance.
> 
> So make the lock_page work the same way:

I'll need to think carefully about this. I remember that there were good
reasons to do things the way we did, so I want to make sure we don't
break something by changing things around.

> 
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 9f18a4af9c1319..847ee50b8672c6 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page)
>  }
>  
>  static int make_secure_pte(pte_t *ptep, unsigned long addr,
> -			   struct page *exp_page, struct uv_cb_header *uvcb)
> +			   struct page *page, struct uv_cb_header *uvcb)
>  {
>  	pte_t entry = READ_ONCE(*ptep);
> -	struct page *page;
>  	int expected, cc = 0;
>  
> -	if (!pte_present(entry))
> -		return -ENXIO;
> -	if (pte_val(entry) & _PAGE_INVALID)
> -		return -ENXIO;
> -
> -	page = pte_page(entry);
> -	if (page != exp_page)
> -		return -ENXIO;
>  	if (PageWriteback(page))
>  		return -EAGAIN;
>  	expected = expected_page_refs(page);
> @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
>  		goto out;
>  
>  	rc = -ENXIO;
> -	page = follow_page(vma, uaddr, FOLL_WRITE);
> -	if (IS_ERR_OR_NULL(page))
> -		goto out;
> -
> -	lock_page(page);
>  	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> +
> +	if (!pte_present(entry))
> +		goto out_unlock_pte;
> +	if (pte_val(entry) & _PAGE_INVALID)
> +		goto out_unlock_pte;
> +	page = pte_page(entry);
> +
> +	if (!trylock_page(page)) {
> +		rc = -EAGAIN;
> +		goto out_unlock_pte;
> +	}
> +
>  	if (should_export_before_import(uvcb, gmap->mm))
>  		uv_convert_from_secure(page_to_phys(page));
>  	rc = make_secure_pte(ptep, uaddr, page, uvcb);
> -	pte_unmap_unlock(ptep, ptelock);
>  	unlock_page(page);
> +out_unlock_pte:
> +	pte_unmap_unlock(ptep, ptelock);
>  out:
>  	mmap_read_unlock(gmap->mm);
>
Claudio Imbrenda Jan. 30, 2023, 6:21 p.m. UTC | #11
On Thu, 26 Jan 2023 12:35:37 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote:
> 
> > I can tell you that the original goal of that function is to make sure
> > that there are no extra references. in particular, we want to prevent
> > I/O of any kind to be ongoing while the page becomes secure. (the I/O
> > will fail and, depending on which device it was, the whole system might
> > end up in a rather unhappy state)  
> 
> Sure, but if there is concurrent IO you just try again right? It
> doesn't wait for refs to drop for instance.
> 
> So make the lock_page work the same way:

the more I look at this, the less I understand why I wrote the code
like that. I do have a comment, though

> 
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index 9f18a4af9c1319..847ee50b8672c6 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page)
>  }
>  
>  static int make_secure_pte(pte_t *ptep, unsigned long addr,
> -			   struct page *exp_page, struct uv_cb_header *uvcb)
> +			   struct page *page, struct uv_cb_header *uvcb)
>  {
>  	pte_t entry = READ_ONCE(*ptep);
> -	struct page *page;
>  	int expected, cc = 0;
>  
> -	if (!pte_present(entry))
> -		return -ENXIO;
> -	if (pte_val(entry) & _PAGE_INVALID)
> -		return -ENXIO;
> -
> -	page = pte_page(entry);
> -	if (page != exp_page)
> -		return -ENXIO;
>  	if (PageWriteback(page))
>  		return -EAGAIN;
>  	expected = expected_page_refs(page);
> @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
>  		goto out;
>  
>  	rc = -ENXIO;
> -	page = follow_page(vma, uaddr, FOLL_WRITE);
> -	if (IS_ERR_OR_NULL(page))
> -		goto out;
> -
> -	lock_page(page);
>  	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> +
> +	if (!pte_present(entry))
> +		goto out_unlock_pte;
> +	if (pte_val(entry) & _PAGE_INVALID)
> +		goto out_unlock_pte;

I guess we also need to make sure the page was writable?
FOLL_WRITE made sure of that

so I guess something like:

if (!pte_write(entry))
	goto out_unlock_pte;

> +	page = pte_page(entry);
> +
> +	if (!trylock_page(page)) {
> +		rc = -EAGAIN;
> +		goto out_unlock_pte;
> +	}
> +
>  	if (should_export_before_import(uvcb, gmap->mm))
>  		uv_convert_from_secure(page_to_phys(page));
>  	rc = make_secure_pte(ptep, uaddr, page, uvcb);
> -	pte_unmap_unlock(ptep, ptelock);
>  	unlock_page(page);
> +out_unlock_pte:
> +	pte_unmap_unlock(ptep, ptelock);
>  out:
>  	mmap_read_unlock(gmap->mm);
>
Jason Gunthorpe Jan. 30, 2023, 6:24 p.m. UTC | #12
On Mon, Jan 30, 2023 at 07:21:04PM +0100, Claudio Imbrenda wrote:
> On Thu, 26 Jan 2023 12:35:37 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote:
> > 
> > > I can tell you that the original goal of that function is to make sure
> > > that there are no extra references. in particular, we want to prevent
> > > I/O of any kind to be ongoing while the page becomes secure. (the I/O
> > > will fail and, depending on which device it was, the whole system might
> > > end up in a rather unhappy state)  
> > 
> > Sure, but if there is concurrent IO you just try again right? It
> > doesn't wait for refs to drop for instance.
> > 
> > So make the lock_page work the same way:
> 
> the more I look at this, the less I understand why I wrote the code
> like that. I do have a comment, though
> 
> > 
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index 9f18a4af9c1319..847ee50b8672c6 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page)
> >  }
> >  
> >  static int make_secure_pte(pte_t *ptep, unsigned long addr,
> > -			   struct page *exp_page, struct uv_cb_header *uvcb)
> > +			   struct page *page, struct uv_cb_header *uvcb)
> >  {
> >  	pte_t entry = READ_ONCE(*ptep);
> > -	struct page *page;
> >  	int expected, cc = 0;
> >  
> > -	if (!pte_present(entry))
> > -		return -ENXIO;
> > -	if (pte_val(entry) & _PAGE_INVALID)
> > -		return -ENXIO;
> > -
> > -	page = pte_page(entry);
> > -	if (page != exp_page)
> > -		return -ENXIO;
> >  	if (PageWriteback(page))
> >  		return -EAGAIN;
> >  	expected = expected_page_refs(page);
> > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> >  		goto out;
> >  
> >  	rc = -ENXIO;
> > -	page = follow_page(vma, uaddr, FOLL_WRITE);
> > -	if (IS_ERR_OR_NULL(page))
> > -		goto out;
> > -
> > -	lock_page(page);
> >  	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> > +
> > +	if (!pte_present(entry))
> > +		goto out_unlock_pte;
> > +	if (pte_val(entry) & _PAGE_INVALID)
> > +		goto out_unlock_pte;
> 
> I guess we also need to make sure the page was writable?
> FOLL_WRITE made sure of that
> 
> so I guess something like:
> 
> if (!pte_write(entry))
> 	goto out_unlock_pte;

Probably, that looks like an existing race that it wasn't re-checked :\

Jason
Claudio Imbrenda Feb. 7, 2023, 11:31 a.m. UTC | #13
On Mon, 30 Jan 2023 14:24:40 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Jan 30, 2023 at 07:21:04PM +0100, Claudio Imbrenda wrote:
> > On Thu, 26 Jan 2023 12:35:37 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote:
> > >   
> > > > I can tell you that the original goal of that function is to make sure
> > > > that there are no extra references. in particular, we want to prevent
> > > > I/O of any kind to be ongoing while the page becomes secure. (the I/O
> > > > will fail and, depending on which device it was, the whole system might
> > > > end up in a rather unhappy state)    
> > > 
> > > Sure, but if there is concurrent IO you just try again right? It
> > > doesn't wait for refs to drop for instance.
> > > 
> > > So make the lock_page work the same way:  
> > 
> > the more I look at this, the less I understand why I wrote the code
> > like that. I do have a comment, though
> >   
> > > 
> > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > > index 9f18a4af9c1319..847ee50b8672c6 100644
> > > --- a/arch/s390/kernel/uv.c
> > > +++ b/arch/s390/kernel/uv.c
> > > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page)
> > >  }
> > >  
> > >  static int make_secure_pte(pte_t *ptep, unsigned long addr,
> > > -			   struct page *exp_page, struct uv_cb_header *uvcb)
> > > +			   struct page *page, struct uv_cb_header *uvcb)
> > >  {
> > >  	pte_t entry = READ_ONCE(*ptep);
> > > -	struct page *page;
> > >  	int expected, cc = 0;
> > >  
> > > -	if (!pte_present(entry))
> > > -		return -ENXIO;
> > > -	if (pte_val(entry) & _PAGE_INVALID)
> > > -		return -ENXIO;
> > > -
> > > -	page = pte_page(entry);
> > > -	if (page != exp_page)
> > > -		return -ENXIO;
> > >  	if (PageWriteback(page))
> > >  		return -EAGAIN;
> > >  	expected = expected_page_refs(page);
> > > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> > >  		goto out;
> > >  
> > >  	rc = -ENXIO;
> > > -	page = follow_page(vma, uaddr, FOLL_WRITE);
> > > -	if (IS_ERR_OR_NULL(page))
> > > -		goto out;
> > > -
> > > -	lock_page(page);
> > >  	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> > > +
> > > +	if (!pte_present(entry))
> > > +		goto out_unlock_pte;
> > > +	if (pte_val(entry) & _PAGE_INVALID)
> > > +		goto out_unlock_pte;  
> > 
> > I guess we also need to make sure the page was writable?
> > FOLL_WRITE made sure of that
> > 
> > so I guess something like:
> > 
> > if (!pte_write(entry))
> > 	goto out_unlock_pte;  
> 
> Probably, that looks like an existing race that it wasn't re-checked :\
> 
> Jason

how do we want to proceed with this?

I can write a patch based on the above and see if anything breaks, but
it will take some thinking and testing and reviewing before I'll be
comfortable with it.
Jason Gunthorpe Feb. 7, 2023, 12:40 p.m. UTC | #14
On Tue, Feb 07, 2023 at 12:31:12PM +0100, Claudio Imbrenda wrote:
> On Mon, 30 Jan 2023 14:24:40 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Jan 30, 2023 at 07:21:04PM +0100, Claudio Imbrenda wrote:
> > > On Thu, 26 Jan 2023 12:35:37 -0400
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Thu, Jan 26, 2023 at 04:39:02PM +0100, Claudio Imbrenda wrote:
> > > >   
> > > > > I can tell you that the original goal of that function is to make sure
> > > > > that there are no extra references. in particular, we want to prevent
> > > > > I/O of any kind to be ongoing while the page becomes secure. (the I/O
> > > > > will fail and, depending on which device it was, the whole system might
> > > > > end up in a rather unhappy state)    
> > > > 
> > > > Sure, but if there is concurrent IO you just try again right? It
> > > > doesn't wait for refs to drop for instance.
> > > > 
> > > > So make the lock_page work the same way:  
> > > 
> > > the more I look at this, the less I understand why I wrote the code
> > > like that. I do have a comment, though
> > >   
> > > > 
> > > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > > > index 9f18a4af9c1319..847ee50b8672c6 100644
> > > > --- a/arch/s390/kernel/uv.c
> > > > +++ b/arch/s390/kernel/uv.c
> > > > @@ -193,20 +193,11 @@ static int expected_page_refs(struct page *page)
> > > >  }
> > > >  
> > > >  static int make_secure_pte(pte_t *ptep, unsigned long addr,
> > > > -			   struct page *exp_page, struct uv_cb_header *uvcb)
> > > > +			   struct page *page, struct uv_cb_header *uvcb)
> > > >  {
> > > >  	pte_t entry = READ_ONCE(*ptep);
> > > > -	struct page *page;
> > > >  	int expected, cc = 0;
> > > >  
> > > > -	if (!pte_present(entry))
> > > > -		return -ENXIO;
> > > > -	if (pte_val(entry) & _PAGE_INVALID)
> > > > -		return -ENXIO;
> > > > -
> > > > -	page = pte_page(entry);
> > > > -	if (page != exp_page)
> > > > -		return -ENXIO;
> > > >  	if (PageWriteback(page))
> > > >  		return -EAGAIN;
> > > >  	expected = expected_page_refs(page);
> > > > @@ -304,17 +295,25 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> > > >  		goto out;
> > > >  
> > > >  	rc = -ENXIO;
> > > > -	page = follow_page(vma, uaddr, FOLL_WRITE);
> > > > -	if (IS_ERR_OR_NULL(page))
> > > > -		goto out;
> > > > -
> > > > -	lock_page(page);
> > > >  	ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> > > > +
> > > > +	if (!pte_present(entry))
> > > > +		goto out_unlock_pte;
> > > > +	if (pte_val(entry) & _PAGE_INVALID)
> > > > +		goto out_unlock_pte;  
> > > 
> > > I guess we also need to make sure the page was writable?
> > > FOLL_WRITE made sure of that
> > > 
> > > so I guess something like:
> > > 
> > > if (!pte_write(entry))
> > > 	goto out_unlock_pte;  
> > 
> > Probably, that looks like an existing race that it wasn't re-checked :\
> > 
> > Jason
> 
> how do we want to proceed with this?
> 
> I can write a patch based on the above and see if anything breaks, but
> it will take some thinking and testing and reviewing before I'll be
> comfortable with it.

I would be happy if you did that

I think the code is clearly racey as written now

Jason
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 518617431f8085..dbfc3a7e9559b4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1039,9 +1039,6 @@  typedef unsigned int __bitwise zap_flags_t;
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
  * other. Here is what they mean, and how to use them:
  *
- * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is in contrast to
- * iov_iter_get_pages(), whose usages are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -1085,24 +1082,40 @@  typedef unsigned int __bitwise zap_flags_t;
  * Please see Documentation/core-api/pin_user_pages.rst for more information.
  */
 
-#define FOLL_WRITE	0x01	/* check pte is writable */
-#define FOLL_TOUCH	0x02	/* mark page accessed */
-#define FOLL_GET	0x04	/* do get_page on page */
-#define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
-#define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
-				 * and return without waiting upon it */
-#define FOLL_NOFAULT	0x80	/* do not fault in pages */
-#define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
-#define FOLL_TRIED	0x800	/* a retry, previous pass started an IO */
-#define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
-#define FOLL_ANON	0x8000	/* don't do file mappings */
-#define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
-#define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
-#define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
-#define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
-#define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
-#define FOLL_INTERRUPTIBLE  0x200000 /* allow interrupts from generic signals */
-#define FOLL_UNLOCKABLE	0x400000 /* allow unlocking the mmap lock (internal only) */
+enum {
+	/* check pte is writable */
+	FOLL_WRITE = 1 << 0,
+	/* do get_page on page */
+	FOLL_GET = 1 << 1,
+	/* give error on hole if it would be zero */
+	FOLL_DUMP = 1 << 2,
+	/* get_user_pages read/write w/o permission */
+	FOLL_FORCE = 1 << 3,
+	/*
+	 * if a disk transfer is needed, start the IO and return without waiting
+	 * upon it
+	 */
+	FOLL_NOWAIT = 1 << 4,
+	/* do not fault in pages */
+	FOLL_NOFAULT = 1 << 5,
+	/* check page is hwpoisoned */
+	FOLL_HWPOISON = 1 << 6,
+	/* don't do file mappings */
+	FOLL_ANON = 1 << 7,
+	/*
+	 * FOLL_LONGTERM indicates that the page will be held for an indefinite
+	 * time period _often_ under userspace control.  This is in contrast to
+	 * iov_iter_get_pages(), whose usages are transient.
+	 */
+	FOLL_LONGTERM = 1 << 8,
+	/* split huge pmd before returning */
+	FOLL_SPLIT_PMD = 1 << 9,
+	/* allow returning PCI P2PDMA pages */
+	FOLL_PCI_P2PDMA = 1 << 10,
+	/* allow interrupts from generic signals */
+	FOLL_INTERRUPTIBLE = 1 << 11,
+
+	/* See also internal only FOLL flags in mm/internal.h */
+};
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/mm/internal.h b/mm/internal.h
index 5c1310b98db64d..673b4b1d5f4c79 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -854,6 +854,21 @@  int migrate_device_coherent_page(struct page *page);
 struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags);
 int __must_check try_grab_page(struct page *page, unsigned int flags);
 
+enum {
+	/* mark page accessed */
+	FOLL_TOUCH = 1 << 16,
+	/* a retry, previous pass started an IO */
+	FOLL_TRIED = 1 << 17,
+	/* we are working on non-current tsk/mm */
+	FOLL_REMOTE = 1 << 18,
+	/* pages must be released via unpin_user_page */
+	FOLL_PIN = 1 << 19,
+	/* gup_fast: prevent fall-back to slow gup */
+	FOLL_FAST_ONLY = 1 << 20,
+	/* allow unlocking the mmap lock */
+	FOLL_UNLOCKABLE = 1 << 21,
+};
+
 /*
  * Indicates for which pages that are write-protected in the page table,
  * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the