diff mbox series

[v7,06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()

Message ID 20240607145902.1137853-7-kernel@pankajraghav.com (mailing list archive)
State Superseded
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Pankaj Raghav (Samsung) June 7, 2024, 2:58 p.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. Cap the PTE range
to be created for the page cache up to the max allowed zero-fill file
end, which is aligned to the PAGE_SIZE.

An fstests test has been created to trigger this edge case [0].

[0] https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox (Oracle) June 12, 2024, 7:08 p.m. UTC | #1
On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Usually the page cache does not extend beyond the size of the inode,
> therefore, no PTEs are created for folios that extend beyond the size.
> 
> But with LBS support, we might extend page cache beyond the size of the
> inode as we need to guarantee folios of minimum order. Cap the PTE range
> to be created for the page cache up to the max allowed zero-fill file
> end, which is aligned to the PAGE_SIZE.

I think this is slightly misleading because we might well zero-fill
to the end of the folio.  The issue is that we're supposed to SIGBUS
if userspace accesses pages which lie entirely beyond the end of this
file.  Can you rephrase this?

(from mmap(2))
       SIGBUS Attempted access to a page of the buffer that lies beyond the end
              of the mapped file.  For an explanation of the treatment  of  the
              bytes  in  the  page that corresponds to the end of a mapped file
              that is not a multiple of the page size, see NOTES.


The code is good though.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

> An fstests test has been created to trigger this edge case [0].
> 
> [0] https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/filemap.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8bb0d2bc93c5..0e48491b3d10 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3610,7 +3610,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct file *file = vma->vm_file;
>  	struct address_space *mapping = file->f_mapping;
> -	pgoff_t last_pgoff = start_pgoff;
> +	pgoff_t file_end, last_pgoff = start_pgoff;
>  	unsigned long addr;
>  	XA_STATE(xas, &mapping->i_pages, start_pgoff);
>  	struct folio *folio;
> @@ -3636,6 +3636,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>  		goto out;
>  	}
>  
> +	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
> +	if (end_pgoff > file_end)
> +		end_pgoff = file_end;
> +
>  	folio_type = mm_counter_file(folio);
>  	do {
>  		unsigned long end;
> -- 
> 2.44.1
>
Luis Chamberlain June 13, 2024, 7:57 a.m. UTC | #2
On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Usually the page cache does not extend beyond the size of the inode,
> > therefore, no PTEs are created for folios that extend beyond the size.
> > 
> > But with LBS support, we might extend page cache beyond the size of the
> > inode as we need to guarantee folios of minimum order. Cap the PTE range
> > to be created for the page cache up to the max allowed zero-fill file
> > end, which is aligned to the PAGE_SIZE.
> 
> I think this is slightly misleading because we might well zero-fill
> to the end of the folio.  The issue is that we're supposed to SIGBUS
> if userspace accesses pages which lie entirely beyond the end of this
> file.  Can you rephrase this?
> 
> (from mmap(2))
>        SIGBUS Attempted access to a page of the buffer that lies beyond the end
>               of the mapped file.  For an explanation of the treatment  of  the
>               bytes  in  the  page that corresponds to the end of a mapped file
>               that is not a multiple of the page size, see NOTES.
> 
> 
> The code is good though.
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Since I've been curating the respective fstests test to test for this
POSIX corner case [0] I wanted to enable the test for tmpfs instead of
skipping it as I originally had it, and that meant also realizing mmap(2)
specifically says this now:

Huge page (Huge TLB) mappings
...
       For mmap(), offset must be a multiple of the underlying huge page
       size. The system automatically aligns length to be a multiple of
       the underlying huge page size.

So do we need to adjust this patch with this:

diff --git a/mm/filemap.c b/mm/filemap.c
index ea78963f0956..9c8897ba90ff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	vm_fault_t ret = 0;
 	unsigned long rss = 0;
 	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+	unsigned int align = PAGE_SIZE;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
@@ -3636,7 +3637,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (folio_test_pmd_mappable(folio))
+		align = 1 << folio_order(folio);
+
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1;
 	if (end_pgoff > file_end)
 		end_pgoff = file_end;

[0] https://lore.kernel.org/all/20240611030203.1719072-3-mcgrof@kernel.org/

  Luis
David Hildenbrand June 13, 2024, 8:07 a.m. UTC | #3
On 13.06.24 09:57, Luis Chamberlain wrote:
> On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
>> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>
>>> Usually the page cache does not extend beyond the size of the inode,
>>> therefore, no PTEs are created for folios that extend beyond the size.
>>>
>>> But with LBS support, we might extend page cache beyond the size of the
>>> inode as we need to guarantee folios of minimum order. Cap the PTE range
>>> to be created for the page cache up to the max allowed zero-fill file
>>> end, which is aligned to the PAGE_SIZE.
>>
>> I think this is slightly misleading because we might well zero-fill
>> to the end of the folio.  The issue is that we're supposed to SIGBUS
>> if userspace accesses pages which lie entirely beyond the end of this
>> file.  Can you rephrase this?
>>
>> (from mmap(2))
>>         SIGBUS Attempted access to a page of the buffer that lies beyond the end
>>                of the mapped file.  For an explanation of the treatment  of  the
>>                bytes  in  the  page that corresponds to the end of a mapped file
>>                that is not a multiple of the page size, see NOTES.
>>
>>
>> The code is good though.
>>
>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> Since I've been curating the respective fstests test to test for this
> POSIX corner case [0] I wanted to enable the test for tmpfs instead of
> skipping it as I originally had it, and that meant also realizing mmap(2)
> specifically says this now:
> 
> Huge page (Huge TLB) mappings

Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP 
and friends.

So it might not be required for below changes.

> ...
>         For mmap(), offset must be a multiple of the underlying huge page
>         size. The system automatically aligns length to be a multiple of
>         the underlying huge page size.
> 
> So do we need to adjust this patch with this:
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ea78963f0956..9c8897ba90ff 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   	vm_fault_t ret = 0;
>   	unsigned long rss = 0;
>   	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
> +	unsigned int align = PAGE_SIZE;
>   
>   	rcu_read_lock();
>   	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> @@ -3636,7 +3637,10 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>   		goto out;
>   	}
>   
> -	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
> +	if (folio_test_pmd_mappable(folio))
> +		align = 1 << folio_order(folio);
> +
> +	file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1;
>   	if (end_pgoff > file_end)
>   		end_pgoff = file_end;
> 
> [0] https://lore.kernel.org/all/20240611030203.1719072-3-mcgrof@kernel.org/
> 
>    Luis
>
Luis Chamberlain June 13, 2024, 8:13 a.m. UTC | #4
On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote:
> On 13.06.24 09:57, Luis Chamberlain wrote:
> > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > > 
> > > > Usually the page cache does not extend beyond the size of the inode,
> > > > therefore, no PTEs are created for folios that extend beyond the size.
> > > > 
> > > > But with LBS support, we might extend page cache beyond the size of the
> > > > inode as we need to guarantee folios of minimum order. Cap the PTE range
> > > > to be created for the page cache up to the max allowed zero-fill file
> > > > end, which is aligned to the PAGE_SIZE.
> > > 
> > > I think this is slightly misleading because we might well zero-fill
> > > to the end of the folio.  The issue is that we're supposed to SIGBUS
> > > if userspace accesses pages which lie entirely beyond the end of this
> > > file.  Can you rephrase this?
> > > 
> > > (from mmap(2))
> > >         SIGBUS Attempted access to a page of the buffer that lies beyond the end
> > >                of the mapped file.  For an explanation of the treatment  of  the
> > >                bytes  in  the  page that corresponds to the end of a mapped file
> > >                that is not a multiple of the page size, see NOTES.
> > > 
> > > 
> > > The code is good though.
> > > 
> > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > 
> > Since I've been curating the respective fstests test to test for this
> > POSIX corner case [0] I wanted to enable the test for tmpfs instead of
> > skipping it as I originally had it, and that meant also realizing mmap(2)
> > specifically says this now:
> > 
> > Huge page (Huge TLB) mappings
> 
> Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and
> friends.
> 
> So it might not be required for below changes.

Thanks, I had to ask as we're dusting off this little obscure corner of
the universe. Reason I ask, is the test fails for tmpfs with huge pages,
and this patch fixes it, but it got me wondering the above applies also
to tmpfs with huge pages.

  Luis
David Hildenbrand June 13, 2024, 8:16 a.m. UTC | #5
On 13.06.24 10:13, Luis Chamberlain wrote:
> On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote:
>> On 13.06.24 09:57, Luis Chamberlain wrote:
>>> On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
>>>> On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>>>
>>>>> Usually the page cache does not extend beyond the size of the inode,
>>>>> therefore, no PTEs are created for folios that extend beyond the size.
>>>>>
>>>>> But with LBS support, we might extend page cache beyond the size of the
>>>>> inode as we need to guarantee folios of minimum order. Cap the PTE range
>>>>> to be created for the page cache up to the max allowed zero-fill file
>>>>> end, which is aligned to the PAGE_SIZE.
>>>>
>>>> I think this is slightly misleading because we might well zero-fill
>>>> to the end of the folio.  The issue is that we're supposed to SIGBUS
>>>> if userspace accesses pages which lie entirely beyond the end of this
>>>> file.  Can you rephrase this?
>>>>
>>>> (from mmap(2))
>>>>          SIGBUS Attempted access to a page of the buffer that lies beyond the end
>>>>                 of the mapped file.  For an explanation of the treatment  of  the
>>>>                 bytes  in  the  page that corresponds to the end of a mapped file
>>>>                 that is not a multiple of the page size, see NOTES.
>>>>
>>>>
>>>> The code is good though.
>>>>
>>>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>
>>> Since I've been curating the respective fstests test to test for this
>>> POSIX corner case [0] I wanted to enable the test for tmpfs instead of
>>> skipping it as I originally had it, and that meant also realizing mmap(2)
>>> specifically says this now:
>>>
>>> Huge page (Huge TLB) mappings
>>
>> Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and
>> friends.
>>
>> So it might not be required for below changes.
> 
> Thanks, I had to ask as we're dusting off this little obscure corner of
> the universe. Reason I ask, is the test fails for tmpfs with huge pages,
> and this patch fixes it, but it got me wondering the above applies also
> to tmpfs with huge pages.

Is it tmpfs with THP/large folios or shmem with hugetlb? I assume the 
tmpfs with THP. There are not really mmap/munmap restrictions to THP and 
friends (because it's supposed to be "transparent" :) ).
Luis Chamberlain June 13, 2024, 3:27 p.m. UTC | #6
On Thu, Jun 13, 2024 at 10:16:10AM +0200, David Hildenbrand wrote:
> On 13.06.24 10:13, Luis Chamberlain wrote:
> > On Thu, Jun 13, 2024 at 10:07:15AM +0200, David Hildenbrand wrote:
> > > On 13.06.24 09:57, Luis Chamberlain wrote:
> > > > On Wed, Jun 12, 2024 at 08:08:15PM +0100, Matthew Wilcox wrote:
> > > > > On Fri, Jun 07, 2024 at 02:58:57PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > > > > 
> > > > > > Usually the page cache does not extend beyond the size of the inode,
> > > > > > therefore, no PTEs are created for folios that extend beyond the size.
> > > > > > 
> > > > > > But with LBS support, we might extend page cache beyond the size of the
> > > > > > inode as we need to guarantee folios of minimum order. Cap the PTE range
> > > > > > to be created for the page cache up to the max allowed zero-fill file
> > > > > > end, which is aligned to the PAGE_SIZE.
> > > > > 
> > > > > I think this is slightly misleading because we might well zero-fill
> > > > > to the end of the folio.  The issue is that we're supposed to SIGBUS
> > > > > if userspace accesses pages which lie entirely beyond the end of this
> > > > > file.  Can you rephrase this?
> > > > > 
> > > > > (from mmap(2))
> > > > >          SIGBUS Attempted access to a page of the buffer that lies beyond the end
> > > > >                 of the mapped file.  For an explanation of the treatment  of  the
> > > > >                 bytes  in  the  page that corresponds to the end of a mapped file
> > > > >                 that is not a multiple of the page size, see NOTES.
> > > > > 
> > > > > 
> > > > > The code is good though.
> > > > > 
> > > > > Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > 
> > > > Since I've been curating the respective fstests test to test for this
> > > > POSIX corner case [0] I wanted to enable the test for tmpfs instead of
> > > > skipping it as I originally had it, and that meant also realizing mmap(2)
> > > > specifically says this now:
> > > > 
> > > > Huge page (Huge TLB) mappings
> > > 
> > > Confusion alert: this likely talks about hugetlb (MAP_HUGETLB), not THP and
> > > friends.
> > > 
> > > So it might not be required for below changes.
> > 
> > Thanks, I had to ask as we're dusting off this little obscure corner of
> > the universe. Reason I ask, is the test fails for tmpfs with huge pages,
> > and this patch fixes it, but it got me wondering the above applies also
> > to tmpfs with huge pages.
> 
> Is it tmpfs with THP/large folios or shmem with hugetlb? I assume the tmpfs
> with THP. There are not really mmap/munmap restrictions to THP and friends
> (because it's supposed to be "transparent" :) ).

The case I tested that failed the test was tmpfs with huge pages (not
large folios). So should we then have this:

diff --git a/mm/filemap.c b/mm/filemap.c
index ea78963f0956..649beb9bbc6b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3617,6 +3617,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	vm_fault_t ret = 0;
 	unsigned long rss = 0;
 	unsigned int nr_pages = 0, mmap_miss = 0, mmap_miss_saved, folio_type;
+	unsigned int align = PAGE_SIZE;
 
 	rcu_read_lock();
 	folio = next_uptodate_folio(&xas, mapping, end_pgoff);
@@ -3636,7 +3637,16 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	/*
+	 * As per the mmap(2) mmap(), the offset must be a multiple of the
+	 * underlying huge page size. The system automatically aligns length to
+	 * be a multiple of the underlying huge page size.
+	 */
+	if (folio_test_pmd_mappable(folio) &&
+	    (shmem_mapping(mapping) || folio_test_hugetlb(folio)))
+		align = 1 << folio_order(folio);
+
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), align) - 1;
 	if (end_pgoff > file_end)
 		end_pgoff = file_end;
Matthew Wilcox (Oracle) June 13, 2024, 3:32 p.m. UTC | #7
On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> The case I tested that failed the test was tmpfs with huge pages (not
> large folios). So should we then have this:

No.
Luis Chamberlain June 13, 2024, 3:38 p.m. UTC | #8
On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> > The case I tested that failed the test was tmpfs with huge pages (not
> > large folios). So should we then have this:
> 
> No.

OK so this does have a change for tmpfs with huge pages enabled, do we
take the position then this is a fix for that?

  Luis
Matthew Wilcox (Oracle) June 13, 2024, 3:40 p.m. UTC | #9
On Thu, Jun 13, 2024 at 08:38:15AM -0700, Luis Chamberlain wrote:
> On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote:
> > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> > > The case I tested that failed the test was tmpfs with huge pages (not
> > > large folios). So should we then have this:
> > 
> > No.
> 
> OK so this does have a change for tmpfs with huge pages enabled, do we
> take the position then this is a fix for that?

You literally said it was a fix just a few messages up thread?

Besides, the behaviour changes (currently) depending on whether
you specify "within_size" or "always".  This patch makes it consistent.
Luis Chamberlain June 13, 2024, 7:39 p.m. UTC | #10
On Thu, Jun 13, 2024 at 04:40:28PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 13, 2024 at 08:38:15AM -0700, Luis Chamberlain wrote:
> > On Thu, Jun 13, 2024 at 04:32:27PM +0100, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2024 at 08:27:27AM -0700, Luis Chamberlain wrote:
> > > > The case I tested that failed the test was tmpfs with huge pages (not
> > > > large folios). So should we then have this:
> > > 
> > > No.
> > 
> > OK so this does have a change for tmpfs with huge pages enabled, do we
> > take the position then this is a fix for that?
> 
> You literally said it was a fix just a few messages up thread?
> 
> Besides, the behaviour changes (currently) depending on whether
> you specify "within_size" or "always".  This patch makes it consistent.

The quoted mmap(2) text made me doubt it, and I was looking for
clarification. It seems clear now based on feedback the text does
not apply to tmpfs with huge pages, and so we'll just annotate it
as a fix for tmpfs with huge pages.

It makes sense to not apply, I mean, why *would* you assume you will
have an extended range zeroed out range to muck around with beyond
PAGE_SIZE just because huge pages were used when the rest of all other
filesystem APIs count on the mmap(2) PAGE_SIZE boundary.

Thanks!

  Luis
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 8bb0d2bc93c5..0e48491b3d10 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3610,7 +3610,7 @@  vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	pgoff_t last_pgoff = start_pgoff;
+	pgoff_t file_end, last_pgoff = start_pgoff;
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
@@ -3636,6 +3636,10 @@  vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (end_pgoff > file_end)
+		end_pgoff = file_end;
+
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;