diff mbox series

[v3,05/11] readahead: allocate folios with mapping_min_order in readahead

Message ID 20240313170253.2324812-6-kernel@pankajraghav.com (mailing list archive)
State New
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Pankaj Raghav (Samsung) March 13, 2024, 5:02 p.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.
When read_pages() is triggered and if a page is already present, check
for truncation and move the ractl->_index by mapping_min_nrpages if that
folio was truncated. This is done to ensure we keep the alignment
requirement while adding a folio to the page cache.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the min_order requirement of the page cache.

When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 86 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 14 deletions(-)

Comments

Matthew Wilcox March 25, 2024, 7 p.m. UTC | #1
On Wed, Mar 13, 2024 at 06:02:47PM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> page_cache_ra_unbounded() was allocating single pages (0 order folios)
> if there was no folio found in an index. Allocate mapping_min_order folios
> as we need to guarantee the minimum order if it is set.
> When read_pages() is triggered and if a page is already present, check
> for truncation and move the ractl->_index by mapping_min_nrpages if that
> folio was truncated. This is done to ensure we keep the alignment
> requirement while adding a folio to the page cache.
> 
> page_cache_ra_order() tries to allocate folio to the page cache with a
> higher order if the index aligns with that order. Modify it so that the
> order does not go below the min_order requirement of the page cache.

This paragraph doesn't make sense.  We have an assertion that there's no
folio in the page cache with a lower order than the minimum, so this
seems to be describing a situation that can't happen.  Does it need to
be rephrased (because you're actually describing something else) or is
it just stale?

> @@ -239,23 +258,35 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index += folio_nr_pages(folio);
> +
> +			/*
> +			 * Move the ractl->_index by at least min_pages
> +			 * if the folio got truncated to respect the
> +			 * alignment constraint in the page cache.
> +			 *
> +			 */
> +			if (mapping != folio->mapping)
> +				nr_pages = min_nrpages;
> +
> +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> +			ractl->_index += nr_pages;
>  			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			break;
>  		if (filemap_add_folio(mapping, folio, index + i,
>  					gfp_mask) < 0) {
>  			folio_put(folio);
>  			read_pages(ractl);
> -			ractl->_index++;
> +			ractl->_index += min_nrpages;

Hah, you changed this here.  Please move into previous patch.

>  			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
> -		if (i == nr_to_read - lookahead_size)
> +		if (i == mark)
>  			folio_set_readahead(folio);
>  		ractl->_workingset |= folio_test_workingset(folio);
>  		ractl->_nr_pages += folio_nr_pages(folio);
> @@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  {
>  	struct address_space *mapping = ractl->mapping;
>  	pgoff_t index = readahead_index(ractl);
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
>  	pgoff_t mark = index + ra->size - ra->async_size;
>  	int err = 0;
>  	gfp_t gfp = readahead_gfp_mask(mapping);
> +	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>  
> -	if (!mapping_large_folio_support(mapping) || ra->size < 4)
> +	/*
> +	 * Fallback when size < min_nrpages as each folio should be
> +	 * at least min_nrpages anyway.
> +	 */
> +	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
>  		goto fallback;
>  
>  	limit = min(limit, index + ra->size - 1);
> @@ -505,9 +542,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  			new_order = MAX_PAGECACHE_ORDER;
>  		while ((1 << new_order) > ra->size)
>  			new_order--;
> +		if (new_order < min_order)
> +			new_order = min_order;

I think these are the two lines you're describing in the paragraph that
doesn't make sense to me?  And if so, I think they're useless?

> @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  		if (index & ((1UL << order) - 1))
>  			order = __ffs(index);
>  		/* Don't allocate pages past EOF */
> -		while (index + (1UL << order) - 1 > limit)
> +		while (order > min_order && index + (1UL << order) - 1 > limit)
>  			order--;

This raises an interesting question that I don't know if we have a test
for.  POSIX says that if we mmap, let's say, the first 16kB of a 10kB
file, then we can store into offset 0-12287, but stores to offsets
12288-16383 get a signal (I forget if it's SEGV or BUS).  Thus far,
we've declined to even create folios in the page cache that would let us
create PTEs for offset 12288-16383, so I haven't paid too much attention
to this.  Now we're going to have folios that extend into that range, so
we need to be sure that when we mmap(), we only create PTEs that go as
far as 12287.

Can you check that we have such an fstest, and that we still pass it
with your patches applied and a suitably large block size?
Pankaj Raghav (Samsung) March 26, 2024, 1:08 p.m. UTC | #2
> > 
> > page_cache_ra_unbounded() was allocating single pages (0 order folios)
> > if there was no folio found in an index. Allocate mapping_min_order folios
> > as we need to guarantee the minimum order if it is set.
> > When read_pages() is triggered and if a page is already present, check
> > for truncation and move the ractl->_index by mapping_min_nrpages if that
> > folio was truncated. This is done to ensure we keep the alignment
> > requirement while adding a folio to the page cache.
> > 
> > page_cache_ra_order() tries to allocate folio to the page cache with a
> > higher order if the index aligns with that order. Modify it so that the
> > order does not go below the min_order requirement of the page cache.
> 
> This paragraph doesn't make sense.  We have an assertion that there's no
> folio in the page cache with a lower order than the minimum, so this
> seems to be describing a situation that can't happen.  Does it need to
> be rephrased (because you're actually describing something else) or is
> it just stale?
> 
Hmm, maybe I need to explain better here.

> > @@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  {
> >  	struct address_space *mapping = ractl->mapping;
> >  	pgoff_t index = readahead_index(ractl);
> > +	unsigned int min_order = mapping_min_folio_order(mapping);
> >  	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> >  	pgoff_t mark = index + ra->size - ra->async_size;
> >  	int err = 0;
> >  	gfp_t gfp = readahead_gfp_mask(mapping);
> > +	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> >  
> > -	if (!mapping_large_folio_support(mapping) || ra->size < 4)

I had one question: `4` was used here because we could not support order
1 folios I assume? As we now support order-1 folios, can this fallback
if ra->size < min_nrpages?

> > +	/*
> > +	 * Fallback when size < min_nrpages as each folio should be
> > +	 * at least min_nrpages anyway.
> > +	 */
> > +	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> >  		goto fallback;
> >  
> >  	limit = min(limit, index + ra->size - 1);
> > @@ -505,9 +542,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  			new_order = MAX_PAGECACHE_ORDER;
> >  		while ((1 << new_order) > ra->size)
> >  			new_order--;
> > +		if (new_order < min_order)
> > +			new_order = min_order;
> 
> I think these are the two lines you're describing in the paragraph that
Yes. At least I tried to ;)

> doesn't make sense to me?  And if so, I think they're useless?

We need this because:
The caller might pass new_order = 0 (such as when we start mmap around)
but ra_order will do the "right" thing by taking care of the page cache
alignment requirements. The previous revision did this other way around
and it felt a bit all over the place.

One of the main changes I did for this revision is to adapt the
functions that alloc and add a folio to respect the min order alignment,
and not on the caller side.

The rationale I went for this is three fold: 
- the callers of page_cache_ra_order() and page_cache_ra_unbounded() 
  requests a range for which we need to do add a folio and read it. They
  don't care about the folios size and alignment.

- file_ra_state also does not need any poking around because we will
  make sure to cover from [ra->start, ra->size] and update ra->size if
  we spilled over due to the min_order requirement(like it was done before).

- And this is also consistent with what we do in filemap where we adjust
  the index before adding the folio to the page cache.

Let me know if this approach makes sense.
> 
> > @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  		if (index & ((1UL << order) - 1))
> >  			order = __ffs(index);
> >  		/* Don't allocate pages past EOF */
> > -		while (index + (1UL << order) - 1 > limit)
> > +		while (order > min_order && index + (1UL << order) - 1 > limit)
> >  			order--;
> 
> This raises an interesting question that I don't know if we have a test
> for.  POSIX says that if we mmap, let's say, the first 16kB of a 10kB
> file, then we can store into offset 0-12287, but stores to offsets
> 12288-16383 get a signal (I forget if it's SEGV or BUS).  Thus far,
> we've declined to even create folios in the page cache that would let us
> create PTEs for offset 12288-16383, so I haven't paid too much attention
> to this.  Now we're going to have folios that extend into that range, so
> we need to be sure that when we mmap(), we only create PTEs that go as
> far as 12287.
> 
> Can you check that we have such an fstest, and that we still pass it
> with your patches applied and a suitably large block size?

Hmmm, good point.
I think you mean this from the man page:

SIGBUS Attempted access to a page of the buffer that lies beyond the end
of the mapped file.  For an explanation of the treatment of the bytes in
the page that corresponds to the end of a mapped file that is not a 
multiple of the page size, see NOTES.

I will add a test case for this and see what happens. I am not sure if I
need to make some changes or the current implementation should hold.
Pankaj Raghav (Samsung) April 22, 2024, 11:03 a.m. UTC | #3
> > @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  		if (index & ((1UL << order) - 1))
> >  			order = __ffs(index);
> >  		/* Don't allocate pages past EOF */
> > -		while (index + (1UL << order) - 1 > limit)
> > +		while (order > min_order && index + (1UL << order) - 1 > limit)
> >  			order--;
> 
> This raises an interesting question that I don't know if we have a test
> for.  POSIX says that if we mmap, let's say, the first 16kB of a 10kB
> file, then we can store into offset 0-12287, but stores to offsets
> 12288-16383 get a signal (I forget if it's SEGV or BUS).  Thus far,
> we've declined to even create folios in the page cache that would let us
> create PTEs for offset 12288-16383, so I haven't paid too much attention
> to this.  Now we're going to have folios that extend into that range, so
> we need to be sure that when we mmap(), we only create PTEs that go as
> far as 12287.
> 
> Can you check that we have such an fstest, and that we still pass it
> with your patches applied and a suitably large block size?
> 

So the mmap is giving the correct SIGBUS error when we try to do this:
dd if=/dev/zero of=./test bs=10k count=1; 
xfs_io -c "mmap -w 0 16384" -c "mwrite 13000 10" test

Logs on bs=64k ps=4k system:
root@debian:/media/test# dd if=/dev/zero of=./test bs=10k count=1;
root@debian:/media/test# du -sh test 
64K     test
root@debian:/media/test# ls -l --block-size=k test 
-rw-r--r-- 1 root root 10K Apr 22 10:42 test
root@debian:/media/test# xfs_io -c "mmap  0 16384" -c "mwrite 13000 10" test
Bus error

The check in filemap_fault takes care of this:

max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(index >= max_idx))
        return VM_FAULT_SIGBUS;

The same operation for read should also give a bus error, but it didn't.
Further investigation pointed out that the fault_around() does not take
this condition into account for LBS configuration. When I set fault_around_bytes
to 4096, things worked as expected as we skip fault_around for reads. 

I have a patch that return SIGBUS also for the following read operation:
dd if=/dev/zero of=./test bs=10k count=1; 
xfs_io -c "mmap -r 0 16384" -c "mread 13000 10" test

This is the patch I have for now that fixes fault_around() logic for LBS
configuration:

diff --git a/mm/filemap.c b/mm/filemap.c
index f0c0cfbbd134..259531dd297b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3600,12 +3600,15 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
        }
        do {
                unsigned long end;
+               unsigned long i_size;
 
                addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
                vmf->pte += xas.xa_index - last_pgoff;
                last_pgoff = xas.xa_index;
                end = folio_next_index(folio) - 1;
-               nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
+               i_size = DIV_ROUND_UP(i_size_read(mapping->host),
+                                     PAGE_SIZE) - 1;
+               nr_pages = min3(end, end_pgoff, i_size) - xas.xa_index + 1;
 
                if (!folio_test_large(folio))
                        ret |= filemap_map_order0_folio(vmf,

I will send a new version of the series this week after doing some more
testing.
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 37b938f4b54f..650834c033f0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long index = readahead_index(ractl), ra_folio_index;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i = 0;
+	unsigned long i = 0, mark;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,6 +224,22 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_start_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	if (index != readahead_index(ractl)) {
+		nr_to_read += readahead_index(ractl) - index;
+		ractl->_index = index;
+	}
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -230,6 +247,8 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
+
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -239,23 +258,35 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
 		ractl->_nr_pages += folio_nr_pages(folio);
@@ -489,12 +520,18 @@  void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -505,9 +542,19 @@  void page_cache_ra_order(struct readahead_control *ractl,
 			new_order = MAX_PAGECACHE_ORDER;
 		while ((1 << new_order) > ra->size)
 			new_order--;
+		if (new_order < min_order)
+			new_order = min_order;
 	}
 
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_start_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -515,7 +562,7 @@  void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -778,8 +825,15 @@  void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -789,9 +843,11 @@  void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -801,7 +857,7 @@  void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -816,9 +872,11 @@  void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -828,10 +886,10 @@  void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }