diff mbox series

[6/7] mm/filemap: allocate folios with mapping blocksize

Message ID 20230614114637.89759-7-hare@suse.de (mailing list archive)
State New, archived
Headers show
Series RFC: high-order folio support for I/O | expand

Commit Message

Hannes Reinecke June 14, 2023, 11:46 a.m. UTC
The mapping has an underlying blocksize (by virtue of
mapping->host->i_blkbits), so if the mapping blocksize
is larger than the pagesize we should allocate folios
in the correct order.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/pagemap.h | 7 +++++++
 mm/filemap.c            | 7 ++++---
 mm/readahead.c          | 6 +++---
 3 files changed, 14 insertions(+), 6 deletions(-)

Comments

Pankaj Raghav June 19, 2023, 8:08 a.m. UTC | #1
Hi Hannes,
On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
> The mapping has an underlying blocksize (by virtue of
> mapping->host->i_blkbits), so if the mapping blocksize
> is larger than the pagesize we should allocate folios
> in the correct order.
> 
Network filesystems such as 9pfs set the blkbits to be maximum data it
wants to transfer leading to unnecessary memory pressure as we will try
to allocate higher order folios(Order 5 in my setup). Isn't it better
for each filesystem to request the minimum folio order it needs for its
page cache early on? Block devices can do the same for its block cache.

I have prototype along those lines and I will it soon. This is also
something willy indicated before in a mailing list conversation.

> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 47afbca1d122..031935b78af7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -245,7 +245,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			continue;
>  		}
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
>  		if (!folio)
>  			break;
>  		if (filemap_add_folio(mapping, folio, index + i,

Did you turn on CONFIG_DEBUG_VM while testing? I don't think we are
incrementing the counter in this function correctly as this function
assumes order 0. We might need something like this:

-               ractl->_nr_pages++;
+               ractl->_nr_pages += folio_nr_pages(folio);
+               i += folio_nr_pages(folio) - 1;
> @@ -806,7 +806,7 @@ void readahead_expand(struct readahead_control *ractl,
>  		if (folio && !xa_is_value(folio))
>  			return; /* Folio apparently present */
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
>  		if (!folio)
>  			return;
>  		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> @@ -833,7 +833,7 @@ void readahead_expand(struct readahead_control *ractl,
>  		if (folio && !xa_is_value(folio))
>  			return; /* Folio apparently present */
Same here:
-               ractl->_nr_pages++;
+               ractl->_nr_pages += folio_nr_pages(folio);

>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
>  		if (!folio)
>  			return;
>  		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> -- 
> 2.35.3
>
Hannes Reinecke June 19, 2023, 8:42 a.m. UTC | #2
On 6/19/23 10:08, Pankaj Raghav wrote:
> Hi Hannes,
> On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
>> The mapping has an underlying blocksize (by virtue of
>> mapping->host->i_blkbits), so if the mapping blocksize
>> is larger than the pagesize we should allocate folios
>> in the correct order.
>>
> Network filesystems such as 9pfs set the blkbits to be maximum data it
> wants to transfer leading to unnecessary memory pressure as we will try
> to allocate higher order folios(Order 5 in my setup). Isn't it better
> for each filesystem to request the minimum folio order it needs for its
> page cache early on? Block devices can do the same for its block cache.
> 
> I have prototype along those lines and I will it soon. This is also
> something willy indicated before in a mailing list conversation.
> 
Well; I _though_ that's why we had things like optimal I/O size and
maximal I/O size. But this seem to be relegated to request queue limits,
so I guess it's not available from 'struct block_device' or 'struct 
gendisk'.

So I've been thinking of adding a flag somewhere (possibly in
'struct address_space') to indicate that blkbits is a hard limit
and not just an advisory thing.

But indeed, I've seen this with NFS, too, which insists on setting 
blkbits to something like 8.

Cheers,

Hannes
Dave Chinner June 19, 2023, 10:57 p.m. UTC | #3
On Mon, Jun 19, 2023 at 10:42:38AM +0200, Hannes Reinecke wrote:
> On 6/19/23 10:08, Pankaj Raghav wrote:
> > Hi Hannes,
> > On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
> > > The mapping has an underlying blocksize (by virtue of
> > > mapping->host->i_blkbits), so if the mapping blocksize
> > > is larger than the pagesize we should allocate folios
> > > in the correct order.
> > > 
> > Network filesystems such as 9pfs set the blkbits to be maximum data it
> > wants to transfer leading to unnecessary memory pressure as we will try
> > to allocate higher order folios(Order 5 in my setup). Isn't it better
> > for each filesystem to request the minimum folio order it needs for its
> > page cache early on? Block devices can do the same for its block cache.

Folio size is not a "filesystem wide" thing - it's a per-inode
configuration. We can have inodes within a filesystem that have
different "block" sizes. A prime example of this is XFS directories
- they can have 64kB block sizes on 4kB block size filesystem.

Another example is extent size hints in XFS data files - they
trigger aligned allocation-around similar to using large folios in
the page cache for small writes. Effectively this gives data files a
"block size" of the extent size hint regardless of the filesystem
block size.

Hence in future we might want different sizes of folios for
different types of inodes and so whatever we do we need to support
per-inode folio size configuration for the inode mapping tree.

> > I have prototype along those lines and I will it soon. This is also
> > something willy indicated before in a mailing list conversation.
> > 
> Well; I _though_ that's why we had things like optimal I/O size and
> maximal I/O size. But this seem to be relegated to request queue limits,
> so I guess it's not available from 'struct block_device' or 'struct
> gendisk'.

Yes, those are block device constructs to enable block device based
filesystems to be laid out best for the given block device. They
don't exist for non-block-based filesystems like network
filesystems...

> So I've been thinking of adding a flag somewhere (possibly in
> 'struct address_space') to indicate that blkbits is a hard limit
> and not just an advisory thing.

This still relies on interpreting inode->i_blkbits repeatedly at
runtime in some way, in mm code that really has no business looking
at filesystem block sizes.

What is needed is a field into the mapping that defines the
folio order that all folios allocated for the page cache must be
aligned/sized to to allow them to be inserted into the mapping.

This means the minimum folio order and alignment is maintained
entirely by the mapping (e.g. it allows truncate to do the right
thing), and the filesystem/device side code does not need to do
anything special (except support large folios) to ensure that the
page cache always contains folios that are block sized and aligned.

We already have mapping_set_large_folios() that we use at
inode/mapping instantiation time to enable large folios in the page
cache for that mapping. What we need is a new
mapping_set_large_folio_order() API to enable the filesystem/device
to set the base folio order for the mapping tree at instantiation
time, and for all the page cache instantiation code to align/size to
the order stored in the mapping...

Cheers,

Dave.
Matthew Wilcox June 20, 2023, midnight UTC | #4
On Tue, Jun 20, 2023 at 08:57:48AM +1000, Dave Chinner wrote:
> What is needed is a field into the mapping that defines the
> folio order that all folios allocated for the page cache must be
> aligned/sized to to allow them to be inserted into the mapping.

Attached patch from December 2020 ;-)
From 1aeee696f4d322af5f34544e39fc00006c399fb8 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Tue, 15 Dec 2020 10:57:34 -0500
Subject: [PATCH] fs: Allow fine-grained control of folio sizes

Some filesystems want to be able to limit the maximum size of folios,
and some want to be able to ensure that folios are at least a certain
size.  Add mapping_set_folio_orders() to allow this level of control
(although it is not yet honoured).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cad81db32e61..9cbb8bdbaee7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -198,9 +198,15 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13,
+	/* 8-17 are used for FOLIO_ORDER */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK	0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0002e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -290,6 +296,29 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/**
+ * mapping_set_folio_orders() - Set the range of folio sizes supported.
+ * @mapping: The file.
+ * @min: Minimum folio order (between 0-31 inclusive).
+ * @max: Maximum folio order (between 0-31 inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which sizes of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ * 
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+		unsigned int min, unsigned int max)
+{
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			(min << AS_FOLIO_ORDER_MIN) |
+			(max << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
  * @mapping: The file.
@@ -303,7 +332,12 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_orders(mapping, 0, 31);
+}
+
+static inline unsigned mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
 }
 
 /*
@@ -312,8 +346,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
Hannes Reinecke June 20, 2023, 5:57 a.m. UTC | #5
On 6/20/23 00:57, Dave Chinner wrote:
> On Mon, Jun 19, 2023 at 10:42:38AM +0200, Hannes Reinecke wrote:
>> On 6/19/23 10:08, Pankaj Raghav wrote:
>>> Hi Hannes,
>>> On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
>>>> The mapping has an underlying blocksize (by virtue of
>>>> mapping->host->i_blkbits), so if the mapping blocksize
>>>> is larger than the pagesize we should allocate folios
>>>> in the correct order.
>>>>
>>> Network filesystems such as 9pfs set the blkbits to be maximum data it
>>> wants to transfer leading to unnecessary memory pressure as we will try
>>> to allocate higher order folios(Order 5 in my setup). Isn't it better
>>> for each filesystem to request the minimum folio order it needs for its
>>> page cache early on? Block devices can do the same for its block cache.
> 
> Folio size is not a "filesystem wide" thing - it's a per-inode
> configuration. We can have inodes within a filesystem that have
> different "block" sizes. A prime example of this is XFS directories
> - they can have 64kB block sizes on 4kB block size filesystem.
> 
> Another example is extent size hints in XFS data files - they
> trigger aligned allocation-around similar to using large folios in
> the page cache for small writes. Effectively this gives data files a
> "block size" of the extent size hint regardless of the filesystem
> block size.
> 
> Hence in future we might want different sizes of folios for
> different types of inodes and so whatever we do we need to support
> per-inode folio size configuration for the inode mapping tree.
> 
Sure. Using some mapping tree configuration was what I had in mind, too.

>>> I have prototype along those lines and I will it soon. This is also
>>> something willy indicated before in a mailing list conversation.
>>>
>> Well; I _though_ that's why we had things like optimal I/O size and
>> maximal I/O size. But this seem to be relegated to request queue limits,
>> so I guess it's not available from 'struct block_device' or 'struct
>> gendisk'.
> 
> Yes, those are block device constructs to enable block device based
> filesystems to be laid out best for the given block device. They
> don't exist for non-block-based filesystems like network
> filesystems...
> 
>> So I've been thinking of adding a flag somewhere (possibly in
>> 'struct address_space') to indicate that blkbits is a hard limit
>> and not just an advisory thing.
> 
> This still relies on interpreting inode->i_blkbits repeatedly at
> runtime in some way, in mm code that really has no business looking
> at filesystem block sizes.
> 
> What is needed is a field into the mapping that defines the
> folio order that all folios allocated for the page cache must be
> aligned/sized to to allow them to be inserted into the mapping.
> 
> This means the minimum folio order and alignment is maintained
> entirely by the mapping (e.g. it allows truncate to do the right
> thing), and the filesystem/device side code does not need to do
> anything special (except support large folios) to ensure that the
> page cache always contains folios that are block sized and aligned.
> 
> We already have mapping_set_large_folios() that we use at
> inode/mapping instantiation time to enable large folios in the page
> cache for that mapping. What we need is a new
> mapping_set_large_folio_order() API to enable the filesystem/device
> to set the base folio order for the mapping tree at instantiation
> time, and for all the page cache instantiation code to align/size to
> the order stored in the mapping...
> 
Having a mapping_set_large_folio_order() (or maybe 
mapping_set_folio_order(), for as soon as order > 0 we will have large
folios ...) looks like a good idea.
I'll give it a go.

Cheers,

Hannes
diff mbox series

Patch

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716953ee1ebd..9ea1a9724d64 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -494,6 +494,13 @@  static inline gfp_t readahead_gfp_mask(struct address_space *x)
 	return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
 }
 
+static inline int mapping_get_order(struct address_space *x)
+{
+	if (x->host->i_blkbits > PAGE_SHIFT)
+		return x->host->i_blkbits - PAGE_SHIFT;
+	return 0;
+}
+
 typedef int filler_t(struct file *, struct folio *);
 
 pgoff_t page_cache_next_miss(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4be20e82e4c3..6f08d04995d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1936,7 +1936,7 @@  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
 
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp, mapping_get_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 
@@ -2495,7 +2495,8 @@  static int filemap_create_folio(struct file *file,
 	struct folio *folio;
 	int error;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    mapping_get_order(mapping));
 	if (!folio)
 		return -ENOMEM;
 
@@ -3646,7 +3647,7 @@  static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp, mapping_get_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		err = filemap_add_folio(mapping, folio, index, gfp);
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..031935b78af7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -245,7 +245,7 @@  void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
@@ -806,7 +806,7 @@  void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
 		if (!folio)
 			return;
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
@@ -833,7 +833,7 @@  void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
 		if (!folio)
 			return;
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {