diff mbox series

[v7,11/11] xfs: enable block size larger than page size support

Message ID 20240607145902.1137853-12-kernel@pankajraghav.com (mailing list archive)
State Superseded
Headers show
Series enable bs > ps in XFS | expand

Commit Message

Pankaj Raghav (Samsung) June 7, 2024, 2:59 p.m. UTC
From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 18 ++++++++++--------
 5 files changed, 22 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig June 13, 2024, 8:47 a.m. UTC | #1
On Fri, Jun 07, 2024 at 02:59:02PM +0000, Pankaj Raghav (Samsung) wrote:
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
>  		igeo->ialloc_align = mp->m_dalign;
>  	else
>  		igeo->ialloc_align = 0;
> +
> +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> +	else
> +		igeo->min_folio_order = 0;
>  }

The minimum folio order isn't really part of the inode (allocation)
geometry, is it?
Dave Chinner June 17, 2024, 1:29 a.m. UTC | #2
On Thu, Jun 13, 2024 at 10:47:25AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 07, 2024 at 02:59:02PM +0000, Pankaj Raghav (Samsung) wrote:
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -3019,6 +3019,11 @@ xfs_ialloc_setup_geometry(
> >  		igeo->ialloc_align = mp->m_dalign;
> >  	else
> >  		igeo->ialloc_align = 0;
> > +
> > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > +	else
> > +		igeo->min_folio_order = 0;
> >  }
> 
> The minimum folio order isn't really part of the inode (allocation)
> geometry, is it?

I suggested it last time around instead of calculating the same
constant on every inode allocation. We're already storing in-memory
strunct xfs_inode allocation init values in this structure. e.g. in
xfs_inode_alloc() we see things like this:

	ip->i_diflags2 = mp->m_ino_geo.new_diflags2;

So that we only calculate the default values for the filesystem once
instead of on every inode allocation. This isn't unreasonable
because xfs_inode_alloc() is a fairly hot path.

The only other place we might store it is the struct xfs_mount, but
given all the inode allocation constants are already in the embedded
mp->m_ino_geo structure, it just seems like a much better idea to
put it will all the other inode allocation constants than dump it
randomly into the struct xfs_mount....

-Dave.
Christoph Hellwig June 17, 2024, 6:51 a.m. UTC | #3
On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote:
> > > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > > +	else
> > > +		igeo->min_folio_order = 0;
> > >  }
> > 
> > The minimum folio order isn't really part of the inode (allocation)
> > geometry, is it?
> 
> I suggested it last time around instead of calculating the same
> constant on every inode allocation. We're already storing in-memory
> strunct xfs_inode allocation init values in this structure. e.g. in
> xfs_inode_alloc() we see things like this:

While new_diflags2 isn't exactly inode geometry, it at least is part
of the inode allocation.  Folio min order for file data has nothing
to do with this at all.

> The only other place we might store it is the struct xfs_mount, but
> given all the inode allocation constants are already in the embedded
> mp->m_ino_geo structure, it just seems like a much better idea to
> put it will all the other inode allocation constants than dump it
> randomly into the struct xfs_mount....

Well, it is very closely elated to say the m_blockmask field in
struct xfs_mount.  The again modern CPUs tend to get a you simple
subtraction for free in most pipelines doing other things, so I'm
not really sure it's worth caching for use in inode allocation to
start with, but I don't care strongly about that.
Pankaj Raghav (Samsung) June 17, 2024, 4:31 p.m. UTC | #4
On Mon, Jun 17, 2024 at 08:51:04AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote:
> > > > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > > > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > > > +	else
> > > > +		igeo->min_folio_order = 0;
> > > >  }
> > > 
> > > The minimum folio order isn't really part of the inode (allocation)
> > > geometry, is it?
> > 
> > I suggested it last time around instead of calculating the same
> > constant on every inode allocation. We're already storing in-memory
> > strunct xfs_inode allocation init values in this structure. e.g. in
> > xfs_inode_alloc() we see things like this:
> 
> While new_diflags2 isn't exactly inode geometry, it at least is part
> of the inode allocation.  Folio min order for file data has nothing
> to do with this at all.
> 
> > The only other place we might store it is the struct xfs_mount, but
> > given all the inode allocation constants are already in the embedded
> > mp->m_ino_geo structure, it just seems like a much better idea to
> > put it will all the other inode allocation constants than dump it
> > randomly into the struct xfs_mount....
> 
> Well, it is very closely elated to say the m_blockmask field in
> struct xfs_mount.  The again modern CPUs tend to get a you simple
> subtraction for free in most pipelines doing other things, so I'm
> not really sure it's worth caching for use in inode allocation to
> start with, but I don't care strongly about that.

But there will also be an extra conditional apart from subtraction
right? 

Initially it was something like this:

@@ -73,6 +73,7 @@ xfs_inode_alloc(
 	xfs_ino_t		ino)
 {
 	struct xfs_inode	*ip;
+	int			min_order = 0;
 
 	/*
 	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
@@ -88,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -313,6 +315,7 @@ xfs_reinit_inode(
 	dev_t			dev = inode->i_rdev;
 	kuid_t			uid = inode->i_uid;
 	kgid_t			gid = inode->i_gid;
+	int			min_order = 0;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -323,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 	return error;
 }

It does introduce a conditional in the inode allocation hot path so I
went with what Chinner proposed as it is something we use when we
initialize an inode.
Dave Chinner June 17, 2024, 11:18 p.m. UTC | #5
On Mon, Jun 17, 2024 at 08:51:04AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 11:29:42AM +1000, Dave Chinner wrote:
> > > > +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > > > +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> > > > +	else
> > > > +		igeo->min_folio_order = 0;
> > > >  }
> > > 
> > > The minimum folio order isn't really part of the inode (allocation)
> > > geometry, is it?
> > 
> > I suggested it last time around instead of calculating the same
> > constant on every inode allocation. We're already storing in-memory
> > strunct xfs_inode allocation init values in this structure. e.g. in
> > xfs_inode_alloc() we see things like this:
> 
> While new_diflags2 isn't exactly inode geometry, it at least is part
> of the inode allocation.  Folio min order for file data has nothing
> to do with this at all.

Yet ip->i_diflags2 is *not* initialised in xfs_init_new_inode()
when we physically allocate and initialise a new inode. It is set
for all inodes when they are allocated in memory, regardless of
their use.

Whether that is the right thing to do or not is a separate issue -
xfs_inode_from_disk() will overwrite it in every inode read case
that isn't a create.

Indeed, We could do the folio order initialisation in
xfs_setup_inode() where we set up the mapping gfp mask, but that
doesn't change the fact we set it up for every inode that is
instantiated in memory or that we want it pre-calculated...

> > The only other place we might store it is the struct xfs_mount, but
> > given all the inode allocation constants are already in the embedded
> > mp->m_ino_geo structure, it just seems like a much better idea to
> > put it will all the other inode allocation constants than dump it
> > randomly into the struct xfs_mount....
> 
> Well, it is very closely elated to say the m_blockmask field in
> struct xfs_mount.

Not really. The block mask is a property of the and used primarily
for manipulating lengths in units of FSB to/from byte counts and
vice versa. It is used all over the place, and the only guaranteed
common structure that all those callers have access to is the
xfs_mount.

OTOH, the folio order is only used for regular files to tell the
page cache how to behave. The scope of the folio order setup is the
same as mapping_set_gfp_mask() - is it only used in one place and
used for inode configuration. I may have called the structure "inode
geometry" because that described what it contained when I first
implemented it, but that doesn't mean that is all that is can
contain. It contains static, precalculated inode configuration
values, and that what we are adding here...

> The again modern CPUs tend to get a you simple
> subtraction for free in most pipelines doing other things, so I'm
> not really sure it's worth caching for use in inode allocation to
> start with, but I don't care strongly about that.

It's not the cost of a subtraction that is the problem -
precalculation is about avoiding a potential branch misprediction in
a hot path that would stall the CPU. If there were no branches, it
wouldn't be an issue, but this value cannot be calculated without at
least one branch in the logic.

-Dave.
diff mbox series

Patch

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 14c81f227c5b..1e76431d75a4 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -3019,6 +3019,11 @@  xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 34f104ed372c..e67a1c7cc0b0 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -231,6 +231,9 @@  struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 0953163a2d84..5ed3dc9e7d90 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -89,7 +89,8 @@  xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -324,7 +325,8 @@  xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 46cb0384143b..a99454208807 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -135,7 +135,6 @@  xfs_sb_validate_fsb_count(
 	uint64_t		max_index;
 	uint64_t		max_bytes;
 
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 27e9f749c4c7..b8a93a8f35ca 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1638,16 +1638,18 @@  xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		if (!xfs_has_crc(mp)) {
+			xfs_warn(mp,
+"V4 Filesystem with blocksize %d bytes. Only pagesize (%ld) or less is supported.",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+			error = -ENOSYS;
+			goto out_free_sb;
+		}
+
+		xfs_warn(mp,
+"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */