diff mbox

[RFC,v2,1/2] Btrfs: add noi_version option to disable MS_I_VERSION

Message ID

1434527672-5762-1-git-send-email-bo.li.liu@oracle.com (mailing list archive)

State

New, archived

Headers

show

From: Liu Bo <bo.li.liu@oracle.com>
To: linux-btrfs@vger.kernel.org
Cc: fdmanana@suse.com
Subject: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable
	MS_I_VERSION
Date: Wed, 17 Jun 2015 15:54:31 +0800
Message-Id: <1434527672-5762-1-git-send-email-bo.li.liu@oracle.com>
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Liu Bo June 17, 2015, 7:54 a.m. UTC

MS_I_VERSION is enabled by default for btrfs, this adds an alternative
option to toggle it off.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/super.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

Comments

David Sterba June 17, 2015, 3:33 p.m. UTC | #1

On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> option to toggle it off.

There's an existing generic iversion/noiversion mount option pair, no
need to extra add it to btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo June 17, 2015, 3:52 p.m. UTC | #2

On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > option to toggle it off.
> 
> There's an existing generic iversion/noiversion mount option pair, no
> need to extra add it to btrfs.

I know, it doesn't work though.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba June 17, 2015, 5:01 p.m. UTC | #3

On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
> On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> > On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > > option to toggle it off.
> > 
> > There's an existing generic iversion/noiversion mount option pair, no
> > need to extra add it to btrfs.
> 
> I know, it doesn't work though.

Sigh, I see, btrfs forces MS_I_VERSION flag,
0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
that there's a standard way to override the defaults.

So the right way is not to do that but this will break everyhing that
relies on that behaviour at the moment. This means to add the exception
to the upper layers, either VFS or 'mount', which is not very likely to
happen.

The generic options do not reach the filesystem specific callbacks, so
we can't check it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo June 18, 2015, 2:46 a.m. UTC | #4

On Wed, Jun 17, 2015 at 07:01:18PM +0200, David Sterba wrote:
> On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
> > On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> > > On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > > > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > > > option to toggle it off.
> > > 
> > > There's an existing generic iversion/noiversion mount option pair, no
> > > need to extra add it to btrfs.
> > 
> > I know, it doesn't work though.
> 
> Sigh, I see, btrfs forces MS_I_VERSION flag,
> 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
> that there's a standard way to override the defaults.
> 
> So the right way is not to do that but this will break everyhing that
> relies on that behaviour at the moment. This means to add the exception
> to the upper layers, either VFS or 'mount', which is not very likely to
> happen.
> 
> The generic options do not reach the filesystem specific callbacks, so
> we can't check it.

Ext4 also makes its own "i_version" option, so I think we can do the
same thing until more filesystems require to do it in a generic way.

The performance benefit with no_iversion is obvious for fsync
related workloads since we would avoid some expensive log commits.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba June 18, 2015, 2:38 p.m. UTC | #5

Moving the discussion to fsdevel.

Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
generic 'noiversion' option cannot be used to achieve that. It is
processed before it reaches btrfs superblock callback, where
MS_I_VERSION is forced.

The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
to which I object.

Continued below.

On Thu, Jun 18, 2015 at 10:46:09AM +0800, Liu Bo wrote:
> On Wed, Jun 17, 2015 at 07:01:18PM +0200, David Sterba wrote:
> > On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
> > > On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
> > > > On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
> > > > > MS_I_VERSION is enabled by default for btrfs, this adds an alternative
> > > > > option to toggle it off.
> > > > 
> > > > There's an existing generic iversion/noiversion mount option pair, no
> > > > need to extra add it to btrfs.
> > > 
> > > I know, it doesn't work though.
> > 
> > Sigh, I see, btrfs forces MS_I_VERSION flag,
> > 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
> > that there's a standard way to override the defaults.
> > 
> > So the right way is not to do that but this will break everyhing that
> > relies on that behaviour at the moment. This means to add the exception
> > to the upper layers, either VFS or 'mount', which is not very likely to
> > happen.
> > 
> > The generic options do not reach the filesystem specific callbacks, so
> > we can't check it.
> 
> Ext4 also makes its own "i_version" option, so I think we can do the
> same thing until more filesystems require to do it in a generic way.

AFAICS, ext4 had added it's own i_version before iversion was added to
mount:

ext4:

Commit:      25ec56b518257a56d2ff41a941d288e4b5ff9488
Commit date: Mon Jan 28 23:58:27 2008 -0500
Subject:     ext4: Add inode version support in ext4

util-linux:

Commti:      4fa5e73d16828c94234ba0aeafaec2470f79011c
Commit date: Thu Nov 27 12:08:44 2008 +0100
Subject:     mount: add i_version support

I don't know the history, this looks like adding the options was not
coordinated.

> The performance benefit with no_iversion is obvious for fsync
> related workloads since we would avoid some expensive log commits.

It is obviuos, but I'd like to avoid cluttering the mount options
interface further.

xfs also forces I_VERSION if it detects the superblock version 5, so it
could use the same fix that would work for btrfs.

I see two possibilities that pretend to be generic and clean:

1) the filesystem MS_I_* defaults would be exported and processed up the
   mount call stack

2) pass the full mount options to the filesystem (if requested eg. by
   file_system_type::fs_flags bits).

The other ideas contain 'make an exception to ... ' which does not sound
appealing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Karel Zak June 19, 2015, 11:44 a.m. UTC | #6

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> AFAICS, ext4 had added it's own i_version before iversion was added to
> mount:

It is not so unusual that some mount option is introduced as
fs-specific and later re-implemented as generic (another example is
MS_LAZYTIME). The mount(8) cares about the generic options only.

> ext4:
> 
> Commit:      25ec56b518257a56d2ff41a941d288e4b5ff9488
> Commit date: Mon Jan 28 23:58:27 2008 -0500
> Subject:     ext4: Add inode version support in ext4
> 
> util-linux:
> 
> Commti:      4fa5e73d16828c94234ba0aeafaec2470f79011c
> Commit date: Thu Nov 27 12:08:44 2008 +0100
> Subject:     mount: add i_version support
> 
> I don't know the history, this looks like adding the options was not
> coordinated.

I don't remember this change, but MS_I_VERSION is part of the kernel
API (include/uapi/linux/fs.h) so I guess it's fine to support it in
mount(8).

    Karel

Dave Chinner June 22, 2015, 8:42 p.m. UTC | #7

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
> 
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
> 
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.

The issue is that you can't overide IS_I_VERSION(inode) because it
looks at the superblock flag, yes?

So perhaps IS_I_VERSION should become an inode flag, set by the
filesystem at inode instantiation time, and hence filesystems can
choose on a per-inode basis if they want I_VERSION behaviour or not.
At that point, the behaviour of MS_I_VERSION becomes irrelevant to
the discussion, doesn't it?

> xfs also forces I_VERSION if it detects the superblock version 5, so it
> could use the same fix that would work for btrfs.

XFS is a special snowflake - it updates the I_VERSION only when an
inode is otherwise modified in a transaction, so turning it off
saves nothing. (And yes, timestamp updates are transactional in
XFS). Hence XFS behaviour is irrelevant to the discussion, because
we aren't ever going to turn it off....

Cheers,

Dave.

Theodore Ts'o June 23, 2015, 4:32 p.m. UTC | #8

On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> Moving the discussion to fsdevel.
> 
> Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> generic 'noiversion' option cannot be used to achieve that. It is
> processed before it reaches btrfs superblock callback, where
> MS_I_VERSION is forced.
> 
> The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> to which I object.

I was talking to Mingming about this on today's ext4 conference call,
and one of the reasons why ext4 turns off i_version update by default
is because it does a real number on our performance as well --- and
furthermore, the only real user of the field from what we can tell is
NFSv4, which not all that many ext4 users actually care about.

This has caused pain for the nfsv4 folks since it means that they need
to tell people to use a special mount option for ext4 if they are
actually using this for nfsv4, and I suspect they won't be all that
eager to hear that btrfs is going to go the same way.

This however got us thinking --- even in if NFSv4 is depending on
i_version, it doesn't actually _look_ at that field all that often.
It's only going to look at it in a response to a client's getattr
call, and that in turn is used to so the client can do its local disk
cache invalidation if anby of the data blocks of the inode has changed.

So what if we have a per-inode flag which "don't update I_VERSION",
which is off by default, but after the i_version has been updated at
least once, is set, so the i_version field won't be updated again ---
at least until something has actually looked at the i_version field,
when the "don't update I_VERSOIN" flag will get cleared again.

So basically, if we know there are no microphones in the forest, we
don't need to make the tree fall.  However, if someone has sampled the
i_version field, then the next time the inode gets updated, we will
update the i_version field so the NFSv4 client can hear the sound of
the tree crashing to the forst floor and so it can invalidate its
local cache of the file.  :-)

This should significantly improve the performance of using the
i_version field if the file system is being exported via NFSv4, and if
NFSv4 is not in use, no one will be looking at the i_version field, so
the performance impact will be very slight, and thus we could enable
i_version updates by default for btrfs and ext4.

And this should make the distribution folks happy, since it will unify
the behavior of all file systems, and make life easier for users who
won't need to set certain magic mount options depending on what file
system they are using and whether they are using NFSv4 or not.

Does this sound reasonable?

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo June 24, 2015, 8:23 a.m. UTC | #9

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
> 
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.
> 
> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
> 
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall.  However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file.  :-)
> 
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
> 
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

I agree, this's a promising way to fix the whole thing.

Regarding to client's getattr, I found that inode->i_version is not read by
calling generic_fillattr(), so I'm a bit missing on how we get to
change the flag...

Thanks,

-liubo

> 
> Cheers,
> 
> 						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba June 24, 2015, 5:28 p.m. UTC | #10

On Tue, Jun 23, 2015 at 06:42:15AM +1000, Dave Chinner wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> The issue is that you can't overide IS_I_VERSION(inode) because it
> looks at the superblock flag, yes?

Effectively, yes.

> So perhaps IS_I_VERSION should become an inode flag, set by the
> filesystem at inode instantiation time, and hence filesystems can
> choose on a per-inode basis if they want I_VERSION behaviour or not.

Sounds good, I like that. Looking at the proposed usecase again, the
performance speedup needs the NODATACOW bit set as well, so setting
one more bit is not a big deal.

Besides, the global 'noi_version' does not have the expected effect
because inode::i_version is incremented unconditionally everywhere
(except 1 call site).  From that perspective I think that the
inode-specific bit is the right approach.

> At that point, the behaviour of MS_I_VERSION becomes irrelevant to
> the discussion, doesn't it?

Agreed.

> > xfs also forces I_VERSION if it detects the superblock version 5, so it
> > could use the same fix that would work for btrfs.
> 
> XFS is a special snowflake - it updates the I_VERSION only when an
> inode is otherwise modified in a transaction, so turning it off
> saves nothing. (And yes, timestamp updates are transactional in
> XFS). Hence XFS behaviour is irrelevant to the discussion, because
> we aren't ever going to turn it off....

Understood.

Thanks for the feedback.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba June 24, 2015, 6:02 p.m. UTC | #11

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.

I did not mean to change the default behaviour with respect to nfs, that
would be a regression.

> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.
> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.

This sounds similar to what Dave proposed, a per-inode I_VERSION
attribute that can be changed through chattr. Though the negated meaning
of the flag could be confusing, I had to reread the paragraph again.

> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.

Btrfs default is to update i_version and the uscecase gets fixed by the
per-inode attribute. But from your description above I think that this
might not be enough for ext4. The reason I see are the different
defaults. You want to turn it on by default but not impose any
performance penalty for that, while for our usecase it's sufficient to
selectively turn it off.

I've tried to locate the source of performance drop for ext4 + iversion,
but was not successful so I don't know if the proposed fix is complete.

> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

It does, the unified behaviour wrt NFS is definitely a good thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o June 24, 2015, 11:17 p.m. UTC | #12

On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> 
> This sounds similar to what Dave proposed, a per-inode I_VERSION
> attribute that can be changed through chattr. Though the negated meaning
> of the flag could be confusing, I had to reread the paragraph again.

Dave did not specify an I_VERSION attribute that would be stored on
disk.  Instead he talked about a inode flag that would be set when the
struct inode is created by the file system.

This would allow file systems to permanently configure (on a per-inode
basis) whether or not a particular inode would require a forced
i_version update any time the inode's data or metadata is modified.  I
suppose you could initialized the inode flag from an on-disk
attribute, but that wasn't implied by Dave's proposal, at least as I
understood it.

> > This should significantly improve the performance of using the
> > i_version field if the file system is being exported via NFSv4, and if
> > NFSv4 is not in use, no one will be looking at the i_version field, so
> > the performance impact will be very slight, and thus we could enable
> > i_version updates by default for btrfs and ext4.
> 
> Btrfs default is to update i_version and the uscecase gets fixed by the
> per-inode attribute. But from your description above I think that this
> might not be enough for ext4. The reason I see are the different
> defaults. You want to turn it on by default but not impose any
> performance penalty for that, while for our usecase it's sufficient to
> selectively turn it off.

The problem with selectively turning it off is that the user might
decide for a particular file which is getting lots of updates to turn
off i_version updates --- but then at some subsequent time, that file
is part of a file system which is exported NFSv4, clients will
mysteriously break because i_version was manually turned off.

So this is going to be a potential support nightmare for enterprise
distro help desks --- the user will report that a particular file is
constantly getting corrupted, due to the NFSv4 cache invalidation
getting broken, and it might not be obvious why this is only happening
with this one file, and it's because with btrfs, the i_version update
for particular file was selectively turned off.  I don't think it's a
good design where it is easy for the user to set a flag which breaks
functionality, and in a potentially confusing way, especially when the
net result is potential data corruption.

This is why I would much rather have the default be on, but with
minimal (preferably not measurable) performance overhead.  It's the
best of both worlds.

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner June 24, 2015, 11:59 p.m. UTC | #13

On Wed, Jun 24, 2015 at 07:17:50PM -0400, Theodore Ts'o wrote:
> On Wed, Jun 24, 2015 at 08:02:15PM +0200, David Sterba wrote:
> > 
> > This sounds similar to what Dave proposed, a per-inode I_VERSION
> > attribute that can be changed through chattr. Though the negated meaning
> > of the flag could be confusing, I had to reread the paragraph again.
> 
> Dave did not specify an I_VERSION attribute that would be stored on
> disk.  Instead he talked about a inode flag that would be set when the
> struct inode is created by the file system.

Right.

> This would allow file systems to permanently configure (on a per-inode
> basis) whether or not a particular inode would require a forced
> i_version update any time the inode's data or metadata is modified.  I
> suppose you could initialized the inode flag from an on-disk
> attribute, but that wasn't implied by Dave's proposal, at least as I
> understood it.

It enables filesystems to do this. If btrfs want to add an on-disk
flag to turn off I_VERSION on a per-inode basis, or imply it from
some other on-disk flag, then they are welcome to do so and the
above infrastructure change will support it.

Cheers,

Dave.

J. Bruce Fields June 25, 2015, 6:46 p.m. UTC | #14

On Tue, Jun 23, 2015 at 12:32:41PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 18, 2015 at 04:38:56PM +0200, David Sterba wrote:
> > Moving the discussion to fsdevel.
> > 
> > Summary: disabling MS_I_VERSION brings some speedups to btrfs, but the
> > generic 'noiversion' option cannot be used to achieve that. It is
> > processed before it reaches btrfs superblock callback, where
> > MS_I_VERSION is forced.
> > 
> > The proposed fix is to add btrfs-specific i_version/noi_version to btrfs,
> > to which I object.
> 
> I was talking to Mingming about this on today's ext4 conference call,
> and one of the reasons why ext4 turns off i_version update by default
> is because it does a real number on our performance as well --- and
> furthermore, the only real user of the field from what we can tell is
> NFSv4, which not all that many ext4 users actually care about.
> 
> This has caused pain for the nfsv4 folks since it means that they need
> to tell people to use a special mount option for ext4 if they are
> actually using this for nfsv4, and I suspect they won't be all that
> eager to hear that btrfs is going to go the same way.

Yes, thanks for looking into this!

> This however got us thinking --- even in if NFSv4 is depending on
> i_version, it doesn't actually _look_ at that field all that often.

Most clients will query it on every write.  (I just took a quick look at
the code and I believe the Linux client's requesting it immediately
after every write, except in the O_DIRECT and delegated cases.)

> It's only going to look at it in a response to a client's getattr
> call, and that in turn is used to so the client can do its local disk
> cache invalidation if anby of the data blocks of the inode has changed.
> 
> So what if we have a per-inode flag which "don't update I_VERSION",
> which is off by default, but after the i_version has been updated at
> least once, is set, so the i_version field won't be updated again ---
> at least until something has actually looked at the i_version field,
> when the "don't update I_VERSOIN" flag will get cleared again.
> 
> So basically, if we know there are no microphones in the forest, we
> don't need to make the tree fall.  However, if someone has sampled the
> i_version field, then the next time the inode gets updated, we will
> update the i_version field so the NFSv4 client can hear the sound of
> the tree crashing to the forst floor and so it can invalidate its
> local cache of the file.  :-)
> 
> This should significantly improve the performance of using the
> i_version field if the file system is being exported via NFSv4, and if
> NFSv4 is not in use, no one will be looking at the i_version field, so
> the performance impact will be very slight, and thus we could enable
> i_version updates by default for btrfs and ext4.
> 
> And this should make the distribution folks happy, since it will unify
> the behavior of all file systems, and make life easier for users who
> won't need to set certain magic mount options depending on what file
> system they are using and whether they are using NFSv4 or not.
> 
> Does this sound reasonable?

Just to make sure I understand, the logic is something like:

	to read the i_version:

		inode->i_version_seen = true;
		return inode->i_version

	to update the i_version:

		/*
		 * If nobody's seen this value of i_version then we can
		 * keep using it, otherwise we need a new one:
		 */
		if (inode->i_version_seen)
			inode->i_version++;
		inode->i_version_seen = false;

Looks OK to me.  As I say I'd expect i_version_seen == true to end up
being the common case in a lot of v4 workloads, so I'm more skeptical of
the claim of a performance improvement in the v4 case.

Could maintaining the new flag be a significant drag in itself?  If not,
then I guess we're not making things any worse there, so fine.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o June 25, 2015, 10:12 p.m. UTC | #15

On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Does this sound reasonable?
> 
> Just to make sure I understand, the logic is something like:
> 
> 	to read the i_version:
> 
> 		inode->i_version_seen = true;
> 		return inode->i_version
> 
> 	to update the i_version:
> 
> 		/*
> 		 * If nobody's seen this value of i_version then we can
> 		 * keep using it, otherwise we need a new one:
> 		 */
> 		if (inode->i_version_seen)
> 			inode->i_version++;
> 		inode->i_version_seen = false;

Yep, that's what I was proposing.

> Looks OK to me.  As I say I'd expect i_version_seen == true to end up
> being the common case in a lot of v4 workloads, so I'm more skeptical of
> the claim of a performance improvement in the v4 case.

Well, so long as we require i_version to be committed to disk on every
single disk write, we're going to be trading off:

	* client-side performance of the advanced NFSv4 cacheing for reads
	* server-side performance for writes
	* data robustness in case of the server crashing and the client-side cache
	  getting out of sync with the server after the crash

I don't see any way around that.  (So for example, with lazy mtime
updates we wouldn't be updating the inode after every single
non-allocating write; enabling i_version updates will trash that
optimization.)

I just want to reduce to a bare minimum the performance hit in the
case where NFSv4 exports are not being used (since that is true in a
very *large* number of ext4 deployments --- i.e., every single Android
handset using ext4 :-), such that we can leave i_version updates
turned on by default.

> Could maintaining the new flag be a significant drag in itself?  If not,
> then I guess we're not making things any worse there, so fine.

I don't think so; it's a bit in the in-memory inode, so I don't think
that should be an issue.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields June 26, 2015, 1:32 p.m. UTC | #16

On Thu, Jun 25, 2015 at 06:12:57PM -0400, Theodore Ts'o wrote:
> On Thu, Jun 25, 2015 at 02:46:44PM -0400, J. Bruce Fields wrote:
> > Looks OK to me.  As I say I'd expect i_version_seen == true to end up
> > being the common case in a lot of v4 workloads, so I'm more skeptical of
> > the claim of a performance improvement in the v4 case.
> 
> Well, so long as we require i_version to be committed to disk on every
> single disk write, we're going to be trading off:
> 
> 	* client-side performance of the advanced NFSv4 cacheing for reads
> 	* server-side performance for writes
> 	* data robustness in case of the server crashing and the client-side cache
> 	  getting out of sync with the server after the crash
> 
> I don't see any way around that.  (So for example, with lazy mtime
> updates we wouldn't be updating the inode after every single
> non-allocating write; enabling i_version updates will trash that
> optimization.)
> 
> I just want to reduce to a bare minimum the performance hit in the
> case where NFSv4 exports are not being used (since that is true in a
> very *large* number of ext4 deployments --- i.e., every single Android
> handset using ext4 :-), such that we can leave i_version updates
> turned on by default.

Definitely understood.  I think it's a good idea.

> > Could maintaining the new flag be a significant drag in itself?  If not,
> > then I guess we're not making things any worse there, so fine.
> 
> I don't think so; it's a bit in the in-memory inode, so I don't think
> that should be an issue.

OK!

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff mbox

Patch

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 05fef19..e610e3e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -324,7 +324,7 @@  enum {
 	Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
 	Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
 	Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
-	Opt_datasum, Opt_treelog, Opt_noinode_cache,
+	Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_noi_version,
 	Opt_err,
 };
 
@@ -351,6 +351,7 @@  static match_table_t tokens = {
 	{Opt_nossd, "nossd"},
 	{Opt_acl, "acl"},
 	{Opt_noacl, "noacl"},
+	{Opt_noi_version, "noi_version"},
 	{Opt_notreelog, "notreelog"},
 	{Opt_treelog, "treelog"},
 	{Opt_flushoncommit, "flushoncommit"},
@@ -593,6 +594,10 @@  int btrfs_parse_options(struct btrfs_root *root, char *options)
 		case Opt_noacl:
 			root->fs_info->sb->s_flags &= ~MS_POSIXACL;
 			break;
+		case Opt_noi_version:
+			root->fs_info->sb->s_flags &= ~MS_I_VERSION;
+			btrfs_info(root->fs_info, "disable i_version");
+			break;
 		case Opt_notreelog:
 			btrfs_set_and_info(root, NOTREELOG,
 					   "disabling tree log");