[PATCH/RFC] NFS: add nostatflush mount option.

Message ID	87k1xgkct1.fsf@notabene.neil.brown.name (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: NeilBrown <neilb@suse.com> To: Trond Myklebust <trond.myklebust@primarydata.com>, Anna Schumaker <Anna.Schumaker@netapp.com> Date: Thu, 21 Dec 2017 13:57:46 +1100 Cc: lkml <linux-kernel@vger.kernel.org>, Linux NFS Mailing List <linux-nfs@vger.kernel.org>, linux-fsdevel@vger.kernel.org Subject: [PATCH/RFC] NFS: add nostatflush mount option. Message-ID: <87k1xgkct1.fsf@notabene.neil.brown.name> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

NeilBrown Dec. 21, 2017, 2:57 a.m. UTC

When an i_op->getattr() call is made on an NFS file
(typically from a 'stat' family system call), NFS
will first flush any dirty data to the server.

This ensures that the mtime reported is correct and stable,
but has a performance penalty.  'stat' is normally thought
to be a quick operation, and imposing this cost can be
surprising.

I have seen problems when one process is writing a large
file and another process performs "ls -l" on the containing
directory and is blocked for as long as it take to flush
all the dirty data to the server, which can be minutes.

I have also seen a legacy application which frequently calls
"fstat" on a file that it is writing to.  On a local
filesystem (and in the Solaris implementation of NFS) this
fstat call is cheap.  On Linux/NFS, the causes a noticeable
decrease in throughput.

The only circumstances where an application calling 'stat()'
might get an mtime which is not stable are times when some
other process is writing to the file and the two processes
are not using locking to ensure consistency, or when the one
process is both writing and stating.  In neither of these
cases is it reasonable to expect the mtime to be stable.

In the most common cases where mtime is important
(e.g. make), no other process has the file open, so there
will be no dirty data and the mtime will be stable.

Rather than unilaterally changing this behavior of 'stat',
this patch adds a "nosyncflush" mount option to allow
sysadmins to have applications which are hurt by the current
behavior to disable it.

Note that this option should probably *not* be used together
with "nocto".  In that case, mtime could be unstable even
when no process has the file open.

Signed-off-by: NeilBrown <neilb@suse.com>
---
 fs/nfs/inode.c                 |  3 ++-
 fs/nfs/super.c                 | 10 ++++++++++
 include/uapi/linux/nfs_mount.h |  6 ++++--
 3 files changed, 16 insertions(+), 3 deletions(-)

Chuck Lever Dec. 21, 2017, 3:39 p.m. UTC | #1

Hi Neil-


> On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
> 
> 
> When an i_op->getattr() call is made on an NFS file
> (typically from a 'stat' family system call), NFS
> will first flush any dirty data to the server.
> 
> This ensures that the mtime reported is correct and stable,
> but has a performance penalty.  'stat' is normally thought
> to be a quick operation, and imposing this cost can be
> surprising.

To be clear, this behavior is a POSIX requirement.


> I have seen problems when one process is writing a large
> file and another process performs "ls -l" on the containing
> directory and is blocked for as long as it take to flush
> all the dirty data to the server, which can be minutes.

Yes, a well-known annoyance that cannot be addressed
even with a write delegation.


> I have also seen a legacy application which frequently calls
> "fstat" on a file that it is writing to.  On a local
> filesystem (and in the Solaris implementation of NFS) this
> fstat call is cheap.  On Linux/NFS, the causes a noticeable
> decrease in throughput.

If the preceding write is small, Linux could be using
a FILE_SYNC write, but Solaris could be using UNSTABLE.


> The only circumstances where an application calling 'stat()'
> might get an mtime which is not stable are times when some
> other process is writing to the file and the two processes
> are not using locking to ensure consistency, or when the one
> process is both writing and stating.  In neither of these
> cases is it reasonable to expect the mtime to be stable.

I'm not convinced this is a strong enough rationale
for claiming it is safe to disable the existing
behavior.

You've explained cases where the new behavior is
reasonable, but do you have any examples where the
new behavior would be a problem? There must be a
reason why POSIX explicitly requires an up-to-date
mtime.

What guidance would nfs(5) give on when it is safe
to specify the new mount option?


> In the most common cases where mtime is important
> (e.g. make), no other process has the file open, so there
> will be no dirty data and the mtime will be stable.

Isn't it also the case that make is a multi-process
workload where one process modifies a file, then
closes it (which triggers a flush), and then another
process stats the file? The new mount option does
not change the behavior of close(2), does it?


> Rather than unilaterally changing this behavior of 'stat',
> this patch adds a "nosyncflush" mount option to allow
> sysadmins to have applications which are hurt by the current
> behavior to disable it.

IMO a mount option is at the wrong granularity. A
mount point will be shared between applications that
can tolerate the non-POSIX behavior and those that
cannot, for instance.

I would rather see us address the underlying cause
of the delay, which is that the GETATTR gets starved
by an ongoing stream of WRITE requests. You could:

- Make active writers wait for the GETATTR so the
GETATTR is not starved

- Start flushing dirty pages earlier so there is
less to flush when a stat(2) is done

- Ensure that GETATTR is not also waiting for a
COMMIT

Or maybe there's some other problem?

I recall nfs_getattr used to grab i_mutex to hold
up active writers. But i_mutex is gone now. Is
there some other mechanism that can block writers
while the client flushes the file and handles the
GETATTR request?


> Note that this option should probably *not* be used together
> with "nocto".  In that case, mtime could be unstable even
> when no process has the file open.
> 
> Signed-off-by: NeilBrown <neilb@suse.com>
> ---
> fs/nfs/inode.c                 |  3 ++-
> fs/nfs/super.c                 | 10 ++++++++++
> include/uapi/linux/nfs_mount.h |  6 ++++--
> 3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
> index b992d2382ffa..16629a34dd62 100644
> --- a/fs/nfs/inode.c
> +++ b/fs/nfs/inode.c
> @@ -740,7 +740,8 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
> 
> 	trace_nfs_getattr_enter(inode);
> 	/* Flush out writes to the server in order to update c/mtime.  */
> -	if (S_ISREG(inode->i_mode)) {
> +	if (S_ISREG(inode->i_mode) &&
> +	    !(NFS_SERVER(inode)->flags & NFS_MOUNT_NOSTATFLUSH)) {
> 		err = filemap_write_and_wait(inode->i_mapping);
> 		if (err)
> 			goto out;
> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
> index 29bacdc56f6a..2351c0be98f5 100644
> --- a/fs/nfs/super.c
> +++ b/fs/nfs/super.c
> @@ -90,6 +90,7 @@ enum {
> 	Opt_resvport, Opt_noresvport,
> 	Opt_fscache, Opt_nofscache,
> 	Opt_migration, Opt_nomigration,
> +	Opt_statflush, Opt_nostatflush,
> 
> 	/* Mount options that take integer arguments */
> 	Opt_port,
> @@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
> 	{ Opt_nofscache, "nofsc" },
> 	{ Opt_migration, "migration" },
> 	{ Opt_nomigration, "nomigration" },
> +	{ Opt_statflush, "statflush" },
> +	{ Opt_nostatflush, "nostatflush" },
> 
> 	{ Opt_port, "port=%s" },
> 	{ Opt_rsize, "rsize=%s" },
> @@ -637,6 +640,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
> 		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
> 		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
> 		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
> +		{ NFS_MOUNT_NOSTATFLUSH, ",nostatflush", "" },
> 		{ 0, NULL, NULL }
> 	};
> 	const struct proc_nfs_info *nfs_infop;
> @@ -1334,6 +1338,12 @@ static int nfs_parse_mount_options(char *raw,
> 		case Opt_nomigration:
> 			mnt->options &= ~NFS_OPTION_MIGRATION;
> 			break;
> +		case Opt_statflush:
> +			mnt->flags &= ~NFS_MOUNT_NOSTATFLUSH;
> +			break;
> +		case Opt_nostatflush:
> +			mnt->flags |= NFS_MOUNT_NOSTATFLUSH;
> +			break;
> 
> 		/*
> 		 * options that take numeric values
> diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
> index e44e00616ab5..d7c6f809d25d 100644
> --- a/include/uapi/linux/nfs_mount.h
> +++ b/include/uapi/linux/nfs_mount.h
> @@ -72,7 +72,9 @@ struct nfs_mount_data {
> #define NFS_MOUNT_NORESVPORT		0x40000
> #define NFS_MOUNT_LEGACY_INTERFACE	0x80000
> 
> -#define NFS_MOUNT_LOCAL_FLOCK	0x100000
> -#define NFS_MOUNT_LOCAL_FCNTL	0x200000
> +#define NFS_MOUNT_LOCAL_FLOCK		0x100000
> +#define NFS_MOUNT_LOCAL_FCNTL		0x200000
> +
> +#define NFS_MOUNT_NOSTATFLUSH		0x400000
> 
> #endif
> -- 
> 2.14.0.rc0.dirty
> 

--
Chuck Lever

Trond Myklebust Dec. 21, 2017, 3:54 p.m. UTC | #2

On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> Hi Neil-

> 

> 

> > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:

> > 

> > 

> > When an i_op->getattr() call is made on an NFS file

> > (typically from a 'stat' family system call), NFS

> > will first flush any dirty data to the server.

> > 

> > This ensures that the mtime reported is correct and stable,

> > but has a performance penalty.  'stat' is normally thought

> > to be a quick operation, and imposing this cost can be

> > surprising.

> 

> To be clear, this behavior is a POSIX requirement.

> 

> 

> > I have seen problems when one process is writing a large

> > file and another process performs "ls -l" on the containing

> > directory and is blocked for as long as it take to flush

> > all the dirty data to the server, which can be minutes.

> 

> Yes, a well-known annoyance that cannot be addressed

> even with a write delegation.

> 

> 

> > I have also seen a legacy application which frequently calls

> > "fstat" on a file that it is writing to.  On a local

> > filesystem (and in the Solaris implementation of NFS) this

> > fstat call is cheap.  On Linux/NFS, the causes a noticeable

> > decrease in throughput.

> 

> If the preceding write is small, Linux could be using

> a FILE_SYNC write, but Solaris could be using UNSTABLE.

> 

> 

> > The only circumstances where an application calling 'stat()'

> > might get an mtime which is not stable are times when some

> > other process is writing to the file and the two processes

> > are not using locking to ensure consistency, or when the one

> > process is both writing and stating.  In neither of these

> > cases is it reasonable to expect the mtime to be stable.

> 

> I'm not convinced this is a strong enough rationale

> for claiming it is safe to disable the existing

> behavior.

> 

> You've explained cases where the new behavior is

> reasonable, but do you have any examples where the

> new behavior would be a problem? There must be a

> reason why POSIX explicitly requires an up-to-date

> mtime.

> 

> What guidance would nfs(5) give on when it is safe

> to specify the new mount option?

> 

> 

> > In the most common cases where mtime is important

> > (e.g. make), no other process has the file open, so there

> > will be no dirty data and the mtime will be stable.

> 

> Isn't it also the case that make is a multi-process

> workload where one process modifies a file, then

> closes it (which triggers a flush), and then another

> process stats the file? The new mount option does

> not change the behavior of close(2), does it?

> 

> 

> > Rather than unilaterally changing this behavior of 'stat',

> > this patch adds a "nosyncflush" mount option to allow

> > sysadmins to have applications which are hurt by the current

> > behavior to disable it.

> 

> IMO a mount option is at the wrong granularity. A

> mount point will be shared between applications that

> can tolerate the non-POSIX behavior and those that

> cannot, for instance.


Agreed. 

The other thing to note here is that we now have an embryonic statx()
system call, which allows the application itself to decide whether or
not it needs up to date values for the atime/ctime/mtime. While we
haven't yet plumbed in the NFS side, the intention was always to use
that information to turn off the writeback flushing when possible.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@primarydata.com

NeilBrown Dec. 21, 2017, 8:51 p.m. UTC | #3

On Thu, Dec 21 2017, Chuck Lever wrote:

> Hi Neil-
>
>
>> On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> 
>> When an i_op->getattr() call is made on an NFS file
>> (typically from a 'stat' family system call), NFS
>> will first flush any dirty data to the server.
>> 
>> This ensures that the mtime reported is correct and stable,
>> but has a performance penalty.  'stat' is normally thought
>> to be a quick operation, and imposing this cost can be
>> surprising.
>
> To be clear, this behavior is a POSIX requirement.

Ah, that would be:

 http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_09

which says:

  All timestamps that are marked for update shall be updated when the
  file ceases to be open by any process or before a fstat(), fstatat(),
  fsync(), futimens(), lstat(), stat(), utime(), utimensat(), or
  utimes() is successfully performed on the file.

Suppose that when handling a stat(), if we find there are dirty pages,
we send a SETATTR to the server to set the mtime to the current time,
and use the result for the mtime?  That would avoid the costly flush, but still
give the appearance of the mtime being correctly updated.

>
>
>> I have seen problems when one process is writing a large
>> file and another process performs "ls -l" on the containing
>> directory and is blocked for as long as it take to flush
>> all the dirty data to the server, which can be minutes.
>
> Yes, a well-known annoyance that cannot be addressed
> even with a write delegation.
>
>
>> I have also seen a legacy application which frequently calls
>> "fstat" on a file that it is writing to.  On a local
>> filesystem (and in the Solaris implementation of NFS) this
>> fstat call is cheap.  On Linux/NFS, the causes a noticeable
>> decrease in throughput.
>
> If the preceding write is small, Linux could be using
> a FILE_SYNC write, but Solaris could be using UNSTABLE.

No, what is actually happening is that Linux is flushing out data and
Solaris isn't.  There are differences in stability and in write size,
but they aren't the main contributors to the performance difference.

>
>
>> The only circumstances where an application calling 'stat()'
>> might get an mtime which is not stable are times when some
>> other process is writing to the file and the two processes
>> are not using locking to ensure consistency, or when the one
>> process is both writing and stating.  In neither of these
>> cases is it reasonable to expect the mtime to be stable.
>
> I'm not convinced this is a strong enough rationale
> for claiming it is safe to disable the existing
> behavior.
>
> You've explained cases where the new behavior is
> reasonable, but do you have any examples where the
> new behavior would be a problem? There must be a
> reason why POSIX explicitly requires an up-to-date
> mtime.

I honestly cannot think of any credible scenario where the current
behaviour would be required.
If I write to a file from one NFS client, and request a stat() from
another NFS client, the stat doesn't flush any data (though it could
with NFSv4 if write delegations were being given).  So depending on a
stat() returning precise timestamps when there is no other co-ordination
between processes will already fail.  Why do we need to provide a
guarantee to processes running on the same client that we don't provide
to processes running on different clients?

>
> What guidance would nfs(5) give on when it is safe
> to specify the new mount option?

That is an excellent question to which I don't have an excellent answer.

 nostatflush:  A strict reading of the Posix specification requires that
  any cached writes be flushed to the server before responding to
  stat() and related operations, so that the timestamps are accurate
  and stable. NFS does not guarantee this between applications on
  different clients, but does when the writer and the caller of stat()
  are on the same host.  This flush can sometimes negatively impact
  performance.  Specifying the nostatflush option causes NFS to ignore
  this requirement of Posix.  A stat call when there is dirty data may
  report a different modify timestamp to the one that would be reported
  after that data was subsequently flushed to the server.  There are no
  known use cases where this inconsistency would cause a problem, but as
  avoiding it is a Posix requirement, the default behavior to force a
  flush on every stat() call.

???

>
>
>> In the most common cases where mtime is important
>> (e.g. make), no other process has the file open, so there
>> will be no dirty data and the mtime will be stable.
>
> Isn't it also the case that make is a multi-process
> workload where one process modifies a file, then
> closes it (which triggers a flush), and then another
> process stats the file? The new mount option does
> not change the behavior of close(2), does it?

No, the mount option doesn't change the behaviour of close(2).
A separate mount option (nocto) can do that.

I think your point is that close() can be delayed by flush in the same
way that stat() can.  I think that is true, but not very relevant.
stat() can be called more often that close() is likely to be, and
close() only imposes a delay on the process that did the writing, after
it has finished writing.  stat() can impose a delay on arbitrary other
processes and at other times.

>
>
>> Rather than unilaterally changing this behavior of 'stat',
>> this patch adds a "nosyncflush" mount option to allow
>> sysadmins to have applications which are hurt by the current
>> behavior to disable it.
>
> IMO a mount option is at the wrong granularity. A
> mount point will be shared between applications that
> can tolerate the non-POSIX behavior and those that
> cannot, for instance.

It is a better granularity than a module parameter, which was my first
approach ;-)

Yes, if we could create a control-group which avoided flush-on-stat, and
ran the problematic programs in that control group, that might be ideal
... for some values of "ideal".

You *could* mount the same filesystem in two places with nosharecache
and with only one having nosyncflush.  But you probably wouldn't.

>
> I would rather see us address the underlying cause
> of the delay, which is that the GETATTR gets starved
> by an ongoing stream of WRITE requests. You could:

I don't think starving is the issue.
Of the two specific cases that I have seen,
in one the issue was that the dirty_ratio multiplied by the amount of
memory, divided by the throughput to the server, resulted in many
minutes to flush out all the dirty data.  Having "ls -l" wait for that
flush was quite inconvenient. (We asked the customer to tune dirty_ratio
way down).

In the other (more recent) case the file being written was some sort of
data-base being restored from a dump (or something like that).  So some
blocks (presumably indexing blocks) were being written multiple times.
When we disabled flush-on-sync the total number of bytes written went
down by about 20% (vague memory .. certainly a non-trivial amount)
because these index block were only written to the server once instead
of multiple times.  So that flush-on-stat slowed down throughput in part
by pushing more data over the wire.

>
> - Make active writers wait for the GETATTR so the
> GETATTR is not starved
>
> - Start flushing dirty pages earlier so there is
> less to flush when a stat(2) is done
>
> - Ensure that GETATTR is not also waiting for a
> COMMIT
>
> Or maybe there's some other problem?
>
> I recall nfs_getattr used to grab i_mutex to hold
> up active writers. But i_mutex is gone now. Is
> there some other mechanism that can block writers
> while the client flushes the file and handles the
> GETATTR request?

I vaguely remember when i_mutex locking was added to nfs_getattr to
avoid the starvation :-)
Today nfs_getattr() calls filemap_write_and_wait() which first triggers
writes on all dirty pages, then waits for all writeback pages.
The first stage should be fairly quick, and writes during the second
stage shouldn't slow it down.  If some process were calling fsync often
(e.g. using O_SYNC) that might starve nfs_getattr(), but I don't think
normal usage would.

I do agree that a "real" fix would be better than a mount option.
The only path to a "real" fix that I can think of is to provide some
alternate source of producing a credible mtime without flushing all the
data.
Might a SETATTR using SET_TO_SERVER_TIME be a workable way forward?

Thank for your thoughts!

NeilBrown

>
>
>> Note that this option should probably *not* be used together
>> with "nocto".  In that case, mtime could be unstable even
>> when no process has the file open.
>> 
>> Signed-off-by: NeilBrown <neilb@suse.com>
>> ---
>> fs/nfs/inode.c                 |  3 ++-
>> fs/nfs/super.c                 | 10 ++++++++++
>> include/uapi/linux/nfs_mount.h |  6 ++++--
>> 3 files changed, 16 insertions(+), 3 deletions(-)
>> 
>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>> index b992d2382ffa..16629a34dd62 100644
>> --- a/fs/nfs/inode.c
>> +++ b/fs/nfs/inode.c
>> @@ -740,7 +740,8 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
>> 
>> 	trace_nfs_getattr_enter(inode);
>> 	/* Flush out writes to the server in order to update c/mtime.  */
>> -	if (S_ISREG(inode->i_mode)) {
>> +	if (S_ISREG(inode->i_mode) &&
>> +	    !(NFS_SERVER(inode)->flags & NFS_MOUNT_NOSTATFLUSH)) {
>> 		err = filemap_write_and_wait(inode->i_mapping);
>> 		if (err)
>> 			goto out;
>> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
>> index 29bacdc56f6a..2351c0be98f5 100644
>> --- a/fs/nfs/super.c
>> +++ b/fs/nfs/super.c
>> @@ -90,6 +90,7 @@ enum {
>> 	Opt_resvport, Opt_noresvport,
>> 	Opt_fscache, Opt_nofscache,
>> 	Opt_migration, Opt_nomigration,
>> +	Opt_statflush, Opt_nostatflush,
>> 
>> 	/* Mount options that take integer arguments */
>> 	Opt_port,
>> @@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
>> 	{ Opt_nofscache, "nofsc" },
>> 	{ Opt_migration, "migration" },
>> 	{ Opt_nomigration, "nomigration" },
>> +	{ Opt_statflush, "statflush" },
>> +	{ Opt_nostatflush, "nostatflush" },
>> 
>> 	{ Opt_port, "port=%s" },
>> 	{ Opt_rsize, "rsize=%s" },
>> @@ -637,6 +640,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
>> 		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
>> 		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
>> 		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
>> +		{ NFS_MOUNT_NOSTATFLUSH, ",nostatflush", "" },
>> 		{ 0, NULL, NULL }
>> 	};
>> 	const struct proc_nfs_info *nfs_infop;
>> @@ -1334,6 +1338,12 @@ static int nfs_parse_mount_options(char *raw,
>> 		case Opt_nomigration:
>> 			mnt->options &= ~NFS_OPTION_MIGRATION;
>> 			break;
>> +		case Opt_statflush:
>> +			mnt->flags &= ~NFS_MOUNT_NOSTATFLUSH;
>> +			break;
>> +		case Opt_nostatflush:
>> +			mnt->flags |= NFS_MOUNT_NOSTATFLUSH;
>> +			break;
>> 
>> 		/*
>> 		 * options that take numeric values
>> diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
>> index e44e00616ab5..d7c6f809d25d 100644
>> --- a/include/uapi/linux/nfs_mount.h
>> +++ b/include/uapi/linux/nfs_mount.h
>> @@ -72,7 +72,9 @@ struct nfs_mount_data {
>> #define NFS_MOUNT_NORESVPORT		0x40000
>> #define NFS_MOUNT_LEGACY_INTERFACE	0x80000
>> 
>> -#define NFS_MOUNT_LOCAL_FLOCK	0x100000
>> -#define NFS_MOUNT_LOCAL_FCNTL	0x200000
>> +#define NFS_MOUNT_LOCAL_FLOCK		0x100000
>> +#define NFS_MOUNT_LOCAL_FCNTL		0x200000
>> +
>> +#define NFS_MOUNT_NOSTATFLUSH		0x400000
>> 
>> #endif
>> -- 
>> 2.14.0.rc0.dirty
>> 
>
> --
> Chuck Lever

NeilBrown Dec. 21, 2017, 8:59 p.m. UTC | #4

On Thu, Dec 21 2017, Trond Myklebust wrote:

> On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
>> Hi Neil-
>> 
>> 
>> > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
>> > 
>> > 
>> > When an i_op->getattr() call is made on an NFS file
>> > (typically from a 'stat' family system call), NFS
>> > will first flush any dirty data to the server.
>> > 
>> > This ensures that the mtime reported is correct and stable,
>> > but has a performance penalty.  'stat' is normally thought
>> > to be a quick operation, and imposing this cost can be
>> > surprising.
>> 
>> To be clear, this behavior is a POSIX requirement.
>> 
>> 
>> > I have seen problems when one process is writing a large
>> > file and another process performs "ls -l" on the containing
>> > directory and is blocked for as long as it take to flush
>> > all the dirty data to the server, which can be minutes.
>> 
>> Yes, a well-known annoyance that cannot be addressed
>> even with a write delegation.
>> 
>> 
>> > I have also seen a legacy application which frequently calls
>> > "fstat" on a file that it is writing to.  On a local
>> > filesystem (and in the Solaris implementation of NFS) this
>> > fstat call is cheap.  On Linux/NFS, the causes a noticeable
>> > decrease in throughput.
>> 
>> If the preceding write is small, Linux could be using
>> a FILE_SYNC write, but Solaris could be using UNSTABLE.
>> 
>> 
>> > The only circumstances where an application calling 'stat()'
>> > might get an mtime which is not stable are times when some
>> > other process is writing to the file and the two processes
>> > are not using locking to ensure consistency, or when the one
>> > process is both writing and stating.  In neither of these
>> > cases is it reasonable to expect the mtime to be stable.
>> 
>> I'm not convinced this is a strong enough rationale
>> for claiming it is safe to disable the existing
>> behavior.
>> 
>> You've explained cases where the new behavior is
>> reasonable, but do you have any examples where the
>> new behavior would be a problem? There must be a
>> reason why POSIX explicitly requires an up-to-date
>> mtime.
>> 
>> What guidance would nfs(5) give on when it is safe
>> to specify the new mount option?
>> 
>> 
>> > In the most common cases where mtime is important
>> > (e.g. make), no other process has the file open, so there
>> > will be no dirty data and the mtime will be stable.
>> 
>> Isn't it also the case that make is a multi-process
>> workload where one process modifies a file, then
>> closes it (which triggers a flush), and then another
>> process stats the file? The new mount option does
>> not change the behavior of close(2), does it?
>> 
>> 
>> > Rather than unilaterally changing this behavior of 'stat',
>> > this patch adds a "nosyncflush" mount option to allow
>> > sysadmins to have applications which are hurt by the current
>> > behavior to disable it.
>> 
>> IMO a mount option is at the wrong granularity. A
>> mount point will be shared between applications that
>> can tolerate the non-POSIX behavior and those that
>> cannot, for instance.
>
> Agreed. 
>
> The other thing to note here is that we now have an embryonic statx()
> system call, which allows the application itself to decide whether or
> not it needs up to date values for the atime/ctime/mtime. While we
> haven't yet plumbed in the NFS side, the intention was always to use
> that information to turn off the writeback flushing when possible.

Yes, if statx() were actually working, we could change the application
to avoid the flush.  But then if changing the application were an
option, I suspect that - for my current customer issue - we could just
remove the fstat() calls.  I doubt they are really necessary.
I think programmers often think of stat() (and particularly fstat()) as
fairly cheap and so they use it whenever convenient.  Only NFS violates
this expectation.

Also statx() is only a real solution if/when it gets widely used.  Will
"ls -l" default to AT_STATX_DONT_SYNC ??

Apart from the Posix requirement (which only requires that the
timestamps be updated, not that the data be flushed), do you know of any
value gained from flushing data before stat()?

Thanks,
NeilBrown

Trond Myklebust Dec. 21, 2017, 9:39 p.m. UTC | #5

On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
> On Thu, Dec 21 2017, Trond Myklebust wrote:
> 
> > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> > > Hi Neil-
> > > 
> > > 
> > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
> > > > 
> > > > 
> > > > When an i_op->getattr() call is made on an NFS file
> > > > (typically from a 'stat' family system call), NFS
> > > > will first flush any dirty data to the server.
> > > > 
> > > > This ensures that the mtime reported is correct and stable,
> > > > but has a performance penalty.  'stat' is normally thought
> > > > to be a quick operation, and imposing this cost can be
> > > > surprising.
> > > 
> > > To be clear, this behavior is a POSIX requirement.
> > > 
> > > 
> > > > I have seen problems when one process is writing a large
> > > > file and another process performs "ls -l" on the containing
> > > > directory and is blocked for as long as it take to flush
> > > > all the dirty data to the server, which can be minutes.
> > > 
> > > Yes, a well-known annoyance that cannot be addressed
> > > even with a write delegation.
> > > 
> > > 
> > > > I have also seen a legacy application which frequently calls
> > > > "fstat" on a file that it is writing to.  On a local
> > > > filesystem (and in the Solaris implementation of NFS) this
> > > > fstat call is cheap.  On Linux/NFS, the causes a noticeable
> > > > decrease in throughput.
> > > 
> > > If the preceding write is small, Linux could be using
> > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
> > > 
> > > 
> > > > The only circumstances where an application calling 'stat()'
> > > > might get an mtime which is not stable are times when some
> > > > other process is writing to the file and the two processes
> > > > are not using locking to ensure consistency, or when the one
> > > > process is both writing and stating.  In neither of these
> > > > cases is it reasonable to expect the mtime to be stable.
> > > 
> > > I'm not convinced this is a strong enough rationale
> > > for claiming it is safe to disable the existing
> > > behavior.
> > > 
> > > You've explained cases where the new behavior is
> > > reasonable, but do you have any examples where the
> > > new behavior would be a problem? There must be a
> > > reason why POSIX explicitly requires an up-to-date
> > > mtime.
> > > 
> > > What guidance would nfs(5) give on when it is safe
> > > to specify the new mount option?
> > > 
> > > 
> > > > In the most common cases where mtime is important
> > > > (e.g. make), no other process has the file open, so there
> > > > will be no dirty data and the mtime will be stable.
> > > 
> > > Isn't it also the case that make is a multi-process
> > > workload where one process modifies a file, then
> > > closes it (which triggers a flush), and then another
> > > process stats the file? The new mount option does
> > > not change the behavior of close(2), does it?
> > > 
> > > 
> > > > Rather than unilaterally changing this behavior of 'stat',
> > > > this patch adds a "nosyncflush" mount option to allow
> > > > sysadmins to have applications which are hurt by the current
> > > > behavior to disable it.
> > > 
> > > IMO a mount option is at the wrong granularity. A
> > > mount point will be shared between applications that
> > > can tolerate the non-POSIX behavior and those that
> > > cannot, for instance.
> > 
> > Agreed. 
> > 
> > The other thing to note here is that we now have an embryonic
> > statx()
> > system call, which allows the application itself to decide whether
> > or
> > not it needs up to date values for the atime/ctime/mtime. While we
> > haven't yet plumbed in the NFS side, the intention was always to
> > use
> > that information to turn off the writeback flushing when possible.
> 
> Yes, if statx() were actually working, we could change the
> application
> to avoid the flush.  But then if changing the application were an
> option, I suspect that - for my current customer issue - we could
> just
> remove the fstat() calls.  I doubt they are really necessary.
> I think programmers often think of stat() (and particularly fstat())
> as
> fairly cheap and so they use it whenever convenient.  Only NFS
> violates
> this expectation.
> 
> Also statx() is only a real solution if/when it gets widely
> used.  Will
> "ls -l" default to AT_STATX_DONT_SYNC ??
> 
> Apart from the Posix requirement (which only requires that the
> timestamps be updated, not that the data be flushed), do you know of
> any
> value gained from flushing data before stat()?
> 

POSIX requires that timestamps change as part of the read() or write()
system call.

IOW: if I do the following

   1. fd = open("foo", O_WRONLY)
   2. fstat(fd, &buf1);
   3. write(fd, "bar", 3)
   4. fstat(fd, &buf2);
   5. fstat(fd, &buf3);

then the requirement translates into the following:
 * The timestamps returned in 'buf1' should differ from the timestamps
   returned in 'buf2'.
 * However 'buf2' and 'buf3' must contain the same values.

The flush performed by the NFS client in order to service the stat() in
(4) ensures that the timestamps get updated on the server (as part of
servicing the WRITE rpc call), which normally ensures that their values
won't change later on.

Note, however, that the client does cheat a little by only flushing the
WRITEs, and by not waiting for a COMMIT to be issued. That is generally
safe, but it means that the server could reboot before we have time to
persist the data+metadata. If that were to happen, then the replay of
the WRITEs might indeed cause the timestamps to change in (5) (in
contravention of POSIX rules).

Cheers
  Trond

NeilBrown Dec. 21, 2017, 10:35 p.m. UTC | #6

On Thu, Dec 21 2017, Trond Myklebust wrote:

> On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
>> On Thu, Dec 21 2017, Trond Myklebust wrote:
>> 
>> > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
>> > > Hi Neil-
>> > > 
>> > > 
>> > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
>> > > > 
>> > > > 
>> > > > When an i_op->getattr() call is made on an NFS file
>> > > > (typically from a 'stat' family system call), NFS
>> > > > will first flush any dirty data to the server.
>> > > > 
>> > > > This ensures that the mtime reported is correct and stable,
>> > > > but has a performance penalty.  'stat' is normally thought
>> > > > to be a quick operation, and imposing this cost can be
>> > > > surprising.
>> > > 
>> > > To be clear, this behavior is a POSIX requirement.
>> > > 
>> > > 
>> > > > I have seen problems when one process is writing a large
>> > > > file and another process performs "ls -l" on the containing
>> > > > directory and is blocked for as long as it take to flush
>> > > > all the dirty data to the server, which can be minutes.
>> > > 
>> > > Yes, a well-known annoyance that cannot be addressed
>> > > even with a write delegation.
>> > > 
>> > > 
>> > > > I have also seen a legacy application which frequently calls
>> > > > "fstat" on a file that it is writing to.  On a local
>> > > > filesystem (and in the Solaris implementation of NFS) this
>> > > > fstat call is cheap.  On Linux/NFS, the causes a noticeable
>> > > > decrease in throughput.
>> > > 
>> > > If the preceding write is small, Linux could be using
>> > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
>> > > 
>> > > 
>> > > > The only circumstances where an application calling 'stat()'
>> > > > might get an mtime which is not stable are times when some
>> > > > other process is writing to the file and the two processes
>> > > > are not using locking to ensure consistency, or when the one
>> > > > process is both writing and stating.  In neither of these
>> > > > cases is it reasonable to expect the mtime to be stable.
>> > > 
>> > > I'm not convinced this is a strong enough rationale
>> > > for claiming it is safe to disable the existing
>> > > behavior.
>> > > 
>> > > You've explained cases where the new behavior is
>> > > reasonable, but do you have any examples where the
>> > > new behavior would be a problem? There must be a
>> > > reason why POSIX explicitly requires an up-to-date
>> > > mtime.
>> > > 
>> > > What guidance would nfs(5) give on when it is safe
>> > > to specify the new mount option?
>> > > 
>> > > 
>> > > > In the most common cases where mtime is important
>> > > > (e.g. make), no other process has the file open, so there
>> > > > will be no dirty data and the mtime will be stable.
>> > > 
>> > > Isn't it also the case that make is a multi-process
>> > > workload where one process modifies a file, then
>> > > closes it (which triggers a flush), and then another
>> > > process stats the file? The new mount option does
>> > > not change the behavior of close(2), does it?
>> > > 
>> > > 
>> > > > Rather than unilaterally changing this behavior of 'stat',
>> > > > this patch adds a "nosyncflush" mount option to allow
>> > > > sysadmins to have applications which are hurt by the current
>> > > > behavior to disable it.
>> > > 
>> > > IMO a mount option is at the wrong granularity. A
>> > > mount point will be shared between applications that
>> > > can tolerate the non-POSIX behavior and those that
>> > > cannot, for instance.
>> > 
>> > Agreed. 
>> > 
>> > The other thing to note here is that we now have an embryonic
>> > statx()
>> > system call, which allows the application itself to decide whether
>> > or
>> > not it needs up to date values for the atime/ctime/mtime. While we
>> > haven't yet plumbed in the NFS side, the intention was always to
>> > use
>> > that information to turn off the writeback flushing when possible.
>> 
>> Yes, if statx() were actually working, we could change the
>> application
>> to avoid the flush.  But then if changing the application were an
>> option, I suspect that - for my current customer issue - we could
>> just
>> remove the fstat() calls.  I doubt they are really necessary.
>> I think programmers often think of stat() (and particularly fstat())
>> as
>> fairly cheap and so they use it whenever convenient.  Only NFS
>> violates
>> this expectation.
>> 
>> Also statx() is only a real solution if/when it gets widely
>> used.  Will
>> "ls -l" default to AT_STATX_DONT_SYNC ??
>> 
>> Apart from the Posix requirement (which only requires that the
>> timestamps be updated, not that the data be flushed), do you know of
>> any
>> value gained from flushing data before stat()?
>> 
>
> POSIX requires that timestamps change as part of the read() or write()
> system call.

Does it require that they don't change at other times?

I see the (arguable) deviation from POSIX that I propose to be
completely inline with the close-to-open consistency semantics that NFS
already uses, which are weaker than what POSIX might suggest.

As you didn't exactly answer the question, would it be fair to say that
you don't know of any reason to flush-before-stat except that POSIX
seems to require it?

Thanks,
NeilBrown


>
> IOW: if I do the following
>
>    1. fd = open("foo", O_WRONLY)
>    2. fstat(fd, &buf1);
>    3. write(fd, "bar", 3)
>    4. fstat(fd, &buf2);
>    5. fstat(fd, &buf3);
>
> then the requirement translates into the following:
>  * The timestamps returned in 'buf1' should differ from the timestamps
>    returned in 'buf2'.
>  * However 'buf2' and 'buf3' must contain the same values.
>
> The flush performed by the NFS client in order to service the stat() in
> (4) ensures that the timestamps get updated on the server (as part of
> servicing the WRITE rpc call), which normally ensures that their values
> won't change later on.
>
> Note, however, that the client does cheat a little by only flushing the
> WRITEs, and by not waiting for a COMMIT to be issued. That is generally
> safe, but it means that the server could reboot before we have time to
> persist the data+metadata. If that were to happen, then the replay of
> the WRITEs might indeed cause the timestamps to change in (5) (in
> contravention of POSIX rules).
>
> Cheers
>   Trond
>
> -- 
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> trond.myklebust@primarydata.com

Trond Myklebust Dec. 22, 2017, 3:17 a.m. UTC | #7

On Fri, 2017-12-22 at 09:35 +1100, NeilBrown wrote:
> On Thu, Dec 21 2017, Trond Myklebust wrote:
> 
> > On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
> > > On Thu, Dec 21 2017, Trond Myklebust wrote:
> > > 
> > > > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> > > > > Hi Neil-
> > > > > 
> > > > > 
> > > > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com>
> > > > > > wrote:
> > > > > > 
> > > > > > 
> > > > > > When an i_op->getattr() call is made on an NFS file
> > > > > > (typically from a 'stat' family system call), NFS
> > > > > > will first flush any dirty data to the server.
> > > > > > 
> > > > > > This ensures that the mtime reported is correct and stable,
> > > > > > but has a performance penalty.  'stat' is normally thought
> > > > > > to be a quick operation, and imposing this cost can be
> > > > > > surprising.
> > > > > 
> > > > > To be clear, this behavior is a POSIX requirement.
> > > > > 
> > > > > 
> > > > > > I have seen problems when one process is writing a large
> > > > > > file and another process performs "ls -l" on the containing
> > > > > > directory and is blocked for as long as it take to flush
> > > > > > all the dirty data to the server, which can be minutes.
> > > > > 
> > > > > Yes, a well-known annoyance that cannot be addressed
> > > > > even with a write delegation.
> > > > > 
> > > > > 
> > > > > > I have also seen a legacy application which frequently
> > > > > > calls
> > > > > > "fstat" on a file that it is writing to.  On a local
> > > > > > filesystem (and in the Solaris implementation of NFS) this
> > > > > > fstat call is cheap.  On Linux/NFS, the causes a noticeable
> > > > > > decrease in throughput.
> > > > > 
> > > > > If the preceding write is small, Linux could be using
> > > > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
> > > > > 
> > > > > 
> > > > > > The only circumstances where an application calling
> > > > > > 'stat()'
> > > > > > might get an mtime which is not stable are times when some
> > > > > > other process is writing to the file and the two processes
> > > > > > are not using locking to ensure consistency, or when the
> > > > > > one
> > > > > > process is both writing and stating.  In neither of these
> > > > > > cases is it reasonable to expect the mtime to be stable.
> > > > > 
> > > > > I'm not convinced this is a strong enough rationale
> > > > > for claiming it is safe to disable the existing
> > > > > behavior.
> > > > > 
> > > > > You've explained cases where the new behavior is
> > > > > reasonable, but do you have any examples where the
> > > > > new behavior would be a problem? There must be a
> > > > > reason why POSIX explicitly requires an up-to-date
> > > > > mtime.
> > > > > 
> > > > > What guidance would nfs(5) give on when it is safe
> > > > > to specify the new mount option?
> > > > > 
> > > > > 
> > > > > > In the most common cases where mtime is important
> > > > > > (e.g. make), no other process has the file open, so there
> > > > > > will be no dirty data and the mtime will be stable.
> > > > > 
> > > > > Isn't it also the case that make is a multi-process
> > > > > workload where one process modifies a file, then
> > > > > closes it (which triggers a flush), and then another
> > > > > process stats the file? The new mount option does
> > > > > not change the behavior of close(2), does it?
> > > > > 
> > > > > 
> > > > > > Rather than unilaterally changing this behavior of 'stat',
> > > > > > this patch adds a "nosyncflush" mount option to allow
> > > > > > sysadmins to have applications which are hurt by the
> > > > > > current
> > > > > > behavior to disable it.
> > > > > 
> > > > > IMO a mount option is at the wrong granularity. A
> > > > > mount point will be shared between applications that
> > > > > can tolerate the non-POSIX behavior and those that
> > > > > cannot, for instance.
> > > > 
> > > > Agreed. 
> > > > 
> > > > The other thing to note here is that we now have an embryonic
> > > > statx()
> > > > system call, which allows the application itself to decide
> > > > whether
> > > > or
> > > > not it needs up to date values for the atime/ctime/mtime. While
> > > > we
> > > > haven't yet plumbed in the NFS side, the intention was always
> > > > to
> > > > use
> > > > that information to turn off the writeback flushing when
> > > > possible.
> > > 
> > > Yes, if statx() were actually working, we could change the
> > > application
> > > to avoid the flush.  But then if changing the application were an
> > > option, I suspect that - for my current customer issue - we could
> > > just
> > > remove the fstat() calls.  I doubt they are really necessary.
> > > I think programmers often think of stat() (and particularly
> > > fstat())
> > > as
> > > fairly cheap and so they use it whenever convenient.  Only NFS
> > > violates
> > > this expectation.
> > > 
> > > Also statx() is only a real solution if/when it gets widely
> > > used.  Will
> > > "ls -l" default to AT_STATX_DONT_SYNC ??
> > > 
> > > Apart from the Posix requirement (which only requires that the
> > > timestamps be updated, not that the data be flushed), do you know
> > > of
> > > any
> > > value gained from flushing data before stat()?
> > > 
> > 
> > POSIX requires that timestamps change as part of the read() or
> > write()
> > system call.
> 
> Does it require that they don't change at other times?

Yes.

> I see the (arguable) deviation from POSIX that I propose to be
> completely inline with the close-to-open consistency semantics that
> NFS
> already uses, which are weaker than what POSIX might suggest.
> 
> As you didn't exactly answer the question, would it be fair to say
> that
> you don't know of any reason to flush-before-stat except that POSIX
> seems to require it?

The reason is to emulate POSIX semantics. That is the only reason, and
that is why when we added a statx() function, I asked that we allow the
application to specify that it doesn't need these POSIX semantics.

Chuck Lever Dec. 22, 2017, 4:38 p.m. UTC | #8

> On Dec 21, 2017, at 3:51 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Dec 21 2017, Chuck Lever wrote:
> 
>> Hi Neil-
>> 
>> 
>>> On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
>>> 
>>> 
>>> When an i_op->getattr() call is made on an NFS file
>>> (typically from a 'stat' family system call), NFS
>>> will first flush any dirty data to the server.
>>> 
>>> This ensures that the mtime reported is correct and stable,
>>> but has a performance penalty.  'stat' is normally thought
>>> to be a quick operation, and imposing this cost can be
>>> surprising.
>> 
>> To be clear, this behavior is a POSIX requirement.
> 
> Ah, that would be:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_09
> 
> which says:
> 
>  All timestamps that are marked for update shall be updated when the
>  file ceases to be open by any process or before a fstat(), fstatat(),
>  fsync(), futimens(), lstat(), stat(), utime(), utimensat(), or
>  utimes() is successfully performed on the file.
> 
> Suppose that when handling a stat(), if we find there are dirty pages,
> we send a SETATTR to the server to set the mtime to the current time,
> and use the result for the mtime?  That would avoid the costly flush, but still
> give the appearance of the mtime being correctly updated.
> 
>> 
>> 
>>> I have seen problems when one process is writing a large
>>> file and another process performs "ls -l" on the containing
>>> directory and is blocked for as long as it take to flush
>>> all the dirty data to the server, which can be minutes.
>> 
>> Yes, a well-known annoyance that cannot be addressed
>> even with a write delegation.
>> 
>> 
>>> I have also seen a legacy application which frequently calls
>>> "fstat" on a file that it is writing to.  On a local
>>> filesystem (and in the Solaris implementation of NFS) this
>>> fstat call is cheap.  On Linux/NFS, the causes a noticeable
>>> decrease in throughput.
>> 
>> If the preceding write is small, Linux could be using
>> a FILE_SYNC write, but Solaris could be using UNSTABLE.
> 
> No, what is actually happening is that Linux is flushing out data and
> Solaris isn't.  There are differences in stability and in write size,
> but they aren't the main contributors to the performance difference.
> 
>> 
>> 
>>> The only circumstances where an application calling 'stat()'
>>> might get an mtime which is not stable are times when some
>>> other process is writing to the file and the two processes
>>> are not using locking to ensure consistency, or when the one
>>> process is both writing and stating.  In neither of these
>>> cases is it reasonable to expect the mtime to be stable.
>> 
>> I'm not convinced this is a strong enough rationale
>> for claiming it is safe to disable the existing
>> behavior.
>> 
>> You've explained cases where the new behavior is
>> reasonable, but do you have any examples where the
>> new behavior would be a problem? There must be a
>> reason why POSIX explicitly requires an up-to-date
>> mtime.
> 
> I honestly cannot think of any credible scenario where the current
> behaviour would be required.

I can't either, but:
- We have years of this behavior in place
- It is required by an aged standard

I don't mind creative thinking, but perhaps we do
need a higher bar than "can you remember why we do
this? <shrug>" before mitigating this particular
behavior. :-)


> If I write to a file from one NFS client, and request a stat() from
> another NFS client, the stat doesn't flush any data (though it could
> with NFSv4 if write delegations were being given).  So depending on a
> stat() returning precise timestamps when there is no other co-ordination
> between processes will already fail.  Why do we need to provide a
> guarantee to processes running on the same client that we don't provide
> to processes running on different clients?

When multiple clients are in play, as you pointed out
earlier, we do require explicit locking for correct
behavior. This usage scenario is clearly one that can
be done only in the context of a distributed file
system (ie., not on non-clustered local file systems).
Thus explicit locking is a tolerable and expected
requirement.


>> What guidance would nfs(5) give on when it is safe
>> to specify the new mount option?
> 
> That is an excellent question to which I don't have an excellent answer.
> 
> nostatflush:  A strict reading of the Posix specification requires that
>  any cached writes be flushed to the server before responding to
>  stat() and related operations, so that the timestamps are accurate
>  and stable. NFS does not guarantee this between applications on
>  different clients, but does when the writer and the caller of stat()
>  are on the same host.  This flush can sometimes negatively impact
>  performance.  Specifying the nostatflush option causes NFS to ignore
>  this requirement of Posix.  A stat call when there is dirty data may
>  report a different modify timestamp to the one that would be reported
>  after that data was subsequently flushed to the server.  There are no
>  known use cases where this inconsistency would cause a problem, but as
>  avoiding it is a Posix requirement, the default behavior to force a
>  flush on every stat() call.
> 
> ???
> 
> 
> 
>> 
>> 
>>> In the most common cases where mtime is important
>>> (e.g. make), no other process has the file open, so there
>>> will be no dirty data and the mtime will be stable.
>> 
>> Isn't it also the case that make is a multi-process
>> workload where one process modifies a file, then
>> closes it (which triggers a flush), and then another
>> process stats the file? The new mount option does
>> not change the behavior of close(2), does it?
> 
> No, the mount option doesn't change the behaviour of close(2).
> A separate mount option (nocto) can do that.
> 
> I think your point is that close() can be delayed by flush in the same
> way that stat() can.  I think that is true, but not very relevant.

Agreed, it isn't relevant. So I'm wondering why we
should consider "make" at all in this case.


> stat() can be called more often that close() is likely to be, and
> close() only imposes a delay on the process that did the writing, after
> it has finished writing.

I don't believe that's true. The file's data cache is
shared across all applications on a client, and a
writeback on close is so that a subsequent GETATTR
reflects information about the file can be used to
verify the data cached the next time an application
on this client opens this file (close-to-open).

For NFSv4, of course, the client must flush dirty
data before it relinquishes the valid state ID at
CLOSE time.


> stat() can impose a delay on arbitrary other
> processes and at other times.

Eliminating unexpected latency is a noble goal. However,
write(2) can occasionally take a long time when the
application has unintentionally exceeded a dirty limit
for example.

More frequent writeback is part of the "sharing tax"
that network file systems have to pay. It can be
reduced somewhat in NFSv4 with delegation, though a
file's current mtime is still managed by the physical
file system on the NFS server.


>>> Rather than unilaterally changing this behavior of 'stat',
>>> this patch adds a "nosyncflush" mount option to allow
>>> sysadmins to have applications which are hurt by the current
>>> behavior to disable it.
>> 
>> IMO a mount option is at the wrong granularity. A
>> mount point will be shared between applications that
>> can tolerate the non-POSIX behavior and those that
>> cannot, for instance.
> 
> It is a better granularity than a module parameter, which was my first
> approach ;-)
> 
> Yes, if we could create a control-group which avoided flush-on-stat, and
> ran the problematic programs in that control group, that might be ideal
> ... for some values of "ideal".
> 
> You *could* mount the same filesystem in two places with nosharecache
> and with only one having nosyncflush.  But you probably wouldn't.

>> I would rather see us address the underlying cause
>> of the delay, which is that the GETATTR gets starved
>> by an ongoing stream of WRITE requests. You could:
> 
> I don't think starving is the issue.
> Of the two specific cases that I have seen,
> in one the issue was that the dirty_ratio multiplied by the amount of
> memory, divided by the throughput to the server, resulted in many
> minutes to flush out all the dirty data.  Having "ls -l" wait for that
> flush was quite inconvenient. (We asked the customer to tune dirty_ratio
> way down).

This is the classic remedy for this problem. The default
dirty_ratio setting is antique for contemporary hardware.


> In the other (more recent) case the file being written was some sort of
> data-base being restored from a dump (or something like that).  So some
> blocks (presumably indexing blocks) were being written multiple times.
> When we disabled flush-on-sync

flush-on-sync? That seems like a good thing! Do you mean
flush-on-getattr?


> the total number of bytes written went
> down by about 20% (vague memory .. certainly a non-trivial amount)
> because these index block were only written to the server once instead
> of multiple times.  So that flush-on-stat slowed down throughput in part
> by pushing more data over the wire.

That behavior is disappointing, but arguments that the
application should be modified to address this issue (use
statx or avoid stat(2)) are convincing to me.


>> - Make active writers wait for the GETATTR so the
>> GETATTR is not starved
>> 
>> - Start flushing dirty pages earlier so there is
>> less to flush when a stat(2) is done
>> 
>> - Ensure that GETATTR is not also waiting for a
>> COMMIT
>> 
>> Or maybe there's some other problem?
>> 
>> I recall nfs_getattr used to grab i_mutex to hold
>> up active writers. But i_mutex is gone now. Is
>> there some other mechanism that can block writers
>> while the client flushes the file and handles the
>> GETATTR request?
> 
> I vaguely remember when i_mutex locking was added to nfs_getattr to
> avoid the starvation :-)
> Today nfs_getattr() calls filemap_write_and_wait() which first triggers
> writes on all dirty pages, then waits for all writeback pages.
> The first stage should be fairly quick, and writes during the second
> stage shouldn't slow it down.  If some process were calling fsync often
> (e.g. using O_SYNC) that might starve nfs_getattr(), but I don't think
> normal usage would.
> 
> I do agree that a "real" fix would be better than a mount option.
> The only path to a "real" fix that I can think of is to provide some
> alternate source of producing a credible mtime without flushing all the
> data.

Perhaps, but flushing sooner will get us most of the way
there IMO, and seems more straightforward then playing
merry mtime games.

Jens Axboe had some writeback improvements that might
make it possible to start writeback sooner, depending on
the aggregate writeback throughput limit of the backing
storage. Not sure these were ever applied to the NFS
client.

The downside to starting writeback sooner is that an
application that uses a file as temporary rather than
persistent storage might be penalized (similar to the
second scenario you mention above).


> Might a SETATTR using SET_TO_SERVER_TIME be a workable way forward?

I thought about this, but here we risk the following odd
behavior:

1. The application calls write(2), and the client caches
the dirtied pages.

2. The application calls stat(2), the client explicitly
updates the mtime, then does a GETATTR returning mtime "A"
to the application.

3. The client later flushes the dirty data, and the
server updates the mtime again to mtime "B"

4. The application does another stat(2), the client
emits another GETATTR, and sees mtime "B" even though
there have been no additional write(2) calls on that
file.

That might be surprising to applications that, like the
client itself, rely on the file's mtime to detect outside
modification of the file. And it certainly violates POSIX.

Perhaps we could add a mechanism to the NFS protocol that
allows a client to send a WRITE that does not update the
file's mtime while holding a write delegation. In other
words, introduce a mode where the client rather than the
server manages a file's mtime.

Maybe pNFS block layout would allow this already?


> Thank for your thoughts!
> 
> NeilBrown
> 
>> 
>> 
>>> Note that this option should probably *not* be used together
>>> with "nocto".  In that case, mtime could be unstable even
>>> when no process has the file open.
>>> 
>>> Signed-off-by: NeilBrown <neilb@suse.com>
>>> ---
>>> fs/nfs/inode.c                 |  3 ++-
>>> fs/nfs/super.c                 | 10 ++++++++++
>>> include/uapi/linux/nfs_mount.h |  6 ++++--
>>> 3 files changed, 16 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
>>> index b992d2382ffa..16629a34dd62 100644
>>> --- a/fs/nfs/inode.c
>>> +++ b/fs/nfs/inode.c
>>> @@ -740,7 +740,8 @@ int nfs_getattr(const struct path *path, struct kstat *stat,
>>> 
>>> 	trace_nfs_getattr_enter(inode);
>>> 	/* Flush out writes to the server in order to update c/mtime.  */
>>> -	if (S_ISREG(inode->i_mode)) {
>>> +	if (S_ISREG(inode->i_mode) &&
>>> +	    !(NFS_SERVER(inode)->flags & NFS_MOUNT_NOSTATFLUSH)) {
>>> 		err = filemap_write_and_wait(inode->i_mapping);
>>> 		if (err)
>>> 			goto out;
>>> diff --git a/fs/nfs/super.c b/fs/nfs/super.c
>>> index 29bacdc56f6a..2351c0be98f5 100644
>>> --- a/fs/nfs/super.c
>>> +++ b/fs/nfs/super.c
>>> @@ -90,6 +90,7 @@ enum {
>>> 	Opt_resvport, Opt_noresvport,
>>> 	Opt_fscache, Opt_nofscache,
>>> 	Opt_migration, Opt_nomigration,
>>> +	Opt_statflush, Opt_nostatflush,
>>> 
>>> 	/* Mount options that take integer arguments */
>>> 	Opt_port,
>>> @@ -151,6 +152,8 @@ static const match_table_t nfs_mount_option_tokens = {
>>> 	{ Opt_nofscache, "nofsc" },
>>> 	{ Opt_migration, "migration" },
>>> 	{ Opt_nomigration, "nomigration" },
>>> +	{ Opt_statflush, "statflush" },
>>> +	{ Opt_nostatflush, "nostatflush" },
>>> 
>>> 	{ Opt_port, "port=%s" },
>>> 	{ Opt_rsize, "rsize=%s" },
>>> @@ -637,6 +640,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
>>> 		{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
>>> 		{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
>>> 		{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
>>> +		{ NFS_MOUNT_NOSTATFLUSH, ",nostatflush", "" },
>>> 		{ 0, NULL, NULL }
>>> 	};
>>> 	const struct proc_nfs_info *nfs_infop;
>>> @@ -1334,6 +1338,12 @@ static int nfs_parse_mount_options(char *raw,
>>> 		case Opt_nomigration:
>>> 			mnt->options &= ~NFS_OPTION_MIGRATION;
>>> 			break;
>>> +		case Opt_statflush:
>>> +			mnt->flags &= ~NFS_MOUNT_NOSTATFLUSH;
>>> +			break;
>>> +		case Opt_nostatflush:
>>> +			mnt->flags |= NFS_MOUNT_NOSTATFLUSH;
>>> +			break;
>>> 
>>> 		/*
>>> 		 * options that take numeric values
>>> diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
>>> index e44e00616ab5..d7c6f809d25d 100644
>>> --- a/include/uapi/linux/nfs_mount.h
>>> +++ b/include/uapi/linux/nfs_mount.h
>>> @@ -72,7 +72,9 @@ struct nfs_mount_data {
>>> #define NFS_MOUNT_NORESVPORT		0x40000
>>> #define NFS_MOUNT_LEGACY_INTERFACE	0x80000
>>> 
>>> -#define NFS_MOUNT_LOCAL_FLOCK	0x100000
>>> -#define NFS_MOUNT_LOCAL_FCNTL	0x200000
>>> +#define NFS_MOUNT_LOCAL_FLOCK		0x100000
>>> +#define NFS_MOUNT_LOCAL_FCNTL		0x200000
>>> +
>>> +#define NFS_MOUNT_NOSTATFLUSH		0x400000
>>> 
>>> #endif
>>> -- 
>>> 2.14.0.rc0.dirty
>>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever

Jeff Layton Dec. 23, 2017, 1:16 p.m. UTC | #9

On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
> On Thu, Dec 21 2017, Trond Myklebust wrote:
> 
> > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
> > > Hi Neil-
> > > 
> > > 
> > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
> > > > 
> > > > 
> > > > When an i_op->getattr() call is made on an NFS file
> > > > (typically from a 'stat' family system call), NFS
> > > > will first flush any dirty data to the server.
> > > > 
> > > > This ensures that the mtime reported is correct and stable,
> > > > but has a performance penalty.  'stat' is normally thought
> > > > to be a quick operation, and imposing this cost can be
> > > > surprising.
> > > 
> > > To be clear, this behavior is a POSIX requirement.
> > > 
> > > 
> > > > I have seen problems when one process is writing a large
> > > > file and another process performs "ls -l" on the containing
> > > > directory and is blocked for as long as it take to flush
> > > > all the dirty data to the server, which can be minutes.
> > > 
> > > Yes, a well-known annoyance that cannot be addressed
> > > even with a write delegation.
> > > 
> > > 
> > > > I have also seen a legacy application which frequently calls
> > > > "fstat" on a file that it is writing to.  On a local
> > > > filesystem (and in the Solaris implementation of NFS) this
> > > > fstat call is cheap.  On Linux/NFS, the causes a noticeable
> > > > decrease in throughput.
> > > 
> > > If the preceding write is small, Linux could be using
> > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
> > > 
> > > 
> > > > The only circumstances where an application calling 'stat()'
> > > > might get an mtime which is not stable are times when some
> > > > other process is writing to the file and the two processes
> > > > are not using locking to ensure consistency, or when the one
> > > > process is both writing and stating.  In neither of these
> > > > cases is it reasonable to expect the mtime to be stable.
> > > 
> > > I'm not convinced this is a strong enough rationale
> > > for claiming it is safe to disable the existing
> > > behavior.
> > > 
> > > You've explained cases where the new behavior is
> > > reasonable, but do you have any examples where the
> > > new behavior would be a problem? There must be a
> > > reason why POSIX explicitly requires an up-to-date
> > > mtime.
> > > 
> > > What guidance would nfs(5) give on when it is safe
> > > to specify the new mount option?
> > > 
> > > 
> > > > In the most common cases where mtime is important
> > > > (e.g. make), no other process has the file open, so there
> > > > will be no dirty data and the mtime will be stable.
> > > 
> > > Isn't it also the case that make is a multi-process
> > > workload where one process modifies a file, then
> > > closes it (which triggers a flush), and then another
> > > process stats the file? The new mount option does
> > > not change the behavior of close(2), does it?
> > > 
> > > 
> > > > Rather than unilaterally changing this behavior of 'stat',
> > > > this patch adds a "nosyncflush" mount option to allow
> > > > sysadmins to have applications which are hurt by the current
> > > > behavior to disable it.
> > > 
> > > IMO a mount option is at the wrong granularity. A
> > > mount point will be shared between applications that
> > > can tolerate the non-POSIX behavior and those that
> > > cannot, for instance.
> > 
> > Agreed. 
> > 
> > The other thing to note here is that we now have an embryonic statx()
> > system call, which allows the application itself to decide whether or
> > not it needs up to date values for the atime/ctime/mtime. While we
> > haven't yet plumbed in the NFS side, the intention was always to use
> > that information to turn off the writeback flushing when possible.
> 
> Yes, if statx() were actually working, we could change the application
> to avoid the flush.  But then if changing the application were an
> option, I suspect that - for my current customer issue - we could just
> remove the fstat() calls.  I doubt they are really necessary.
> I think programmers often think of stat() (and particularly fstat()) as
> fairly cheap and so they use it whenever convenient.  Only NFS violates
> this expectation.
> 
> Also statx() is only a real solution if/when it gets widely used.  Will
> "ls -l" default to AT_STATX_DONT_SYNC ??
> 

Maybe. Eventually, I could see glibc converting normal stat/fstat/etc.
to use a statx() syscall under the hood (similar to how stat syscalls on
32-bit arches will use stat64 in most cases).

With that, we could look at any number of ways to sneak a "don't flush"
flag into the call. Maybe an environment variable that causes the stat
syscall wrapper to add it? I think there are possibilities there that
don't necessarily require recompiling applications.

> Apart from the Posix requirement (which only requires that the
> timestamps be updated, not that the data be flushed), do you know of any
> value gained from flushing data before stat()?
>

NeilBrown Jan. 1, 2018, 11:29 p.m. UTC | #10

On Sat, Dec 23 2017, Jeff Layton wrote:

> On Fri, 2017-12-22 at 07:59 +1100, NeilBrown wrote:
>> On Thu, Dec 21 2017, Trond Myklebust wrote:
>> 
>> > On Thu, 2017-12-21 at 10:39 -0500, Chuck Lever wrote:
>> > > Hi Neil-
>> > > 
>> > > 
>> > > > On Dec 20, 2017, at 9:57 PM, NeilBrown <neilb@suse.com> wrote:
>> > > > 
>> > > > 
>> > > > When an i_op->getattr() call is made on an NFS file
>> > > > (typically from a 'stat' family system call), NFS
>> > > > will first flush any dirty data to the server.
>> > > > 
>> > > > This ensures that the mtime reported is correct and stable,
>> > > > but has a performance penalty.  'stat' is normally thought
>> > > > to be a quick operation, and imposing this cost can be
>> > > > surprising.
>> > > 
>> > > To be clear, this behavior is a POSIX requirement.
>> > > 
>> > > 
>> > > > I have seen problems when one process is writing a large
>> > > > file and another process performs "ls -l" on the containing
>> > > > directory and is blocked for as long as it take to flush
>> > > > all the dirty data to the server, which can be minutes.
>> > > 
>> > > Yes, a well-known annoyance that cannot be addressed
>> > > even with a write delegation.
>> > > 
>> > > 
>> > > > I have also seen a legacy application which frequently calls
>> > > > "fstat" on a file that it is writing to.  On a local
>> > > > filesystem (and in the Solaris implementation of NFS) this
>> > > > fstat call is cheap.  On Linux/NFS, the causes a noticeable
>> > > > decrease in throughput.
>> > > 
>> > > If the preceding write is small, Linux could be using
>> > > a FILE_SYNC write, but Solaris could be using UNSTABLE.
>> > > 
>> > > 
>> > > > The only circumstances where an application calling 'stat()'
>> > > > might get an mtime which is not stable are times when some
>> > > > other process is writing to the file and the two processes
>> > > > are not using locking to ensure consistency, or when the one
>> > > > process is both writing and stating.  In neither of these
>> > > > cases is it reasonable to expect the mtime to be stable.
>> > > 
>> > > I'm not convinced this is a strong enough rationale
>> > > for claiming it is safe to disable the existing
>> > > behavior.
>> > > 
>> > > You've explained cases where the new behavior is
>> > > reasonable, but do you have any examples where the
>> > > new behavior would be a problem? There must be a
>> > > reason why POSIX explicitly requires an up-to-date
>> > > mtime.
>> > > 
>> > > What guidance would nfs(5) give on when it is safe
>> > > to specify the new mount option?
>> > > 
>> > > 
>> > > > In the most common cases where mtime is important
>> > > > (e.g. make), no other process has the file open, so there
>> > > > will be no dirty data and the mtime will be stable.
>> > > 
>> > > Isn't it also the case that make is a multi-process
>> > > workload where one process modifies a file, then
>> > > closes it (which triggers a flush), and then another
>> > > process stats the file? The new mount option does
>> > > not change the behavior of close(2), does it?
>> > > 
>> > > 
>> > > > Rather than unilaterally changing this behavior of 'stat',
>> > > > this patch adds a "nosyncflush" mount option to allow
>> > > > sysadmins to have applications which are hurt by the current
>> > > > behavior to disable it.
>> > > 
>> > > IMO a mount option is at the wrong granularity. A
>> > > mount point will be shared between applications that
>> > > can tolerate the non-POSIX behavior and those that
>> > > cannot, for instance.
>> > 
>> > Agreed. 
>> > 
>> > The other thing to note here is that we now have an embryonic statx()
>> > system call, which allows the application itself to decide whether or
>> > not it needs up to date values for the atime/ctime/mtime. While we
>> > haven't yet plumbed in the NFS side, the intention was always to use
>> > that information to turn off the writeback flushing when possible.
>> 
>> Yes, if statx() were actually working, we could change the application
>> to avoid the flush.  But then if changing the application were an
>> option, I suspect that - for my current customer issue - we could just
>> remove the fstat() calls.  I doubt they are really necessary.
>> I think programmers often think of stat() (and particularly fstat()) as
>> fairly cheap and so they use it whenever convenient.  Only NFS violates
>> this expectation.
>> 
>> Also statx() is only a real solution if/when it gets widely used.  Will
>> "ls -l" default to AT_STATX_DONT_SYNC ??
>> 
>
> Maybe. Eventually, I could see glibc converting normal stat/fstat/etc.
> to use a statx() syscall under the hood (similar to how stat syscalls on
> 32-bit arches will use stat64 in most cases).
>
> With that, we could look at any number of ways to sneak a "don't flush"
> flag into the call. Maybe an environment variable that causes the stat
> syscall wrapper to add it? I think there are possibilities there that
> don't necessarily require recompiling applications.

Thanks - interesting ideas.

One possibility would be an LD_PRELOAD which implements fstat() using
statx().
That doesn't address the "ls -l is needlessly slow" problem, but it
would address the "legacy application calls fstat too often" problem.

This isn't an option for the "enterprise kernel" the customer is using
(statx?  what is statx?), but having a clear view of a credible
upstream solution is very helpful.

So thanks - and thanks a lot to Trond and Chuck for your input.  It
helped clarify my thoughts a lot.

Is anyone working on proper statx support for NFS, or is it a case of
"that shouldn't be hard and we should do that, but it isn't a high
priority for anyone" ??

Thanks,
NeilBrown


>
>> Apart from the Posix requirement (which only requires that the
>> timestamps be updated, not that the data be flushed), do you know of any
>> value gained from flushing data before stat()?
>> 
> -- 
> Jeff Layton <jlayton@kernel.org>

[PATCH/RFC] NFS: add nostatflush mount option.

Commit Message

Comments

Patch