diff mbox

nfs-backed mmap file results in 1000s of WRITEs per second

Message ID 1378411320.5450.27.camel@leira.trondhjem.org (mailing list archive)
State New, archived
Headers show

Commit Message

Trond Myklebust Sept. 5, 2013, 8:02 p.m. UTC
On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:

> > Neil Brown posted a patch couple days ago for this!

> > 

> > http://thread.gmane.org/gmane.linux.nfs/58473

> 

> I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still

> exhibited the same 1000s of WRITEs/sec problem.

> 

> Any other ideas?


Yes. Please try the attached patch.

> > Regards, Malahal.

> > 

> > Quentin Barnes [qbarnes@gmail.com] wrote:

> > > If two (or more) processes are doing nothing more than writing to

> > > the memory addresses of an mmapped shared file on an NFS mounted

> > > file system, it results in the kernel scribbling WRITEs to the

> > > server as fast as it can (1000s per second) even while no syscalls

> > > are going on.

> > > 

> > > The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've

> > > reproduced this on the 3.11 kernel, and it happens as far back as

> > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18

> > > based).  (All x86_64 systems.)  I didn't try anything in between.

> > > 

> > > I've created a self-contained program below that will demonstrate

> > > the problem (call it "t1").  Assuming /mnt has an NFS file system:

> > > 

> > >   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally

> > >   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing

> > > 

> > > Just run "watch -d nfsstat" in another window while running the two

> > > writer test and watch the WRITE count explode.

> > > 

> > > I don't see anything particularly wrong with what the example code

> > > is doing with its use of mmap.  Is there anything undefined about

> > > the code that would explain this behavior, or is this a NFS bug

> > > that's really lived this long?

> > > 

> > > Quentin

> > > 

> > > 

> > > 

> > > #include <sys/stat.h>

> > > #include <sys/mman.h>

> > > #include <sys/stat.h>

> > > #include <sys/wait.h>

> > > #include <errno.h>

> > > #include <fcntl.h>

> > > #include <stdio.h>

> > > #include <stdlib.h>

> > > #include <signal.h>

> > > #include <string.h>

> > > #include <unistd.h>

> > > 

> > > int

> > > kill_children()

> > > {

> > > 	int		cnt = 0;

> > > 	siginfo_t	infop;

> > > 

> > > 	signal(SIGINT, SIG_IGN);

> > > 	kill(0, SIGINT);

> > > 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;

> > > 

> > > 	return cnt;

> > > }

> > > 

> > > void

> > > sighandler(int sig)

> > > {

> > > 	printf("Cleaning up all children.\n");

> > > 	int cnt = kill_children();

> > > 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");

> > > 

> > > 	exit(0);

> > > }

> > > 

> > > int

> > > do_child(volatile int *iaddr)

> > > {

> > > 	while (1) *iaddr = 1;

> > > }

> > > 

> > > int

> > > main(int argc, char **argv)

> > > {

> > > 	const char	*path;

> > > 	int		fd;

> > > 	ssize_t		wlen;

> > > 	int		*ip;

> > > 	int		fork_count = 1;

> > > 

> > > 	if (argc == 1) {

> > > 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",

> > > 			argv[0]);

> > > 		return 1;

> > > 	}

> > > 

> > > 	path = argv[1];

> > > 

> > > 	if (argc > 2) {

> > > 		int fc = atoi(argv[2]);

> > > 		if (fc >= 0)

> > > 			fork_count = fc;

> > > 	}

> > > 

> > > 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);

> > > 	if (fd < 0) {

> > > 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",

> > > 			path, strerror(errno), errno);

> > > 		return 1;

> > > 	}

> > > 

> > > 	wlen = write(fd, &(int){0}, sizeof(int));

> > > 	if (wlen != sizeof(int)) {

> > > 		if (wlen < 0)

> > > 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",

> > > 				path, strerror(errno), errno);

> > > 		else

> > > 			fprintf(stderr, "Short write to '%s'\n", path);

> > > 		return 1;

> > > 	}

> > > 

> > > 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,

> > > 			   MAP_SHARED, fd, 0);

> > > 	if (ip == MAP_FAILED) {

> > > 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",

> > > 			path, strerror(errno), errno);

> > > 		return 1;

> > > 	}

> > > 

> > > 	signal(SIGINT, sighandler);

> > > 

> > > 	while (fork_count-- > 0) {

> > > 		switch(fork()) {

> > > 		case -1:

> > > 			fprintf(stderr, "Fork failed: %s (%d)\n",

> > > 				strerror(errno), errno);

> > > 			kill_children();

> > > 			return 1;

> > > 		case 0:   /* child  */

> > > 			signal(SIGINT, SIG_DFL);

> > > 			do_child(ip);

> > > 			break;

> > > 		default:  /* parent */

> > > 			break;

> > > 		}

> > > 	}

> > > 

> > > 	printf("Press ^C to terminate test.\n");

> > > 	pause();

> > > 

> > > 	return 0;

> > > }

> > > --

> > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in

> > > the body of a message to majordomo@vger.kernel.org

> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

> > > 

> > 

> 

> Quentin

> --

> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in

> the body of a message to majordomo@vger.kernel.org

> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Comments

Quentin Barnes Sept. 5, 2013, 9:36 p.m. UTC | #1
On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > Neil Brown posted a patch couple days ago for this!
> > > 
> > > http://thread.gmane.org/gmane.linux.nfs/58473
> > 
> > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > exhibited the same 1000s of WRITEs/sec problem.
> > 
> > Any other ideas?
> 
> Yes. Please try the attached patch.

Great!  That did the trick!

Do you feel this patch could be worthy of pushing it upstream in its
current state or was it just to verify a theory?


In comparing the nfs_flush_incompatible() implementations between
RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
more or less logically equivalent to me on whether or not to flush
the page.  Also, when and where nfs_flush_incompatible() is invoked
seems the same.  Would you provide a very brief pointer to clue me
in as to why this problem didn't also manifest circa 2.6.18 days?

Quentin



> > > Regards, Malahal.
> > > 
> > > Quentin Barnes [qbarnes@gmail.com] wrote:
> > > > If two (or more) processes are doing nothing more than writing to
> > > > the memory addresses of an mmapped shared file on an NFS mounted
> > > > file system, it results in the kernel scribbling WRITEs to the
> > > > server as fast as it can (1000s per second) even while no syscalls
> > > > are going on.
> > > > 
> > > > The problems happens on NFS clients mounting NFSv3 or NFSv4.  I've
> > > > reproduced this on the 3.11 kernel, and it happens as far back as
> > > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18
> > > > based).  (All x86_64 systems.)  I didn't try anything in between.
> > > > 
> > > > I've created a self-contained program below that will demonstrate
> > > > the problem (call it "t1").  Assuming /mnt has an NFS file system:
> > > > 
> > > >   $ t1 /mnt/mynfsfile 1    # Fork 1 writer, kernel behaves normally
> > > >   $ t1 /mnt/mynfsfile 2    # Fork 2 writers, kernel goes crazy WRITEing
> > > > 
> > > > Just run "watch -d nfsstat" in another window while running the two
> > > > writer test and watch the WRITE count explode.
> > > > 
> > > > I don't see anything particularly wrong with what the example code
> > > > is doing with its use of mmap.  Is there anything undefined about
> > > > the code that would explain this behavior, or is this a NFS bug
> > > > that's really lived this long?
> > > > 
> > > > Quentin
> > > > 
> > > > 
> > > > 
> > > > #include <sys/stat.h>
> > > > #include <sys/mman.h>
> > > > #include <sys/stat.h>
> > > > #include <sys/wait.h>
> > > > #include <errno.h>
> > > > #include <fcntl.h>
> > > > #include <stdio.h>
> > > > #include <stdlib.h>
> > > > #include <signal.h>
> > > > #include <string.h>
> > > > #include <unistd.h>
> > > > 
> > > > int
> > > > kill_children()
> > > > {
> > > > 	int		cnt = 0;
> > > > 	siginfo_t	infop;
> > > > 
> > > > 	signal(SIGINT, SIG_IGN);
> > > > 	kill(0, SIGINT);
> > > > 	while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt;
> > > > 
> > > > 	return cnt;
> > > > }
> > > > 
> > > > void
> > > > sighandler(int sig)
> > > > {
> > > > 	printf("Cleaning up all children.\n");
> > > > 	int cnt = kill_children();
> > > > 	printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren");
> > > > 
> > > > 	exit(0);
> > > > }
> > > > 
> > > > int
> > > > do_child(volatile int *iaddr)
> > > > {
> > > > 	while (1) *iaddr = 1;
> > > > }
> > > > 
> > > > int
> > > > main(int argc, char **argv)
> > > > {
> > > > 	const char	*path;
> > > > 	int		fd;
> > > > 	ssize_t		wlen;
> > > > 	int		*ip;
> > > > 	int		fork_count = 1;
> > > > 
> > > > 	if (argc == 1) {
> > > > 		fprintf(stderr, "Usage: %s {filename} [fork_count].\n",
> > > > 			argv[0]);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	path = argv[1];
> > > > 
> > > > 	if (argc > 2) {
> > > > 		int fc = atoi(argv[2]);
> > > > 		if (fc >= 0)
> > > > 			fork_count = fc;
> > > > 	}
> > > > 
> > > > 	fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR);
> > > > 	if (fd < 0) {
> > > > 		fprintf(stderr, "Open of '%s' failed: %s (%d)\n",
> > > > 			path, strerror(errno), errno);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	wlen = write(fd, &(int){0}, sizeof(int));
> > > > 	if (wlen != sizeof(int)) {
> > > > 		if (wlen < 0)
> > > > 			fprintf(stderr, "Write of '%s' failed: %s (%d)\n",
> > > > 				path, strerror(errno), errno);
> > > > 		else
> > > > 			fprintf(stderr, "Short write to '%s'\n", path);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE,
> > > > 			   MAP_SHARED, fd, 0);
> > > > 	if (ip == MAP_FAILED) {
> > > > 		fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n",
> > > > 			path, strerror(errno), errno);
> > > > 		return 1;
> > > > 	}
> > > > 
> > > > 	signal(SIGINT, sighandler);
> > > > 
> > > > 	while (fork_count-- > 0) {
> > > > 		switch(fork()) {
> > > > 		case -1:
> > > > 			fprintf(stderr, "Fork failed: %s (%d)\n",
> > > > 				strerror(errno), errno);
> > > > 			kill_children();
> > > > 			return 1;
> > > > 		case 0:   /* child  */
> > > > 			signal(SIGINT, SIG_DFL);
> > > > 			do_child(ip);
> > > > 			break;
> > > > 		default:  /* parent */
> > > > 			break;
> > > > 		}
> > > > 	}
> > > > 
> > > > 	printf("Press ^C to terminate test.\n");
> > > > 	pause();
> > > > 
> > > > 	return 0;
> > > > }
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > 
> > 
> > Quentin
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com

> From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date: Thu, 5 Sep 2013 15:52:51 -0400
> Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless
>  file is locked
> 
> If we're doing buffered writes, and there is no file locking involved,
> then we don't have to worry about whether or not the lock owner information
> is identical.
> By relaxing this check, we ensure that fork()ed child processes can write
> to a page without having to first sync dirty data that was written
> by the parent to disk.
> 
> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
>  fs/nfs/write.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 40979e8..ac1dc33 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page)
>  			return 0;
>  		l_ctx = req->wb_lock_context;
>  		do_flush = req->wb_page != page || req->wb_context != ctx;
> -		if (l_ctx) {
> +		if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
>  			do_flush |= l_ctx->lockowner.l_owner != current->files
>  				|| l_ctx->lockowner.l_pid != current->tgid;
>  		}
> -- 
> 1.8.3.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 5, 2013, 9:57 p.m. UTC | #2
On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:

> > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:

> > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:

> > > > Neil Brown posted a patch couple days ago for this!

> > > > 

> > > > http://thread.gmane.org/gmane.linux.nfs/58473

> > > 

> > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still

> > > exhibited the same 1000s of WRITEs/sec problem.

> > > 

> > > Any other ideas?

> > 

> > Yes. Please try the attached patch.

> 

> Great!  That did the trick!

> 

> Do you feel this patch could be worthy of pushing it upstream in its

> current state or was it just to verify a theory?

> 

> 

> In comparing the nfs_flush_incompatible() implementations between

> RHEL5 and v3.11 (without your patch), the guts of the algorithm seem

> more or less logically equivalent to me on whether or not to flush

> the page.  Also, when and where nfs_flush_incompatible() is invoked

> seems the same.  Would you provide a very brief pointer to clue me

> in as to why this problem didn't also manifest circa 2.6.18 days?


There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
days, and so the risk was that your mmapped writes could end up being
sent with the wrong credentials.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Trond Myklebust Sept. 5, 2013, 10:07 p.m. UTC | #3
On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:

> > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:

> > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:

> > > > Neil Brown posted a patch couple days ago for this!

> > > > 

> > > > http://thread.gmane.org/gmane.linux.nfs/58473

> > > 

> > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still

> > > exhibited the same 1000s of WRITEs/sec problem.

> > > 

> > > Any other ideas?

> > 

> > Yes. Please try the attached patch.

> 

> Great!  That did the trick!

> 

> Do you feel this patch could be worthy of pushing it upstream in its

> current state or was it just to verify a theory?


It should be safe to merge this. It is a valid optimisation in the case
where there are no locks applied.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Quentin Barnes Sept. 5, 2013, 10:34 p.m. UTC | #4
On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > > > Neil Brown posted a patch couple days ago for this!
> > > > > 
> > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > > > 
> > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > > > exhibited the same 1000s of WRITEs/sec problem.
> > > > 
> > > > Any other ideas?
> > > 
> > > Yes. Please try the attached patch.
> > 
> > Great!  That did the trick!
> > 
> > Do you feel this patch could be worthy of pushing it upstream in its
> > current state or was it just to verify a theory?
> > 
> > 
> > In comparing the nfs_flush_incompatible() implementations between
> > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > more or less logically equivalent to me on whether or not to flush
> > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > seems the same.  Would you provide a very brief pointer to clue me
> > in as to why this problem didn't also manifest circa 2.6.18 days?
> 
> There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> days, and so the risk was that your mmapped writes could end up being
> sent with the wrong credentials.

Ah!  You're right that nfs_vm_page_mkwrite() was missing from
the original 2.6.18, so that makes sense, however, Red Hat had
backported that function starting with their RHEL5.9(*) kernels,
yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
somewhere in RHEL5.9's do_wp_page(), or up that call path, but
glancing through it, it all looks pretty close though.


(*) That was the source I using when comparing with the 3.11 source
when studying your patch since it was the last kernel known to me
without the problem.

> -- 
> Trond Myklebust
> Linux NFS client maintainer
> 
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com

Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Sept. 6, 2013, 1:36 p.m. UTC | #5
On Thu, 5 Sep 2013 17:34:20 -0500
Quentin Barnes <qbarnes@gmail.com> wrote:

> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > > > > Neil Brown posted a patch couple days ago for this!
> > > > > > 
> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > > > > 
> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > > > > exhibited the same 1000s of WRITEs/sec problem.
> > > > > 
> > > > > Any other ideas?
> > > > 
> > > > Yes. Please try the attached patch.
> > > 
> > > Great!  That did the trick!
> > > 
> > > Do you feel this patch could be worthy of pushing it upstream in its
> > > current state or was it just to verify a theory?
> > > 
> > > 
> > > In comparing the nfs_flush_incompatible() implementations between
> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > > more or less logically equivalent to me on whether or not to flush
> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > > seems the same.  Would you provide a very brief pointer to clue me
> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> > 
> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> > days, and so the risk was that your mmapped writes could end up being
> > sent with the wrong credentials.
> 
> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> the original 2.6.18, so that makes sense, however, Red Hat had
> backported that function starting with their RHEL5.9(*) kernels,
> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> glancing through it, it all looks pretty close though.
> 
> 
> (*) That was the source I using when comparing with the 3.11 source
> when studying your patch since it was the last kernel known to me
> without the problem.
> 

I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
private at the moment (I'll see about opening it up). I brought this up
upstream about a year ago with this strawman patch:

    http://article.gmane.org/gmane.linux.nfs/51240

...at the time Trond said he was working on a set of patches to track
the open/lock stateid on a per-req basis. Did that approach not pan
out?

Also, do you need to do a similar fix to nfs_can_coalesce_requests?

Thanks,
Trond Myklebust Sept. 6, 2013, 3 p.m. UTC | #6
On Fri, 2013-09-06 at 09:36 -0400, Jeff Layton wrote:
> On Thu, 5 Sep 2013 17:34:20 -0500

> Quentin Barnes <qbarnes@gmail.com> wrote:

> 

> > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:

> > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:

> > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:

> > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:

> > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:

> > > > > > > Neil Brown posted a patch couple days ago for this!

> > > > > > > 

> > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473

> > > > > > 

> > > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still

> > > > > > exhibited the same 1000s of WRITEs/sec problem.

> > > > > > 

> > > > > > Any other ideas?

> > > > > 

> > > > > Yes. Please try the attached patch.

> > > > 

> > > > Great!  That did the trick!

> > > > 

> > > > Do you feel this patch could be worthy of pushing it upstream in its

> > > > current state or was it just to verify a theory?

> > > > 

> > > > 

> > > > In comparing the nfs_flush_incompatible() implementations between

> > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem

> > > > more or less logically equivalent to me on whether or not to flush

> > > > the page.  Also, when and where nfs_flush_incompatible() is invoked

> > > > seems the same.  Would you provide a very brief pointer to clue me

> > > > in as to why this problem didn't also manifest circa 2.6.18 days?

> > > 

> > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18

> > > days, and so the risk was that your mmapped writes could end up being

> > > sent with the wrong credentials.

> > 

> > Ah!  You're right that nfs_vm_page_mkwrite() was missing from

> > the original 2.6.18, so that makes sense, however, Red Hat had

> > backported that function starting with their RHEL5.9(*) kernels,

> > yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies

> > somewhere in RHEL5.9's do_wp_page(), or up that call path, but

> > glancing through it, it all looks pretty close though.

> > 

> > 

> > (*) That was the source I using when comparing with the 3.11 source

> > when studying your patch since it was the last kernel known to me

> > without the problem.

> > 

> 

> I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why

> you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked

> private at the moment (I'll see about opening it up). I brought this up

> upstream about a year ago with this strawman patch:

> 

>     http://article.gmane.org/gmane.linux.nfs/51240

> 

> ...at the time Trond said he was working on a set of patches to track

> the open/lock stateid on a per-req basis. Did that approach not pan

> out?


We've achieved what we wanted to do (Neil's lock recovery patch) without
that machinery, so for now, we're dropping that.

> Also, do you need to do a similar fix to nfs_can_coalesce_requests?


Yes. Good point!

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Jeff Layton Sept. 6, 2013, 3:04 p.m. UTC | #7
On Fri, 6 Sep 2013 15:00:56 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Fri, 2013-09-06 at 09:36 -0400, Jeff Layton wrote:
> > On Thu, 5 Sep 2013 17:34:20 -0500
> > Quentin Barnes <qbarnes@gmail.com> wrote:
> > 
> > > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> > > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> > > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> > > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> > > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> > > > > > > > Neil Brown posted a patch couple days ago for this!
> > > > > > > > 
> > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> > > > > > > 
> > > > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> > > > > > > exhibited the same 1000s of WRITEs/sec problem.
> > > > > > > 
> > > > > > > Any other ideas?
> > > > > > 
> > > > > > Yes. Please try the attached patch.
> > > > > 
> > > > > Great!  That did the trick!
> > > > > 
> > > > > Do you feel this patch could be worthy of pushing it upstream in its
> > > > > current state or was it just to verify a theory?
> > > > > 
> > > > > 
> > > > > In comparing the nfs_flush_incompatible() implementations between
> > > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> > > > > more or less logically equivalent to me on whether or not to flush
> > > > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> > > > > seems the same.  Would you provide a very brief pointer to clue me
> > > > > in as to why this problem didn't also manifest circa 2.6.18 days?
> > > > 
> > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> > > > days, and so the risk was that your mmapped writes could end up being
> > > > sent with the wrong credentials.
> > > 
> > > Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> > > the original 2.6.18, so that makes sense, however, Red Hat had
> > > backported that function starting with their RHEL5.9(*) kernels,
> > > yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> > > somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> > > glancing through it, it all looks pretty close though.
> > > 
> > > 
> > > (*) That was the source I using when comparing with the 3.11 source
> > > when studying your patch since it was the last kernel known to me
> > > without the problem.
> > > 
> > 
> > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > private at the moment (I'll see about opening it up). I brought this up
> > upstream about a year ago with this strawman patch:
> > 
> >     http://article.gmane.org/gmane.linux.nfs/51240
> > 
> > ...at the time Trond said he was working on a set of patches to track
> > the open/lock stateid on a per-req basis. Did that approach not pan
> > out?
> 
> We've achieved what we wanted to do (Neil's lock recovery patch) without
> that machinery, so for now, we're dropping that.
> 
> > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> 
> Yes. Good point!
> 

Cool. FWIW, here's the original bug that was opened against RHEL5:

    https://bugzilla.redhat.com/show_bug.cgi?id=736578

...the reproducer that Max cooked up is not doing mmapped I/O so there
may be a difference there, but I haven't looked closely at why that is.
Quentin Barnes Sept. 6, 2013, 4:48 p.m. UTC | #8
Jeff, can your try out my test program in the base note on your
RHEL5.9 or later RHEL5.x kernels?

I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
kernel (latest released RHEL5.9) does not show the problem for me.
Based on what you and Trond have said in this thread though, I'm
really curious why it doesn't have the problem.

On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote:
> On Thu, 5 Sep 2013 17:34:20 -0500
> Quentin Barnes <qbarnes@gmail.com> wrote:
>
>> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
>> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
>> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
>> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
>> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
>> > > > > > Neil Brown posted a patch couple days ago for this!
>> > > > > >
>> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
>> > > > >
>> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
>> > > > > exhibited the same 1000s of WRITEs/sec problem.
>> > > > >
>> > > > > Any other ideas?
>> > > >
>> > > > Yes. Please try the attached patch.
>> > >
>> > > Great!  That did the trick!
>> > >
>> > > Do you feel this patch could be worthy of pushing it upstream in its
>> > > current state or was it just to verify a theory?
>> > >
>> > >
>> > > In comparing the nfs_flush_incompatible() implementations between
>> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
>> > > more or less logically equivalent to me on whether or not to flush
>> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
>> > > seems the same.  Would you provide a very brief pointer to clue me
>> > > in as to why this problem didn't also manifest circa 2.6.18 days?
>> >
>> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
>> > days, and so the risk was that your mmapped writes could end up being
>> > sent with the wrong credentials.
>>
>> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
>> the original 2.6.18, so that makes sense, however, Red Hat had
>> backported that function starting with their RHEL5.9(*) kernels,
>> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
>> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
>> glancing through it, it all looks pretty close though.
>>
>>
>> (*) That was the source I using when comparing with the 3.11 source
>> when studying your patch since it was the last kernel known to me
>> without the problem.
>>
>
> I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> private at the moment (I'll see about opening it up). I brought this up
> upstream about a year ago with this strawman patch:
>
>     http://article.gmane.org/gmane.linux.nfs/51240
>
> ...at the time Trond said he was working on a set of patches to track
> the open/lock stateid on a per-req basis. Did that approach not pan
> out?
>
> Also, do you need to do a similar fix to nfs_can_coalesce_requests?
>
> Thanks,
> --
> Jeff Layton <jlayton@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Sept. 7, 2013, 2:51 p.m. UTC | #9
On Fri, 6 Sep 2013 11:48:45 -0500
Quentin Barnes <qbarnes@gmail.com> wrote:

> Jeff, can your try out my test program in the base note on your
> RHEL5.9 or later RHEL5.x kernels?
> 
> I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> kernel (latest released RHEL5.9) does not show the problem for me.
> Based on what you and Trond have said in this thread though, I'm
> really curious why it doesn't have the problem.
> 
> On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 5 Sep 2013 17:34:20 -0500
> > Quentin Barnes <qbarnes@gmail.com> wrote:
> >
> >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> >> > > > > > Neil Brown posted a patch couple days ago for this!
> >> > > > > >
> >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> >> > > > >
> >> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> >> > > > > exhibited the same 1000s of WRITEs/sec problem.
> >> > > > >
> >> > > > > Any other ideas?
> >> > > >
> >> > > > Yes. Please try the attached patch.
> >> > >
> >> > > Great!  That did the trick!
> >> > >
> >> > > Do you feel this patch could be worthy of pushing it upstream in its
> >> > > current state or was it just to verify a theory?
> >> > >
> >> > >
> >> > > In comparing the nfs_flush_incompatible() implementations between
> >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> >> > > more or less logically equivalent to me on whether or not to flush
> >> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> >> > > seems the same.  Would you provide a very brief pointer to clue me
> >> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> >> >
> >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> >> > days, and so the risk was that your mmapped writes could end up being
> >> > sent with the wrong credentials.
> >>
> >> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> >> the original 2.6.18, so that makes sense, however, Red Hat had
> >> backported that function starting with their RHEL5.9(*) kernels,
> >> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> >> glancing through it, it all looks pretty close though.
> >>
> >>
> >> (*) That was the source I using when comparing with the 3.11 source
> >> when studying your patch since it was the last kernel known to me
> >> without the problem.
> >>
> >
> > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > private at the moment (I'll see about opening it up). I brought this up
> > upstream about a year ago with this strawman patch:
> >
> >     http://article.gmane.org/gmane.linux.nfs/51240
> >
> > ...at the time Trond said he was working on a set of patches to track
> > the open/lock stateid on a per-req basis. Did that approach not pan
> > out?
> >
> > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> >

Yes, I see the same behavior you do. With a recent kernel I see a ton
of WRITE requests go out, with RHEL5 hardly any.

I guess I'm a little confused as to the reverse question. Why are we
seeing this data get flushed out so quickly in recent kernels from just
changes to the mmaped pages?

My understanding has always been that when a page is cleaned, we set
the WP bit on it, and then when it goes dirty we clear it and also
call page_mkwrite (not necessarily in that order).

So here we have two processes that mmap the same page, and then are
furiously writing to it. The kernel shouldn't really care or be aware
of that thrashing until that page gets flushed out for some reason
(msync() call or VM pressure).

IOW, RHEL5 behaves the way I'd expect. What's unclear to me is why more
recent kernels don't behave that way.
Trond Myklebust Sept. 7, 2013, 3 p.m. UTC | #10
T24gU2F0LCAyMDEzLTA5LTA3IGF0IDEwOjUxIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g
T24gRnJpLCA2IFNlcCAyMDEzIDExOjQ4OjQ1IC0wNTAwDQo+IFF1ZW50aW4gQmFybmVzIDxxYmFy
bmVzQGdtYWlsLmNvbT4gd3JvdGU6DQo+IA0KPiA+IEplZmYsIGNhbiB5b3VyIHRyeSBvdXQgbXkg
dGVzdCBwcm9ncmFtIGluIHRoZSBiYXNlIG5vdGUgb24geW91cg0KPiA+IFJIRUw1Ljkgb3IgbGF0
ZXIgUkhFTDUueCBrZXJuZWxzPw0KPiA+IA0KPiA+IEkgcmV2ZXJpZmllZCB0aGF0IHJ1bm5pbmcg
dGhlIHRlc3Qgb24gYSAyLjYuMTgtMzQ4LjE2LjEuZWw1IHg4Nl82NA0KPiA+IGtlcm5lbCAobGF0
ZXN0IHJlbGVhc2VkIFJIRUw1LjkpIGRvZXMgbm90IHNob3cgdGhlIHByb2JsZW0gZm9yIG1lLg0K
PiA+IEJhc2VkIG9uIHdoYXQgeW91IGFuZCBUcm9uZCBoYXZlIHNhaWQgaW4gdGhpcyB0aHJlYWQg
dGhvdWdoLCBJJ20NCj4gPiByZWFsbHkgY3VyaW91cyB3aHkgaXQgZG9lc24ndCBoYXZlIHRoZSBw
cm9ibGVtLg0KPiA+IA0KPiA+IE9uIEZyaSwgU2VwIDYsIDIwMTMgYXQgODozNiBBTSwgSmVmZiBM
YXl0b24gPGpsYXl0b25AcmVkaGF0LmNvbT4gd3JvdGU6DQo+ID4gPiBPbiBUaHUsIDUgU2VwIDIw
MTMgMTc6MzQ6MjAgLTA1MDANCj4gPiA+IFF1ZW50aW4gQmFybmVzIDxxYmFybmVzQGdtYWlsLmNv
bT4gd3JvdGU6DQo+ID4gPg0KPiA+ID4+IE9uIFRodSwgU2VwIDA1LCAyMDEzIGF0IDA5OjU3OjI0
UE0gKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gPj4gPiBPbiBUaHUsIDIwMTMt
MDktMDUgYXQgMTY6MzYgLTA1MDAsIFF1ZW50aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4+ID4gPiBP
biBUaHUsIFNlcCAwNSwgMjAxMyBhdCAwODowMjowMVBNICswMDAwLCBNeWtsZWJ1c3QsIFRyb25k
IHdyb3RlOg0KPiA+ID4+ID4gPiA+IE9uIFRodSwgMjAxMy0wOS0wNSBhdCAxNDoxMSAtMDUwMCwg
UXVlbnRpbiBCYXJuZXMgd3JvdGU6DQo+ID4gPj4gPiA+ID4gPiBPbiBUaHUsIFNlcCAwNSwgMjAx
MyBhdCAxMjowMzowM1BNIC0wNTAwLCBNYWxhaGFsIE5haW5lbmkgd3JvdGU6DQo+ID4gPj4gPiA+
ID4gPiA+IE5laWwgQnJvd24gcG9zdGVkIGEgcGF0Y2ggY291cGxlIGRheXMgYWdvIGZvciB0aGlz
IQ0KPiA+ID4+ID4gPiA+ID4gPg0KPiA+ID4+ID4gPiA+ID4gPiBodHRwOi8vdGhyZWFkLmdtYW5l
Lm9yZy9nbWFuZS5saW51eC5uZnMvNTg0NzMNCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4g
PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl
cm5lbCBzdGlsbA0KPiA+ID4+ID4gPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdS
SVRFcy9zZWMgcHJvYmxlbS4NCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4gPiBBbnkgb3Ro
ZXIgaWRlYXM/DQo+ID4gPj4gPiA+ID4NCj4gPiA+PiA+ID4gPiBZZXMuIFBsZWFzZSB0cnkgdGhl
IGF0dGFjaGVkIHBhdGNoLg0KPiA+ID4+ID4gPg0KPiA+ID4+ID4gPiBHcmVhdCEgIFRoYXQgZGlk
IHRoZSB0cmljayENCj4gPiA+PiA+ID4NCj4gPiA+PiA+ID4gRG8geW91IGZlZWwgdGhpcyBwYXRj
aCBjb3VsZCBiZSB3b3J0aHkgb2YgcHVzaGluZyBpdCB1cHN0cmVhbSBpbiBpdHMNCj4gPiA+PiA+
ID4gY3VycmVudCBzdGF0ZSBvciB3YXMgaXQganVzdCB0byB2ZXJpZnkgYSB0aGVvcnk/DQo+ID4g
Pj4gPiA+DQo+ID4gPj4gPiA+DQo+ID4gPj4gPiA+IEluIGNvbXBhcmluZyB0aGUgbmZzX2ZsdXNo
X2luY29tcGF0aWJsZSgpIGltcGxlbWVudGF0aW9ucyBiZXR3ZWVuDQo+ID4gPj4gPiA+IFJIRUw1
IGFuZCB2My4xMSAod2l0aG91dCB5b3VyIHBhdGNoKSwgdGhlIGd1dHMgb2YgdGhlIGFsZ29yaXRo
bSBzZWVtDQo+ID4gPj4gPiA+IG1vcmUgb3IgbGVzcyBsb2dpY2FsbHkgZXF1aXZhbGVudCB0byBt
ZSBvbiB3aGV0aGVyIG9yIG5vdCB0byBmbHVzaA0KPiA+ID4+ID4gPiB0aGUgcGFnZS4gIEFsc28s
IHdoZW4gYW5kIHdoZXJlIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKSBpcyBpbnZva2VkDQo+ID4g
Pj4gPiA+IHNlZW1zIHRoZSBzYW1lLiAgV291bGQgeW91IHByb3ZpZGUgYSB2ZXJ5IGJyaWVmIHBv
aW50ZXIgdG8gY2x1ZSBtZQ0KPiA+ID4+ID4gPiBpbiBhcyB0byB3aHkgdGhpcyBwcm9ibGVtIGRp
ZG4ndCBhbHNvIG1hbmlmZXN0IGNpcmNhIDIuNi4xOCBkYXlzPw0KPiA+ID4+ID4NCj4gPiA+PiA+
IFRoZXJlIHdhcyBubyBuZnNfdm1fcGFnZV9ta3dyaXRlKCkgdG8gaGFuZGxlIHBhZ2UgZmF1bHRz
IGluIHRoZSAyLjYuMTgNCj4gPiA+PiA+IGRheXMsIGFuZCBzbyB0aGUgcmlzayB3YXMgdGhhdCB5
b3VyIG1tYXBwZWQgd3JpdGVzIGNvdWxkIGVuZCB1cCBiZWluZw0KPiA+ID4+ID4gc2VudCB3aXRo
IHRoZSB3cm9uZyBjcmVkZW50aWFscy4NCj4gPiA+Pg0KPiA+ID4+IEFoISAgWW91J3JlIHJpZ2h0
IHRoYXQgbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpIHdhcyBtaXNzaW5nIGZyb20NCj4gPiA+PiB0aGUg
b3JpZ2luYWwgMi42LjE4LCBzbyB0aGF0IG1ha2VzIHNlbnNlLCBob3dldmVyLCBSZWQgSGF0IGhh
ZA0KPiA+ID4+IGJhY2twb3J0ZWQgdGhhdCBmdW5jdGlvbiBzdGFydGluZyB3aXRoIHRoZWlyIFJI
RUw1LjkoKikga2VybmVscywNCj4gPiA+PiB5ZXQgdGhlIHByb2JsZW0gZG9lc24ndCBtYW5pZmVz
dCBvbiBSSEVMNS45LiAgTWF5YmUgdGhlIGFuc3dlciBsaWVzDQo+ID4gPj4gc29tZXdoZXJlIGlu
IFJIRUw1LjkncyBkb193cF9wYWdlKCksIG9yIHVwIHRoYXQgY2FsbCBwYXRoLCBidXQNCj4gPiA+
PiBnbGFuY2luZyB0aHJvdWdoIGl0LCBpdCBhbGwgbG9va3MgcHJldHR5IGNsb3NlIHRob3VnaC4N
Cj4gPiA+Pg0KPiA+ID4+DQo+ID4gPj4gKCopIFRoYXQgd2FzIHRoZSBzb3VyY2UgSSB1c2luZyB3
aGVuIGNvbXBhcmluZyB3aXRoIHRoZSAzLjExIHNvdXJjZQ0KPiA+ID4+IHdoZW4gc3R1ZHlpbmcg
eW91ciBwYXRjaCBzaW5jZSBpdCB3YXMgdGhlIGxhc3Qga2VybmVsIGtub3duIHRvIG1lDQo+ID4g
Pj4gd2l0aG91dCB0aGUgcHJvYmxlbS4NCj4gPiA+Pg0KPiA+ID4NCj4gPiA+IEknbSBwcmV0dHkg
c3VyZSBSSEVMNSBoYXMgYSBzaW1pbGFyIHByb2JsZW0sIGJ1dCBpdCdzIHVuY2xlYXIgdG8gbWUg
d2h5DQo+ID4gPiB5b3UncmUgbm90IHNlZWluZyBpdCB0aGVyZS4gSSBoYXZlIGEgUkhCWiBvcGVu
IHZzLiBSSEVMNSBidXQgaXQncyBtYXJrZWQNCj4gPiA+IHByaXZhdGUgYXQgdGhlIG1vbWVudCAo
SSdsbCBzZWUgYWJvdXQgb3BlbmluZyBpdCB1cCkuIEkgYnJvdWdodCB0aGlzIHVwDQo+ID4gPiB1
cHN0cmVhbSBhYm91dCBhIHllYXIgYWdvIHdpdGggdGhpcyBzdHJhd21hbiBwYXRjaDoNCj4gPiA+
DQo+ID4gPiAgICAgaHR0cDovL2FydGljbGUuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81MTI0
MA0KPiA+ID4NCj4gPiA+IC4uLmF0IHRoZSB0aW1lIFRyb25kIHNhaWQgaGUgd2FzIHdvcmtpbmcg
b24gYSBzZXQgb2YgcGF0Y2hlcyB0byB0cmFjaw0KPiA+ID4gdGhlIG9wZW4vbG9jayBzdGF0ZWlk
IG9uIGEgcGVyLXJlcSBiYXNpcy4gRGlkIHRoYXQgYXBwcm9hY2ggbm90IHBhbg0KPiA+ID4gb3V0
Pw0KPiA+ID4NCj4gPiA+IEFsc28sIGRvIHlvdSBuZWVkIHRvIGRvIGEgc2ltaWxhciBmaXggdG8g
bmZzX2Nhbl9jb2FsZXNjZV9yZXF1ZXN0cz8NCj4gPiA+DQo+IA0KPiBZZXMsIEkgc2VlIHRoZSBz
YW1lIGJlaGF2aW9yIHlvdSBkby4gV2l0aCBhIHJlY2VudCBrZXJuZWwgSSBzZWUgYSB0b24NCj4g
b2YgV1JJVEUgcmVxdWVzdHMgZ28gb3V0LCB3aXRoIFJIRUw1IGhhcmRseSBhbnkuDQo+IA0KPiBJ
IGd1ZXNzIEknbSBhIGxpdHRsZSBjb25mdXNlZCBhcyB0byB0aGUgcmV2ZXJzZSBxdWVzdGlvbi4g
V2h5IGFyZSB3ZQ0KPiBzZWVpbmcgdGhpcyBkYXRhIGdldCBmbHVzaGVkIG91dCBzbyBxdWlja2x5
IGluIHJlY2VudCBrZXJuZWxzIGZyb20ganVzdA0KPiBjaGFuZ2VzIHRvIHRoZSBtbWFwZWQgcGFn
ZXM/DQo+IA0KPiBNeSB1bmRlcnN0YW5kaW5nIGhhcyBhbHdheXMgYmVlbiB0aGF0IHdoZW4gYSBw
YWdlIGlzIGNsZWFuZWQsIHdlIHNldA0KPiB0aGUgV1AgYml0IG9uIGl0LCBhbmQgdGhlbiB3aGVu
IGl0IGdvZXMgZGlydHkgd2UgY2xlYXIgaXQgYW5kIGFsc28NCj4gY2FsbCBwYWdlX21rd3JpdGUg
KG5vdCBuZWNlc3NhcmlseSBpbiB0aGF0IG9yZGVyKS4NCj4gDQo+IFNvIGhlcmUgd2UgaGF2ZSB0
d28gcHJvY2Vzc2VzIHRoYXQgbW1hcCB0aGUgc2FtZSBwYWdlLCBhbmQgdGhlbiBhcmUNCj4gZnVy
aW91c2x5IHdyaXRpbmcgdG8gaXQuIFRoZSBrZXJuZWwgc2hvdWxkbid0IHJlYWxseSBjYXJlIG9y
IGJlIGF3YXJlDQo+IG9mIHRoYXQgdGhyYXNoaW5nIHVudGlsIHRoYXQgcGFnZSBnZXRzIGZsdXNo
ZWQgb3V0IGZvciBzb21lIHJlYXNvbg0KPiAobXN5bmMoKSBjYWxsIG9yIFZNIHByZXNzdXJlKS4N
Cg0KZm9yaygpIGlzIG5vdCBzdXBwb3NlZCB0byBzaGFyZSBwYWdlIHRhYmxlcyBiZXR3ZWVuIHBh
cmVudCBhbmQgY2hpbGQNCnByb2Nlc3MuIFNob3VsZG4ndCB0aGF0IGFsc28gaW1wbHkgdGhhdCB0
aGUgcGFnZSB3cml0ZSBwcm90ZWN0IGJpdHMgYXJlDQpub3Qgc2hhcmVkPw0KDQpJT1c6IGEgd3Jp
dGUgcHJvdGVjdCBwYWdlIGZhdWx0IGluIHRoZSBwYXJlbnQgcHJvY2VzcyB0aGF0IHNldHMNCnBh
Z2VfbWt3cml0ZSgpIHNob3VsZCBub3QgcHJldmVudCBhIHNpbWlsYXIgd3JpdGUgcHJvdGVjdCBw
YWdlIGZhdWx0IGluDQp0aGUgY2hpbGQgcHJvY2VzcyAoYW5kIHN1YnNlcXVlbnQgY2FsbCB0byBw
YWdlX21rd3JpdGUoKSkuDQoNCi4uLm9yIGlzIG15IHVuZGVyc3RhbmRpbmcgb2YgdGhlIHBhZ2Ug
ZmF1bHQgc2VtYW50aWNzIHdyb25nPw0KDQo+IElPVywgUkhFTDUgYmVoYXZlcyB0aGUgd2F5IEkn
ZCBleHBlY3QuIFdoYXQncyB1bmNsZWFyIHRvIG1lIGlzIHdoeSBtb3JlDQo+IHJlY2VudCBrZXJu
ZWxzIGRvbid0IGJlaGF2ZSB0aGF0IHdheS4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4
IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAu
Y29tDQp3d3cubmV0YXBwLmNvbQ0K
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Sept. 9, 2013, 1:04 p.m. UTC | #11
On Fri, 6 Sep 2013 11:48:45 -0500
Quentin Barnes <qbarnes@gmail.com> wrote:

> Jeff, can your try out my test program in the base note on your
> RHEL5.9 or later RHEL5.x kernels?
> 
> I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> kernel (latest released RHEL5.9) does not show the problem for me.
> Based on what you and Trond have said in this thread though, I'm
> really curious why it doesn't have the problem.
> 

I can confirm what you see on RHEL5. One difference is that RHEL5's
page_mkwrite handler does not do wait_on_page_writeback. That was added
as part of the stable pages work that went in a while back, so that may 
be the main difference. Adding that in doesn't seem to materially
change things though.

In any case, what I see is that the initial program just ends up with a
two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then
things settle down (likely because the page is still marked dirty).

Eventually, another write occurs and the dirty page gets pushed out to
the server in a small flurry of WRITEs to the same range.Then, things
settle down again until there's another small flurry of activity.

My suspicion is that there is a race condition involved here, but I'm
unclear on where it is. I'm not 100% convinced this is a bug, but page
fault semantics aren't my strong suit.

You may want to consider opening a "formal" RH support case if you have
interest in getting Trond's patch backported, and/or following up on
why RHEL5 behaves the way it does.


> On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote:
> > On Thu, 5 Sep 2013 17:34:20 -0500
> > Quentin Barnes <qbarnes@gmail.com> wrote:
> >
> >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote:
> >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote:
> >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote:
> >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote:
> >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote:
> >> > > > > > Neil Brown posted a patch couple days ago for this!
> >> > > > > >
> >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473
> >> > > > >
> >> > > > > I tried Neil's patch on a v3.11 kernel.  The rebuilt kernel still
> >> > > > > exhibited the same 1000s of WRITEs/sec problem.
> >> > > > >
> >> > > > > Any other ideas?
> >> > > >
> >> > > > Yes. Please try the attached patch.
> >> > >
> >> > > Great!  That did the trick!
> >> > >
> >> > > Do you feel this patch could be worthy of pushing it upstream in its
> >> > > current state or was it just to verify a theory?
> >> > >
> >> > >
> >> > > In comparing the nfs_flush_incompatible() implementations between
> >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem
> >> > > more or less logically equivalent to me on whether or not to flush
> >> > > the page.  Also, when and where nfs_flush_incompatible() is invoked
> >> > > seems the same.  Would you provide a very brief pointer to clue me
> >> > > in as to why this problem didn't also manifest circa 2.6.18 days?
> >> >
> >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18
> >> > days, and so the risk was that your mmapped writes could end up being
> >> > sent with the wrong credentials.
> >>
> >> Ah!  You're right that nfs_vm_page_mkwrite() was missing from
> >> the original 2.6.18, so that makes sense, however, Red Hat had
> >> backported that function starting with their RHEL5.9(*) kernels,
> >> yet the problem doesn't manifest on RHEL5.9.  Maybe the answer lies
> >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but
> >> glancing through it, it all looks pretty close though.
> >>
> >>
> >> (*) That was the source I using when comparing with the 3.11 source
> >> when studying your patch since it was the last kernel known to me
> >> without the problem.
> >>
> >
> > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why
> > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked
> > private at the moment (I'll see about opening it up). I brought this up
> > upstream about a year ago with this strawman patch:
> >
> >     http://article.gmane.org/gmane.linux.nfs/51240
> >
> > ...at the time Trond said he was working on a set of patches to track
> > the open/lock stateid on a per-req basis. Did that approach not pan
> > out?
> >
> > Also, do you need to do a similar fix to nfs_can_coalesce_requests?
> >
> > Thanks,
> > --
> > Jeff Layton <jlayton@redhat.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Quentin Barnes Sept. 9, 2013, 5:32 p.m. UTC | #12
On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote:
> On Fri, 6 Sep 2013 11:48:45 -0500
> Quentin Barnes <qbarnes@gmail.com> wrote:
> 
> > Jeff, can your try out my test program in the base note on your
> > RHEL5.9 or later RHEL5.x kernels?
> > 
> > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> > kernel (latest released RHEL5.9) does not show the problem for me.
> > Based on what you and Trond have said in this thread though, I'm
> > really curious why it doesn't have the problem.
> 
> I can confirm what you see on RHEL5. One difference is that RHEL5's
> page_mkwrite handler does not do wait_on_page_writeback. That was added
> as part of the stable pages work that went in a while back, so that may 
> be the main difference. Adding that in doesn't seem to materially
> change things though.

Good to know you confirmed the behavior I saw on RHEL5 (just so that
I know it's not some random variable in play I had overlooked).

> In any case, what I see is that the initial program just ends up with a
> two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then
> things settle down (likely because the page is still marked dirty).
> 
> Eventually, another write occurs and the dirty page gets pushed out to
> the server in a small flurry of WRITEs to the same range.Then, things
> settle down again until there's another small flurry of activity.
> 
> My suspicion is that there is a race condition involved here, but I'm
> unclear on where it is. I'm not 100% convinced this is a bug, but page
> fault semantics aren't my strong suit.

As a test on RHEL6, I made a trivial systemtap script for kprobing
nfs_vm_page_mkwrite() and nfs_flush_incompatible().  I wanted to
make sure this bug was limited to just the nfs module and was not a
result of some mm behavior change.

With the bug unfixed running the test program, nfs_vm_page_mkwrite()
and nfs_flush_incompatible() are called repeatedly at a very high rate
(hence all the WRITEs).

After Trond's patch, the two functions are called just at the
program's initialization and then called only every 30 seconds or
so.

It looks like to me from the code flow that there must be something
nfs_wb_page() does that resets the need for mm to keeping reinvoking
nfs_vm_page_mkwrite().  I didn't look any deeper than that though
for now.  Maybe a race in how nfs_wb_page() updates status you're
thinking of?

> You may want to consider opening a "formal" RH support case if you have
> interest in getting Trond's patch backported, and/or following up on
> why RHEL5 behaves the way it does.

Yes, I'll be doing that.  When I do, I'll send you an email with the
case ticket.  Before filing it though, I want to hear back from the
group that had the original problem to make sure Trond's patch fully
addresses their problem (besides just the trivial test program).

> -- 
> Jeff Layton <jlayton@redhat.com>

Quentin
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Sept. 9, 2013, 5:47 p.m. UTC | #13
On Mon, 2013-09-09 at 12:32 -0500, Quentin Barnes wrote:
> On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote:

> > On Fri, 6 Sep 2013 11:48:45 -0500

> > Quentin Barnes <qbarnes@gmail.com> wrote:

> > 

> > > Jeff, can your try out my test program in the base note on your

> > > RHEL5.9 or later RHEL5.x kernels?

> > > 

> > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64

> > > kernel (latest released RHEL5.9) does not show the problem for me.

> > > Based on what you and Trond have said in this thread though, I'm

> > > really curious why it doesn't have the problem.

> > 

> > I can confirm what you see on RHEL5. One difference is that RHEL5's

> > page_mkwrite handler does not do wait_on_page_writeback. That was added

> > as part of the stable pages work that went in a while back, so that may 

> > be the main difference. Adding that in doesn't seem to materially

> > change things though.

> 

> Good to know you confirmed the behavior I saw on RHEL5 (just so that

> I know it's not some random variable in play I had overlooked).

> 

> > In any case, what I see is that the initial program just ends up with a

> > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then

> > things settle down (likely because the page is still marked dirty).

> > 

> > Eventually, another write occurs and the dirty page gets pushed out to

> > the server in a small flurry of WRITEs to the same range.Then, things

> > settle down again until there's another small flurry of activity.

> > 

> > My suspicion is that there is a race condition involved here, but I'm

> > unclear on where it is. I'm not 100% convinced this is a bug, but page

> > fault semantics aren't my strong suit.

> 

> As a test on RHEL6, I made a trivial systemtap script for kprobing

> nfs_vm_page_mkwrite() and nfs_flush_incompatible().  I wanted to

> make sure this bug was limited to just the nfs module and was not a

> result of some mm behavior change.

> 

> With the bug unfixed running the test program, nfs_vm_page_mkwrite()

> and nfs_flush_incompatible() are called repeatedly at a very high rate

> (hence all the WRITEs).

> 

> After Trond's patch, the two functions are called just at the

> program's initialization and then called only every 30 seconds or

> so.

> 

> It looks like to me from the code flow that there must be something

> nfs_wb_page() does that resets the need for mm to keeping reinvoking

> nfs_vm_page_mkwrite().  I didn't look any deeper than that though

> for now.  Maybe a race in how nfs_wb_page() updates status you're

> thinking of?


In RHEL-5, nfs_wb_page() is just a wrapper to nfs_sync_inode_wait(),
which does _not_ call clear_page_dirty_for_io() (and hence does not call
page_mkclean()).

That would explain it...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Jeff Layton Sept. 9, 2013, 6:21 p.m. UTC | #14
On Mon, 9 Sep 2013 17:47:48 +0000
"Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Mon, 2013-09-09 at 12:32 -0500, Quentin Barnes wrote:
> > On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote:
> > > On Fri, 6 Sep 2013 11:48:45 -0500
> > > Quentin Barnes <qbarnes@gmail.com> wrote:
> > > 
> > > > Jeff, can your try out my test program in the base note on your
> > > > RHEL5.9 or later RHEL5.x kernels?
> > > > 
> > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64
> > > > kernel (latest released RHEL5.9) does not show the problem for me.
> > > > Based on what you and Trond have said in this thread though, I'm
> > > > really curious why it doesn't have the problem.
> > > 
> > > I can confirm what you see on RHEL5. One difference is that RHEL5's
> > > page_mkwrite handler does not do wait_on_page_writeback. That was added
> > > as part of the stable pages work that went in a while back, so that may 
> > > be the main difference. Adding that in doesn't seem to materially
> > > change things though.
> > 
> > Good to know you confirmed the behavior I saw on RHEL5 (just so that
> > I know it's not some random variable in play I had overlooked).
> > 
> > > In any case, what I see is that the initial program just ends up with a
> > > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then
> > > things settle down (likely because the page is still marked dirty).
> > > 
> > > Eventually, another write occurs and the dirty page gets pushed out to
> > > the server in a small flurry of WRITEs to the same range.Then, things
> > > settle down again until there's another small flurry of activity.
> > > 
> > > My suspicion is that there is a race condition involved here, but I'm
> > > unclear on where it is. I'm not 100% convinced this is a bug, but page
> > > fault semantics aren't my strong suit.
> > 
> > As a test on RHEL6, I made a trivial systemtap script for kprobing
> > nfs_vm_page_mkwrite() and nfs_flush_incompatible().  I wanted to
> > make sure this bug was limited to just the nfs module and was not a
> > result of some mm behavior change.
> > 
> > With the bug unfixed running the test program, nfs_vm_page_mkwrite()
> > and nfs_flush_incompatible() are called repeatedly at a very high rate
> > (hence all the WRITEs).
> > 
> > After Trond's patch, the two functions are called just at the
> > program's initialization and then called only every 30 seconds or
> > so.
> > 
> > It looks like to me from the code flow that there must be something
> > nfs_wb_page() does that resets the need for mm to keeping reinvoking
> > nfs_vm_page_mkwrite().  I didn't look any deeper than that though
> > for now.  Maybe a race in how nfs_wb_page() updates status you're
> > thinking of?
> 
> In RHEL-5, nfs_wb_page() is just a wrapper to nfs_sync_inode_wait(),
> which does _not_ call clear_page_dirty_for_io() (and hence does not call
> page_mkclean()).
> 
> That would explain it...
> 

Thanks Trond, that does explain it.

FWIW, at this point in the RHEL5 lifecycle I'd be disinclined to make
any changes to that code without some strong justification. Backporting
Trond's recent patch for RHEL6 and making sure that RHEL7 has it sounds
quite reasonable though.
diff mbox

Patch

From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Thu, 5 Sep 2013 15:52:51 -0400
Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless
 file is locked

If we're doing buffered writes, and there is no file locking involved,
then we don't have to worry about whether or not the lock owner information
is identical.
By relaxing this check, we ensure that fork()ed child processes can write
to a page without having to first sync dirty data that was written
by the parent to disk.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 40979e8..ac1dc33 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -863,7 +863,7 @@  int nfs_flush_incompatible(struct file *file, struct page *page)
 			return 0;
 		l_ctx = req->wb_lock_context;
 		do_flush = req->wb_page != page || req->wb_context != ctx;
-		if (l_ctx) {
+		if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) {
 			do_flush |= l_ctx->lockowner.l_owner != current->files
 				|| l_ctx->lockowner.l_pid != current->tgid;
 		}
-- 
1.8.3.1