Message ID | 1378411320.5450.27.camel@leira.trondhjem.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > Neil Brown posted a patch couple days ago for this! > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > exhibited the same 1000s of WRITEs/sec problem. > > > > Any other ideas? > > Yes. Please try the attached patch. Great! That did the trick! Do you feel this patch could be worthy of pushing it upstream in its current state or was it just to verify a theory? In comparing the nfs_flush_incompatible() implementations between RHEL5 and v3.11 (without your patch), the guts of the algorithm seem more or less logically equivalent to me on whether or not to flush the page. Also, when and where nfs_flush_incompatible() is invoked seems the same. Would you provide a very brief pointer to clue me in as to why this problem didn't also manifest circa 2.6.18 days? Quentin > > > Regards, Malahal. > > > > > > Quentin Barnes [qbarnes@gmail.com] wrote: > > > > If two (or more) processes are doing nothing more than writing to > > > > the memory addresses of an mmapped shared file on an NFS mounted > > > > file system, it results in the kernel scribbling WRITEs to the > > > > server as fast as it can (1000s per second) even while no syscalls > > > > are going on. > > > > > > > > The problems happens on NFS clients mounting NFSv3 or NFSv4. I've > > > > reproduced this on the 3.11 kernel, and it happens as far back as > > > > RHEL6 (2.6.32 based), however, it is not a problem on RHEL5 (2.6.18 > > > > based). (All x86_64 systems.) I didn't try anything in between. > > > > > > > > I've created a self-contained program below that will demonstrate > > > > the problem (call it "t1"). Assuming /mnt has an NFS file system: > > > > > > > > $ t1 /mnt/mynfsfile 1 # Fork 1 writer, kernel behaves normally > > > > $ t1 /mnt/mynfsfile 2 # Fork 2 writers, kernel goes crazy WRITEing > > > > > > > > Just run "watch -d nfsstat" in another window while running the two > > > > writer test and watch the WRITE count explode. > > > > > > > > I don't see anything particularly wrong with what the example code > > > > is doing with its use of mmap. Is there anything undefined about > > > > the code that would explain this behavior, or is this a NFS bug > > > > that's really lived this long? > > > > > > > > Quentin > > > > > > > > > > > > > > > > #include <sys/stat.h> > > > > #include <sys/mman.h> > > > > #include <sys/stat.h> > > > > #include <sys/wait.h> > > > > #include <errno.h> > > > > #include <fcntl.h> > > > > #include <stdio.h> > > > > #include <stdlib.h> > > > > #include <signal.h> > > > > #include <string.h> > > > > #include <unistd.h> > > > > > > > > int > > > > kill_children() > > > > { > > > > int cnt = 0; > > > > siginfo_t infop; > > > > > > > > signal(SIGINT, SIG_IGN); > > > > kill(0, SIGINT); > > > > while (waitid(P_ALL, 0, &infop, WEXITED) != -1) ++cnt; > > > > > > > > return cnt; > > > > } > > > > > > > > void > > > > sighandler(int sig) > > > > { > > > > printf("Cleaning up all children.\n"); > > > > int cnt = kill_children(); > > > > printf("Cleaned up %d child%s.\n", cnt, cnt == 1 ? "" : "ren"); > > > > > > > > exit(0); > > > > } > > > > > > > > int > > > > do_child(volatile int *iaddr) > > > > { > > > > while (1) *iaddr = 1; > > > > } > > > > > > > > int > > > > main(int argc, char **argv) > > > > { > > > > const char *path; > > > > int fd; > > > > ssize_t wlen; > > > > int *ip; > > > > int fork_count = 1; > > > > > > > > if (argc == 1) { > > > > fprintf(stderr, "Usage: %s {filename} [fork_count].\n", > > > > argv[0]); > > > > return 1; > > > > } > > > > > > > > path = argv[1]; > > > > > > > > if (argc > 2) { > > > > int fc = atoi(argv[2]); > > > > if (fc >= 0) > > > > fork_count = fc; > > > > } > > > > > > > > fd = open(path, O_CREAT|O_TRUNC|O_RDWR|O_APPEND, S_IRUSR|S_IWUSR); > > > > if (fd < 0) { > > > > fprintf(stderr, "Open of '%s' failed: %s (%d)\n", > > > > path, strerror(errno), errno); > > > > return 1; > > > > } > > > > > > > > wlen = write(fd, &(int){0}, sizeof(int)); > > > > if (wlen != sizeof(int)) { > > > > if (wlen < 0) > > > > fprintf(stderr, "Write of '%s' failed: %s (%d)\n", > > > > path, strerror(errno), errno); > > > > else > > > > fprintf(stderr, "Short write to '%s'\n", path); > > > > return 1; > > > > } > > > > > > > > ip = (int *)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, > > > > MAP_SHARED, fd, 0); > > > > if (ip == MAP_FAILED) { > > > > fprintf(stderr, "Mmap of '%s' failed: %s (%d)\n", > > > > path, strerror(errno), errno); > > > > return 1; > > > > } > > > > > > > > signal(SIGINT, sighandler); > > > > > > > > while (fork_count-- > 0) { > > > > switch(fork()) { > > > > case -1: > > > > fprintf(stderr, "Fork failed: %s (%d)\n", > > > > strerror(errno), errno); > > > > kill_children(); > > > > return 1; > > > > case 0: /* child */ > > > > signal(SIGINT, SIG_DFL); > > > > do_child(ip); > > > > break; > > > > default: /* parent */ > > > > break; > > > > } > > > > } > > > > > > > > printf("Press ^C to terminate test.\n"); > > > > pause(); > > > > > > > > return 0; > > > > } > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > Quentin > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com > From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <Trond.Myklebust@netapp.com> > Date: Thu, 5 Sep 2013 15:52:51 -0400 > Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless > file is locked > > If we're doing buffered writes, and there is no file locking involved, > then we don't have to worry about whether or not the lock owner information > is identical. > By relaxing this check, we ensure that fork()ed child processes can write > to a page without having to first sync dirty data that was written > by the parent to disk. > > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> > --- > fs/nfs/write.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index 40979e8..ac1dc33 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page) > return 0; > l_ctx = req->wb_lock_context; > do_flush = req->wb_page != page || req->wb_context != ctx; > - if (l_ctx) { > + if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) { > do_flush |= l_ctx->lockowner.l_owner != current->files > || l_ctx->lockowner.l_pid != current->tgid; > } > -- > 1.8.3.1 > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > Any other ideas? > > > > Yes. Please try the attached patch. > > Great! That did the trick! > > Do you feel this patch could be worthy of pushing it upstream in its > current state or was it just to verify a theory? > > > In comparing the nfs_flush_incompatible() implementations between > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > more or less logically equivalent to me on whether or not to flush > the page. Also, when and where nfs_flush_incompatible() is invoked > seems the same. Would you provide a very brief pointer to clue me > in as to why this problem didn't also manifest circa 2.6.18 days? There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 days, and so the risk was that your mmapped writes could end up being sent with the wrong credentials. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > Any other ideas? > > > > Yes. Please try the attached patch. > > Great! That did the trick! > > Do you feel this patch could be worthy of pushing it upstream in its > current state or was it just to verify a theory? It should be safe to merge this. It is a valid optimisation in the case where there are no locks applied. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > Any other ideas? > > > > > > Yes. Please try the attached patch. > > > > Great! That did the trick! > > > > Do you feel this patch could be worthy of pushing it upstream in its > > current state or was it just to verify a theory? > > > > > > In comparing the nfs_flush_incompatible() implementations between > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > more or less logically equivalent to me on whether or not to flush > > the page. Also, when and where nfs_flush_incompatible() is invoked > > seems the same. Would you provide a very brief pointer to clue me > > in as to why this problem didn't also manifest circa 2.6.18 days? > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > days, and so the risk was that your mmapped writes could end up being > sent with the wrong credentials. Ah! You're right that nfs_vm_page_mkwrite() was missing from the original 2.6.18, so that makes sense, however, Red Hat had backported that function starting with their RHEL5.9(*) kernels, yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies somewhere in RHEL5.9's do_wp_page(), or up that call path, but glancing through it, it all looks pretty close though. (*) That was the source I using when comparing with the 3.11 source when studying your patch since it was the last kernel known to me without the problem. > -- > Trond Myklebust > Linux NFS client maintainer > > NetApp > Trond.Myklebust@netapp.com > www.netapp.com Quentin -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 5 Sep 2013 17:34:20 -0500 Quentin Barnes <qbarnes@gmail.com> wrote: > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > > > Any other ideas? > > > > > > > > Yes. Please try the attached patch. > > > > > > Great! That did the trick! > > > > > > Do you feel this patch could be worthy of pushing it upstream in its > > > current state or was it just to verify a theory? > > > > > > > > > In comparing the nfs_flush_incompatible() implementations between > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > > more or less logically equivalent to me on whether or not to flush > > > the page. Also, when and where nfs_flush_incompatible() is invoked > > > seems the same. Would you provide a very brief pointer to clue me > > > in as to why this problem didn't also manifest circa 2.6.18 days? > > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > > days, and so the risk was that your mmapped writes could end up being > > sent with the wrong credentials. > > Ah! You're right that nfs_vm_page_mkwrite() was missing from > the original 2.6.18, so that makes sense, however, Red Hat had > backported that function starting with their RHEL5.9(*) kernels, > yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > somewhere in RHEL5.9's do_wp_page(), or up that call path, but > glancing through it, it all looks pretty close though. > > > (*) That was the source I using when comparing with the 3.11 source > when studying your patch since it was the last kernel known to me > without the problem. > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked private at the moment (I'll see about opening it up). I brought this up upstream about a year ago with this strawman patch: http://article.gmane.org/gmane.linux.nfs/51240 ...at the time Trond said he was working on a set of patches to track the open/lock stateid on a per-req basis. Did that approach not pan out? Also, do you need to do a similar fix to nfs_can_coalesce_requests? Thanks,
On Fri, 2013-09-06 at 09:36 -0400, Jeff Layton wrote: > On Thu, 5 Sep 2013 17:34:20 -0500 > Quentin Barnes <qbarnes@gmail.com> wrote: > > > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > > > > > Any other ideas? > > > > > > > > > > Yes. Please try the attached patch. > > > > > > > > Great! That did the trick! > > > > > > > > Do you feel this patch could be worthy of pushing it upstream in its > > > > current state or was it just to verify a theory? > > > > > > > > > > > > In comparing the nfs_flush_incompatible() implementations between > > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > > > more or less logically equivalent to me on whether or not to flush > > > > the page. Also, when and where nfs_flush_incompatible() is invoked > > > > seems the same. Would you provide a very brief pointer to clue me > > > > in as to why this problem didn't also manifest circa 2.6.18 days? > > > > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > > > days, and so the risk was that your mmapped writes could end up being > > > sent with the wrong credentials. > > > > Ah! You're right that nfs_vm_page_mkwrite() was missing from > > the original 2.6.18, so that makes sense, however, Red Hat had > > backported that function starting with their RHEL5.9(*) kernels, > > yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > > somewhere in RHEL5.9's do_wp_page(), or up that call path, but > > glancing through it, it all looks pretty close though. > > > > > > (*) That was the source I using when comparing with the 3.11 source > > when studying your patch since it was the last kernel known to me > > without the problem. > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > private at the moment (I'll see about opening it up). I brought this up > upstream about a year ago with this strawman patch: > > http://article.gmane.org/gmane.linux.nfs/51240 > > ...at the time Trond said he was working on a set of patches to track > the open/lock stateid on a per-req basis. Did that approach not pan > out? We've achieved what we wanted to do (Neil's lock recovery patch) without that machinery, so for now, we're dropping that. > Also, do you need to do a similar fix to nfs_can_coalesce_requests? Yes. Good point! -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Fri, 6 Sep 2013 15:00:56 +0000 "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Fri, 2013-09-06 at 09:36 -0400, Jeff Layton wrote: > > On Thu, 5 Sep 2013 17:34:20 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > > > On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > > > > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > > > > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > > > > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > > > > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > > > > > > > > Neil Brown posted a patch couple days ago for this! > > > > > > > > > > > > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > > > > > > > > > > > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > > > > > > > exhibited the same 1000s of WRITEs/sec problem. > > > > > > > > > > > > > > Any other ideas? > > > > > > > > > > > > Yes. Please try the attached patch. > > > > > > > > > > Great! That did the trick! > > > > > > > > > > Do you feel this patch could be worthy of pushing it upstream in its > > > > > current state or was it just to verify a theory? > > > > > > > > > > > > > > > In comparing the nfs_flush_incompatible() implementations between > > > > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > > > > > more or less logically equivalent to me on whether or not to flush > > > > > the page. Also, when and where nfs_flush_incompatible() is invoked > > > > > seems the same. Would you provide a very brief pointer to clue me > > > > > in as to why this problem didn't also manifest circa 2.6.18 days? > > > > > > > > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > > > > days, and so the risk was that your mmapped writes could end up being > > > > sent with the wrong credentials. > > > > > > Ah! You're right that nfs_vm_page_mkwrite() was missing from > > > the original 2.6.18, so that makes sense, however, Red Hat had > > > backported that function starting with their RHEL5.9(*) kernels, > > > yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > > > somewhere in RHEL5.9's do_wp_page(), or up that call path, but > > > glancing through it, it all looks pretty close though. > > > > > > > > > (*) That was the source I using when comparing with the 3.11 source > > > when studying your patch since it was the last kernel known to me > > > without the problem. > > > > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > > private at the moment (I'll see about opening it up). I brought this up > > upstream about a year ago with this strawman patch: > > > > http://article.gmane.org/gmane.linux.nfs/51240 > > > > ...at the time Trond said he was working on a set of patches to track > > the open/lock stateid on a per-req basis. Did that approach not pan > > out? > > We've achieved what we wanted to do (Neil's lock recovery patch) without > that machinery, so for now, we're dropping that. > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > Yes. Good point! > Cool. FWIW, here's the original bug that was opened against RHEL5: https://bugzilla.redhat.com/show_bug.cgi?id=736578 ...the reproducer that Max cooked up is not doing mmapped I/O so there may be a difference there, but I haven't looked closely at why that is.
Jeff, can your try out my test program in the base note on your RHEL5.9 or later RHEL5.x kernels? I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 kernel (latest released RHEL5.9) does not show the problem for me. Based on what you and Trond have said in this thread though, I'm really curious why it doesn't have the problem. On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 5 Sep 2013 17:34:20 -0500 > Quentin Barnes <qbarnes@gmail.com> wrote: > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: >> > > > > > Neil Brown posted a patch couple days ago for this! >> > > > > > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 >> > > > > >> > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still >> > > > > exhibited the same 1000s of WRITEs/sec problem. >> > > > > >> > > > > Any other ideas? >> > > > >> > > > Yes. Please try the attached patch. >> > > >> > > Great! That did the trick! >> > > >> > > Do you feel this patch could be worthy of pushing it upstream in its >> > > current state or was it just to verify a theory? >> > > >> > > >> > > In comparing the nfs_flush_incompatible() implementations between >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem >> > > more or less logically equivalent to me on whether or not to flush >> > > the page. Also, when and where nfs_flush_incompatible() is invoked >> > > seems the same. Would you provide a very brief pointer to clue me >> > > in as to why this problem didn't also manifest circa 2.6.18 days? >> > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 >> > days, and so the risk was that your mmapped writes could end up being >> > sent with the wrong credentials. >> >> Ah! You're right that nfs_vm_page_mkwrite() was missing from >> the original 2.6.18, so that makes sense, however, Red Hat had >> backported that function starting with their RHEL5.9(*) kernels, >> yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but >> glancing through it, it all looks pretty close though. >> >> >> (*) That was the source I using when comparing with the 3.11 source >> when studying your patch since it was the last kernel known to me >> without the problem. >> > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > private at the moment (I'll see about opening it up). I brought this up > upstream about a year ago with this strawman patch: > > http://article.gmane.org/gmane.linux.nfs/51240 > > ...at the time Trond said he was working on a set of patches to track > the open/lock stateid on a per-req basis. Did that approach not pan > out? > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > Thanks, > -- > Jeff Layton <jlayton@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 6 Sep 2013 11:48:45 -0500 Quentin Barnes <qbarnes@gmail.com> wrote: > Jeff, can your try out my test program in the base note on your > RHEL5.9 or later RHEL5.x kernels? > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > kernel (latest released RHEL5.9) does not show the problem for me. > Based on what you and Trond have said in this thread though, I'm > really curious why it doesn't have the problem. > > On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote: > > On Thu, 5 Sep 2013 17:34:20 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > >> > > > > > Neil Brown posted a patch couple days ago for this! > >> > > > > > > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > >> > > > > > >> > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > >> > > > > exhibited the same 1000s of WRITEs/sec problem. > >> > > > > > >> > > > > Any other ideas? > >> > > > > >> > > > Yes. Please try the attached patch. > >> > > > >> > > Great! That did the trick! > >> > > > >> > > Do you feel this patch could be worthy of pushing it upstream in its > >> > > current state or was it just to verify a theory? > >> > > > >> > > > >> > > In comparing the nfs_flush_incompatible() implementations between > >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > >> > > more or less logically equivalent to me on whether or not to flush > >> > > the page. Also, when and where nfs_flush_incompatible() is invoked > >> > > seems the same. Would you provide a very brief pointer to clue me > >> > > in as to why this problem didn't also manifest circa 2.6.18 days? > >> > > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > >> > days, and so the risk was that your mmapped writes could end up being > >> > sent with the wrong credentials. > >> > >> Ah! You're right that nfs_vm_page_mkwrite() was missing from > >> the original 2.6.18, so that makes sense, however, Red Hat had > >> backported that function starting with their RHEL5.9(*) kernels, > >> yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but > >> glancing through it, it all looks pretty close though. > >> > >> > >> (*) That was the source I using when comparing with the 3.11 source > >> when studying your patch since it was the last kernel known to me > >> without the problem. > >> > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > > private at the moment (I'll see about opening it up). I brought this up > > upstream about a year ago with this strawman patch: > > > > http://article.gmane.org/gmane.linux.nfs/51240 > > > > ...at the time Trond said he was working on a set of patches to track > > the open/lock stateid on a per-req basis. Did that approach not pan > > out? > > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > Yes, I see the same behavior you do. With a recent kernel I see a ton of WRITE requests go out, with RHEL5 hardly any. I guess I'm a little confused as to the reverse question. Why are we seeing this data get flushed out so quickly in recent kernels from just changes to the mmaped pages? My understanding has always been that when a page is cleaned, we set the WP bit on it, and then when it goes dirty we clear it and also call page_mkwrite (not necessarily in that order). So here we have two processes that mmap the same page, and then are furiously writing to it. The kernel shouldn't really care or be aware of that thrashing until that page gets flushed out for some reason (msync() call or VM pressure). IOW, RHEL5 behaves the way I'd expect. What's unclear to me is why more recent kernels don't behave that way.
T24gU2F0LCAyMDEzLTA5LTA3IGF0IDEwOjUxIC0wNDAwLCBKZWZmIExheXRvbiB3cm90ZToNCj4g T24gRnJpLCA2IFNlcCAyMDEzIDExOjQ4OjQ1IC0wNTAwDQo+IFF1ZW50aW4gQmFybmVzIDxxYmFy bmVzQGdtYWlsLmNvbT4gd3JvdGU6DQo+IA0KPiA+IEplZmYsIGNhbiB5b3VyIHRyeSBvdXQgbXkg dGVzdCBwcm9ncmFtIGluIHRoZSBiYXNlIG5vdGUgb24geW91cg0KPiA+IFJIRUw1Ljkgb3IgbGF0 ZXIgUkhFTDUueCBrZXJuZWxzPw0KPiA+IA0KPiA+IEkgcmV2ZXJpZmllZCB0aGF0IHJ1bm5pbmcg dGhlIHRlc3Qgb24gYSAyLjYuMTgtMzQ4LjE2LjEuZWw1IHg4Nl82NA0KPiA+IGtlcm5lbCAobGF0 ZXN0IHJlbGVhc2VkIFJIRUw1LjkpIGRvZXMgbm90IHNob3cgdGhlIHByb2JsZW0gZm9yIG1lLg0K PiA+IEJhc2VkIG9uIHdoYXQgeW91IGFuZCBUcm9uZCBoYXZlIHNhaWQgaW4gdGhpcyB0aHJlYWQg dGhvdWdoLCBJJ20NCj4gPiByZWFsbHkgY3VyaW91cyB3aHkgaXQgZG9lc24ndCBoYXZlIHRoZSBw cm9ibGVtLg0KPiA+IA0KPiA+IE9uIEZyaSwgU2VwIDYsIDIwMTMgYXQgODozNiBBTSwgSmVmZiBM YXl0b24gPGpsYXl0b25AcmVkaGF0LmNvbT4gd3JvdGU6DQo+ID4gPiBPbiBUaHUsIDUgU2VwIDIw MTMgMTc6MzQ6MjAgLTA1MDANCj4gPiA+IFF1ZW50aW4gQmFybmVzIDxxYmFybmVzQGdtYWlsLmNv bT4gd3JvdGU6DQo+ID4gPg0KPiA+ID4+IE9uIFRodSwgU2VwIDA1LCAyMDEzIGF0IDA5OjU3OjI0 UE0gKzAwMDAsIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ID4gPj4gPiBPbiBUaHUsIDIwMTMt MDktMDUgYXQgMTY6MzYgLTA1MDAsIFF1ZW50aW4gQmFybmVzIHdyb3RlOg0KPiA+ID4+ID4gPiBP biBUaHUsIFNlcCAwNSwgMjAxMyBhdCAwODowMjowMVBNICswMDAwLCBNeWtsZWJ1c3QsIFRyb25k IHdyb3RlOg0KPiA+ID4+ID4gPiA+IE9uIFRodSwgMjAxMy0wOS0wNSBhdCAxNDoxMSAtMDUwMCwg UXVlbnRpbiBCYXJuZXMgd3JvdGU6DQo+ID4gPj4gPiA+ID4gPiBPbiBUaHUsIFNlcCAwNSwgMjAx MyBhdCAxMjowMzowM1BNIC0wNTAwLCBNYWxhaGFsIE5haW5lbmkgd3JvdGU6DQo+ID4gPj4gPiA+ ID4gPiA+IE5laWwgQnJvd24gcG9zdGVkIGEgcGF0Y2ggY291cGxlIGRheXMgYWdvIGZvciB0aGlz IQ0KPiA+ID4+ID4gPiA+ID4gPg0KPiA+ID4+ID4gPiA+ID4gPiBodHRwOi8vdGhyZWFkLmdtYW5l Lm9yZy9nbWFuZS5saW51eC5uZnMvNTg0NzMNCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4g PiBJIHRyaWVkIE5laWwncyBwYXRjaCBvbiBhIHYzLjExIGtlcm5lbC4gIFRoZSByZWJ1aWx0IGtl cm5lbCBzdGlsbA0KPiA+ID4+ID4gPiA+ID4gZXhoaWJpdGVkIHRoZSBzYW1lIDEwMDBzIG9mIFdS SVRFcy9zZWMgcHJvYmxlbS4NCj4gPiA+PiA+ID4gPiA+DQo+ID4gPj4gPiA+ID4gPiBBbnkgb3Ro ZXIgaWRlYXM/DQo+ID4gPj4gPiA+ID4NCj4gPiA+PiA+ID4gPiBZZXMuIFBsZWFzZSB0cnkgdGhl IGF0dGFjaGVkIHBhdGNoLg0KPiA+ID4+ID4gPg0KPiA+ID4+ID4gPiBHcmVhdCEgIFRoYXQgZGlk IHRoZSB0cmljayENCj4gPiA+PiA+ID4NCj4gPiA+PiA+ID4gRG8geW91IGZlZWwgdGhpcyBwYXRj aCBjb3VsZCBiZSB3b3J0aHkgb2YgcHVzaGluZyBpdCB1cHN0cmVhbSBpbiBpdHMNCj4gPiA+PiA+ ID4gY3VycmVudCBzdGF0ZSBvciB3YXMgaXQganVzdCB0byB2ZXJpZnkgYSB0aGVvcnk/DQo+ID4g Pj4gPiA+DQo+ID4gPj4gPiA+DQo+ID4gPj4gPiA+IEluIGNvbXBhcmluZyB0aGUgbmZzX2ZsdXNo X2luY29tcGF0aWJsZSgpIGltcGxlbWVudGF0aW9ucyBiZXR3ZWVuDQo+ID4gPj4gPiA+IFJIRUw1 IGFuZCB2My4xMSAod2l0aG91dCB5b3VyIHBhdGNoKSwgdGhlIGd1dHMgb2YgdGhlIGFsZ29yaXRo bSBzZWVtDQo+ID4gPj4gPiA+IG1vcmUgb3IgbGVzcyBsb2dpY2FsbHkgZXF1aXZhbGVudCB0byBt ZSBvbiB3aGV0aGVyIG9yIG5vdCB0byBmbHVzaA0KPiA+ID4+ID4gPiB0aGUgcGFnZS4gIEFsc28s IHdoZW4gYW5kIHdoZXJlIG5mc19mbHVzaF9pbmNvbXBhdGlibGUoKSBpcyBpbnZva2VkDQo+ID4g Pj4gPiA+IHNlZW1zIHRoZSBzYW1lLiAgV291bGQgeW91IHByb3ZpZGUgYSB2ZXJ5IGJyaWVmIHBv aW50ZXIgdG8gY2x1ZSBtZQ0KPiA+ID4+ID4gPiBpbiBhcyB0byB3aHkgdGhpcyBwcm9ibGVtIGRp ZG4ndCBhbHNvIG1hbmlmZXN0IGNpcmNhIDIuNi4xOCBkYXlzPw0KPiA+ID4+ID4NCj4gPiA+PiA+ IFRoZXJlIHdhcyBubyBuZnNfdm1fcGFnZV9ta3dyaXRlKCkgdG8gaGFuZGxlIHBhZ2UgZmF1bHRz IGluIHRoZSAyLjYuMTgNCj4gPiA+PiA+IGRheXMsIGFuZCBzbyB0aGUgcmlzayB3YXMgdGhhdCB5 b3VyIG1tYXBwZWQgd3JpdGVzIGNvdWxkIGVuZCB1cCBiZWluZw0KPiA+ID4+ID4gc2VudCB3aXRo IHRoZSB3cm9uZyBjcmVkZW50aWFscy4NCj4gPiA+Pg0KPiA+ID4+IEFoISAgWW91J3JlIHJpZ2h0 IHRoYXQgbmZzX3ZtX3BhZ2VfbWt3cml0ZSgpIHdhcyBtaXNzaW5nIGZyb20NCj4gPiA+PiB0aGUg b3JpZ2luYWwgMi42LjE4LCBzbyB0aGF0IG1ha2VzIHNlbnNlLCBob3dldmVyLCBSZWQgSGF0IGhh ZA0KPiA+ID4+IGJhY2twb3J0ZWQgdGhhdCBmdW5jdGlvbiBzdGFydGluZyB3aXRoIHRoZWlyIFJI RUw1LjkoKikga2VybmVscywNCj4gPiA+PiB5ZXQgdGhlIHByb2JsZW0gZG9lc24ndCBtYW5pZmVz dCBvbiBSSEVMNS45LiAgTWF5YmUgdGhlIGFuc3dlciBsaWVzDQo+ID4gPj4gc29tZXdoZXJlIGlu IFJIRUw1LjkncyBkb193cF9wYWdlKCksIG9yIHVwIHRoYXQgY2FsbCBwYXRoLCBidXQNCj4gPiA+ PiBnbGFuY2luZyB0aHJvdWdoIGl0LCBpdCBhbGwgbG9va3MgcHJldHR5IGNsb3NlIHRob3VnaC4N Cj4gPiA+Pg0KPiA+ID4+DQo+ID4gPj4gKCopIFRoYXQgd2FzIHRoZSBzb3VyY2UgSSB1c2luZyB3 aGVuIGNvbXBhcmluZyB3aXRoIHRoZSAzLjExIHNvdXJjZQ0KPiA+ID4+IHdoZW4gc3R1ZHlpbmcg eW91ciBwYXRjaCBzaW5jZSBpdCB3YXMgdGhlIGxhc3Qga2VybmVsIGtub3duIHRvIG1lDQo+ID4g Pj4gd2l0aG91dCB0aGUgcHJvYmxlbS4NCj4gPiA+Pg0KPiA+ID4NCj4gPiA+IEknbSBwcmV0dHkg c3VyZSBSSEVMNSBoYXMgYSBzaW1pbGFyIHByb2JsZW0sIGJ1dCBpdCdzIHVuY2xlYXIgdG8gbWUg d2h5DQo+ID4gPiB5b3UncmUgbm90IHNlZWluZyBpdCB0aGVyZS4gSSBoYXZlIGEgUkhCWiBvcGVu IHZzLiBSSEVMNSBidXQgaXQncyBtYXJrZWQNCj4gPiA+IHByaXZhdGUgYXQgdGhlIG1vbWVudCAo SSdsbCBzZWUgYWJvdXQgb3BlbmluZyBpdCB1cCkuIEkgYnJvdWdodCB0aGlzIHVwDQo+ID4gPiB1 cHN0cmVhbSBhYm91dCBhIHllYXIgYWdvIHdpdGggdGhpcyBzdHJhd21hbiBwYXRjaDoNCj4gPiA+ DQo+ID4gPiAgICAgaHR0cDovL2FydGljbGUuZ21hbmUub3JnL2dtYW5lLmxpbnV4Lm5mcy81MTI0 MA0KPiA+ID4NCj4gPiA+IC4uLmF0IHRoZSB0aW1lIFRyb25kIHNhaWQgaGUgd2FzIHdvcmtpbmcg b24gYSBzZXQgb2YgcGF0Y2hlcyB0byB0cmFjaw0KPiA+ID4gdGhlIG9wZW4vbG9jayBzdGF0ZWlk IG9uIGEgcGVyLXJlcSBiYXNpcy4gRGlkIHRoYXQgYXBwcm9hY2ggbm90IHBhbg0KPiA+ID4gb3V0 Pw0KPiA+ID4NCj4gPiA+IEFsc28sIGRvIHlvdSBuZWVkIHRvIGRvIGEgc2ltaWxhciBmaXggdG8g bmZzX2Nhbl9jb2FsZXNjZV9yZXF1ZXN0cz8NCj4gPiA+DQo+IA0KPiBZZXMsIEkgc2VlIHRoZSBz YW1lIGJlaGF2aW9yIHlvdSBkby4gV2l0aCBhIHJlY2VudCBrZXJuZWwgSSBzZWUgYSB0b24NCj4g b2YgV1JJVEUgcmVxdWVzdHMgZ28gb3V0LCB3aXRoIFJIRUw1IGhhcmRseSBhbnkuDQo+IA0KPiBJ IGd1ZXNzIEknbSBhIGxpdHRsZSBjb25mdXNlZCBhcyB0byB0aGUgcmV2ZXJzZSBxdWVzdGlvbi4g V2h5IGFyZSB3ZQ0KPiBzZWVpbmcgdGhpcyBkYXRhIGdldCBmbHVzaGVkIG91dCBzbyBxdWlja2x5 IGluIHJlY2VudCBrZXJuZWxzIGZyb20ganVzdA0KPiBjaGFuZ2VzIHRvIHRoZSBtbWFwZWQgcGFn ZXM/DQo+IA0KPiBNeSB1bmRlcnN0YW5kaW5nIGhhcyBhbHdheXMgYmVlbiB0aGF0IHdoZW4gYSBw YWdlIGlzIGNsZWFuZWQsIHdlIHNldA0KPiB0aGUgV1AgYml0IG9uIGl0LCBhbmQgdGhlbiB3aGVu IGl0IGdvZXMgZGlydHkgd2UgY2xlYXIgaXQgYW5kIGFsc28NCj4gY2FsbCBwYWdlX21rd3JpdGUg KG5vdCBuZWNlc3NhcmlseSBpbiB0aGF0IG9yZGVyKS4NCj4gDQo+IFNvIGhlcmUgd2UgaGF2ZSB0 d28gcHJvY2Vzc2VzIHRoYXQgbW1hcCB0aGUgc2FtZSBwYWdlLCBhbmQgdGhlbiBhcmUNCj4gZnVy aW91c2x5IHdyaXRpbmcgdG8gaXQuIFRoZSBrZXJuZWwgc2hvdWxkbid0IHJlYWxseSBjYXJlIG9y IGJlIGF3YXJlDQo+IG9mIHRoYXQgdGhyYXNoaW5nIHVudGlsIHRoYXQgcGFnZSBnZXRzIGZsdXNo ZWQgb3V0IGZvciBzb21lIHJlYXNvbg0KPiAobXN5bmMoKSBjYWxsIG9yIFZNIHByZXNzdXJlKS4N Cg0KZm9yaygpIGlzIG5vdCBzdXBwb3NlZCB0byBzaGFyZSBwYWdlIHRhYmxlcyBiZXR3ZWVuIHBh cmVudCBhbmQgY2hpbGQNCnByb2Nlc3MuIFNob3VsZG4ndCB0aGF0IGFsc28gaW1wbHkgdGhhdCB0 aGUgcGFnZSB3cml0ZSBwcm90ZWN0IGJpdHMgYXJlDQpub3Qgc2hhcmVkPw0KDQpJT1c6IGEgd3Jp dGUgcHJvdGVjdCBwYWdlIGZhdWx0IGluIHRoZSBwYXJlbnQgcHJvY2VzcyB0aGF0IHNldHMNCnBh Z2VfbWt3cml0ZSgpIHNob3VsZCBub3QgcHJldmVudCBhIHNpbWlsYXIgd3JpdGUgcHJvdGVjdCBw YWdlIGZhdWx0IGluDQp0aGUgY2hpbGQgcHJvY2VzcyAoYW5kIHN1YnNlcXVlbnQgY2FsbCB0byBw YWdlX21rd3JpdGUoKSkuDQoNCi4uLm9yIGlzIG15IHVuZGVyc3RhbmRpbmcgb2YgdGhlIHBhZ2Ug ZmF1bHQgc2VtYW50aWNzIHdyb25nPw0KDQo+IElPVywgUkhFTDUgYmVoYXZlcyB0aGUgd2F5IEkn ZCBleHBlY3QuIFdoYXQncyB1bmNsZWFyIHRvIG1lIGlzIHdoeSBtb3JlDQo+IHJlY2VudCBrZXJu ZWxzIGRvbid0IGJlaGF2ZSB0aGF0IHdheS4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4 IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAu Y29tDQp3d3cubmV0YXBwLmNvbQ0K -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 6 Sep 2013 11:48:45 -0500 Quentin Barnes <qbarnes@gmail.com> wrote: > Jeff, can your try out my test program in the base note on your > RHEL5.9 or later RHEL5.x kernels? > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > kernel (latest released RHEL5.9) does not show the problem for me. > Based on what you and Trond have said in this thread though, I'm > really curious why it doesn't have the problem. > I can confirm what you see on RHEL5. One difference is that RHEL5's page_mkwrite handler does not do wait_on_page_writeback. That was added as part of the stable pages work that went in a while back, so that may be the main difference. Adding that in doesn't seem to materially change things though. In any case, what I see is that the initial program just ends up with a two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then things settle down (likely because the page is still marked dirty). Eventually, another write occurs and the dirty page gets pushed out to the server in a small flurry of WRITEs to the same range.Then, things settle down again until there's another small flurry of activity. My suspicion is that there is a race condition involved here, but I'm unclear on where it is. I'm not 100% convinced this is a bug, but page fault semantics aren't my strong suit. You may want to consider opening a "formal" RH support case if you have interest in getting Trond's patch backported, and/or following up on why RHEL5 behaves the way it does. > On Fri, Sep 6, 2013 at 8:36 AM, Jeff Layton <jlayton@redhat.com> wrote: > > On Thu, 5 Sep 2013 17:34:20 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > >> On Thu, Sep 05, 2013 at 09:57:24PM +0000, Myklebust, Trond wrote: > >> > On Thu, 2013-09-05 at 16:36 -0500, Quentin Barnes wrote: > >> > > On Thu, Sep 05, 2013 at 08:02:01PM +0000, Myklebust, Trond wrote: > >> > > > On Thu, 2013-09-05 at 14:11 -0500, Quentin Barnes wrote: > >> > > > > On Thu, Sep 05, 2013 at 12:03:03PM -0500, Malahal Naineni wrote: > >> > > > > > Neil Brown posted a patch couple days ago for this! > >> > > > > > > >> > > > > > http://thread.gmane.org/gmane.linux.nfs/58473 > >> > > > > > >> > > > > I tried Neil's patch on a v3.11 kernel. The rebuilt kernel still > >> > > > > exhibited the same 1000s of WRITEs/sec problem. > >> > > > > > >> > > > > Any other ideas? > >> > > > > >> > > > Yes. Please try the attached patch. > >> > > > >> > > Great! That did the trick! > >> > > > >> > > Do you feel this patch could be worthy of pushing it upstream in its > >> > > current state or was it just to verify a theory? > >> > > > >> > > > >> > > In comparing the nfs_flush_incompatible() implementations between > >> > > RHEL5 and v3.11 (without your patch), the guts of the algorithm seem > >> > > more or less logically equivalent to me on whether or not to flush > >> > > the page. Also, when and where nfs_flush_incompatible() is invoked > >> > > seems the same. Would you provide a very brief pointer to clue me > >> > > in as to why this problem didn't also manifest circa 2.6.18 days? > >> > > >> > There was no nfs_vm_page_mkwrite() to handle page faults in the 2.6.18 > >> > days, and so the risk was that your mmapped writes could end up being > >> > sent with the wrong credentials. > >> > >> Ah! You're right that nfs_vm_page_mkwrite() was missing from > >> the original 2.6.18, so that makes sense, however, Red Hat had > >> backported that function starting with their RHEL5.9(*) kernels, > >> yet the problem doesn't manifest on RHEL5.9. Maybe the answer lies > >> somewhere in RHEL5.9's do_wp_page(), or up that call path, but > >> glancing through it, it all looks pretty close though. > >> > >> > >> (*) That was the source I using when comparing with the 3.11 source > >> when studying your patch since it was the last kernel known to me > >> without the problem. > >> > > > > I'm pretty sure RHEL5 has a similar problem, but it's unclear to me why > > you're not seeing it there. I have a RHBZ open vs. RHEL5 but it's marked > > private at the moment (I'll see about opening it up). I brought this up > > upstream about a year ago with this strawman patch: > > > > http://article.gmane.org/gmane.linux.nfs/51240 > > > > ...at the time Trond said he was working on a set of patches to track > > the open/lock stateid on a per-req basis. Did that approach not pan > > out? > > > > Also, do you need to do a similar fix to nfs_can_coalesce_requests? > > > > Thanks, > > -- > > Jeff Layton <jlayton@redhat.com> > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote: > On Fri, 6 Sep 2013 11:48:45 -0500 > Quentin Barnes <qbarnes@gmail.com> wrote: > > > Jeff, can your try out my test program in the base note on your > > RHEL5.9 or later RHEL5.x kernels? > > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > > kernel (latest released RHEL5.9) does not show the problem for me. > > Based on what you and Trond have said in this thread though, I'm > > really curious why it doesn't have the problem. > > I can confirm what you see on RHEL5. One difference is that RHEL5's > page_mkwrite handler does not do wait_on_page_writeback. That was added > as part of the stable pages work that went in a while back, so that may > be the main difference. Adding that in doesn't seem to materially > change things though. Good to know you confirmed the behavior I saw on RHEL5 (just so that I know it's not some random variable in play I had overlooked). > In any case, what I see is that the initial program just ends up with a > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then > things settle down (likely because the page is still marked dirty). > > Eventually, another write occurs and the dirty page gets pushed out to > the server in a small flurry of WRITEs to the same range.Then, things > settle down again until there's another small flurry of activity. > > My suspicion is that there is a race condition involved here, but I'm > unclear on where it is. I'm not 100% convinced this is a bug, but page > fault semantics aren't my strong suit. As a test on RHEL6, I made a trivial systemtap script for kprobing nfs_vm_page_mkwrite() and nfs_flush_incompatible(). I wanted to make sure this bug was limited to just the nfs module and was not a result of some mm behavior change. With the bug unfixed running the test program, nfs_vm_page_mkwrite() and nfs_flush_incompatible() are called repeatedly at a very high rate (hence all the WRITEs). After Trond's patch, the two functions are called just at the program's initialization and then called only every 30 seconds or so. It looks like to me from the code flow that there must be something nfs_wb_page() does that resets the need for mm to keeping reinvoking nfs_vm_page_mkwrite(). I didn't look any deeper than that though for now. Maybe a race in how nfs_wb_page() updates status you're thinking of? > You may want to consider opening a "formal" RH support case if you have > interest in getting Trond's patch backported, and/or following up on > why RHEL5 behaves the way it does. Yes, I'll be doing that. When I do, I'll send you an email with the case ticket. Before filing it though, I want to hear back from the group that had the original problem to make sure Trond's patch fully addresses their problem (besides just the trivial test program). > -- > Jeff Layton <jlayton@redhat.com> Quentin -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-09-09 at 12:32 -0500, Quentin Barnes wrote: > On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote: > > On Fri, 6 Sep 2013 11:48:45 -0500 > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > > > Jeff, can your try out my test program in the base note on your > > > RHEL5.9 or later RHEL5.x kernels? > > > > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > > > kernel (latest released RHEL5.9) does not show the problem for me. > > > Based on what you and Trond have said in this thread though, I'm > > > really curious why it doesn't have the problem. > > > > I can confirm what you see on RHEL5. One difference is that RHEL5's > > page_mkwrite handler does not do wait_on_page_writeback. That was added > > as part of the stable pages work that went in a while back, so that may > > be the main difference. Adding that in doesn't seem to materially > > change things though. > > Good to know you confirmed the behavior I saw on RHEL5 (just so that > I know it's not some random variable in play I had overlooked). > > > In any case, what I see is that the initial program just ends up with a > > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then > > things settle down (likely because the page is still marked dirty). > > > > Eventually, another write occurs and the dirty page gets pushed out to > > the server in a small flurry of WRITEs to the same range.Then, things > > settle down again until there's another small flurry of activity. > > > > My suspicion is that there is a race condition involved here, but I'm > > unclear on where it is. I'm not 100% convinced this is a bug, but page > > fault semantics aren't my strong suit. > > As a test on RHEL6, I made a trivial systemtap script for kprobing > nfs_vm_page_mkwrite() and nfs_flush_incompatible(). I wanted to > make sure this bug was limited to just the nfs module and was not a > result of some mm behavior change. > > With the bug unfixed running the test program, nfs_vm_page_mkwrite() > and nfs_flush_incompatible() are called repeatedly at a very high rate > (hence all the WRITEs). > > After Trond's patch, the two functions are called just at the > program's initialization and then called only every 30 seconds or > so. > > It looks like to me from the code flow that there must be something > nfs_wb_page() does that resets the need for mm to keeping reinvoking > nfs_vm_page_mkwrite(). I didn't look any deeper than that though > for now. Maybe a race in how nfs_wb_page() updates status you're > thinking of? In RHEL-5, nfs_wb_page() is just a wrapper to nfs_sync_inode_wait(), which does _not_ call clear_page_dirty_for_io() (and hence does not call page_mkclean()). That would explain it... -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Mon, 9 Sep 2013 17:47:48 +0000 "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote: > On Mon, 2013-09-09 at 12:32 -0500, Quentin Barnes wrote: > > On Mon, Sep 09, 2013 at 09:04:24AM -0400, Jeff Layton wrote: > > > On Fri, 6 Sep 2013 11:48:45 -0500 > > > Quentin Barnes <qbarnes@gmail.com> wrote: > > > > > > > Jeff, can your try out my test program in the base note on your > > > > RHEL5.9 or later RHEL5.x kernels? > > > > > > > > I reverified that running the test on a 2.6.18-348.16.1.el5 x86_64 > > > > kernel (latest released RHEL5.9) does not show the problem for me. > > > > Based on what you and Trond have said in this thread though, I'm > > > > really curious why it doesn't have the problem. > > > > > > I can confirm what you see on RHEL5. One difference is that RHEL5's > > > page_mkwrite handler does not do wait_on_page_writeback. That was added > > > as part of the stable pages work that went in a while back, so that may > > > be the main difference. Adding that in doesn't seem to materially > > > change things though. > > > > Good to know you confirmed the behavior I saw on RHEL5 (just so that > > I know it's not some random variable in play I had overlooked). > > > > > In any case, what I see is that the initial program just ends up with a > > > two calls to nfs_vm_page_mkwrite(). They both push out a WRITE and then > > > things settle down (likely because the page is still marked dirty). > > > > > > Eventually, another write occurs and the dirty page gets pushed out to > > > the server in a small flurry of WRITEs to the same range.Then, things > > > settle down again until there's another small flurry of activity. > > > > > > My suspicion is that there is a race condition involved here, but I'm > > > unclear on where it is. I'm not 100% convinced this is a bug, but page > > > fault semantics aren't my strong suit. > > > > As a test on RHEL6, I made a trivial systemtap script for kprobing > > nfs_vm_page_mkwrite() and nfs_flush_incompatible(). I wanted to > > make sure this bug was limited to just the nfs module and was not a > > result of some mm behavior change. > > > > With the bug unfixed running the test program, nfs_vm_page_mkwrite() > > and nfs_flush_incompatible() are called repeatedly at a very high rate > > (hence all the WRITEs). > > > > After Trond's patch, the two functions are called just at the > > program's initialization and then called only every 30 seconds or > > so. > > > > It looks like to me from the code flow that there must be something > > nfs_wb_page() does that resets the need for mm to keeping reinvoking > > nfs_vm_page_mkwrite(). I didn't look any deeper than that though > > for now. Maybe a race in how nfs_wb_page() updates status you're > > thinking of? > > In RHEL-5, nfs_wb_page() is just a wrapper to nfs_sync_inode_wait(), > which does _not_ call clear_page_dirty_for_io() (and hence does not call > page_mkclean()). > > That would explain it... > Thanks Trond, that does explain it. FWIW, at this point in the RHEL5 lifecycle I'd be disinclined to make any changes to that code without some strong justification. Backporting Trond's recent patch for RHEL6 and making sure that RHEL7 has it sounds quite reasonable though.
From 903ebaeefae78e6e03f3719aafa8fd5dd22d3288 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Thu, 5 Sep 2013 15:52:51 -0400 Subject: [PATCH] NFS: Don't check lock owner compatibility in writes unless file is locked If we're doing buffered writes, and there is no file locking involved, then we don't have to worry about whether or not the lock owner information is identical. By relaxing this check, we ensure that fork()ed child processes can write to a page without having to first sync dirty data that was written by the parent to disk. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> --- fs/nfs/write.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 40979e8..ac1dc33 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -863,7 +863,7 @@ int nfs_flush_incompatible(struct file *file, struct page *page) return 0; l_ctx = req->wb_lock_context; do_flush = req->wb_page != page || req->wb_context != ctx; - if (l_ctx) { + if (l_ctx && ctx->dentry->d_inode->i_flock != NULL) { do_flush |= l_ctx->lockowner.l_owner != current->files || l_ctx->lockowner.l_pid != current->tgid; } -- 1.8.3.1