diff mbox

Bugs / Patch in nfsd

Message ID 20131118124406.GA46678@colin.muc.de (mailing list archive)
State New, archived
Headers show

Commit Message

Albert Fluegel Nov. 18, 2013, 12:44 p.m. UTC
Hello,

i posted Bug 1028439 to bugzilla.redhat.com and was asked to post the patch
to this mailing list for discussion and possibly upstream fix.

As most of the followers here probably do not have access to the redhat
bugzilla, i'll repeat the most important parts of the report here. A proposed
patch is attached. Sorry, this is not short, but it's not a simle topic

Description of problem:
When a Solaris 2.5, 2.6, 7 or Solaris 8 client uses a Linux
NFS (version 3) server, the directories are messed up over NFS,
many files not found. As an example, GNU make 3.81 is tried to build.
After the configure step running make the Makefile is not found.
Some lines from truss showing an inconsistency:
...
stat(".", 0xFFBEE628)                           = 0
open64(".", O_RDONLY|O_NDELAY)                  = 3
fcntl(3, F_SETFD, 0x00000001)                   = 0
fstat64(3, 0xFFBEE4F8)                          = 0
getdents64(3, 0x00054FE0, 1048)                 = 1024
close(3)                                        = 0
stat("GNUmakefile", 0xFFBEE708)                 Err#2 ENOENT
stat("makefile", 0xFFBEE708)                    Err#2 ENOENT
stat("Makefile", 0xFFBEE708)                    = 0
makewrite(2, " m a k e", 4)                             = 4
: *** write(2, " :   * * *  ", 6)                       = 6
No targets specified and no makefile foundwrite(2, " N o   t a r g e t s   s".., 42)    = 42

prompt% ls Makefile
Makefile

The problem does not show up with Linux or e.g. SunOS-4 NFS clients.

What i've seen on the network is, that the
Linux NFS server replies among other things to a "Check access permission"
the following:

NFS:    File type = 2 (Directory)
NFS:    Mode = 040755

A netapp server replies here:
NFS:    File type = 2 (Directory)
NFS:    Mode = 0755

The RFC 1813 i read:
   fattr3

      struct fattr3 {
         ftype3     type;
         mode3      mode;
         uint32     nlink;
...
For the mode bits only the lowest 9 are defined in the RFC

The problem occurs with the kernels 2.6.18-348.3.1.el5 upward for RHEL5,
2.6.32-358.18.1.el6 upward and some versions earlier and with Fedora kernel
3.9.10-100 on the NFS server

There seem to be several issues, all caused by the 64 bit cookies enabled,
directly or indirectly.
One change:
diff -r kernel-2.6.18-308/linux-2.6.18-308.11.1.el5.i386/fs/nfsd/vfs.c kernel-2.6.18-348/linux-2.6.18-348.3.1.el5.i386/fs/nfsd/vfs.c
725a727,733
> 	else {
> 		if (may_flags & NFSD_MAY_64BIT_COOKIE)
> 			(*filp)->f_mode |= FMODE_64BITHASH;
> 		else
> 			(*filp)->f_mode |= FMODE_32BITHASH;
> 	}
> 

makes bits set in the mode field of the
RPC reply, that are used internally by the kernel.
They should really not go into the RPC reply.
imo these internally used bits should be set in an own component of
the struct and not in f_mode (like the f_type, which is separate).
Probably for NFSv4 this must be fixed, too.

The other problem is, that the nfsd_readdir seems not to find cookies
or at least does not position the read pointer correctly and starts
reading the directory anew, causing the (Solaris) client to be in an
endless loop. The cookie returned in a "Read Directory" reply is actually
32 bit and with the next query issued with this (identical) cookie
the Linux NFS server replies with the directories started anew.
I don't know, in how far the cookies depend on the client. However,
with a Solaris client i consider it worth noting, that in the reply
to the directory read the upper 32 bits are either all 0 or all 1
(0xffffffff). With a Linux client, they are either 0 or have some
random value, but not constantly 0xffffffff .

How to definitely correctly fix this cookie oddity is imo up to the
maintainers. In the meantime i propose the attached patch to mask
out the internally used FMODE_*BITHASH flags from the mode in the
RPC reply and to use only 32 bit cookies.

Regarding the cookie thing i don't think the clients misbehave.
Linux clients seem to evaluate an entire reply to a "Read Directory" and
use the cookie of the last received entry for the next query. Solaris in
my test case with the unpacked GNU make 3.81 uses about 60% of the entries
and puts the cookie of the next one into the next "Read Directory"
NFS query to the server. As far as i can see in the wireshark evaluating
the network trace, the cookie is 100 % correct, but the Linux NFS server
starts with the beginning of the directory again in the next reply.
Could be, all cookies except the last one of the query are actually
unusable and the problem is not seen on a Linux NFS client, because it
always takes the last cookie for the next query.

The attached patches fix the problem with Solaris clients. With HP-UX
there's still trouble (stale filehandle all the time).
See the names of the patch files regarding for what kernel version they fit.
The first part of each fixes the higher bits problem in f_mode. The second
part disabled 64 bit cookies.

Thank you for looking into this.
Best Regards,
 Albert Fluegel

Comments

Christoph Hellwig Nov. 18, 2013, 1 p.m. UTC | #1
On Mon, Nov 18, 2013 at 01:44:06PM +0100, Albert Fluegel wrote:
> directly or indirectly.
> One change:
> diff -r kernel-2.6.18-308/linux-2.6.18-308.11.1.el5.i386/fs/nfsd/vfs.c kernel-2.6.18-348/linux-2.6.18-348.3.1.el5.i386/fs/nfsd/vfs.c
> 725a727,733
> > 	else {
> > 		if (may_flags & NFSD_MAY_64BIT_COOKIE)
> > 			(*filp)->f_mode |= FMODE_64BITHASH;
> > 		else
> > 			(*filp)->f_mode |= FMODE_32BITHASH;
> > 	}
> > 
> 
> makes bits set in the mode field of the
> RPC reply, that are used internally by the kernel.
>
> They should really not go into the RPC reply.
> imo these internally used bits should be set in an own component of
> the struct and not in f_mode (like the f_type, which is separate).
> Probably for NFSv4 this must be fixed, too.

I can't see how NFSD ever exports f_mode.  This is a kernel-internal
field that doesn't make a whole lot of sense over the wire.  What gets
sent in nfsd GETATTR operations is the mode field from the stat (or more
specificly the kstat) structure, which contain entirely different
constants.

I have to admit that I still don't quite understand how the f_mode
access in nfsd_open() is supposed to work race-free, though.

> The other problem is, that the nfsd_readdir seems not to find cookies
> or at least does not position the read pointer correctly and starts
> reading the directory anew, causing the (Solaris) client to be in an
> endless loop. The cookie returned in a "Read Directory" reply is actually
> 32 bit and with the next query issued with this (identical) cookie
> the Linux NFS server replies with the directories started anew.
> I don't know, in how far the cookies depend on the client. However,
> with a Solaris client i consider it worth noting, that in the reply
> to the directory read the upper 32 bits are either all 0 or all 1
> (0xffffffff). With a Linux client, they are either 0 or have some
> random value, but not constantly 0xffffffff .

Just curious, do you see the same issues if not using ext3/ext4 which
have a the 64-bit directory cookie issue but XFS instead?

>  	*p++ = htonl(nfs3_ftypes[(stat->mode & S_IFMT) >> 12]);
> -	*p++ = htonl((u32) stat->mode);
> +	*p++ = htonl((u32) (stat->mode & (S_IRWXU | S_IRWXG | S_IRWXO | S_ISUID | S_ISGID | S_ISVTX)));

Should be S_IALLUGO instead of manually repeating the flags.  While this
is a good practice and I'm in favour of it, I'd love to know what other
value you see in the mode field.  Do you have a wireshark dump or a
trace?

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Nov. 18, 2013, 5:01 p.m. UTC | #2
On Mon, Nov 18, 2013 at 05:00:26AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 18, 2013 at 01:44:06PM +0100, Albert Fluegel wrote:
> >  	*p++ = htonl(nfs3_ftypes[(stat->mode & S_IFMT) >> 12]);
> > -	*p++ = htonl((u32) stat->mode);
> > +	*p++ = htonl((u32) (stat->mode & (S_IRWXU | S_IRWXG | S_IRWXO | S_ISUID | S_ISGID | S_ISVTX)));
> 
> Should be S_IALLUGO instead of manually repeating the flags.  While this
> is a good practice and I'm in favour of it, I'd love to know what other
> value you see in the mode field.  Do you have a wireshark dump or a
> trace?

Just tracing a v3 mount I can confirm that S_IFDIR (0x4000, 040000)
gets set in GETATTR replies.

Which definitely looks wrong.  But we've been doing this forever (as far
as I can see nfs3xdr.c got added in 2.1.32, and the code was the same
way there), so I wonder if there's any undocumented lore about use of
these high bits.

Also if a client's actually behaving differently depending on how those
high bits are set, that might be worth reporting to the client
implementers, as it may represent an unintentional failure to sanitize
the returned bits.

Anyway, absent objections my default is to queue this up for 3.14 (using
S_IALLUGO).

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 18, 2013, 5:23 p.m. UTC | #3
On Mon, Nov 18, 2013 at 12:01:32PM -0500, J. Bruce Fields wrote:
> Just tracing a v3 mount I can confirm that S_IFDIR (0x4000, 040000)
> gets set in GETATTR replies.

Oh. got it.  We set the stat file type, nothing related to the 32-bit
readdir cookie.

> Anyway, absent objections my default is to queue this up for 3.14 (using
> S_IALLUGO).

Yeah, that part looks good to me.  But it seem Albert still has an issue
with the directory offsets.  Might be worth to get the ext4 developes
in the loop for that, as I doubt it's an issue with the nfsd code.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Nov. 18, 2013, 5:28 p.m. UTC | #4
On Mon, Nov 18, 2013 at 01:44:06PM +0100, Albert Fluegel wrote:
> Hello,
> 
> i posted Bug 1028439 to bugzilla.redhat.com and was asked to post the patch
> to this mailing list for discussion and possibly upstream fix.
> 
> As most of the followers here probably do not have access to the redhat
> bugzilla, i'll repeat the most important parts of the report here. A proposed
> patch is attached. Sorry, this is not short, but it's not a simle topic
> 
> Description of problem:
> When a Solaris 2.5, 2.6, 7 or Solaris 8 client uses a Linux
> NFS (version 3) server, the directories are messed up over NFS,
> many files not found. As an example, GNU make 3.81 is tried to build.
> After the configure step running make the Makefile is not found.
> Some lines from truss showing an inconsistency:
> ...
> stat(".", 0xFFBEE628)                           = 0
> open64(".", O_RDONLY|O_NDELAY)                  = 3
> fcntl(3, F_SETFD, 0x00000001)                   = 0
> fstat64(3, 0xFFBEE4F8)                          = 0
> getdents64(3, 0x00054FE0, 1048)                 = 1024
> close(3)                                        = 0
> stat("GNUmakefile", 0xFFBEE708)                 Err#2 ENOENT
> stat("makefile", 0xFFBEE708)                    Err#2 ENOENT
> stat("Makefile", 0xFFBEE708)                    = 0
> makewrite(2, " m a k e", 4)                             = 4
> : *** write(2, " :   * * *  ", 6)                       = 6
> No targets specified and no makefile foundwrite(2, " N o   t a r g e t s   s".., 42)    = 42
> 
> prompt% ls Makefile
> Makefile
> 
> The problem does not show up with Linux or e.g. SunOS-4 NFS clients.
> 
> What i've seen on the network is, that the
> Linux NFS server replies among other things to a "Check access permission"
> the following:
> 
> NFS:    File type = 2 (Directory)
> NFS:    Mode = 040755
> 
> A netapp server replies here:
> NFS:    File type = 2 (Directory)
> NFS:    Mode = 0755
> 
> The RFC 1813 i read:
>    fattr3
> 
>       struct fattr3 {
>          ftype3     type;
>          mode3      mode;
>          uint32     nlink;
> ...
> For the mode bits only the lowest 9 are defined in the RFC
> 
> The problem occurs with the kernels 2.6.18-348.3.1.el5 upward for RHEL5,
> 2.6.32-358.18.1.el6 upward and some versions earlier and with Fedora kernel
> 3.9.10-100 on the NFS server
> 
> There seem to be several issues, all caused by the 64 bit cookies enabled,
> directly or indirectly.
> One change:
> diff -r kernel-2.6.18-308/linux-2.6.18-308.11.1.el5.i386/fs/nfsd/vfs.c kernel-2.6.18-348/linux-2.6.18-348.3.1.el5.i386/fs/nfsd/vfs.c
> 725a727,733
> > 	else {
> > 		if (may_flags & NFSD_MAY_64BIT_COOKIE)
> > 			(*filp)->f_mode |= FMODE_64BITHASH;
> > 		else
> > 			(*filp)->f_mode |= FMODE_32BITHASH;
> > 	}
> > 
> 
> makes bits set in the mode field of the
> RPC reply, that are used internally by the kernel.

It appears to me that you're just confused by the naming of the field;
"f_mode" has nothing to do with a file's mode bits.

Have you actually checked that turning off FMODE_64BITHASH changes the
returned mode bits?  If so, that would be interesting (and extremely
surprising).

> The other problem is, that the nfsd_readdir seems not to find cookies
> or at least does not position the read pointer correctly and starts
> reading the directory anew, causing the (Solaris) client to be in an
> endless loop. The cookie returned in a "Read Directory" reply is actually
> 32 bit and with the next query issued with this (identical) cookie
> the Linux NFS server replies with the directories started anew.
> I don't know, in how far the cookies depend on the client. However,
> with a Solaris client i consider it worth noting, that in the reply
> to the directory read the upper 32 bits are either all 0 or all 1
> (0xffffffff). With a Linux client, they are either 0 or have some
> random value, but not constantly 0xffffffff .

Apologies, I don't understand this description.  Best would be if you
could send a packet capture showing the described behavior.  (When you
say "in the reply to the directory read", you're referring to cookies in
the READDIR reply?  Those should be identical regardless of which client
requests them.)

> Regarding the cookie thing i don't think the clients misbehave.
> Linux clients seem to evaluate an entire reply to a "Read Directory" and
> use the cookie of the last received entry for the next query. Solaris in
> my test case with the unpacked GNU make 3.81 uses about 60% of the entries
> and puts the cookie of the next one into the next "Read Directory"
> NFS query to the server. As far as i can see in the wireshark evaluating
> the network trace, the cookie is 100 % correct, but the Linux NFS server
> starts with the beginning of the directory again in the next reply.
> Could be, all cookies except the last one of the query are actually
> unusable and the problem is not seen on a Linux NFS client, because it
> always takes the last cookie for the next query.

There was indeed a known RHEL-only bug which corrupted all but the final
cookie in a READDIR reply, fixed in RHEL5 kernel -353.  This should not
be reproduceable upstream.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Nov. 18, 2013, 5:37 p.m. UTC | #5
On Mon, Nov 18, 2013 at 09:23:15AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 18, 2013 at 12:01:32PM -0500, J. Bruce Fields wrote:
> > Just tracing a v3 mount I can confirm that S_IFDIR (0x4000, 040000)
> > gets set in GETATTR replies.
> 
> Oh. got it.  We set the stat file type, nothing related to the 32-bit
> readdir cookie.
> 
> > Anyway, absent objections my default is to queue this up for 3.14 (using
> > S_IALLUGO).
> 
> Yeah, that part looks good to me.  But it seem Albert still has an issue
> with the directory offsets.  Might be worth to get the ext4 developes
> in the loop for that, as I doubt it's an issue with the nfsd code.

One problem he's seeing was RHEL5-specific, the other is the known ext4
problem that's been discussed before.

(Basically, ext4 has a tradeoff between correctness, lookup performance,
and compatibility with some buggy old clients:

	1. turn off dir_index and performance on large directories may
	   suffer, but it's correct and any client will be happy.
	2. turn on dir_index and return 32-bit cookies: now you get
	   directory loops on large directories due to random hash
	   collisions.
	3. turn on dir_index and return 64-bit cookies: some clients seem
	   to then return errors to 32-bit applications doing readdirs.
	   Cookies have been 64-bit since NFSv3 and 32-bit Linux clients
	   deal with this fine (it fakes up its own small integer offsets
	   to return to applications), but apparently some other clients
	   return errors on readdir.

So currently we default to 3 and if people complain, tell them to turn
off dir_index and complain to their client vendor....)

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 18, 2013, 5:47 p.m. UTC | #6
On Mon, Nov 18, 2013 at 12:37:31PM -0500, J. Bruce Fields wrote:
> One problem he's seeing was RHEL5-specific, the other is the known ext4
> problem that's been discussed before.

Ok, didn't know there was a RHEL bug.

> (Basically, ext4 has a tradeoff between correctness, lookup performance,
> and compatibility with some buggy old clients:
> 
> 	1. turn off dir_index and performance on large directories may
> 	   suffer, but it's correct and any client will be happy.
> 	2. turn on dir_index and return 32-bit cookies: now you get
> 	   directory loops on large directories due to random hash
> 	   collisions.
> 	3. turn on dir_index and return 64-bit cookies: some clients seem
> 	   to then return errors to 32-bit applications doing readdirs.
> 	   Cookies have been 64-bit since NFSv3 and 32-bit Linux clients
> 	   deal with this fine (it fakes up its own small integer offsets
> 	   to return to applications), but apparently some other clients
> 	   return errors on readdir.
> 
> So currently we default to 3 and if people complain, tell them to turn
> off dir_index and complain to their client vendor....)

Or people could use a filesystem taking care of that.  In addition to
taking care of the directory index XFS also has other advantages like
much better support to get the metadata operations to disk a lot more
efficiently.


... working hard on my marketing skills..

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Nov. 20, 2013, 4:45 p.m. UTC | #7
On Wed, Nov 20, 2013 at 05:28:10PM +0100, Albert Fluegel wrote:
> On Mon, Nov 18, 2013 at 12:37:31PM -0500, J. Bruce Fields wrote:
> > > > Anyway, absent objections my default is to queue this up for 3.14 (using
> > > > S_IALLUGO).
> This is great ! Thank you.
> 
> > > > ...
> > One problem he's seeing was RHEL5-specific, the other is the known ext4
> > problem that's been discussed before.
> > 
> > (Basically, ext4 has a tradeoff between correctness, lookup performance,
> > and compatibility with some buggy old clients:
> > 
> > 	1. turn off dir_index and performance on large directories may
> > 	   suffer, but it's correct and any client will be happy.
> > 	2. turn on dir_index and return 32-bit cookies: now you get
> > 	   directory loops on large directories due to random hash
> > 	   collisions.
> > 	3. turn on dir_index and return 64-bit cookies: some clients seem
> > 	   to then return errors to 32-bit applications doing readdirs.
> > 	   Cookies have been 64-bit since NFSv3 and 32-bit Linux clients
> > 	   deal with this fine (it fakes up its own small integer offsets
> > 	   to return to applications), but apparently some other clients
> > 	   return errors on readdir.
> > 
> > So currently we default to 3 and if people complain, tell them to turn
> > off dir_index and complain to their client vendor....)
> I agree with that. Did some "research" in the meantime. It's a real abyss.
> I think it does not make much sense to continue this thread. Thanks to
> all contributors bringing more light into this.
> 
> So this is for the records:
> With current RHEL5/6 + ext3 there is no problem over NFS. With ext4 + dir_index
> Solaris-8 fails with EOVERFLOW on a directory read. Solaris-2.5.1 complains
> (RPC: Can't decode result). There are 2 differences when turning off dir_index:
> The cookies have very low values then (in contrast to using all 64 bits with
> dir_index on) and the order returned by readdir is different (does not start
> with . and ..) Don't know, which one makes which Solaris fail.
> HP-UX fails differently on a ext4, even with dir_index turned off, but not
> always. If in the reply of a getattr the nanoseconds are not 0, HPUX fails
> with "stale file handle".

This is in the ctime/mtime/atime fields?

> Could it be, it mixes some of these bytes into the handle ?

More likely some sort of bug when they try to fill their attribute cache
for the new file.

Anyway, sounds like a pretty egregious client bug if that's accurate.  I
don't know if there's any easy way to force ext4 to truncate those
times.  Probably the only workaround is to stick to ext3.

--b.

> If in the reply the nanoseconds are all 0, HPUX works even
> with 64 bit cookies (dir_index on) on an ext4. On a xfs they all work.
> In the NFS replies on an xfs i've seen all nanoseconds set to 0, so this is
> consistent and the faulty behaviour seems definitely on the client side.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -ru linux-2.6.32-358.23.2.el6.x86_64/fs/nfsd/nfs3xdr.c linux-2.6.32-358.23.2.el6.x86_64.paf/fs/nfsd/nfs3xdr.c
--- linux-2.6.32-358.23.2.el6.x86_64.paf/fs/nfsd/nfs3xdr.c	2013-11-15 15:54:26.648622688 +0100
+++ linux-2.6.32-358.23.2.el6.x86_64/fs/nfsd/nfs3xdr.c	2013-09-14 10:52:32.000000000 +0200
@@ -163,7 +163,7 @@ 
 	      struct kstat *stat)
 {
 	*p++ = htonl(nfs3_ftypes[(stat->mode & S_IFMT) >> 12]);
-	*p++ = htonl((u32) stat->mode);
+	*p++ = htonl((u32) (stat->mode & (S_IRWXU | S_IRWXG | S_IRWXO | S_ISUID | S_ISGID | S_ISVTX)));
 	*p++ = htonl((u32) stat->nlink);
 	*p++ = htonl((u32) nfsd_ruid(rqstp, stat->uid));
 	*p++ = htonl((u32) nfsd_rgid(rqstp, stat->gid));
diff -ru linux-2.6.32-358.23.2.el6.x86_64.paf/fs/nfsd/vfs.c linux-2.6.32-358.23.2.el6.x86_64/fs/nfsd/vfs.c
--- linux-2.6.32-358.23.2.el6.x86_64/fs/nfsd/vfs.c	2013-11-15 15:57:10.802262655 +0100
+++ linux-2.6.32-358.23.2.el6.x86_64.paf/fs/nfsd/vfs.c	2013-09-14 10:53:17.000000000 +0200
@@ -784,7 +784,7 @@ 
 	else {
 		host_err = ima_file_check(*filp, may_flags);
 
-		if (may_flags & NFSD_MAY_64BIT_COOKIE)
+		if (may_flags & NFSD_MAY_64BIT_COOKIE && 0)
 			(*filp)->f_mode |= FMODE_64BITHASH;
 		else
 			(*filp)->f_mode |= FMODE_32BITHASH;