diff mbox

/etc/mtab read ~900 times by rpc.mountd

Message ID 871spj1pfr.fsf@notabene.neil.brown.name (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown July 14, 2017, 6:51 a.m. UTC
On Wed, Jul 12 2017, Phil Kauffman wrote:

> On 07/11/2017 07:46 PM, NeilBrown wrote:
>> So the new data shows about 7 seconds for a login, which is probably
>> a little longer than you would like, but might be acceptable?
> Unfortunately, the delay is not acceptable.
>
> The ideal would be to achieve performance parity with when one is not 
> forced to use the 'crossmnt' option.
>
> My current setup (home directory server) does not require 'crossmnt' and 
> does not incur a delay. It is a standard nfs server using mdadm, lvm, 
> and xfs.
>
> While my current setup is probably "normal" and using nested datasets 
> with the 'crossmnt' option might be "weird" now; nested mounts will 
> probably only become more common as BTRFS, ZFS, and other filesystems 
> with similar features gain traction on linux.
>
>
>> That is more than a 2-line patch.  I might have a go later this week.
> That would be super! Thanks for taking a look.

Please try this (against a clean nfs-utils.  i.e. remove the previous
patch).
It is a hack and would need to be totally re-written, but hopely the
measurements you make and strace that you report could be useful.
Also, for the strace, please use "-ttt" rather than "-tt" like I asked
before.  It is easier to find the difference between two times with
-ttt.
And add -T as well.

Thanks,
NeilBrown


From f15571d747bb6272ec539e46b6b3aee3ccc53b30 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Fri, 14 Jul 2017 16:43:37 +1000
Subject: [PATCH] testing

Signed-off-by: NeilBrown <neilb@suse.com>
---
 utils/mountd/cache.c | 105 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 81 insertions(+), 24 deletions(-)

Comments

Phil Kauffman July 25, 2017, 9:12 p.m. UTC | #1
On 07/14/2017 01:51 AM, NeilBrown wrote:
> Please try this (against a clean nfs-utils.  i.e. remove the previous
> patch).
> It is a hack and would need to be totally re-written, but hopely the
> measurements you make and strace that you report could be useful.
> Also, for the strace, please use "-ttt" rather than "-tt" like I asked
> before.  It is easier to find the difference between two times with
> -ttt. And add -T as well.

Sorry this took so long but I had to be sure of my results.

With your latest patch applied I am unable to mount my NFS shares and 
thus continue with the tests from before.

Since the patch you provided cannot be applied cleanly to the source 
used to build '1:1.2.8-9ubuntu12.1' for Ubuntu Xenial I chose to build a 
couple of VMs to continue this testing. They should be identical to the 
machines I was using before except they are under powered and not using 
nfs-common, nfs-kernel-server package version 1:1.2.8-9ubuntu12.1. Most 
importantly the zfs datasets and the files in /etc/exports.d are the same.

I found that utils/mountd/cache.c was last modified in 1.3.4, so I 
tested both 1.3.4 and 2.1.1 without your patch as a baseline. I was able 
to mount my shares in both versions.

I then reverted my VMs to a state before the baseline install and 
installed again with your patch applied.

I get the following with your patch applied on 1.3.4 and 2.1.1:

root@nfsclient:~# mount -avv
/                        : ignored
/boot                    : already mounted
none                     : ignored
mount.nfs4: timeout set for Tue Jul 25 15:43:13 2017
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598
mount.nfs4: mount(2): Permission denied
mount.nfs4: access denied by server while mounting 
nfsserver.cs.uchicago.edu:/homes
mount.nfs4: timeout set for Tue Jul 25 15:43:14 2017
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,vers=4.2,addr=10.135.24.23,clientaddr=10.135.24.22'
mount.nfs4: mount(2): No such file or directory
mount.nfs4: trying text-based options 
'ac,hard,rsize=16384,wsize=16384,addr=10.135.24.23'
mount.nfs4: prog 100003, trying vers=3, prot=6
mount.nfs4: trying 10.135.24.23 prog 100003 vers 3 prot TCP port 2049
mount.nfs4: prog 100005, trying vers=3, prot=17
mount.nfs4: trying 10.135.24.23 prog 100005 vers 3 prot UDP port 41598
mount.nfs4: mount(2): Permission denied
mount.nfs4: access denied by server while mounting 
nfsserver.cs.uchicago.edu:/stage


Attached are my install notes and the strace during the failed mount above.
NeilBrown July 27, 2017, 4:37 a.m. UTC | #2
On Tue, Jul 25 2017, Phil Kauffman wrote:

> On 07/14/2017 01:51 AM, NeilBrown wrote:
>> Please try this (against a clean nfs-utils.  i.e. remove the previous
>> patch).
>> It is a hack and would need to be totally re-written, but hopely the
>> measurements you make and strace that you report could be useful.
>> Also, for the strace, please use "-ttt" rather than "-tt" like I asked
>> before.  It is easier to find the difference between two times with
>> -ttt. And add -T as well.
>
> Sorry this took so long but I had to be sure of my results.

It is worth being thorough.

>
> With your latest patch applied I am unable to mount my NFS shares and 
> thus continue with the tests from before.

Weird.

You are experiencing a bug that was very recently fixed, where if
mount.nfs4 gets the error ENOENT from the server, it falls back to
NFSv3.
That explains some of the noise, but doesn't explain why you get ENOENT
for the v4 mount.

The strace output you've provided doesn't even show any attempts to read
/etc/mtab, which my patch doesn't change at all.  So it seems like the
context is different in some way.

Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
... maybe that it done automatically somehow...

Can you try with unpatches 2.1.1?
Also maybe provide an strace starting before any attempt to mount
anything, and with an extra option "-s 1000"..

Thanks,
NeilBrown
Phil Kauffman July 27, 2017, 9:40 p.m. UTC | #3
On 07/26/2017 11:37 PM, NeilBrown wrote:
> Weird.
I thought so as well.

> So it seems like the context is different in some way.
It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties. 

 
> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
> ... maybe that it done automatically somehow...
I copied the files from the original NFS server into /etc/exports.d  as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary.

root@nfsserver:~# ls /etc/exports.d/ | wc -l
972

I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver).

The ZFS datasets were also precreated:

root@nfsserver:~# zfs list | wc -l
6062


> Can you try with unpatches 2.1.1?
I assume you mean without any patches. vanilla 2.1.1

> Also maybe provide an strace starting before any attempt to mount
> anything, and with an extra option "-s 1000"..

This is all that happens *before* trying to mount from the client.
root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt
strace: Process 2517 attached
1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL

Stracing rpc.nfsd:
root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify
http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt

When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt
NeilBrown July 31, 2017, 4:23 a.m. UTC | #4
On Thu, Jul 27 2017, Phil Kauffman wrote:

> On 07/26/2017 11:37 PM, NeilBrown wrote:
>> Weird.
> I thought so as well.
>
>> So it seems like the context is different in some way.
> It is in the sense that I am now using 1.3.4 and 2.1.1 vanilla on Ubuntu Xenial(provides 1.2.8) without any of debian's patches or niceties. 
>
>  
>> Your nfs_test_notes.txt doesn't show /etc/export.d getting filled in
>> ... maybe that it done automatically somehow...
> I copied the files from the original NFS server into /etc/exports.d  as part of the VM snapshot I revert back to. My notes say 'mkdir -p /etc/exports.d', but this step is not necessary.
>
> root@nfsserver:~# ls /etc/exports.d/ | wc -l
> 972
>
> I only added 3 extra files so that 'nfsclient' would be able to mount from the testing server (nfsserver).
>
> The ZFS datasets were also precreated:
>
> root@nfsserver:~# zfs list | wc -l
> 6062
>
>
>> Can you try with unpatches 2.1.1?
> I assume you mean without any patches. vanilla 2.1.1
>
>> Also maybe provide an strace starting before any attempt to mount
>> anything, and with an extra option "-s 1000"..
>
> This is all that happens *before* trying to mount from the client.
> root@nfsserver:~# systemctl restart nfs-server; exportfs -f; m=$(pgrep rpc.mountd); strace -s 1000 -ttt -T -p $m 2>&1 | tee strace_rpc.mountd2.txt
> strace: Process 2517 attached
> 1501190436.659800 select(1024, [3 4 5 7 8 9 10 11 12 13 14 15 16 17 18], NULL, NULL, NULL
>
> Stracing rpc.nfsd:
> root@nfsserver:~# exportfs -f; strace -s1000 -ttt -T /usr/sbin/rpc.nfsd 0 2>&1| tee strace_nfsd.txt; /usr/sbin/sm-notify
> http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.nfsd.txt
>
> When mounting from the client I get a strace that is 1.3G (it seems '-s' was not respected for some reason): http://people.cs.uchicago.edu/~kauffman/nfs-kernel-server/take3/strace_rpc.mountd.txt 
>


I'm sorry but I cannot get anything useful from these.  It isn't crystal
clear what was being tested in each case....
Could you try:

1/ build the current 'master' branch from the nfs-utils git tree, with
   no extra patches.
2/ Install that and perform the same test which produce the 1.3G strace
   file above.  Provide the resulting strace of rpc.mountd.
3/ Tell me what happened when you mounted on the client.  Did it work as
   expected?
4/ Apply the patch I provided on top of the 'master' branch but change

		hcreate_r(1000, &hdata);

to

		hcreate_r(20000, &hdata);

5/ Install that and perform the same test as in step to, producing
   a new strace of rpc.mounted.  Presumably it will be difference.
6/ Tell me what happened when you mounted on the client.  Did it work
   as expected, or did it fail like I think it did last time.

If one worked and the other didn't, I can compare the strace and try to
find a difference.
If both worked, then we should have useful information about how much
speed up the patch provides (if any).
If neither worked, then something weird is happening...

NeilBrown
diff mbox

Patch

diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index ca6c84f4d93d..db89a4feb7ae 100644
--- a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -18,6 +18,7 @@ 
 #include <time.h>
 #include <netinet/in.h>
 #include <arpa/inet.h>
+#include <sys/sysmacros.h>
 #include <unistd.h>
 #include <fcntl.h>
 #include <errno.h>
@@ -299,7 +300,7 @@  static const long int nonblkid_filesystems[] = {
     0        /* last */
 };
 
-static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
+static int uuid_by_path(char *path, struct statfs64 *stp, int type, size_t uuidlen, char *uuid)
 {
 	/* get a uuid for the filesystem found at 'path'.
 	 * There are several possible ways of generating the
@@ -334,12 +335,17 @@  static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
 	const char *val;
 	int rc;
 
-	rc = statfs64(path, &st);
+	if (stp)
+		rc = 0;
+	else {
+		stp = &st;
+		rc= statfs64(path, stp);
+	}
 
 	if (type == 0 && rc == 0) {
 		const long int *bad;
 		for (bad = nonblkid_filesystems; *bad; bad++) {
-			if (*bad == st.f_type)
+			if (*bad == stp->f_type)
 				break;
 		}
 		if (*bad == 0)
@@ -347,9 +353,9 @@  static int uuid_by_path(char *path, int type, size_t uuidlen, char *uuid)
 	}
 
 	if (rc == 0 &&
-	    (st.f_fsid.__val[0] || st.f_fsid.__val[1]))
+	    (stp->f_fsid.__val[0] || stp->f_fsid.__val[1]))
 		snprintf(fsid_val, 17, "%08x%08x",
-			 st.f_fsid.__val[0], st.f_fsid.__val[1]);
+			 stp->f_fsid.__val[0], stp->f_fsid.__val[1]);
 	else
 		fsid_val[0] = 0;
 
@@ -603,25 +609,64 @@  static int parse_fsid(int fsidtype, int fsidlen, char *fsid,
 	return 0;
 }
 
+#define entry FOO
+#include <search.h>
+struct pinfo {
+	unsigned int stb:1, mntp:1, stfs:1;
+	unsigned int mountpoint:1;
+	time_t valid;
+	struct stat statbuf;
+	struct statfs64 statfsbuf;
+};
+static struct hsearch_data hdata;
+static int hdata_init = 0;
+
+
 static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
 {
-	struct stat stb;
+	struct stat *stb;
 	int type;
 	char u[16];
+	struct pinfo *pi;
+	ENTRY ent, *ret;
 
-	if (stat(path, &stb) != 0)
-		return false;
-	if (!S_ISDIR(stb.st_mode) && !S_ISREG(stb.st_mode))
+	if (!hdata_init) {
+		hcreate_r(1000, &hdata);
+		hdata_init = 1;
+	}
+	ent.key = path;
+	if (hsearch_r(ent, FIND, &ret, &hdata) == 0) {
+		ent.key = strdup(path);
+		pi = xmalloc(sizeof(*pi));
+		pi->stb = pi->mntp = pi->stfs = 0;
+		pi->valid = 0;
+		ent.data = pi;
+		hsearch_r(ent, ENTER, &ret, &hdata);
+	} else
+		pi = ret->data;
+
+	if (pi->valid < time(NULL)+120) {
+		pi->stb = pi->mntp = pi->stfs = 0;
+		pi->valid = time(NULL);
+	}
+
+	stb = &pi->statbuf;
+	if (!pi->stb) {
+		if (stat(path, stb) != 0)
+			return false;
+		pi->stb = 1;
+	}
+	if (!S_ISDIR(stb->st_mode) && !S_ISREG(stb->st_mode))
 		return false;
 
 	switch (parsed->fsidtype) {
 	case FSID_DEV:
 	case FSID_MAJOR_MINOR:
 	case FSID_ENCODE_DEV:
-		if (stb.st_ino != parsed->inode)
+		if (stb->st_ino != parsed->inode)
 			return false;
-		if (parsed->major != major(stb.st_dev) ||
-		    parsed->minor != minor(stb.st_dev))
+		if (parsed->major != major(stb->st_dev) ||
+		    parsed->minor != minor(stb->st_dev))
 			return false;
 		return true;
 	case FSID_NUM:
@@ -631,12 +676,16 @@  static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
 		return true;
 	case FSID_UUID4_INUM:
 	case FSID_UUID16_INUM:
-		if (stb.st_ino != parsed->inode)
+		if (stb->st_ino != parsed->inode)
 			return false;
 		goto check_uuid;
 	case FSID_UUID8:
 	case FSID_UUID16:
-		if (!is_mountpoint(path))
+		if (!pi->mntp) {
+			pi->mountpoint = is_mountpoint(path);
+			pi->mntp = 1;
+		}
+		if (!pi->mountpoint)
 			return false;
 	check_uuid:
 		if (exp->m_export.e_uuid) {
@@ -644,12 +693,18 @@  static bool match_fsid(struct parsed_fsid *parsed, nfs_export *exp, char *path)
 			if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0)
 				return true;
 		}
-		else
+		else {
+			if (!pi->stfs) {
+				if (statfs64(path, &pi->statfsbuf) != 0)
+					return false;
+				pi->stfs = 1;
+			}
 			for (type = 0;
-			     uuid_by_path(path, type, parsed->uuidlen, u);
+			     uuid_by_path(path, &pi->statfsbuf, type, parsed->uuidlen, u);
 			     type++)
 				if (memcmp(u, parsed->fhuuid, parsed->uuidlen) == 0)
 					return true;
+		}
 	}
 	/* Well, unreachable, actually: */
 	return false;
@@ -727,10 +782,18 @@  static void nfsd_fh(int f)
 		for (exp = exportlist[i].p_head; exp; exp = next_exp) {
 			char *path;
 
+			if (!is_ipaddr_client(dom)) {
+				if (!namelist_client_matches(exp, dom))
+					continue;
+			} else {
+				if (!ipaddr_client_matches(exp, ai))
+					continue;
+			}
+
 			if (exp->m_export.e_flags & NFSEXP_CROSSMOUNT) {
 				static nfs_export *prev = NULL;
 				static void *mnt = NULL;
-				
+
 				if (prev == exp) {
 					/* try a submount */
 					path = next_mnt(&mnt, exp->m_export.e_path);
@@ -751,9 +814,6 @@  static void nfsd_fh(int f)
 				next_exp = exp->m_next;
 			}
 
-			if (!is_ipaddr_client(dom)
-					&& !namelist_client_matches(exp, dom))
-				continue;
 			if (exp->m_export.e_mountpoint &&
 			    !is_mountpoint(exp->m_export.e_mountpoint[0]?
 					   exp->m_export.e_mountpoint:
@@ -762,9 +822,6 @@  static void nfsd_fh(int f)
 
 			if (!match_fsid(&parsed, exp, path))
 				continue;
-			if (is_ipaddr_client(dom)
-					&& !ipaddr_client_matches(exp, ai))
-				continue;
 			if (!found || subexport(&exp->m_export, found)) {
 				found = &exp->m_export;
 				free(found_path);
@@ -906,7 +963,7 @@  static int dump_to_cache(int f, char *buf, int buflen, char *domain,
 		write_secinfo(&bp, &blen, exp, flag_mask);
 		if (exp->e_uuid == NULL || different_fs) {
 			char u[16];
-			if (uuid_by_path(path, 0, 16, u)) {
+			if (uuid_by_path(path, NULL, 0, 16, u)) {
 				qword_add(&bp, &blen, "uuid");
 				qword_addhex(&bp, &blen, u, 16);
 			}