From patchwork Mon Oct 29 01:59:14 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: NeilBrown X-Patchwork-Id: 1660501 Return-Path: X-Original-To: patchwork-linux-nfs@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork1.kernel.org (Postfix) with ESMTP id 6F3993FCF7 for ; Mon, 29 Oct 2012 01:58:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753814Ab2J2B64 (ORCPT ); Sun, 28 Oct 2012 21:58:56 -0400 Received: from cantor2.suse.de ([195.135.220.15]:38556 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753504Ab2J2B6z (ORCPT ); Sun, 28 Oct 2012 21:58:55 -0400 Received: from relay1.suse.de (unknown [195.135.220.254]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx2.suse.de (Postfix) with ESMTP id 2E56C98E46; Mon, 29 Oct 2012 02:58:54 +0100 (CET) Date: Mon, 29 Oct 2012 12:59:14 +1100 From: NeilBrown To: Chuck Lever Cc: Linux NFS Mailing List Subject: Re: Legacy NFS client DNS resolver fails since 2.6.37 Message-ID: <20121029125914.506eb0fc@notabene.brown> In-Reply-To: <6F448C67-E729-41E7-A09C-A49D15B50D5E@oracle.com> References: <6F448C67-E729-41E7-A09C-A49D15B50D5E@oracle.com> X-Mailer: Claws Mail 3.7.10 (GTK+ 2.24.7; x86_64-suse-linux-gnu) Mime-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-nfs@vger.kernel.org On Sun, 28 Oct 2012 21:03:45 -0400 Chuck Lever wrote: > Hi Neil- > > To use the legacy DNS resolver for resolving hostnames in NFSv4 referrals, I've installed the /sbin/nfs_cache_getent script on my NFS client "degas." I've confirmed it works with a 2.6.36 kernel. > > However, since 2.6.37 commit c5b29f885afe890f953f7f23424045cdad31d3e4 "sunrpc: use seconds since boot in expiry cache" the legacy DNS resolver appears not to work. When attempting to follow a referral that uses a server hostname, the client fails 100% of the time to mount the referred to server with an error such as: > > [cel@degas example.net]$ ls home > ls: cannot open directory home: No such file or directory > > The contents of the dns_resolve cache appear to indicate that there are resolution results in the cache, but the CACHE_VALID flag is not set for that entry: > > [root@degas dns_resolve]# cat content > # ip address hostname ttl > # , klimt.example.net 48 > > klimt.example.net is the hostname that is contained in the referral. > > I have a second referral called "ip-address" in the same directory (domainroot), with the same content except the IP address of klimt is used instead of its hostname. Following that second referral always works. > > I've tried every stable.0 release up to 3.6.0, and the behavior is roughly the same for each, which suggests that there is no upstream fix for this issue thus far. > > Since I've never seen a problem like this reported, I'm wondering if anyone else can confirm this issue. > > I have a narrow interest in fixing the legacy DNS server in stable kernels, but there may also be a latent problem with the RPC cache implementation that could spell trouble for other consumers, even post-3.6. > > A rough outline of how you might reproduce this: > > + Build and install a 2.6.37 or later kernel for your NFS client with CONFIG_NFS_USE_LEGACY_DNS=y. > > + Set up an NFS server with "refer=" exports. man exports(5) > > + On your client, mount the server directory that contains the exports, then try to "cd" through one of the referrals. > > If you don't feel up to replicating the above arrangement, can you suggest cache debugging instrumentation that can be added to my client to help nail this? Thanks for any advice! > Hi Chuck, looks like I messed up. Every other cache uses absolute timestamps for expiry time. The dns resolver differs from this and uses relative time stamps (ttl). I obviously didn't understand this properly when I wrote the patch that broke things. In particular, using get_expiry() is inappropriate in this context. Something like this should fix it. NeilBrown Tested-by: Chuck Lever diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c index 31c26c4..d9415a2 100644 --- a/fs/nfs/dns_resolve.c +++ b/fs/nfs/dns_resolve.c @@ -217,7 +217,7 @@ static int nfs_dns_parse(struct cache_detail *cd, char *buf, int buflen) { char buf1[NFS_DNS_HOSTNAME_MAXLEN+1]; struct nfs_dns_ent key, *item; - unsigned long ttl; + unsigned int ttl; ssize_t len; int ret = -EINVAL; @@ -240,7 +240,8 @@ static int nfs_dns_parse(struct cache_detail *cd, char *buf, int buflen) key.namelen = len; memset(&key.h, 0, sizeof(key.h)); - ttl = get_expiry(&buf); + if (get_int(&buf, &ttl) < 0) + goto out; if (ttl == 0) goto out; key.h.expiry_time = ttl + seconds_since_boot();