diff mbox

Question on nfs40_discover_server_trunking.

Message ID 50F9CDDF.4070609@candelatech.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Greear Jan. 18, 2013, 10:34 p.m. UTC
On 01/18/2013 02:29 PM, Chuck Lever wrote:
>
> On Jan 18, 2013, at 4:59 PM, Ben Greear <greearb@candelatech.com> wrote:
>
>> On 01/18/2013 01:33 PM, Chuck Lever wrote:
>>>
>>> On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>
>>>> Any chance the STALE_CLIENTID case needs a 'break'?
>>>
>>> I don't think so.  LEASE_CONFIRM is set, and we want to wake the state renewal thread.
>>>
>>>>
>>>> Twice I've seen kernel crashes after the nfs40_walk_client_list
>>>> failed (though code comments say it should never fail).
>>>
>>> nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list.  If the search fails, that's a bug.
>>>
>>> Eyeball the contents of your nfs_client list.  You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it.
>>
>> Ok, I think I found another issue.
>>
>> nfs4_client_init does not initialize 'old', but passes it to nfs4_discover_server_trunking.
>>
>> That gets passed to the detect_trunking operation, which is nfs40_discover_server_trunking
>> or nfs41_discover_server_trunking.
>
> Yes, *result is an output parameter, not an input parameter.
>
>> This will call walk_client_list, which also may not ever assign a value to
>> 'result'.
>
> It should always assign *result in the SUCCESS case.
>
>> The code in walk_client_list always dereferences result, however.
>>
>> So, that is probably why my system blows up shortly after the 'impossible'
>> error message...
>
> In particular, nfs4_walk_client_list() does not set *result when it returns NFS4ERR_STALE_CLIENTID.  In nfs4_discover_server_trunking(), should we therefore have this:
>
>       status = nfs40_walk_client_list(clp, result, cred);
>       switch (status) {
>       case -NFS4ERR_STALE_CLIENTID:
>               set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
>>>            nfs4_schedule_state_renewal(clp);
>>>            break;
>       case 0:
>
> I'm not sure the nfs4_schedule_state_renewal() call is necessary here.


I'm not sure I follow you entirely..but I'm going to start testing this patch in
a minute.  Looks like it should fix my crash, and maybe give clues about
why we get to the error case in the first place.

Comments

Chuck Lever Jan. 18, 2013, 10:43 p.m. UTC | #1
On Jan 18, 2013, at 5:34 PM, Ben Greear <greearb@candelatech.com> wrote:

> On 01/18/2013 02:29 PM, Chuck Lever wrote:
>> 
>> On Jan 18, 2013, at 4:59 PM, Ben Greear <greearb@candelatech.com> wrote:
>> 
>>> On 01/18/2013 01:33 PM, Chuck Lever wrote:
>>>> 
>>>> On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>> 
>>>>> Any chance the STALE_CLIENTID case needs a 'break'?
>>>> 
>>>> I don't think so.  LEASE_CONFIRM is set, and we want to wake the state renewal thread.
>>>> 
>>>>> 
>>>>> Twice I've seen kernel crashes after the nfs40_walk_client_list
>>>>> failed (though code comments say it should never fail).
>>>> 
>>>> nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list.  If the search fails, that's a bug.
>>>> 
>>>> Eyeball the contents of your nfs_client list.  You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it.
>>> 
>>> Ok, I think I found another issue.
>>> 
>>> nfs4_client_init does not initialize 'old', but passes it to nfs4_discover_server_trunking.
>>> 
>>> That gets passed to the detect_trunking operation, which is nfs40_discover_server_trunking
>>> or nfs41_discover_server_trunking.
>> 
>> Yes, *result is an output parameter, not an input parameter.
>> 
>>> This will call walk_client_list, which also may not ever assign a value to
>>> 'result'.
>> 
>> It should always assign *result in the SUCCESS case.
>> 
>>> The code in walk_client_list always dereferences result, however.
>>> 
>>> So, that is probably why my system blows up shortly after the 'impossible'
>>> error message...
>> 
>> In particular, nfs4_walk_client_list() does not set *result when it returns NFS4ERR_STALE_CLIENTID.  In nfs4_discover_server_trunking(), should we therefore have this:
>> 
>>      status = nfs40_walk_client_list(clp, result, cred);
>>      switch (status) {
>>      case -NFS4ERR_STALE_CLIENTID:
>>              set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
>>>>           nfs4_schedule_state_renewal(clp);
>>>>           break;
>>      case 0:
>> 
>> I'm not sure the nfs4_schedule_state_renewal() call is necessary here.
> 
> 
> I'm not sure I follow you entirely..but I'm going to start testing this patch in
> a minute.  Looks like it should fix my crash, and maybe give clues about
> why we get to the error case in the first place.
> 
> 
> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
> index d6b39a9..3188283 100644
> --- a/fs/nfs/nfs4client.c
> +++ b/fs/nfs/nfs4client.c
> @@ -307,6 +307,8 @@ int nfs40_walk_client_list(struct nfs_client *new,
>        };
>        int status;
> 
> +       *result = NULL;
> +
>        spin_lock(&nn->nfs_client_lock);
>        list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
>                /* If "pos" isn't marked ready, we can't trust the
> @@ -364,6 +366,7 @@ int nfs40_walk_client_list(struct nfs_client *new,
>                nfs_put_client(prev);
>        spin_unlock(&nn->nfs_client_lock);
>        pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
> +       WARN_ON(1);
>        return -NFS4ERR_STALE_CLIENTID;
> }
> 
> @@ -433,6 +436,8 @@ int nfs41_walk_client_list(struct nfs_client *new,
>        struct nfs_client *pos, *n, *prev = NULL;
>        int error;
> 
> +       *result = NULL;
> +

The design of this code is that output parameters should never be touched except in the "success" case.

>        spin_lock(&nn->nfs_client_lock);
>        list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
>                /* If "pos" isn't marked ready, we can't trust the
> @@ -489,6 +494,7 @@ int nfs41_walk_client_list(struct nfs_client *new,
>         */
>        spin_unlock(&nn->nfs_client_lock);
>        pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
> +       WARN_ON(1);
>        return -NFS4ERR_STALE_CLIENTID;
> }
> #endif /* CONFIG_NFS_V4_1 */
> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
> index c351e6b..54d29e2 100644
> --- a/fs/nfs/nfs4state.c
> +++ b/fs/nfs/nfs4state.c
> @@ -142,7 +142,8 @@ int nfs40_discover_server_trunking(struct nfs_client *clp,
>        case 0:
>                /* Sustain the lease, even if it's empty.  If the clientid4
>                 * goes stale it's of no use for trunking discovery. */
> -               nfs4_schedule_state_renewal(*result);
> +               if (*result)
> +                       nfs4_schedule_state_renewal(*result);

In the STALE_CLIENTID case, this would do state renewal on the wrong nfs_client.

>                break;
>        }

Let's look at this again:

 138         status = nfs40_walk_client_list(clp, result, cred);
 139         switch (status) {
 140         case -NFS4ERR_STALE_CLIENTID:
 141                 set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);

This arm of the switch operates on "clp" not on "*result".  So we want to add:

       nfs4_schedule_state_renewal(clp);
       break;

That makes initializing *result above completely unnecessary.

 142         case 0:
 143                 /* Sustain the lease, even if it's empty.  If the clientid4
 144                  * goes stale it's of no use for trunking discovery. */
 145                 nfs4_schedule_state_renewal(*result);
 146                 break;
 147         }
Ben Greear Jan. 18, 2013, 10:51 p.m. UTC | #2
On 01/18/2013 02:43 PM, Chuck Lever wrote:
>
> On Jan 18, 2013, at 5:34 PM, Ben Greear <greearb@candelatech.com> wrote:
>
>> On 01/18/2013 02:29 PM, Chuck Lever wrote:
>>>
>>> On Jan 18, 2013, at 4:59 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>
>>>> On 01/18/2013 01:33 PM, Chuck Lever wrote:
>>>>>
>>>>> On Jan 18, 2013, at 4:28 PM, Ben Greear <greearb@candelatech.com> wrote:
>>>>>
>>>>>> Any chance the STALE_CLIENTID case needs a 'break'?
>>>>>
>>>>> I don't think so.  LEASE_CONFIRM is set, and we want to wake the state renewal thread.
>>>>>
>>>>>>
>>>>>> Twice I've seen kernel crashes after the nfs40_walk_client_list
>>>>>> failed (though code comments say it should never fail).
>>>>>
>>>>> nfs40_walk_client_list() is looking for an nfs_client that is supposed to already be in the nfs_client list.  If the search fails, that's a bug.
>>>>>
>>>>> Eyeball the contents of your nfs_client list.  You should find an appropriate nfs_client in there, and then figure out why the search doesn't find it.
>>>>
>>>> Ok, I think I found another issue.
>>>>
>>>> nfs4_client_init does not initialize 'old', but passes it to nfs4_discover_server_trunking.
>>>>
>>>> That gets passed to the detect_trunking operation, which is nfs40_discover_server_trunking
>>>> or nfs41_discover_server_trunking.
>>>
>>> Yes, *result is an output parameter, not an input parameter.
>>>
>>>> This will call walk_client_list, which also may not ever assign a value to
>>>> 'result'.
>>>
>>> It should always assign *result in the SUCCESS case.
>>>
>>>> The code in walk_client_list always dereferences result, however.
>>>>
>>>> So, that is probably why my system blows up shortly after the 'impossible'
>>>> error message...
>>>
>>> In particular, nfs4_walk_client_list() does not set *result when it returns NFS4ERR_STALE_CLIENTID.  In nfs4_discover_server_trunking(), should we therefore have this:
>>>
>>>       status = nfs40_walk_client_list(clp, result, cred);
>>>       switch (status) {
>>>       case -NFS4ERR_STALE_CLIENTID:
>>>               set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
>>>>>            nfs4_schedule_state_renewal(clp);
>>>>>            break;
>>>       case 0:
>>>
>>> I'm not sure the nfs4_schedule_state_renewal() call is necessary here.
>>
>>
>> I'm not sure I follow you entirely..but I'm going to start testing this patch in
>> a minute.  Looks like it should fix my crash, and maybe give clues about
>> why we get to the error case in the first place.
>>
>>
>> diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
>> index d6b39a9..3188283 100644
>> --- a/fs/nfs/nfs4client.c
>> +++ b/fs/nfs/nfs4client.c
>> @@ -307,6 +307,8 @@ int nfs40_walk_client_list(struct nfs_client *new,
>>         };
>>         int status;
>>
>> +       *result = NULL;
>> +
>>         spin_lock(&nn->nfs_client_lock);
>>         list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
>>                 /* If "pos" isn't marked ready, we can't trust the
>> @@ -364,6 +366,7 @@ int nfs40_walk_client_list(struct nfs_client *new,
>>                 nfs_put_client(prev);
>>         spin_unlock(&nn->nfs_client_lock);
>>         pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
>> +       WARN_ON(1);
>>         return -NFS4ERR_STALE_CLIENTID;
>> }
>>
>> @@ -433,6 +436,8 @@ int nfs41_walk_client_list(struct nfs_client *new,
>>         struct nfs_client *pos, *n, *prev = NULL;
>>         int error;
>>
>> +       *result = NULL;
>> +
>
> The design of this code is that output parameters should never be touched except in the "success" case.
>
>>         spin_lock(&nn->nfs_client_lock);
>>         list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
>>                 /* If "pos" isn't marked ready, we can't trust the
>> @@ -489,6 +494,7 @@ int nfs41_walk_client_list(struct nfs_client *new,
>>          */
>>         spin_unlock(&nn->nfs_client_lock);
>>         pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
>> +       WARN_ON(1);
>>         return -NFS4ERR_STALE_CLIENTID;
>> }
>> #endif /* CONFIG_NFS_V4_1 */
>> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
>> index c351e6b..54d29e2 100644
>> --- a/fs/nfs/nfs4state.c
>> +++ b/fs/nfs/nfs4state.c
>> @@ -142,7 +142,8 @@ int nfs40_discover_server_trunking(struct nfs_client *clp,
>>         case 0:
>>                 /* Sustain the lease, even if it's empty.  If the clientid4
>>                  * goes stale it's of no use for trunking discovery. */
>> -               nfs4_schedule_state_renewal(*result);
>> +               if (*result)
>> +                       nfs4_schedule_state_renewal(*result);
>
> In the STALE_CLIENTID case, this would do state renewal on the wrong nfs_client.
>
>>                 break;
>>         }
>
> Let's look at this again:
>
>   138         status = nfs40_walk_client_list(clp, result, cred);
>   139         switch (status) {
>   140         case -NFS4ERR_STALE_CLIENTID:
>   141                 set_bit(NFS4CLNT_LEASE_CONFIRM, &clp->cl_state);
>
> This arm of the switch operates on "clp" not on "*result".  So we want to add:
>
>         nfs4_schedule_state_renewal(clp);
>         break;
>
> That makes initializing *result above completely unnecessary.

Ok, I'll test that, but I still think the result pointer should be initialized
somewhere.  If a bug like this ever creeps in again, dereferencing a NULL
pointer should be a lot more obvious than dereferencing some random thing.


>
>   142         case 0:
>   143                 /* Sustain the lease, even if it's empty.  If the clientid4
>   144                  * goes stale it's of no use for trunking discovery. */
>   145                 nfs4_schedule_state_renewal(*result);
>   146                 break;
>   147         }
>
diff mbox

Patch

diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index d6b39a9..3188283 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -307,6 +307,8 @@  int nfs40_walk_client_list(struct nfs_client *new,
         };
         int status;

+       *result = NULL;
+
         spin_lock(&nn->nfs_client_lock);
         list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
                 /* If "pos" isn't marked ready, we can't trust the
@@ -364,6 +366,7 @@  int nfs40_walk_client_list(struct nfs_client *new,
                 nfs_put_client(prev);
         spin_unlock(&nn->nfs_client_lock);
         pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
+       WARN_ON(1);
         return -NFS4ERR_STALE_CLIENTID;
  }

@@ -433,6 +436,8 @@  int nfs41_walk_client_list(struct nfs_client *new,
         struct nfs_client *pos, *n, *prev = NULL;
         int error;

+       *result = NULL;
+
         spin_lock(&nn->nfs_client_lock);
         list_for_each_entry_safe(pos, n, &nn->nfs_client_list, cl_share_link) {
                 /* If "pos" isn't marked ready, we can't trust the
@@ -489,6 +494,7 @@  int nfs41_walk_client_list(struct nfs_client *new,
          */
         spin_unlock(&nn->nfs_client_lock);
         pr_err("NFS: %s Error: no matching nfs_client found\n", __func__);
+       WARN_ON(1);
         return -NFS4ERR_STALE_CLIENTID;
  }
  #endif /* CONFIG_NFS_V4_1 */
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index c351e6b..54d29e2 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -142,7 +142,8 @@  int nfs40_discover_server_trunking(struct nfs_client *clp,
         case 0:
                 /* Sustain the lease, even if it's empty.  If the clientid4
                  * goes stale it's of no use for trunking discovery. */
-               nfs4_schedule_state_renewal(*result);
+               if (*result)
+                       nfs4_schedule_state_renewal(*result);
                 break;
         }