diff mbox

not picking a delegation stateid for IO when delegation stateid is being returned

Message ID CAN-5tyFpbM5GE2fAcx8uMG6MO2JC-R4ASN_jrnmYH8idie_H+w@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Olga Kornievskaia Dec. 3, 2014, 7:54 p.m. UTC
Hi folks,

I would like an opinion about changing code in such as way that we
don't select a delegation stateid for an IO operation when this
particular delegation is being returned.

The reason it's some what problematic is that we send out a
DELEG_RETURN and then we don't remove the stateid until we receive a
reply. In the mean while, an IO operation can be happening and in
nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
Well, at the server, it finishes processing DELEG_RETURN before
getting an IO op and by that time the server is finished with the
stateid and can error an IO operation with BAD_STATEID.

                nfs_mark_delegation_referenced(delegation);
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Trond Myklebust Dec. 3, 2014, 8:59 p.m. UTC | #1
Hi Olga,

On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> Hi folks,
>
> I would like an opinion about changing code in such as way that we
> don't select a delegation stateid for an IO operation when this
> particular delegation is being returned.
>
> The reason it's some what problematic is that we send out a
> DELEG_RETURN and then we don't remove the stateid until we receive a
> reply. In the mean while, an IO operation can be happening and in
> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
> Well, at the server, it finishes processing DELEG_RETURN before
> getting an IO op and by that time the server is finished with the
> stateid and can error an IO operation with BAD_STATEID.
>
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index 7f3f606..4f6f6c9 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
> *dst, struct inode *in
>         flags &= FMODE_READ|FMODE_WRITE;
>         rcu_read_lock();
>         delegation = rcu_dereference(nfsi->delegation);
> -       ret = (delegation != NULL && (delegation->type & flags) == flags);
> +       ret = (delegation != NULL && (delegation->type & flags) == flags &&
> +               !test_bit(NFS_DELEGATION_RETURNING, &delegation->flags));
>         if (ret) {
>                 nfs4_stateid_copy(dst, &delegation->stateid);
>                 nfs_mark_delegation_referenced(delegation);

The above cannot eliminate the possibility that we won't use a
delegation while it is being returned. It will at best just reduce the
window of opportunity.

So, why is this being considered to be a problem in the first place?
Are you seeing a measurable performance impact on a real life workload
(as opposed to some 1-in-a-billion occurrence from a QA test :-))?

Cheers
  Trond
Olga Kornievskaia Dec. 3, 2014, 10:59 p.m. UTC | #2
On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
> Hi Olga,
>
> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>> Hi folks,
>>
>> I would like an opinion about changing code in such as way that we
>> don't select a delegation stateid for an IO operation when this
>> particular delegation is being returned.
>>
>> The reason it's some what problematic is that we send out a
>> DELEG_RETURN and then we don't remove the stateid until we receive a
>> reply. In the mean while, an IO operation can be happening and in
>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>> Well, at the server, it finishes processing DELEG_RETURN before
>> getting an IO op and by that time the server is finished with the
>> stateid and can error an IO operation with BAD_STATEID.
>>
>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>> index 7f3f606..4f6f6c9 100644
>> --- a/fs/nfs/delegation.c
>> +++ b/fs/nfs/delegation.c
>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>> *dst, struct inode *in
>>         flags &= FMODE_READ|FMODE_WRITE;
>>         rcu_read_lock();
>>         delegation = rcu_dereference(nfsi->delegation);
>> -       ret = (delegation != NULL && (delegation->type & flags) == flags);
>> +       ret = (delegation != NULL && (delegation->type & flags) == flags &&
>> +               !test_bit(NFS_DELEGATION_RETURNING, &delegation->flags));
>>         if (ret) {
>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>                 nfs_mark_delegation_referenced(delegation);
>
> The above cannot eliminate the possibility that we won't use a
> delegation while it is being returned. It will at best just reduce the
> window of opportunity.
>

You are right this are still problems. Actually, we might set the bit
but not yet get the open stateid from the open with deleg_cur and
that's not good. It would be good to know we got the open stateid and
then pick that.

> So, why is this being considered to be a problem in the first place?
> Are you seeing a measurable performance impact on a real life workload
> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?

Unfortunately, this problem is quite common and I hit it all the time
on my setup. This leads to client seizing IO on that file and
returning EIO. It's an unrecoverable error. I'm trying to figure out
how to eliminate getting to that state.


>
> Cheers
>   Trond
> --
> Trond Myklebust
>
> Linux NFS client maintainer, PrimaryData
>
> trond.myklebust@primarydata.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Dec. 3, 2014, 11:13 p.m. UTC | #3
On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
> <trond.myklebust@primarydata.com> wrote:
>> Hi Olga,
>>
>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> Hi folks,
>>>
>>> I would like an opinion about changing code in such as way that we
>>> don't select a delegation stateid for an IO operation when this
>>> particular delegation is being returned.
>>>
>>> The reason it's some what problematic is that we send out a
>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>> reply. In the mean while, an IO operation can be happening and in
>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>> Well, at the server, it finishes processing DELEG_RETURN before
>>> getting an IO op and by that time the server is finished with the
>>> stateid and can error an IO operation with BAD_STATEID.
>>>
>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>> index 7f3f606..4f6f6c9 100644
>>> --- a/fs/nfs/delegation.c
>>> +++ b/fs/nfs/delegation.c
>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>> *dst, struct inode *in
>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>         rcu_read_lock();
>>>         delegation = rcu_dereference(nfsi->delegation);
>>> -       ret = (delegation != NULL && (delegation->type & flags) == flags);
>>> +       ret = (delegation != NULL && (delegation->type & flags) == flags &&
>>> +               !test_bit(NFS_DELEGATION_RETURNING, &delegation->flags));
>>>         if (ret) {
>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>                 nfs_mark_delegation_referenced(delegation);
>>
>> The above cannot eliminate the possibility that we won't use a
>> delegation while it is being returned. It will at best just reduce the
>> window of opportunity.
>>
>
> You are right this are still problems. Actually, we might set the bit
> but not yet get the open stateid from the open with deleg_cur and
> that's not good. It would be good to know we got the open stateid and
> then pick that.
>
>> So, why is this being considered to be a problem in the first place?
>> Are you seeing a measurable performance impact on a real life workload
>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>
> Unfortunately, this problem is quite common and I hit it all the time
> on my setup. This leads to client seizing IO on that file and
> returning EIO. It's an unrecoverable error. I'm trying to figure out
> how to eliminate getting to that state.
>

It definitely isn't intended to be an irrecoverable error. The client
is supposed to just replay the write after updating the stateid.
Olga Kornievskaia Dec. 3, 2014, 11:21 p.m. UTC | #4
On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
> On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>> <trond.myklebust@primarydata.com> wrote:
>>> Hi Olga,
>>>
>>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>> Hi folks,
>>>>
>>>> I would like an opinion about changing code in such as way that we
>>>> don't select a delegation stateid for an IO operation when this
>>>> particular delegation is being returned.
>>>>
>>>> The reason it's some what problematic is that we send out a
>>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>> reply. In the mean while, an IO operation can be happening and in
>>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>> getting an IO op and by that time the server is finished with the
>>>> stateid and can error an IO operation with BAD_STATEID.
>>>>
>>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>> index 7f3f606..4f6f6c9 100644
>>>> --- a/fs/nfs/delegation.c
>>>> +++ b/fs/nfs/delegation.c
>>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>> *dst, struct inode *in
>>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>>         rcu_read_lock();
>>>>         delegation = rcu_dereference(nfsi->delegation);
>>>> -       ret = (delegation != NULL && (delegation->type & flags) == flags);
>>>> +       ret = (delegation != NULL && (delegation->type & flags) == flags &&
>>>> +               !test_bit(NFS_DELEGATION_RETURNING, &delegation->flags));
>>>>         if (ret) {
>>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>>                 nfs_mark_delegation_referenced(delegation);
>>>
>>> The above cannot eliminate the possibility that we won't use a
>>> delegation while it is being returned. It will at best just reduce the
>>> window of opportunity.
>>>
>>
>> You are right this are still problems. Actually, we might set the bit
>> but not yet get the open stateid from the open with deleg_cur and
>> that's not good. It would be good to know we got the open stateid and
>> then pick that.
>>
>>> So, why is this being considered to be a problem in the first place?
>>> Are you seeing a measurable performance impact on a real life workload
>>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>
>> Unfortunately, this problem is quite common and I hit it all the time
>> on my setup. This leads to client seizing IO on that file and
>> returning EIO. It's an unrecoverable error. I'm trying to figure out
>> how to eliminate getting to that state.
>>
>
> It definitely isn't intended to be an irrecoverable error. The client
> is supposed to just replay the write after updating the stateid.

open(deleg_cur) call / reply
lock() call/reply
deleg_return() call
write(with deluge_stateid) call gets BAD_STATEID
state recovery code marks lock state lost -> EIO.

:-/

>
> --
> Trond Myklebust
>
> Linux NFS client maintainer, PrimaryData
>
> trond.myklebust@primarydata.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia Dec. 3, 2014, 11:50 p.m. UTC | #5
On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
>
> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>
>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>> <trond.myklebust@primarydata.com> wrote:
>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>> > wrote:
>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>> >> <trond.myklebust@primarydata.com> wrote:
>> >>> Hi Olga,
>> >>>
>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>> >>> wrote:
>> >>>> Hi folks,
>> >>>>
>> >>>> I would like an opinion about changing code in such as way that we
>> >>>> don't select a delegation stateid for an IO operation when this
>> >>>> particular delegation is being returned.
>> >>>>
>> >>>> The reason it's some what problematic is that we send out a
>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>> >>>> reply. In the mean while, an IO operation can be happening and in
>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>> >>>> getting an IO op and by that time the server is finished with the
>> >>>> stateid and can error an IO operation with BAD_STATEID.
>> >>>>
>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>> >>>> index 7f3f606..4f6f6c9 100644
>> >>>> --- a/fs/nfs/delegation.c
>> >>>> +++ b/fs/nfs/delegation.c
>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>> >>>> *dst, struct inode *in
>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>> >>>>         rcu_read_lock();
>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>> >>>> flags);
>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>> >>>> flags &&
>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>> >>>> &delegation->flags));
>> >>>>         if (ret) {
>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>> >>>>                 nfs_mark_delegation_referenced(delegation);
>> >>>
>> >>> The above cannot eliminate the possibility that we won't use a
>> >>> delegation while it is being returned. It will at best just reduce the
>> >>> window of opportunity.
>> >>>
>> >>
>> >> You are right this are still problems. Actually, we might set the bit
>> >> but not yet get the open stateid from the open with deleg_cur and
>> >> that's not good. It would be good to know we got the open stateid and
>> >> then pick that.
>> >>
>> >>> So, why is this being considered to be a problem in the first place?
>> >>> Are you seeing a measurable performance impact on a real life workload
>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>> >>
>> >> Unfortunately, this problem is quite common and I hit it all the time
>> >> on my setup. This leads to client seizing IO on that file and
>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>> >> how to eliminate getting to that state.
>> >>
>> >
>> > It definitely isn't intended to be an irrecoverable error. The client
>> > is supposed to just replay the write after updating the stateid.
>>
>> open(deleg_cur) call / reply
>> lock() call/reply
>> deleg_return() call
>> write(with deluge_stateid) call gets BAD_STATEID
>> state recovery code marks lock state lost -> EIO.
>
> Why is it marking the lock as lost? If the recovery succeeded, it should
> notice that the stateid has changed and instead retry.

I'll get you a better explanation tomorrow besides saying "that's what
I see when I run the code".

> What kernel is this?

This is upstream.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia Dec. 4, 2014, 5:36 p.m. UTC | #6
On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
> <trond.myklebust@primarydata.com> wrote:
>>
>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>>
>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>> <trond.myklebust@primarydata.com> wrote:
>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>>> > wrote:
>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>> >> <trond.myklebust@primarydata.com> wrote:
>>> >>> Hi Olga,
>>> >>>
>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>>> >>> wrote:
>>> >>>> Hi folks,
>>> >>>>
>>> >>>> I would like an opinion about changing code in such as way that we
>>> >>>> don't select a delegation stateid for an IO operation when this
>>> >>>> particular delegation is being returned.
>>> >>>>
>>> >>>> The reason it's some what problematic is that we send out a
>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>> >>>> getting an IO op and by that time the server is finished with the
>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>> >>>>
>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>> >>>> index 7f3f606..4f6f6c9 100644
>>> >>>> --- a/fs/nfs/delegation.c
>>> >>>> +++ b/fs/nfs/delegation.c
>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>> >>>> *dst, struct inode *in
>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>> >>>>         rcu_read_lock();
>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>> >>>> flags);
>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>> >>>> flags &&
>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>> >>>> &delegation->flags));
>>> >>>>         if (ret) {
>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>> >>>
>>> >>> The above cannot eliminate the possibility that we won't use a
>>> >>> delegation while it is being returned. It will at best just reduce the
>>> >>> window of opportunity.
>>> >>>
>>> >>
>>> >> You are right this are still problems. Actually, we might set the bit
>>> >> but not yet get the open stateid from the open with deleg_cur and
>>> >> that's not good. It would be good to know we got the open stateid and
>>> >> then pick that.
>>> >>
>>> >>> So, why is this being considered to be a problem in the first place?
>>> >>> Are you seeing a measurable performance impact on a real life workload
>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>> >>
>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>> >> on my setup. This leads to client seizing IO on that file and
>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>> >> how to eliminate getting to that state.
>>> >>
>>> >
>>> > It definitely isn't intended to be an irrecoverable error. The client
>>> > is supposed to just replay the write after updating the stateid.
>>>
>>> open(deleg_cur) call / reply
>>> lock() call/reply
>>> deleg_return() call
>>> write(with deluge_stateid) call gets BAD_STATEID
>>> state recovery code marks lock state lost -> EIO.
>>
>> Why is it marking the lock as lost? If the recovery succeeded, it should
>> notice that the stateid has changed and instead retry.
>
> I'll get you a better explanation tomorrow besides saying "that's what
> I see when I run the code".

nfs4_async_handle_error() initiates state recovery
nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
marks the lock LOST. state is delegated so the kernel logs "lock
reclaim failed".
write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
and the nfs4_select_rw_stateid() fails with EIO.

>
>> What kernel is this?
>
> This is upstream.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Dec. 4, 2014, 6:23 p.m. UTC | #7
On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
>> <trond.myklebust@primarydata.com> wrote:
>>>
>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>>>
>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>>> <trond.myklebust@primarydata.com> wrote:
>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>>>> > wrote:
>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>>> >> <trond.myklebust@primarydata.com> wrote:
>>>> >>> Hi Olga,
>>>> >>>
>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>>>> >>> wrote:
>>>> >>>> Hi folks,
>>>> >>>>
>>>> >>>> I would like an opinion about changing code in such as way that we
>>>> >>>> don't select a delegation stateid for an IO operation when this
>>>> >>>> particular delegation is being returned.
>>>> >>>>
>>>> >>>> The reason it's some what problematic is that we send out a
>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>> >>>> getting an IO op and by that time the server is finished with the
>>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>>> >>>>
>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>> >>>> index 7f3f606..4f6f6c9 100644
>>>> >>>> --- a/fs/nfs/delegation.c
>>>> >>>> +++ b/fs/nfs/delegation.c
>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>> >>>> *dst, struct inode *in
>>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>> >>>>         rcu_read_lock();
>>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>>> >>>> flags);
>>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>>> >>>> flags &&
>>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>>> >>>> &delegation->flags));
>>>> >>>>         if (ret) {
>>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>>> >>>
>>>> >>> The above cannot eliminate the possibility that we won't use a
>>>> >>> delegation while it is being returned. It will at best just reduce the
>>>> >>> window of opportunity.
>>>> >>>
>>>> >>
>>>> >> You are right this are still problems. Actually, we might set the bit
>>>> >> but not yet get the open stateid from the open with deleg_cur and
>>>> >> that's not good. It would be good to know we got the open stateid and
>>>> >> then pick that.
>>>> >>
>>>> >>> So, why is this being considered to be a problem in the first place?
>>>> >>> Are you seeing a measurable performance impact on a real life workload
>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>>> >>
>>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>>> >> on my setup. This leads to client seizing IO on that file and
>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>>> >> how to eliminate getting to that state.
>>>> >>
>>>> >
>>>> > It definitely isn't intended to be an irrecoverable error. The client
>>>> > is supposed to just replay the write after updating the stateid.
>>>>
>>>> open(deleg_cur) call / reply
>>>> lock() call/reply
>>>> deleg_return() call
>>>> write(with deluge_stateid) call gets BAD_STATEID
>>>> state recovery code marks lock state lost -> EIO.
>>>
>>> Why is it marking the lock as lost? If the recovery succeeded, it should
>>> notice that the stateid has changed and instead retry.
>>
>> I'll get you a better explanation tomorrow besides saying "that's what
>> I see when I run the code".
>
> nfs4_async_handle_error() initiates state recovery
> nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
> marks the lock LOST. state is delegated so the kernel logs "lock
> reclaim failed".
> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
> and the nfs4_select_rw_stateid() fails with EIO.
>
>>
>>> What kernel is this?
>>
>> This is upstream.

So why isn't nfs4_write_stateid_changed() catching the change before
we even get to nfs4_async_handle_error()? That's where this race is
supposed to get resolved.
Olga Kornievskaia Dec. 4, 2014, 6:58 p.m. UTC | #8
On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
>>> <trond.myklebust@primarydata.com> wrote:
>>>>
>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>>>>
>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>>>> <trond.myklebust@primarydata.com> wrote:
>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>> > wrote:
>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>>>> >> <trond.myklebust@primarydata.com> wrote:
>>>>> >>> Hi Olga,
>>>>> >>>
>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>> >>> wrote:
>>>>> >>>> Hi folks,
>>>>> >>>>
>>>>> >>>> I would like an opinion about changing code in such as way that we
>>>>> >>>> don't select a delegation stateid for an IO operation when this
>>>>> >>>> particular delegation is being returned.
>>>>> >>>>
>>>>> >>>> The reason it's some what problematic is that we send out a
>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>>> >>>> getting an IO op and by that time the server is finished with the
>>>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>>>> >>>>
>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>>> >>>> index 7f3f606..4f6f6c9 100644
>>>>> >>>> --- a/fs/nfs/delegation.c
>>>>> >>>> +++ b/fs/nfs/delegation.c
>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>>> >>>> *dst, struct inode *in
>>>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>>> >>>>         rcu_read_lock();
>>>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>> >>>> flags);
>>>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>> >>>> flags &&
>>>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>>>> >>>> &delegation->flags));
>>>>> >>>>         if (ret) {
>>>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>>>> >>>
>>>>> >>> The above cannot eliminate the possibility that we won't use a
>>>>> >>> delegation while it is being returned. It will at best just reduce the
>>>>> >>> window of opportunity.
>>>>> >>>
>>>>> >>
>>>>> >> You are right this are still problems. Actually, we might set the bit
>>>>> >> but not yet get the open stateid from the open with deleg_cur and
>>>>> >> that's not good. It would be good to know we got the open stateid and
>>>>> >> then pick that.
>>>>> >>
>>>>> >>> So, why is this being considered to be a problem in the first place?
>>>>> >>> Are you seeing a measurable performance impact on a real life workload
>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>>>> >>
>>>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>>>> >> on my setup. This leads to client seizing IO on that file and
>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>>>> >> how to eliminate getting to that state.
>>>>> >>
>>>>> >
>>>>> > It definitely isn't intended to be an irrecoverable error. The client
>>>>> > is supposed to just replay the write after updating the stateid.
>>>>>
>>>>> open(deleg_cur) call / reply
>>>>> lock() call/reply
>>>>> deleg_return() call
>>>>> write(with deluge_stateid) call gets BAD_STATEID
>>>>> state recovery code marks lock state lost -> EIO.
>>>>
>>>> Why is it marking the lock as lost? If the recovery succeeded, it should
>>>> notice that the stateid has changed and instead retry.
>>>
>>> I'll get you a better explanation tomorrow besides saying "that's what
>>> I see when I run the code".
>>
>> nfs4_async_handle_error() initiates state recovery
>> nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
>> marks the lock LOST. state is delegated so the kernel logs "lock
>> reclaim failed".
>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
>> and the nfs4_select_rw_stateid() fails with EIO.
>>
>>>
>>>> What kernel is this?
>>>
>>> This is upstream.
>
> So why isn't nfs4_write_stateid_changed() catching the change before
> we even get to nfs4_async_handle_error()? That's where this race is
> supposed to get resolved.

Probably because nfs4_stateid_is_current() returns true because
nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the
lock lost?

>
> --
> Trond Myklebust
>
> Linux NFS client maintainer, PrimaryData
>
> trond.myklebust@primarydata.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Dec. 4, 2014, 7:08 p.m. UTC | #9
On Thu, Dec 4, 2014 at 1:58 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust
> <trond.myklebust@primarydata.com> wrote:
>> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
>>>> <trond.myklebust@primarydata.com> wrote:
>>>>>
>>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>>>>>
>>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>>>>> <trond.myklebust@primarydata.com> wrote:
>>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>>> > wrote:
>>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>>>>> >> <trond.myklebust@primarydata.com> wrote:
>>>>>> >>> Hi Olga,
>>>>>> >>>
>>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>>> >>> wrote:
>>>>>> >>>> Hi folks,
>>>>>> >>>>
>>>>>> >>>> I would like an opinion about changing code in such as way that we
>>>>>> >>>> don't select a delegation stateid for an IO operation when this
>>>>>> >>>> particular delegation is being returned.
>>>>>> >>>>
>>>>>> >>>> The reason it's some what problematic is that we send out a
>>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>>>> >>>> getting an IO op and by that time the server is finished with the
>>>>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>>>>> >>>>
>>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>>>> >>>> index 7f3f606..4f6f6c9 100644
>>>>>> >>>> --- a/fs/nfs/delegation.c
>>>>>> >>>> +++ b/fs/nfs/delegation.c
>>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>>>> >>>> *dst, struct inode *in
>>>>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>>>> >>>>         rcu_read_lock();
>>>>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>>>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>> >>>> flags);
>>>>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>> >>>> flags &&
>>>>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>>>>> >>>> &delegation->flags));
>>>>>> >>>>         if (ret) {
>>>>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>>>>> >>>
>>>>>> >>> The above cannot eliminate the possibility that we won't use a
>>>>>> >>> delegation while it is being returned. It will at best just reduce the
>>>>>> >>> window of opportunity.
>>>>>> >>>
>>>>>> >>
>>>>>> >> You are right this are still problems. Actually, we might set the bit
>>>>>> >> but not yet get the open stateid from the open with deleg_cur and
>>>>>> >> that's not good. It would be good to know we got the open stateid and
>>>>>> >> then pick that.
>>>>>> >>
>>>>>> >>> So, why is this being considered to be a problem in the first place?
>>>>>> >>> Are you seeing a measurable performance impact on a real life workload
>>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>>>>> >>
>>>>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>>>>> >> on my setup. This leads to client seizing IO on that file and
>>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>>>>> >> how to eliminate getting to that state.
>>>>>> >>
>>>>>> >
>>>>>> > It definitely isn't intended to be an irrecoverable error. The client
>>>>>> > is supposed to just replay the write after updating the stateid.
>>>>>>
>>>>>> open(deleg_cur) call / reply
>>>>>> lock() call/reply
>>>>>> deleg_return() call
>>>>>> write(with deluge_stateid) call gets BAD_STATEID
>>>>>> state recovery code marks lock state lost -> EIO.
>>>>>
>>>>> Why is it marking the lock as lost? If the recovery succeeded, it should
>>>>> notice that the stateid has changed and instead retry.
>>>>
>>>> I'll get you a better explanation tomorrow besides saying "that's what
>>>> I see when I run the code".
>>>
>>> nfs4_async_handle_error() initiates state recovery
>>> nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
>>> marks the lock LOST. state is delegated so the kernel logs "lock
>>> reclaim failed".
>>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
>>> and the nfs4_select_rw_stateid() fails with EIO.
>>>
>>>>
>>>>> What kernel is this?
>>>>
>>>> This is upstream.
>>
>> So why isn't nfs4_write_stateid_changed() catching the change before
>> we even get to nfs4_async_handle_error()? That's where this race is
>> supposed to get resolved.
>
> Probably because nfs4_stateid_is_current() returns true because
> nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the
> lock lost?

I don't understand why the lock would already be marked as lost.
Remember that we're calling from inside nfs4_write_done() and before
calling nfs4_async_handle_error().
Olga Kornievskaia Dec. 4, 2014, 8:49 p.m. UTC | #10
On Thu, Dec 4, 2014 at 2:08 PM, Trond Myklebust
<trond.myklebust@primarydata.com> wrote:
> On Thu, Dec 4, 2014 at 1:58 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>> On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust
>> <trond.myklebust@primarydata.com> wrote:
>>> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
>>>>> <trond.myklebust@primarydata.com> wrote:
>>>>>>
>>>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>>>>>>
>>>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>>>>>> <trond.myklebust@primarydata.com> wrote:
>>>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>>>> > wrote:
>>>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>>>>>> >> <trond.myklebust@primarydata.com> wrote:
>>>>>>> >>> Hi Olga,
>>>>>>> >>>
>>>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>>>> >>> wrote:
>>>>>>> >>>> Hi folks,
>>>>>>> >>>>
>>>>>>> >>>> I would like an opinion about changing code in such as way that we
>>>>>>> >>>> don't select a delegation stateid for an IO operation when this
>>>>>>> >>>> particular delegation is being returned.
>>>>>>> >>>>
>>>>>>> >>>> The reason it's some what problematic is that we send out a
>>>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>>>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>>>>> >>>> getting an IO op and by that time the server is finished with the
>>>>>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>>>>>> >>>>
>>>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>>>>> >>>> index 7f3f606..4f6f6c9 100644
>>>>>>> >>>> --- a/fs/nfs/delegation.c
>>>>>>> >>>> +++ b/fs/nfs/delegation.c
>>>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>>>>> >>>> *dst, struct inode *in
>>>>>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>>>>> >>>>         rcu_read_lock();
>>>>>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>>>>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>>> >>>> flags);
>>>>>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>>> >>>> flags &&
>>>>>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>>>>>> >>>> &delegation->flags));
>>>>>>> >>>>         if (ret) {
>>>>>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>>>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>>>>>> >>>
>>>>>>> >>> The above cannot eliminate the possibility that we won't use a
>>>>>>> >>> delegation while it is being returned. It will at best just reduce the
>>>>>>> >>> window of opportunity.
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >> You are right this are still problems. Actually, we might set the bit
>>>>>>> >> but not yet get the open stateid from the open with deleg_cur and
>>>>>>> >> that's not good. It would be good to know we got the open stateid and
>>>>>>> >> then pick that.
>>>>>>> >>
>>>>>>> >>> So, why is this being considered to be a problem in the first place?
>>>>>>> >>> Are you seeing a measurable performance impact on a real life workload
>>>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>>>>>> >>
>>>>>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>>>>>> >> on my setup. This leads to client seizing IO on that file and
>>>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>>>>>> >> how to eliminate getting to that state.
>>>>>>> >>
>>>>>>> >
>>>>>>> > It definitely isn't intended to be an irrecoverable error. The client
>>>>>>> > is supposed to just replay the write after updating the stateid.
>>>>>>>
>>>>>>> open(deleg_cur) call / reply
>>>>>>> lock() call/reply
>>>>>>> deleg_return() call
>>>>>>> write(with deluge_stateid) call gets BAD_STATEID
>>>>>>> state recovery code marks lock state lost -> EIO.
>>>>>>
>>>>>> Why is it marking the lock as lost? If the recovery succeeded, it should
>>>>>> notice that the stateid has changed and instead retry.
>>>>>
>>>>> I'll get you a better explanation tomorrow besides saying "that's what
>>>>> I see when I run the code".
>>>>
>>>> nfs4_async_handle_error() initiates state recovery
>>>> nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
>>>> marks the lock LOST. state is delegated so the kernel logs "lock
>>>> reclaim failed".
>>>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
>>>> and the nfs4_select_rw_stateid() fails with EIO.
>>>>
>>>>>
>>>>>> What kernel is this?
>>>>>
>>>>> This is upstream.
>>>
>>> So why isn't nfs4_write_stateid_changed() catching the change before
>>> we even get to nfs4_async_handle_error()? That's where this race is
>>> supposed to get resolved.
>>
>> Probably because nfs4_stateid_is_current() returns true because
>> nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the
>> lock lost?
>
> I don't understand why the lock would already be marked as lost.
> Remember that we're calling from inside nfs4_write_done() and before
> calling nfs4_async_handle_error().

Sigh. I'm so confused...

On November 30th I posted "is receiving BAD_STATEID during IO on
delegated stateid unrecoverable?" and you've agreed that it would lead
to the EIO.

There are a couple of cases i'm juggling and I'm just getting confused
if they are different or same. 1 is using delegation stateid for the
IO while we are returning the stateid and thus getting BAD_STATEID. 2
is getting a BAD_STATEID on a delegated stateid on IO in without a
deleg_return. It seems like 2nd case can also lead to the EIO. I have
lossy traces from QA so it's possible that it's the 1st case and known
to get the EIO.


>
> --
> Trond Myklebust
>
> Linux NFS client maintainer, PrimaryData
>
> trond.myklebust@primarydata.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Olga Kornievskaia Dec. 4, 2014, 8:57 p.m. UTC | #11
On Thu, Dec 4, 2014 at 3:49 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> On Thu, Dec 4, 2014 at 2:08 PM, Trond Myklebust
> <trond.myklebust@primarydata.com> wrote:
>> On Thu, Dec 4, 2014 at 1:58 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> On Thu, Dec 4, 2014 at 1:23 PM, Trond Myklebust
>>> <trond.myklebust@primarydata.com> wrote:
>>>> On Thu, Dec 4, 2014 at 12:36 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>> On Wed, Dec 3, 2014 at 6:50 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>>>>> On Wed, Dec 3, 2014 at 6:34 PM, Trond Myklebust
>>>>>> <trond.myklebust@primarydata.com> wrote:
>>>>>>>
>>>>>>> On Dec 3, 2014 6:21 PM, "Olga Kornievskaia" <aglo@umich.edu> wrote:
>>>>>>>>
>>>>>>>> On Wed, Dec 3, 2014 at 6:13 PM, Trond Myklebust
>>>>>>>> <trond.myklebust@primarydata.com> wrote:
>>>>>>>> > On Wed, Dec 3, 2014 at 5:59 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>>>>> > wrote:
>>>>>>>> >> On Wed, Dec 3, 2014 at 3:59 PM, Trond Myklebust
>>>>>>>> >> <trond.myklebust@primarydata.com> wrote:
>>>>>>>> >>> Hi Olga,
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Dec 3, 2014 at 2:54 PM, Olga Kornievskaia <aglo@umich.edu>
>>>>>>>> >>> wrote:
>>>>>>>> >>>> Hi folks,
>>>>>>>> >>>>
>>>>>>>> >>>> I would like an opinion about changing code in such as way that we
>>>>>>>> >>>> don't select a delegation stateid for an IO operation when this
>>>>>>>> >>>> particular delegation is being returned.
>>>>>>>> >>>>
>>>>>>>> >>>> The reason it's some what problematic is that we send out a
>>>>>>>> >>>> DELEG_RETURN and then we don't remove the stateid until we receive a
>>>>>>>> >>>> reply. In the mean while, an IO operation can be happening and in
>>>>>>>> >>>> nfs4_select_rw_stateid() it sees a delegation stateid and uses it.
>>>>>>>> >>>> Well, at the server, it finishes processing DELEG_RETURN before
>>>>>>>> >>>> getting an IO op and by that time the server is finished with the
>>>>>>>> >>>> stateid and can error an IO operation with BAD_STATEID.
>>>>>>>> >>>>
>>>>>>>> >>>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>>>>>>>> >>>> index 7f3f606..4f6f6c9 100644
>>>>>>>> >>>> --- a/fs/nfs/delegation.c
>>>>>>>> >>>> +++ b/fs/nfs/delegation.c
>>>>>>>> >>>> @@ -854,7 +854,8 @@ bool nfs4_copy_delegation_stateid(nfs4_stateid
>>>>>>>> >>>> *dst, struct inode *in
>>>>>>>> >>>>         flags &= FMODE_READ|FMODE_WRITE;
>>>>>>>> >>>>         rcu_read_lock();
>>>>>>>> >>>>         delegation = rcu_dereference(nfsi->delegation);
>>>>>>>> >>>> -       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>>>> >>>> flags);
>>>>>>>> >>>> +       ret = (delegation != NULL && (delegation->type & flags) ==
>>>>>>>> >>>> flags &&
>>>>>>>> >>>> +               !test_bit(NFS_DELEGATION_RETURNING,
>>>>>>>> >>>> &delegation->flags));
>>>>>>>> >>>>         if (ret) {
>>>>>>>> >>>>                 nfs4_stateid_copy(dst, &delegation->stateid);
>>>>>>>> >>>>                 nfs_mark_delegation_referenced(delegation);
>>>>>>>> >>>
>>>>>>>> >>> The above cannot eliminate the possibility that we won't use a
>>>>>>>> >>> delegation while it is being returned. It will at best just reduce the
>>>>>>>> >>> window of opportunity.
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >> You are right this are still problems. Actually, we might set the bit
>>>>>>>> >> but not yet get the open stateid from the open with deleg_cur and
>>>>>>>> >> that's not good. It would be good to know we got the open stateid and
>>>>>>>> >> then pick that.
>>>>>>>> >>
>>>>>>>> >>> So, why is this being considered to be a problem in the first place?
>>>>>>>> >>> Are you seeing a measurable performance impact on a real life workload
>>>>>>>> >>> (as opposed to some 1-in-a-billion occurrence from a QA test :-))?
>>>>>>>> >>
>>>>>>>> >> Unfortunately, this problem is quite common and I hit it all the time
>>>>>>>> >> on my setup. This leads to client seizing IO on that file and
>>>>>>>> >> returning EIO. It's an unrecoverable error. I'm trying to figure out
>>>>>>>> >> how to eliminate getting to that state.
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> > It definitely isn't intended to be an irrecoverable error. The client
>>>>>>>> > is supposed to just replay the write after updating the stateid.
>>>>>>>>
>>>>>>>> open(deleg_cur) call / reply
>>>>>>>> lock() call/reply
>>>>>>>> deleg_return() call
>>>>>>>> write(with deluge_stateid) call gets BAD_STATEID
>>>>>>>> state recovery code marks lock state lost -> EIO.
>>>>>>>
>>>>>>> Why is it marking the lock as lost? If the recovery succeeded, it should
>>>>>>> notice that the stateid has changed and instead retry.
>>>>>>
>>>>>> I'll get you a better explanation tomorrow besides saying "that's what
>>>>>> I see when I run the code".
>>>>>
>>>>> nfs4_async_handle_error() initiates state recovery
>>>>> nfs4_reclaim_open_state() eventually calls  nfs4_reclaim_locks() which
>>>>> marks the lock LOST. state is delegated so the kernel logs "lock
>>>>> reclaim failed".
>>>>> write retries and in nfs4_copy_ lock_stateid() the lock is marked LOST
>>>>> and the nfs4_select_rw_stateid() fails with EIO.
>>>>>
>>>>>>
>>>>>>> What kernel is this?
>>>>>>
>>>>>> This is upstream.
>>>>
>>>> So why isn't nfs4_write_stateid_changed() catching the change before
>>>> we even get to nfs4_async_handle_error()? That's where this race is
>>>> supposed to get resolved.
>>>
>>> Probably because nfs4_stateid_is_current() returns true because
>>> nfs4_set_rw_stateid() calls the nfs4_select_rw_stateid() and finds the
>>> lock lost?
>>
>> I don't understand why the lock would already be marked as lost.
>> Remember that we're calling from inside nfs4_write_done() and before
>> calling nfs4_async_handle_error().
>
> Sigh. I'm so confused...
>
> On November 30th I posted "is receiving BAD_STATEID during IO on
> delegated stateid unrecoverable?" and you've agreed that it would lead
> to the EIO.
>
> There are a couple of cases i'm juggling and I'm just getting confused
> if they are different or same. 1 is using delegation stateid for the
> IO while we are returning the stateid and thus getting BAD_STATEID. 2
> is getting a BAD_STATEID on a delegated stateid on IO in without a
> deleg_return. It seems like 2nd case can also lead to the EIO. I have
> lossy traces from QA so it's possible that it's the 1st case and known
> to get the EIO.

got my cases mixed up. 2nd case was the november's message and agreed
to be unrecoverable. 1st case is what i'm also looking into (maybe).

>
>
>>
>> --
>> Trond Myklebust
>>
>> Linux NFS client maintainer, PrimaryData
>>
>> trond.myklebust@primarydata.com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index 7f3f606..4f6f6c9 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -854,7 +854,8 @@  bool nfs4_copy_delegation_stateid(nfs4_stateid
*dst, struct inode *in
        flags &= FMODE_READ|FMODE_WRITE;
        rcu_read_lock();
        delegation = rcu_dereference(nfsi->delegation);
-       ret = (delegation != NULL && (delegation->type & flags) == flags);
+       ret = (delegation != NULL && (delegation->type & flags) == flags &&
+               !test_bit(NFS_DELEGATION_RETURNING, &delegation->flags));
        if (ret) {
                nfs4_stateid_copy(dst, &delegation->stateid);