NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done

Message ID	1344457310-26442-1-git-send-email-Trond.Myklebust@netapp.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> From: Trond Myklebust <Trond.Myklebust@netapp.com> To: linux-nfs@vger.kernel.org Cc: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>, Boaz Harrosh <bharrosh@panasas.com>, Benny Halevy <bhalevy@tonian.com>, Fred Isaman <Fred.Isaman@netapp.com> Subject: [PATCH] NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done Date: Wed, 8 Aug 2012 16:21:50 -0400 Message-Id: <1344457310-26442-1-git-send-email-Trond.Myklebust@netapp.com> Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Trond Myklebust Aug. 8, 2012, 8:21 p.m. UTC

Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
disconnected data server) we've been sending layoutreturn calls
while there is potentially still outstanding I/O to the data
servers. The reason we do this is to avoid races between replayed
writes to the MDS and the original writes to the DS.

When this happens, the BUG_ON() in nfs4_layoutreturn_done can
be triggered because it assumes that we would never call
layoutreturn without knowing that all I/O to the DS is
finished. The fix is to remove the BUG_ON() now that the
assumptions behind the test are obsolete.

Reported-by: Boaz Harrosh <bharrosh@panasas.com>
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.5]
---
 fs/nfs/nfs4proc.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

Peng Tao Aug. 9, 2012, 2:30 p.m. UTC | #1

On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
<Trond.Myklebust@netapp.com> wrote:
> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
> disconnected data server) we've been sending layoutreturn calls
> while there is potentially still outstanding I/O to the data
> servers. The reason we do this is to avoid races between replayed
> writes to the MDS and the original writes to the DS.
>
> When this happens, the BUG_ON() in nfs4_layoutreturn_done can
> be triggered because it assumes that we would never call
> layoutreturn without knowing that all I/O to the DS is
> finished. The fix is to remove the BUG_ON() now that the
> assumptions behind the test are obsolete.
>
Isn't MDS supposed to recall the layout if races are possible between
outstanding write-to-DS and write-through-MDS?

And it causes data corruption for blocklayout if client returns layout
while there is in-flight disk IO...

Trond Myklebust Aug. 9, 2012, 2:36 p.m. UTC | #2

On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust

> <Trond.Myklebust@netapp.com> wrote:

> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence

> > disconnected data server) we've been sending layoutreturn calls

> > while there is potentially still outstanding I/O to the data

> > servers. The reason we do this is to avoid races between replayed

> > writes to the MDS and the original writes to the DS.

> >

> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can

> > be triggered because it assumes that we would never call

> > layoutreturn without knowing that all I/O to the DS is

> > finished. The fix is to remove the BUG_ON() now that the

> > assumptions behind the test are obsolete.

> >

> Isn't MDS supposed to recall the layout if races are possible between

> outstanding write-to-DS and write-through-MDS?


Where do you read that in RFC5661?

> And it causes data corruption for blocklayout if client returns layout

> while there is in-flight disk IO...


Then it needs to turn off fast failover to write-through-MDS.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Peng Tao Aug. 9, 2012, 3:01 p.m. UTC | #3

On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>> <Trond.Myklebust@netapp.com> wrote:
>> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>> > disconnected data server) we've been sending layoutreturn calls
>> > while there is potentially still outstanding I/O to the data
>> > servers. The reason we do this is to avoid races between replayed
>> > writes to the MDS and the original writes to the DS.
>> >
>> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>> > be triggered because it assumes that we would never call
>> > layoutreturn without knowing that all I/O to the DS is
>> > finished. The fix is to remove the BUG_ON() now that the
>> > assumptions behind the test are obsolete.
>> >
>> Isn't MDS supposed to recall the layout if races are possible between
>> outstanding write-to-DS and write-through-MDS?
>
> Where do you read that in RFC5661?
>
That's my (maybe mis-)understanding of how server works... But looking
at rfc5661 section 18.44.3. layoutreturn implementation.
"
After this call,
   the client MUST NOT use the returned layout(s) and the associated
   storage protocol to access the file data.
"
And given commit 0a57cdac3f, client is using the layout even after
layoutreturn, which IMHO is a violation of rfc5661.

>> And it causes data corruption for blocklayout if client returns layout
>> while there is in-flight disk IO...
>
> Then it needs to turn off fast failover to write-through-MDS.
>
If you still consider it following rfc5661, I'd choose to disable
layoutreturn in before write-through-MDS for blocklayout, by adding
some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects'
PNFS_LAYOUTRET_ON_SETATTR.

Trond Myklebust Aug. 9, 2012, 3:39 p.m. UTC | #4

On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond

> <Trond.Myklebust@netapp.com> wrote:

> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:

> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust

> >> <Trond.Myklebust@netapp.com> wrote:

> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence

> >> > disconnected data server) we've been sending layoutreturn calls

> >> > while there is potentially still outstanding I/O to the data

> >> > servers. The reason we do this is to avoid races between replayed

> >> > writes to the MDS and the original writes to the DS.

> >> >

> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can

> >> > be triggered because it assumes that we would never call

> >> > layoutreturn without knowing that all I/O to the DS is

> >> > finished. The fix is to remove the BUG_ON() now that the

> >> > assumptions behind the test are obsolete.

> >> >

> >> Isn't MDS supposed to recall the layout if races are possible between

> >> outstanding write-to-DS and write-through-MDS?

> >

> > Where do you read that in RFC5661?

> >

> That's my (maybe mis-)understanding of how server works... But looking

> at rfc5661 section 18.44.3. layoutreturn implementation.

> "

> After this call,

>    the client MUST NOT use the returned layout(s) and the associated

>    storage protocol to access the file data.

> "

> And given commit 0a57cdac3f, client is using the layout even after

> layoutreturn, which IMHO is a violation of rfc5661.


No. It is using the layoutreturn to tell the MDS to fence off I/O to a
data server that is not responding. It isn't attempting to use the
layout after the layoutreturn: the whole point is that we are attempting
write-through-MDS after the attempt to write through the DS timed out.

> >> And it causes data corruption for blocklayout if client returns layout

> >> while there is in-flight disk IO...

> >

> > Then it needs to turn off fast failover to write-through-MDS.

> >

> If you still consider it following rfc5661, I'd choose to disable

> layoutreturn in before write-through-MDS for blocklayout, by adding

> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects'

> PNFS_LAYOUTRET_ON_SETATTR.


I don't see how that will prevent corruption.

In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the
layoutreturn to communicate to the MDS that the DS is timing out via an
error code (like the object layout has done all the time). How can you
reconcile that change with a flag such as the one you propose?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Peng Tao Aug. 9, 2012, 4:22 p.m. UTC | #5

On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
>> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>> >> > disconnected data server) we've been sending layoutreturn calls
>> >> > while there is potentially still outstanding I/O to the data
>> >> > servers. The reason we do this is to avoid races between replayed
>> >> > writes to the MDS and the original writes to the DS.
>> >> >
>> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>> >> > be triggered because it assumes that we would never call
>> >> > layoutreturn without knowing that all I/O to the DS is
>> >> > finished. The fix is to remove the BUG_ON() now that the
>> >> > assumptions behind the test are obsolete.
>> >> >
>> >> Isn't MDS supposed to recall the layout if races are possible between
>> >> outstanding write-to-DS and write-through-MDS?
>> >
>> > Where do you read that in RFC5661?
>> >
>> That's my (maybe mis-)understanding of how server works... But looking
>> at rfc5661 section 18.44.3. layoutreturn implementation.
>> "
>> After this call,
>>    the client MUST NOT use the returned layout(s) and the associated
>>    storage protocol to access the file data.
>> "
>> And given commit 0a57cdac3f, client is using the layout even after
>> layoutreturn, which IMHO is a violation of rfc5661.
>
> No. It is using the layoutreturn to tell the MDS to fence off I/O to a
> data server that is not responding. It isn't attempting to use the
> layout after the layoutreturn: the whole point is that we are attempting
> write-through-MDS after the attempt to write through the DS timed out.
>
But it is RFC violation that there is in-flight DS IO when client
sends layoutreturn, right? Not just in-flight, client is well possible
to send IO to DS _after_ layoutreturn because some thread can hold
lseg reference and not yet send IO.

>> >> And it causes data corruption for blocklayout if client returns layout
>> >> while there is in-flight disk IO...
>> >
>> > Then it needs to turn off fast failover to write-through-MDS.
>> >
>> If you still consider it following rfc5661, I'd choose to disable
>> layoutreturn in before write-through-MDS for blocklayout, by adding
>> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects'
>> PNFS_LAYOUTRET_ON_SETATTR.
>
> I don't see how that will prevent corruption.
>
> In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the
> layoutreturn to communicate to the MDS that the DS is timing out via an
> error code (like the object layout has done all the time). How can you
> reconcile that change with a flag such as the one you propose?
I just intend to use the flag to disable layoutreturn in
pnfs_ld_write_done. block extents are data access permissions per
rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block
layout works correctly because server can decide if there is data
access race and if there is, MDS can recall the layout from client
before applying the MDS writes.

Sorin's proposed error code is just a client indication to server that
there is disk access error. It is not intended to solve the data race
between write-through-MDS and write-through-DS.

Thanks,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 9, 2012, 4:29 p.m. UTC | #6

On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote:
> On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond

> <Trond.Myklebust@netapp.com> wrote:

> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:

> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond

> >> <Trond.Myklebust@netapp.com> wrote:

> >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:

> >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust

> >> >> <Trond.Myklebust@netapp.com> wrote:

> >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence

> >> >> > disconnected data server) we've been sending layoutreturn calls

> >> >> > while there is potentially still outstanding I/O to the data

> >> >> > servers. The reason we do this is to avoid races between replayed

> >> >> > writes to the MDS and the original writes to the DS.

> >> >> >

> >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can

> >> >> > be triggered because it assumes that we would never call

> >> >> > layoutreturn without knowing that all I/O to the DS is

> >> >> > finished. The fix is to remove the BUG_ON() now that the

> >> >> > assumptions behind the test are obsolete.

> >> >> >

> >> >> Isn't MDS supposed to recall the layout if races are possible between

> >> >> outstanding write-to-DS and write-through-MDS?

> >> >

> >> > Where do you read that in RFC5661?

> >> >

> >> That's my (maybe mis-)understanding of how server works... But looking

> >> at rfc5661 section 18.44.3. layoutreturn implementation.

> >> "

> >> After this call,

> >>    the client MUST NOT use the returned layout(s) and the associated

> >>    storage protocol to access the file data.

> >> "

> >> And given commit 0a57cdac3f, client is using the layout even after

> >> layoutreturn, which IMHO is a violation of rfc5661.

> >

> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a

> > data server that is not responding. It isn't attempting to use the

> > layout after the layoutreturn: the whole point is that we are attempting

> > write-through-MDS after the attempt to write through the DS timed out.

> >

> But it is RFC violation that there is in-flight DS IO when client

> sends layoutreturn, right? Not just in-flight, client is well possible

> to send IO to DS _after_ layoutreturn because some thread can hold

> lseg reference and not yet send IO.


Once the write has been sent, how do you know that it is no longer
'in-flight' unless the DS responds?

> >> >> And it causes data corruption for blocklayout if client returns layout

> >> >> while there is in-flight disk IO...

> >> >

> >> > Then it needs to turn off fast failover to write-through-MDS.

> >> >

> >> If you still consider it following rfc5661, I'd choose to disable

> >> layoutreturn in before write-through-MDS for blocklayout, by adding

> >> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects'

> >> PNFS_LAYOUTRET_ON_SETATTR.

> >

> > I don't see how that will prevent corruption.

> >

> > In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the

> > layoutreturn to communicate to the MDS that the DS is timing out via an

> > error code (like the object layout has done all the time). How can you

> > reconcile that change with a flag such as the one you propose?

> I just intend to use the flag to disable layoutreturn in

> pnfs_ld_write_done. block extents are data access permissions per

> rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block

> layout works correctly because server can decide if there is data

> access race and if there is, MDS can recall the layout from client

> before applying the MDS writes.

> 

> Sorin's proposed error code is just a client indication to server that

> there is disk access error. It is not intended to solve the data race

> between write-through-MDS and write-through-DS.


Then how do you solve that race on a block device?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Peng Tao Aug. 9, 2012, 4:40 p.m. UTC | #7

On Fri, Aug 10, 2012 at 12:29 AM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote:
>> On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
>> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>> >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>> >> >> <Trond.Myklebust@netapp.com> wrote:
>> >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>> >> >> > disconnected data server) we've been sending layoutreturn calls
>> >> >> > while there is potentially still outstanding I/O to the data
>> >> >> > servers. The reason we do this is to avoid races between replayed
>> >> >> > writes to the MDS and the original writes to the DS.
>> >> >> >
>> >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>> >> >> > be triggered because it assumes that we would never call
>> >> >> > layoutreturn without knowing that all I/O to the DS is
>> >> >> > finished. The fix is to remove the BUG_ON() now that the
>> >> >> > assumptions behind the test are obsolete.
>> >> >> >
>> >> >> Isn't MDS supposed to recall the layout if races are possible between
>> >> >> outstanding write-to-DS and write-through-MDS?
>> >> >
>> >> > Where do you read that in RFC5661?
>> >> >
>> >> That's my (maybe mis-)understanding of how server works... But looking
>> >> at rfc5661 section 18.44.3. layoutreturn implementation.
>> >> "
>> >> After this call,
>> >>    the client MUST NOT use the returned layout(s) and the associated
>> >>    storage protocol to access the file data.
>> >> "
>> >> And given commit 0a57cdac3f, client is using the layout even after
>> >> layoutreturn, which IMHO is a violation of rfc5661.
>> >
>> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a
>> > data server that is not responding. It isn't attempting to use the
>> > layout after the layoutreturn: the whole point is that we are attempting
>> > write-through-MDS after the attempt to write through the DS timed out.
>> >
>> But it is RFC violation that there is in-flight DS IO when client
>> sends layoutreturn, right? Not just in-flight, client is well possible
>> to send IO to DS _after_ layoutreturn because some thread can hold
>> lseg reference and not yet send IO.
>
> Once the write has been sent, how do you know that it is no longer
> 'in-flight' unless the DS responds?
>
>> >> >> And it causes data corruption for blocklayout if client returns layout
>> >> >> while there is in-flight disk IO...
>> >> >
>> >> > Then it needs to turn off fast failover to write-through-MDS.
>> >> >
>> >> If you still consider it following rfc5661, I'd choose to disable
>> >> layoutreturn in before write-through-MDS for blocklayout, by adding
>> >> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects'
>> >> PNFS_LAYOUTRET_ON_SETATTR.
>> >
>> > I don't see how that will prevent corruption.
>> >
>> > In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the
>> > layoutreturn to communicate to the MDS that the DS is timing out via an
>> > error code (like the object layout has done all the time). How can you
>> > reconcile that change with a flag such as the one you propose?
>> I just intend to use the flag to disable layoutreturn in
>> pnfs_ld_write_done. block extents are data access permissions per
>> rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block
>> layout works correctly because server can decide if there is data
>> access race and if there is, MDS can recall the layout from client
>> before applying the MDS writes.
>>
>> Sorin's proposed error code is just a client indication to server that
>> there is disk access error. It is not intended to solve the data race
>> between write-through-MDS and write-through-DS.
>
> Then how do you solve that race on a block device?
As mentioned above, block extents are permissions per RFC5663. So if
MDS needs to access the disk, it needs the permission as well. So if
there is data access race, MDS must recall the layout from client
before processing the MDS writes. We've been dealing with the problem
for years in MPFS and it works perfectly to rely on MDS's decisions.

Thanks,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Peng Tao Aug. 9, 2012, 5:06 p.m. UTC | #8

On Fri, Aug 10, 2012 at 12:29 AM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote:
>> On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
>> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>> >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>> >> >> <Trond.Myklebust@netapp.com> wrote:
>> >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>> >> >> > disconnected data server) we've been sending layoutreturn calls
>> >> >> > while there is potentially still outstanding I/O to the data
>> >> >> > servers. The reason we do this is to avoid races between replayed
>> >> >> > writes to the MDS and the original writes to the DS.
>> >> >> >
>> >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>> >> >> > be triggered because it assumes that we would never call
>> >> >> > layoutreturn without knowing that all I/O to the DS is
>> >> >> > finished. The fix is to remove the BUG_ON() now that the
>> >> >> > assumptions behind the test are obsolete.
>> >> >> >
>> >> >> Isn't MDS supposed to recall the layout if races are possible between
>> >> >> outstanding write-to-DS and write-through-MDS?
>> >> >
>> >> > Where do you read that in RFC5661?
>> >> >
>> >> That's my (maybe mis-)understanding of how server works... But looking
>> >> at rfc5661 section 18.44.3. layoutreturn implementation.
>> >> "
>> >> After this call,
>> >>    the client MUST NOT use the returned layout(s) and the associated
>> >>    storage protocol to access the file data.
>> >> "
>> >> And given commit 0a57cdac3f, client is using the layout even after
>> >> layoutreturn, which IMHO is a violation of rfc5661.
>> >
>> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a
>> > data server that is not responding. It isn't attempting to use the
>> > layout after the layoutreturn: the whole point is that we are attempting
>> > write-through-MDS after the attempt to write through the DS timed out.
>> >
>> But it is RFC violation that there is in-flight DS IO when client
>> sends layoutreturn, right? Not just in-flight, client is well possible
>> to send IO to DS _after_ layoutreturn because some thread can hold
>> lseg reference and not yet send IO.
>
> Once the write has been sent, how do you know that it is no longer
> 'in-flight' unless the DS responds?
RFC5663 provides a way.
"
"blh_maximum_io_time" is the maximum
   time it can take for a client I/O to the storage system to either
   complete or fail
"
It is not perfect solution but still serves as a best effort. It
solves the in-flight IO question for current writing thread.

For in-flight IO from other concurrent threads, lseg reference is the
source that we can rely on. And I think that the BUG_ON can be
triggered much easily because of concurrent writing threads and one of
them fails DS writes.

Boaz Harrosh Aug. 12, 2012, 5:36 p.m. UTC | #9

On 08/09/2012 06:39 PM, Myklebust, Trond wrote:
> If the problem is that the DS is failing to respond, how does the client
> know that the in-flight I/O has ended?

For the client, the above DS in question, has timed-out, we have reset
it's session and closed it's sockets. And all it's RPC requests have
been, or are being, ended with a timeout-error. So the timed-out
DS is a no-op. All it's IO request will end very soon, if not already.

A DS time-out is just a very valid, and meaningful response, just like
an op-done-with-error. This was what Andy added to the RFC's errata
which I agree with.

> 
> No. It is using the layoutreturn to tell the MDS to fence off I/O to a
> data server that is not responding. It isn't attempting to use the
> layout after the layoutreturn: 

> the whole point is that we are attempting
> write-through-MDS after the attempt to write through the DS timed out.
> 

Trond STOP!!! this is pure bullshit. You guys took the opportunity of
me being in Hospital, and the rest of the bunch not having a clue. And
snuck in a patch that is totally wrong for everyone, not taking care of
any other LD *crashes* . And especially when this patch is wrong even for
files layout.

This above here is where you are wrong!! You don't understand my point,
and ignore my comments. So let me state it as clear as I can.

(Lets assume files layout, for blocks and objects it's a bit different
 but mostly the same.)

- Heavy IO is going on, the device_id in question has *3* DSs in it's
  device topography. Say DS1, DS2, DS3

- We have been queuing IO, and all queues are full. (we have 3 queues in
  in question, right? What is the maximum Q depth per files-DS? I know
  that in blocks and objects we usually have, I think, something like 128.
  This is a *tunable* in the block-layer's request-queue. Is it not some
  negotiated parameter with the NFS servers?)

- Now, boom DS2 has timed-out. The Linux-client resets the session and
  internally closes all sockets of that session. All the RPCs that
  belong to DS2 are being returned up with a timeout error. This one
  is just the first of all those belonging to this DS2. They will
  be decrementing the reference for this layout very, very soon.

- But what about DS1, and DS3 RPCs. What should we do with those?
  This is where you guys (Trond and Andy) are wrong. We must also
  wait for these RPC's as well. And opposite to what you think, this
  should not take long. Let me explain:

  We don't know anything about DS1 and DS3, each might be, either,
  "Having the same communication problem, like DS2". Or "is just working
  fine". So lets say for example that DS3 will also time-out in the
  future, and that DS1 is just fine and is writing as usual.

  * DS1 - Since it's working, it has most probably already done
    with all it's IO, because the NFS timeout is usually much longer
    then the normal RPC time, and since we are queuing evenly on
    all 3 DSs, at this point must probably, all of DS1 RPCs are
    already done. (And layout has been de-referenced).

  * DS3 - Will timeout in the future, when will that be?
    So let me start with, saying:
    (1). We could enhance our code and proactively, 
        "cancel/abort" all RPCs that belong to DS3 (more on this
         below)
    (2). Or We can prove that DS3's RPCs will timeout at worst
         case 1 x NFS-timeout after above DS2 timeout event, or
         2 x NFS-timeout after the queuing of the first timed-out
         RPC. And statistically in the average case DS3 will timeout
         very near the time DS2 timed-out.

         This is easy since the last IO we queued was the one that
         made DS2's queue to be full, and it was kept full because
         DS2 stopped responding and nothing emptied the queue.

     So the easiest we can do is wait for DS3 to timeout, soon
     enough, and once that will happen, session will be reset and all
     RPCs will end with an error.

So in the worst case scenario we can recover 2 x NFS-timeout after
a network partition, which is just 1 x NFS-timeout, after your
schizophrenic FENCE_ME_OFF, newly invented operation.

What we can do to enhance our code to reduce error recovery to
1 x NFS-timeout:

- DS3 above:
  (As I said DS1's queues are now empty, because it was working fine,
   So DS3 is a representation of all DS's that have RPCs at the
   time DS2 timed-out, which belong to this layout)

  We can proactively abort all RPCs belonging to DS3. If there is
  a way to internally abort RPC's use that. Else just reset it's
  session and all sockets will close (and reopen), and all RPC's
  will end with a disconnect error.

- Both DS2 that timed-out, and DS3 that was aborted. Should be
  marked with a flag. When new IO that belong to some other
  inode through some other layout+device_id encounters a flagged
  device, it should abort and turn to MDS IO, with also invalidating
  it's layout, and hens, soon enough the device_id for DS2&3 will be
  de-referenced and be removed from device cache. (And all referencing
  layouts are now gone)
  So we do not continue queuing new IO to dead devices. And since most
  probably MDS will not give us dead servers in new layout, we should be
  good.

In summery.
- FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client
  *must not* skb-send a single byte belonging to a layout, after the send
  of LAYOUT_RETURN.
  (It need not wait for OPT_DONE from DS to do that, it just must make
   sure, that all it's internal, or on-the-wire request, are aborted
   by easily closing the sockets they belong too, and/or waiting for
   healthy DS's IO to be OPT_DONE . So the client is not dependent on
   any DS response, it is only dependent on it's internal state being
   *clean* from any more skb-send(s))

- The proper implementation of LAYOUT_RETURN on error for fast turnover
  is not hard, and does not involve a new invented NFS operation such
  as FENCE_ME_OFF. Proper codded client, independently, without
  the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround
  by actively returning all layouts that belong to a bad DS, and not
  waiting for a fence-off of a single layout, then encountering just
  the same error with all other layouts that have the same DS     

- And I know that just as you did not read my emails from before
  me going to Hospital, you will continue to not understand this
  one, or what I'm trying to explain, and will most probably ignore
  all of it. But please note one thing:

    YOU have sabotaged the NFS 4.1 Linux client, which is now totally
    not STD complaint, and have introduced CRASHs. And for no good
    reason.

No thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 13, 2012, 4:26 p.m. UTC | #10

On Sun, 2012-08-12 at 20:36 +0300, Boaz Harrosh wrote:
> On 08/09/2012 06:39 PM, Myklebust, Trond wrote:

> > If the problem is that the DS is failing to respond, how does the client

> > know that the in-flight I/O has ended?

> 

> For the client, the above DS in question, has timed-out, we have reset

> it's session and closed it's sockets. And all it's RPC requests have

> been, or are being, ended with a timeout-error. So the timed-out

> DS is a no-op. All it's IO request will end very soon, if not already.

> 

> A DS time-out is just a very valid, and meaningful response, just like

> an op-done-with-error. This was what Andy added to the RFC's errata

> which I agree with.

> 

> > 

> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a

> > data server that is not responding. It isn't attempting to use the

> > layout after the layoutreturn: 

> 

> > the whole point is that we are attempting

> > write-through-MDS after the attempt to write through the DS timed out.

> > 

> 

> Trond STOP!!! this is pure bullshit. You guys took the opportunity of

> me being in Hospital, and the rest of the bunch not having a clue. And

> snuck in a patch that is totally wrong for everyone, not taking care of

> any other LD *crashes* . And especially when this patch is wrong even for

> files layout.

> 

> This above here is where you are wrong!! You don't understand my point,

> and ignore my comments. So let me state it as clear as I can.


YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout
for an RPC call once it has started. This is why we need fencing
_specifically_ for the pNFS files client.

> (Lets assume files layout, for blocks and objects it's a bit different

>  but mostly the same.)


That, and the fact that fencing hasn't been implemented for blocks and
objects. The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT
for each file with failed DS connection I/O) and touches only
fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects.

> - Heavy IO is going on, the device_id in question has *3* DSs in it's

>   device topography. Say DS1, DS2, DS3

> 

> - We have been queuing IO, and all queues are full. (we have 3 queues in

>   in question, right? What is the maximum Q depth per files-DS? I know

>   that in blocks and objects we usually have, I think, something like 128.

>   This is a *tunable* in the block-layer's request-queue. Is it not some

>   negotiated parameter with the NFS servers?)

> 

> - Now, boom DS2 has timed-out. The Linux-client resets the session and

>   internally closes all sockets of that session. All the RPCs that

>   belong to DS2 are being returned up with a timeout error. This one

>   is just the first of all those belonging to this DS2. They will

>   be decrementing the reference for this layout very, very soon.

> 

> - But what about DS1, and DS3 RPCs. What should we do with those?

>   This is where you guys (Trond and Andy) are wrong. We must also

>   wait for these RPC's as well. And opposite to what you think, this

>   should not take long. Let me explain:

> 

>   We don't know anything about DS1 and DS3, each might be, either,

>   "Having the same communication problem, like DS2". Or "is just working

>   fine". So lets say for example that DS3 will also time-out in the

>   future, and that DS1 is just fine and is writing as usual.

> 

>   * DS1 - Since it's working, it has most probably already done

>     with all it's IO, because the NFS timeout is usually much longer

>     then the normal RPC time, and since we are queuing evenly on

>     all 3 DSs, at this point must probably, all of DS1 RPCs are

>     already done. (And layout has been de-referenced).

> 

>   * DS3 - Will timeout in the future, when will that be?

>     So let me start with, saying:

>     (1). We could enhance our code and proactively, 

>         "cancel/abort" all RPCs that belong to DS3 (more on this

>          below)


Which makes the race _WORSE_. As I said above, there is no 'cancel RPC'
operation in SUNRPC. Once your RPC call is launched, it cannot be
recalled. All your discussion above is about the client side, and
ignores what may be happening on the data server side. The fencing is
what is needed to deal with the data server picture.

>     (2). Or We can prove that DS3's RPCs will timeout at worst

>          case 1 x NFS-timeout after above DS2 timeout event, or

>          2 x NFS-timeout after the queuing of the first timed-out

>          RPC. And statistically in the average case DS3 will timeout

>          very near the time DS2 timed-out.

> 

>          This is easy since the last IO we queued was the one that

>          made DS2's queue to be full, and it was kept full because

>          DS2 stopped responding and nothing emptied the queue.

> 

>      So the easiest we can do is wait for DS3 to timeout, soon

>      enough, and once that will happen, session will be reset and all

>      RPCs will end with an error.



You are still only discussing the client side.

Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR
CANCELED. Fencing is the closest we can come to an abort operation.

> So in the worst case scenario we can recover 2 x NFS-timeout after

> a network partition, which is just 1 x NFS-timeout, after your

> schizophrenic FENCE_ME_OFF, newly invented operation.

> 

> What we can do to enhance our code to reduce error recovery to

> 1 x NFS-timeout:

> 

> - DS3 above:

>   (As I said DS1's queues are now empty, because it was working fine,

>    So DS3 is a representation of all DS's that have RPCs at the

>    time DS2 timed-out, which belong to this layout)

> 

>   We can proactively abort all RPCs belonging to DS3. If there is

>   a way to internally abort RPC's use that. Else just reset it's

>   session and all sockets will close (and reopen), and all RPC's

>   will end with a disconnect error.


Not on most servers that I'm aware of. If you close or reset the socket
on the client, then the Linux server will happily continue to process
those RPC calls; it just won't be able to send a reply.
Furthermore, if the problem is that the data server isn't responding,
then a socket close/reset tells you nothing either.

> - Both DS2 that timed-out, and DS3 that was aborted. Should be

>   marked with a flag. When new IO that belong to some other

>   inode through some other layout+device_id encounters a flagged

>   device, it should abort and turn to MDS IO, with also invalidating

>   it's layout, and hens, soon enough the device_id for DS2&3 will be

>   de-referenced and be removed from device cache. (And all referencing

>   layouts are now gone)


There is no RPC abort functionality Sun RPC. Again, this argument relies
on functionality that _doesn't_ exist.

>   So we do not continue queuing new IO to dead devices. And since most

>   probably MDS will not give us dead servers in new layout, we should be

>   good.

> In summery.

> - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client

>   *must not* skb-send a single byte belonging to a layout, after the send

>   of LAYOUT_RETURN.

>   (It need not wait for OPT_DONE from DS to do that, it just must make

>    sure, that all it's internal, or on-the-wire request, are aborted

>    by easily closing the sockets they belong too, and/or waiting for

>    healthy DS's IO to be OPT_DONE . So the client is not dependent on

>    any DS response, it is only dependent on it's internal state being

>    *clean* from any more skb-send(s))


Ditto

> - The proper implementation of LAYOUT_RETURN on error for fast turnover

>   is not hard, and does not involve a new invented NFS operation such

>   as FENCE_ME_OFF. Proper codded client, independently, without

>   the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround

>   by actively returning all layouts that belong to a bad DS, and not

>   waiting for a fence-off of a single layout, then encountering just

>   the same error with all other layouts that have the same DS     


What do you mean by "all layouts that belong to a bad DS"? Layouts don't
belong to a DS, and so there is no way to get from a DS to a layout.

> - And I know that just as you did not read my emails from before

>   me going to Hospital, you will continue to not understand this

>   one, or what I'm trying to explain, and will most probably ignore

>   all of it. But please note one thing:


I read them, but just as now, they continue to ignore the reality about
timeouts: timeouts mean _nothing_ in an RPC failover situation. There is
no RPC abort functionality that you can rely on other than fencing.

>     YOU have sabotaged the NFS 4.1 Linux client, which is now totally

>     not STD complaint, and have introduced CRASHs. And for no good

>     reason.


See above.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Trond Myklebust Aug. 13, 2012, 4:58 p.m. UTC | #11

T24gTW9uLCAyMDEyLTA4LTEzIGF0IDEyOjI2IC0wNDAwLCBUcm9uZCBNeWtsZWJ1c3Qgd3JvdGU6
DQo+IE9uIFN1biwgMjAxMi0wOC0xMiBhdCAyMDozNiArMDMwMCwgQm9heiBIYXJyb3NoIHdyb3Rl
Og0KPiA+IE9uIDA4LzA5LzIwMTIgMDY6MzkgUE0sIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+
ID4gPiBJZiB0aGUgcHJvYmxlbSBpcyB0aGF0IHRoZSBEUyBpcyBmYWlsaW5nIHRvIHJlc3BvbmQs
IGhvdyBkb2VzIHRoZSBjbGllbnQNCj4gPiA+IGtub3cgdGhhdCB0aGUgaW4tZmxpZ2h0IEkvTyBo
YXMgZW5kZWQ/DQo+ID4gDQo+ID4gRm9yIHRoZSBjbGllbnQsIHRoZSBhYm92ZSBEUyBpbiBxdWVz
dGlvbiwgaGFzIHRpbWVkLW91dCwgd2UgaGF2ZSByZXNldA0KPiA+IGl0J3Mgc2Vzc2lvbiBhbmQg
Y2xvc2VkIGl0J3Mgc29ja2V0cy4gQW5kIGFsbCBpdCdzIFJQQyByZXF1ZXN0cyBoYXZlDQo+ID4g
YmVlbiwgb3IgYXJlIGJlaW5nLCBlbmRlZCB3aXRoIGEgdGltZW91dC1lcnJvci4gU28gdGhlIHRp
bWVkLW91dA0KPiA+IERTIGlzIGEgbm8tb3AuIEFsbCBpdCdzIElPIHJlcXVlc3Qgd2lsbCBlbmQg
dmVyeSBzb29uLCBpZiBub3QgYWxyZWFkeS4NCj4gPiANCj4gPiBBIERTIHRpbWUtb3V0IGlzIGp1
c3QgYSB2ZXJ5IHZhbGlkLCBhbmQgbWVhbmluZ2Z1bCByZXNwb25zZSwganVzdCBsaWtlDQo+ID4g
YW4gb3AtZG9uZS13aXRoLWVycm9yLiBUaGlzIHdhcyB3aGF0IEFuZHkgYWRkZWQgdG8gdGhlIFJG
QydzIGVycmF0YQ0KPiA+IHdoaWNoIEkgYWdyZWUgd2l0aC4NCj4gPiANCj4gPiA+IA0KPiA+ID4g
Tm8uIEl0IGlzIHVzaW5nIHRoZSBsYXlvdXRyZXR1cm4gdG8gdGVsbCB0aGUgTURTIHRvIGZlbmNl
IG9mZiBJL08gdG8gYQ0KPiA+ID4gZGF0YSBzZXJ2ZXIgdGhhdCBpcyBub3QgcmVzcG9uZGluZy4g
SXQgaXNuJ3QgYXR0ZW1wdGluZyB0byB1c2UgdGhlDQo+ID4gPiBsYXlvdXQgYWZ0ZXIgdGhlIGxh
eW91dHJldHVybjogDQo+ID4gDQo+ID4gPiB0aGUgd2hvbGUgcG9pbnQgaXMgdGhhdCB3ZSBhcmUg
YXR0ZW1wdGluZw0KPiA+ID4gd3JpdGUtdGhyb3VnaC1NRFMgYWZ0ZXIgdGhlIGF0dGVtcHQgdG8g
d3JpdGUgdGhyb3VnaCB0aGUgRFMgdGltZWQgb3V0Lg0KPiA+ID4gDQo+ID4gDQo+ID4gVHJvbmQg
U1RPUCEhISB0aGlzIGlzIHB1cmUgYnVsbHNoaXQuIFlvdSBndXlzIHRvb2sgdGhlIG9wcG9ydHVu
aXR5IG9mDQo+ID4gbWUgYmVpbmcgaW4gSG9zcGl0YWwsIGFuZCB0aGUgcmVzdCBvZiB0aGUgYnVu
Y2ggbm90IGhhdmluZyBhIGNsdWUuIEFuZA0KPiA+IHNudWNrIGluIGEgcGF0Y2ggdGhhdCBpcyB0
b3RhbGx5IHdyb25nIGZvciBldmVyeW9uZSwgbm90IHRha2luZyBjYXJlIG9mDQo+ID4gYW55IG90
aGVyIExEICpjcmFzaGVzKiAuIEFuZCBlc3BlY2lhbGx5IHdoZW4gdGhpcyBwYXRjaCBpcyB3cm9u
ZyBldmVuIGZvcg0KPiA+IGZpbGVzIGxheW91dC4NCj4gPiANCj4gPiBUaGlzIGFib3ZlIGhlcmUg
aXMgd2hlcmUgeW91IGFyZSB3cm9uZyEhIFlvdSBkb24ndCB1bmRlcnN0YW5kIG15IHBvaW50LA0K
PiA+IGFuZCBpZ25vcmUgbXkgY29tbWVudHMuIFNvIGxldCBtZSBzdGF0ZSBpdCBhcyBjbGVhciBh
cyBJIGNhbi4NCj4gDQo+IFlPVSBhcmUgaWdub3JpbmcgdGhlIHJlYWxpdHkgb2YgU3VuUlBDLiBU
aGVyZSBpcyBubyBhYm9ydC9jYW5jZWwvdGltZW91dA0KPiBmb3IgYW4gUlBDIGNhbGwgb25jZSBp
dCBoYXMgc3RhcnRlZC4gVGhpcyBpcyB3aHkgd2UgbmVlZCBmZW5jaW5nDQo+IF9zcGVjaWZpY2Fs
bHlfIGZvciB0aGUgcE5GUyBmaWxlcyBjbGllbnQuDQo+IA0KPiA+IChMZXRzIGFzc3VtZSBmaWxl
cyBsYXlvdXQsIGZvciBibG9ja3MgYW5kIG9iamVjdHMgaXQncyBhIGJpdCBkaWZmZXJlbnQNCj4g
PiAgYnV0IG1vc3RseSB0aGUgc2FtZS4pDQo+IA0KPiBUaGF0LCBhbmQgdGhlIGZhY3QgdGhhdCBm
ZW5jaW5nIGhhc24ndCBiZWVuIGltcGxlbWVudGVkIGZvciBibG9ja3MgYW5kDQo+IG9iamVjdHMu
IFRoZSBjb21taXQgaW4gcXVlc3Rpb24gaXMgODJjN2M3YTVhIChORlN2NC4xIHJldHVybiB0aGUg
TEFZT1VUDQo+IGZvciBlYWNoIGZpbGUgd2l0aCBmYWlsZWQgRFMgY29ubmVjdGlvbiBJL08pIGFu
ZCB0b3VjaGVzIG9ubHkNCj4gZnMvbmZzL25mczRmaWxlbGF5b3V0LmMuIEl0IGNhbm5vdCBiZSBh
ZmZlY3RpbmcgYmxvY2tzIGFuZCBvYmplY3RzLg0KPiANCj4gPiAtIEhlYXZ5IElPIGlzIGdvaW5n
IG9uLCB0aGUgZGV2aWNlX2lkIGluIHF1ZXN0aW9uIGhhcyAqMyogRFNzIGluIGl0J3MNCj4gPiAg
IGRldmljZSB0b3BvZ3JhcGh5LiBTYXkgRFMxLCBEUzIsIERTMw0KPiA+IA0KPiA+IC0gV2UgaGF2
ZSBiZWVuIHF1ZXVpbmcgSU8sIGFuZCBhbGwgcXVldWVzIGFyZSBmdWxsLiAod2UgaGF2ZSAzIHF1
ZXVlcyBpbg0KPiA+ICAgaW4gcXVlc3Rpb24sIHJpZ2h0PyBXaGF0IGlzIHRoZSBtYXhpbXVtIFEg
ZGVwdGggcGVyIGZpbGVzLURTPyBJIGtub3cNCj4gPiAgIHRoYXQgaW4gYmxvY2tzIGFuZCBvYmpl
Y3RzIHdlIHVzdWFsbHkgaGF2ZSwgSSB0aGluaywgc29tZXRoaW5nIGxpa2UgMTI4Lg0KPiA+ICAg
VGhpcyBpcyBhICp0dW5hYmxlKiBpbiB0aGUgYmxvY2stbGF5ZXIncyByZXF1ZXN0LXF1ZXVlLiBJ
cyBpdCBub3Qgc29tZQ0KPiA+ICAgbmVnb3RpYXRlZCBwYXJhbWV0ZXIgd2l0aCB0aGUgTkZTIHNl
cnZlcnM/KQ0KPiA+IA0KPiA+IC0gTm93LCBib29tIERTMiBoYXMgdGltZWQtb3V0LiBUaGUgTGlu
dXgtY2xpZW50IHJlc2V0cyB0aGUgc2Vzc2lvbiBhbmQNCj4gPiAgIGludGVybmFsbHkgY2xvc2Vz
IGFsbCBzb2NrZXRzIG9mIHRoYXQgc2Vzc2lvbi4gQWxsIHRoZSBSUENzIHRoYXQNCj4gPiAgIGJl
bG9uZyB0byBEUzIgYXJlIGJlaW5nIHJldHVybmVkIHVwIHdpdGggYSB0aW1lb3V0IGVycm9yLiBU
aGlzIG9uZQ0KPiA+ICAgaXMganVzdCB0aGUgZmlyc3Qgb2YgYWxsIHRob3NlIGJlbG9uZ2luZyB0
byB0aGlzIERTMi4gVGhleSB3aWxsDQo+ID4gICBiZSBkZWNyZW1lbnRpbmcgdGhlIHJlZmVyZW5j
ZSBmb3IgdGhpcyBsYXlvdXQgdmVyeSwgdmVyeSBzb29uLg0KPiA+IA0KPiA+IC0gQnV0IHdoYXQg
YWJvdXQgRFMxLCBhbmQgRFMzIFJQQ3MuIFdoYXQgc2hvdWxkIHdlIGRvIHdpdGggdGhvc2U/DQo+
ID4gICBUaGlzIGlzIHdoZXJlIHlvdSBndXlzIChUcm9uZCBhbmQgQW5keSkgYXJlIHdyb25nLiBX
ZSBtdXN0IGFsc28NCj4gPiAgIHdhaXQgZm9yIHRoZXNlIFJQQydzIGFzIHdlbGwuIEFuZCBvcHBv
c2l0ZSB0byB3aGF0IHlvdSB0aGluaywgdGhpcw0KPiA+ICAgc2hvdWxkIG5vdCB0YWtlIGxvbmcu
IExldCBtZSBleHBsYWluOg0KPiA+IA0KPiA+ICAgV2UgZG9uJ3Qga25vdyBhbnl0aGluZyBhYm91
dCBEUzEgYW5kIERTMywgZWFjaCBtaWdodCBiZSwgZWl0aGVyLA0KPiA+ICAgIkhhdmluZyB0aGUg
c2FtZSBjb21tdW5pY2F0aW9uIHByb2JsZW0sIGxpa2UgRFMyIi4gT3IgImlzIGp1c3Qgd29ya2lu
Zw0KPiA+ICAgZmluZSIuIFNvIGxldHMgc2F5IGZvciBleGFtcGxlIHRoYXQgRFMzIHdpbGwgYWxz
byB0aW1lLW91dCBpbiB0aGUNCj4gPiAgIGZ1dHVyZSwgYW5kIHRoYXQgRFMxIGlzIGp1c3QgZmlu
ZSBhbmQgaXMgd3JpdGluZyBhcyB1c3VhbC4NCj4gPiANCj4gPiAgICogRFMxIC0gU2luY2UgaXQn
cyB3b3JraW5nLCBpdCBoYXMgbW9zdCBwcm9iYWJseSBhbHJlYWR5IGRvbmUNCj4gPiAgICAgd2l0
aCBhbGwgaXQncyBJTywgYmVjYXVzZSB0aGUgTkZTIHRpbWVvdXQgaXMgdXN1YWxseSBtdWNoIGxv
bmdlcg0KPiA+ICAgICB0aGVuIHRoZSBub3JtYWwgUlBDIHRpbWUsIGFuZCBzaW5jZSB3ZSBhcmUg
cXVldWluZyBldmVubHkgb24NCj4gPiAgICAgYWxsIDMgRFNzLCBhdCB0aGlzIHBvaW50IG11c3Qg
cHJvYmFibHksIGFsbCBvZiBEUzEgUlBDcyBhcmUNCj4gPiAgICAgYWxyZWFkeSBkb25lLiAoQW5k
IGxheW91dCBoYXMgYmVlbiBkZS1yZWZlcmVuY2VkKS4NCj4gPiANCj4gPiAgICogRFMzIC0gV2ls
bCB0aW1lb3V0IGluIHRoZSBmdXR1cmUsIHdoZW4gd2lsbCB0aGF0IGJlPw0KPiA+ICAgICBTbyBs
ZXQgbWUgc3RhcnQgd2l0aCwgc2F5aW5nOg0KPiA+ICAgICAoMSkuIFdlIGNvdWxkIGVuaGFuY2Ug
b3VyIGNvZGUgYW5kIHByb2FjdGl2ZWx5LCANCj4gPiAgICAgICAgICJjYW5jZWwvYWJvcnQiIGFs
bCBSUENzIHRoYXQgYmVsb25nIHRvIERTMyAobW9yZSBvbiB0aGlzDQo+ID4gICAgICAgICAgYmVs
b3cpDQo+IA0KPiBXaGljaCBtYWtlcyB0aGUgcmFjZSBfV09SU0VfLiBBcyBJIHNhaWQgYWJvdmUs
IHRoZXJlIGlzIG5vICdjYW5jZWwgUlBDJw0KPiBvcGVyYXRpb24gaW4gU1VOUlBDLiBPbmNlIHlv
dXIgUlBDIGNhbGwgaXMgbGF1bmNoZWQsIGl0IGNhbm5vdCBiZQ0KPiByZWNhbGxlZC4gQWxsIHlv
dXIgZGlzY3Vzc2lvbiBhYm92ZSBpcyBhYm91dCB0aGUgY2xpZW50IHNpZGUsIGFuZA0KPiBpZ25v
cmVzIHdoYXQgbWF5IGJlIGhhcHBlbmluZyBvbiB0aGUgZGF0YSBzZXJ2ZXIgc2lkZS4gVGhlIGZl
bmNpbmcgaXMNCj4gd2hhdCBpcyBuZWVkZWQgdG8gZGVhbCB3aXRoIHRoZSBkYXRhIHNlcnZlciBw
aWN0dXJlLg0KPiANCj4gPiAgICAgKDIpLiBPciBXZSBjYW4gcHJvdmUgdGhhdCBEUzMncyBSUENz
IHdpbGwgdGltZW91dCBhdCB3b3JzdA0KPiA+ICAgICAgICAgIGNhc2UgMSB4IE5GUy10aW1lb3V0
IGFmdGVyIGFib3ZlIERTMiB0aW1lb3V0IGV2ZW50LCBvcg0KPiA+ICAgICAgICAgIDIgeCBORlMt
dGltZW91dCBhZnRlciB0aGUgcXVldWluZyBvZiB0aGUgZmlyc3QgdGltZWQtb3V0DQo+ID4gICAg
ICAgICAgUlBDLiBBbmQgc3RhdGlzdGljYWxseSBpbiB0aGUgYXZlcmFnZSBjYXNlIERTMyB3aWxs
IHRpbWVvdXQNCj4gPiAgICAgICAgICB2ZXJ5IG5lYXIgdGhlIHRpbWUgRFMyIHRpbWVkLW91dC4N
Cj4gPiANCj4gPiAgICAgICAgICBUaGlzIGlzIGVhc3kgc2luY2UgdGhlIGxhc3QgSU8gd2UgcXVl
dWVkIHdhcyB0aGUgb25lIHRoYXQNCj4gPiAgICAgICAgICBtYWRlIERTMidzIHF1ZXVlIHRvIGJl
IGZ1bGwsIGFuZCBpdCB3YXMga2VwdCBmdWxsIGJlY2F1c2UNCj4gPiAgICAgICAgICBEUzIgc3Rv
cHBlZCByZXNwb25kaW5nIGFuZCBub3RoaW5nIGVtcHRpZWQgdGhlIHF1ZXVlLg0KPiA+IA0KPiA+
ICAgICAgU28gdGhlIGVhc2llc3Qgd2UgY2FuIGRvIGlzIHdhaXQgZm9yIERTMyB0byB0aW1lb3V0
LCBzb29uDQo+ID4gICAgICBlbm91Z2gsIGFuZCBvbmNlIHRoYXQgd2lsbCBoYXBwZW4sIHNlc3Np
b24gd2lsbCBiZSByZXNldCBhbmQgYWxsDQo+ID4gICAgICBSUENzIHdpbGwgZW5kIHdpdGggYW4g
ZXJyb3IuDQo+IA0KPiANCj4gWW91IGFyZSBzdGlsbCBvbmx5IGRpc2N1c3NpbmcgdGhlIGNsaWVu
dCBzaWRlLg0KPiANCj4gUmVhZCBteSBsaXBzOiBTdW4gUlBDIE9QRVJBVElPTlMgRE8gTk9UIFRJ
TUVPVVQgQU5EIENBTk5PVCBCRSBBQk9SVEVEIE9SDQo+IENBTkNFTEVELiBGZW5jaW5nIGlzIHRo
ZSBjbG9zZXN0IHdlIGNhbiBjb21lIHRvIGFuIGFib3J0IG9wZXJhdGlvbi4NCj4gDQo+ID4gU28g
aW4gdGhlIHdvcnN0IGNhc2Ugc2NlbmFyaW8gd2UgY2FuIHJlY292ZXIgMiB4IE5GUy10aW1lb3V0
IGFmdGVyDQo+ID4gYSBuZXR3b3JrIHBhcnRpdGlvbiwgd2hpY2ggaXMganVzdCAxIHggTkZTLXRp
bWVvdXQsIGFmdGVyIHlvdXINCj4gPiBzY2hpem9waHJlbmljIEZFTkNFX01FX09GRiwgbmV3bHkg
aW52ZW50ZWQgb3BlcmF0aW9uLg0KPiA+IA0KPiA+IFdoYXQgd2UgY2FuIGRvIHRvIGVuaGFuY2Ug
b3VyIGNvZGUgdG8gcmVkdWNlIGVycm9yIHJlY292ZXJ5IHRvDQo+ID4gMSB4IE5GUy10aW1lb3V0
Og0KPiA+IA0KPiA+IC0gRFMzIGFib3ZlOg0KPiA+ICAgKEFzIEkgc2FpZCBEUzEncyBxdWV1ZXMg
YXJlIG5vdyBlbXB0eSwgYmVjYXVzZSBpdCB3YXMgd29ya2luZyBmaW5lLA0KPiA+ICAgIFNvIERT
MyBpcyBhIHJlcHJlc2VudGF0aW9uIG9mIGFsbCBEUydzIHRoYXQgaGF2ZSBSUENzIGF0IHRoZQ0K
PiA+ICAgIHRpbWUgRFMyIHRpbWVkLW91dCwgd2hpY2ggYmVsb25nIHRvIHRoaXMgbGF5b3V0KQ0K
PiA+IA0KPiA+ICAgV2UgY2FuIHByb2FjdGl2ZWx5IGFib3J0IGFsbCBSUENzIGJlbG9uZ2luZyB0
byBEUzMuIElmIHRoZXJlIGlzDQo+ID4gICBhIHdheSB0byBpbnRlcm5hbGx5IGFib3J0IFJQQydz
IHVzZSB0aGF0LiBFbHNlIGp1c3QgcmVzZXQgaXQncw0KPiA+ICAgc2Vzc2lvbiBhbmQgYWxsIHNv
Y2tldHMgd2lsbCBjbG9zZSAoYW5kIHJlb3BlbiksIGFuZCBhbGwgUlBDJ3MNCj4gPiAgIHdpbGwg
ZW5kIHdpdGggYSBkaXNjb25uZWN0IGVycm9yLg0KPiANCj4gTm90IG9uIG1vc3Qgc2VydmVycyB0
aGF0IEknbSBhd2FyZSBvZi4gSWYgeW91IGNsb3NlIG9yIHJlc2V0IHRoZSBzb2NrZXQNCj4gb24g
dGhlIGNsaWVudCwgdGhlbiB0aGUgTGludXggc2VydmVyIHdpbGwgaGFwcGlseSBjb250aW51ZSB0
byBwcm9jZXNzDQo+IHRob3NlIFJQQyBjYWxsczsgaXQganVzdCB3b24ndCBiZSBhYmxlIHRvIHNl
bmQgYSByZXBseS4NCg0KT25lIHNtYWxsIGNvcnJlY3Rpb24gaGVyZToNCl9JZl8gd2UgYXJlIHVz
aW5nIE5GU3Y0LjIsIGFuZCBfaWZfIHRoZSBjbGllbnQgcmVxdWVzdHMgdGhlDQpFWENIR0lENF9G
TEFHX1NVUFBfRkVOQ0VfT1BTIGluIHRoZSBFWENIQU5HRV9JRCBvcGVyYXRpb24sIGFuZCBfaWZf
IHRoZQ0KZGF0YSBzZXJ2ZXIgcmVwbGllcyB0aGF0IGl0IHN1cHBvcnRzIHRoYXQsIGFuZCBfaWZf
IHRoZSBjbGllbnQgZ2V0cyBhDQpzdWNjZXNzZnVsIHJlcGx5IHRvIGEgREVTVFJPWV9TRVNTSU9O
IGNhbGwgdG8gdGhlIGRhdGEgc2VydmVyLCBfdGhlbl8gaXQNCmNhbiBrbm93IHRoYXQgYWxsIFJQ
QyBjYWxscyBoYXZlIGNvbXBsZXRlZC4NCg0KSG93ZXZlciwgd2UncmUgbm90IHN1cHBvcnRpbmcg
TkZTdjQuMiB5ZXQuDQoNCj4gRnVydGhlcm1vcmUsIGlmIHRoZSBwcm9ibGVtIGlzIHRoYXQgdGhl
IGRhdGEgc2VydmVyIGlzbid0IHJlc3BvbmRpbmcsDQo+IHRoZW4gYSBzb2NrZXQgY2xvc2UvcmVz
ZXQgdGVsbHMgeW91IG5vdGhpbmcgZWl0aGVyLg0KDQouLi5hbmQgd2Ugc3RpbGwgaGF2ZSBubyBz
b2x1dGlvbiBmb3IgdGhpcyBjYXNlLg0KDQo+ID4gLSBCb3RoIERTMiB0aGF0IHRpbWVkLW91dCwg
YW5kIERTMyB0aGF0IHdhcyBhYm9ydGVkLiBTaG91bGQgYmUNCj4gPiAgIG1hcmtlZCB3aXRoIGEg
ZmxhZy4gV2hlbiBuZXcgSU8gdGhhdCBiZWxvbmcgdG8gc29tZSBvdGhlcg0KPiA+ICAgaW5vZGUg
dGhyb3VnaCBzb21lIG90aGVyIGxheW91dCtkZXZpY2VfaWQgZW5jb3VudGVycyBhIGZsYWdnZWQN
Cj4gPiAgIGRldmljZSwgaXQgc2hvdWxkIGFib3J0IGFuZCB0dXJuIHRvIE1EUyBJTywgd2l0aCBh
bHNvIGludmFsaWRhdGluZw0KPiA+ICAgaXQncyBsYXlvdXQsIGFuZCBoZW5zLCBzb29uIGVub3Vn
aCB0aGUgZGV2aWNlX2lkIGZvciBEUzImMyB3aWxsIGJlDQo+ID4gICBkZS1yZWZlcmVuY2VkIGFu
ZCBiZSByZW1vdmVkIGZyb20gZGV2aWNlIGNhY2hlLiAoQW5kIGFsbCByZWZlcmVuY2luZw0KPiA+
ICAgbGF5b3V0cyBhcmUgbm93IGdvbmUpDQo+IA0KPiBUaGVyZSBpcyBubyBSUEMgYWJvcnQgZnVu
Y3Rpb25hbGl0eSBTdW4gUlBDLiBBZ2FpbiwgdGhpcyBhcmd1bWVudCByZWxpZXMNCj4gb24gZnVu
Y3Rpb25hbGl0eSB0aGF0IF9kb2Vzbid0XyBleGlzdC4NCj4gDQo+ID4gICBTbyB3ZSBkbyBub3Qg
Y29udGludWUgcXVldWluZyBuZXcgSU8gdG8gZGVhZCBkZXZpY2VzLiBBbmQgc2luY2UgbW9zdA0K
PiA+ICAgcHJvYmFibHkgTURTIHdpbGwgbm90IGdpdmUgdXMgZGVhZCBzZXJ2ZXJzIGluIG5ldyBs
YXlvdXQsIHdlIHNob3VsZCBiZQ0KPiA+ICAgZ29vZC4NCj4gPiBJbiBzdW1tZXJ5Lg0KPiA+IC0g
RkVOQ0VfTUVfT0ZGIGlzIGEgbmV3IG9wZXJhdGlvbiwgYW5kIGlzIG5vdCA9PT0gTEFZT1VUX1JF
VFVSTi4gQ2xpZW50DQo+ID4gICAqbXVzdCBub3QqIHNrYi1zZW5kIGEgc2luZ2xlIGJ5dGUgYmVs
b25naW5nIHRvIGEgbGF5b3V0LCBhZnRlciB0aGUgc2VuZA0KPiA+ICAgb2YgTEFZT1VUX1JFVFVS
Ti4NCj4gPiAgIChJdCBuZWVkIG5vdCB3YWl0IGZvciBPUFRfRE9ORSBmcm9tIERTIHRvIGRvIHRo
YXQsIGl0IGp1c3QgbXVzdCBtYWtlDQo+ID4gICAgc3VyZSwgdGhhdCBhbGwgaXQncyBpbnRlcm5h
bCwgb3Igb24tdGhlLXdpcmUgcmVxdWVzdCwgYXJlIGFib3J0ZWQNCj4gPiAgICBieSBlYXNpbHkg
Y2xvc2luZyB0aGUgc29ja2V0cyB0aGV5IGJlbG9uZyB0b28sIGFuZC9vciB3YWl0aW5nIGZvcg0K
PiA+ICAgIGhlYWx0aHkgRFMncyBJTyB0byBiZSBPUFRfRE9ORSAuIFNvIHRoZSBjbGllbnQgaXMg
bm90IGRlcGVuZGVudCBvbg0KPiA+ICAgIGFueSBEUyByZXNwb25zZSwgaXQgaXMgb25seSBkZXBl
bmRlbnQgb24gaXQncyBpbnRlcm5hbCBzdGF0ZSBiZWluZw0KPiA+ICAgICpjbGVhbiogZnJvbSBh
bnkgbW9yZSBza2Itc2VuZChzKSkNCj4gDQo+IERpdHRvDQo+IA0KPiA+IC0gVGhlIHByb3BlciBp
bXBsZW1lbnRhdGlvbiBvZiBMQVlPVVRfUkVUVVJOIG9uIGVycm9yIGZvciBmYXN0IHR1cm5vdmVy
DQo+ID4gICBpcyBub3QgaGFyZCwgYW5kIGRvZXMgbm90IGludm9sdmUgYSBuZXcgaW52ZW50ZWQg
TkZTIG9wZXJhdGlvbiBzdWNoDQo+ID4gICBhcyBGRU5DRV9NRV9PRkYuIFByb3BlciBjb2RkZWQg
Y2xpZW50LCBpbmRlcGVuZGVudGx5LCB3aXRob3V0DQo+ID4gICB0aGUgYWlkIG9mIGFueSBGRU5D
RV9NRV9PRkYgb3BlcmF0aW9uLCBjYW4gYWNoaWV2ZSBhIGZhc3RlciB0dXJuYXJvdW5kDQo+ID4g
ICBieSBhY3RpdmVseSByZXR1cm5pbmcgYWxsIGxheW91dHMgdGhhdCBiZWxvbmcgdG8gYSBiYWQg
RFMsIGFuZCBub3QNCj4gPiAgIHdhaXRpbmcgZm9yIGEgZmVuY2Utb2ZmIG9mIGEgc2luZ2xlIGxh
eW91dCwgdGhlbiBlbmNvdW50ZXJpbmcganVzdA0KPiA+ICAgdGhlIHNhbWUgZXJyb3Igd2l0aCBh
bGwgb3RoZXIgbGF5b3V0cyB0aGF0IGhhdmUgdGhlIHNhbWUgRFMgICAgIA0KPiANCj4gV2hhdCBk
byB5b3UgbWVhbiBieSAiYWxsIGxheW91dHMgdGhhdCBiZWxvbmcgdG8gYSBiYWQgRFMiPyBMYXlv
dXRzIGRvbid0DQo+IGJlbG9uZyB0byBhIERTLCBhbmQgc28gdGhlcmUgaXMgbm8gd2F5IHRvIGdl
dCBmcm9tIGEgRFMgdG8gYSBsYXlvdXQuDQo+IA0KPiA+IC0gQW5kIEkga25vdyB0aGF0IGp1c3Qg
YXMgeW91IGRpZCBub3QgcmVhZCBteSBlbWFpbHMgZnJvbSBiZWZvcmUNCj4gPiAgIG1lIGdvaW5n
IHRvIEhvc3BpdGFsLCB5b3Ugd2lsbCBjb250aW51ZSB0byBub3QgdW5kZXJzdGFuZCB0aGlzDQo+
ID4gICBvbmUsIG9yIHdoYXQgSSdtIHRyeWluZyB0byBleHBsYWluLCBhbmQgd2lsbCBtb3N0IHBy
b2JhYmx5IGlnbm9yZQ0KPiA+ICAgYWxsIG9mIGl0LiBCdXQgcGxlYXNlIG5vdGUgb25lIHRoaW5n
Og0KPiANCj4gSSByZWFkIHRoZW0sIGJ1dCBqdXN0IGFzIG5vdywgdGhleSBjb250aW51ZSB0byBp
Z25vcmUgdGhlIHJlYWxpdHkgYWJvdXQNCj4gdGltZW91dHM6IHRpbWVvdXRzIG1lYW4gX25vdGhp
bmdfIGluIGFuIFJQQyBmYWlsb3ZlciBzaXR1YXRpb24uIFRoZXJlIGlzDQo+IG5vIFJQQyBhYm9y
dCBmdW5jdGlvbmFsaXR5IHRoYXQgeW91IGNhbiByZWx5IG9uIG90aGVyIHRoYW4gZmVuY2luZy4N
Cj4gDQo+ID4gICAgIFlPVSBoYXZlIHNhYm90YWdlZCB0aGUgTkZTIDQuMSBMaW51eCBjbGllbnQs
IHdoaWNoIGlzIG5vdyB0b3RhbGx5DQo+ID4gICAgIG5vdCBTVEQgY29tcGxhaW50LCBhbmQgaGF2
ZSBpbnRyb2R1Y2VkIENSQVNIcy4gQW5kIGZvciBubyBnb29kDQo+ID4gICAgIHJlYXNvbi4NCj4g
DQo+IFNlZSBhYm92ZS4NCj4gDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xp
ZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3
Lm5ldGFwcC5jb20NCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh Aug. 13, 2012, 11:39 p.m. UTC | #12

On 08/13/2012 07:26 PM, Myklebust, Trond wrote:

>> This above here is where you are wrong!! You don't understand my point,
>> and ignore my comments. So let me state it as clear as I can.
> 
> YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout
> for an RPC call once it has started. This is why we need fencing
> _specifically_ for the pNFS files client.
> 

Again we have a communication problem between us. I say some words and
mean one thing, and you say and hear the same words but attach different
meanings to them. This is no one's fault it's just is.

Lets do an experiment, mount a regular NFS4 in -o soft mode and start
writing to server, say with dd. Now disconnect the cable. After some timeout
the dd will return with "IO error", and will stop writing to file.

This is the timeout I mean. Surely some RPC-requests did not complete and
returned to NFS core with some kind of error.

With RPC-requests I do not mean the RPC protocol on the wire, I mean that
entity inside the Linux Kernel which represents an RPC. Surly some
linux-RPC-requests objects were not released do to a server "rpc-done"
received. But do to an internal mechanism that called the "release" method do to
a communication timeout.

So this is what I call "returned with a timeout". It does exist and used
every day.

Even better if I don't disconnect the wire but do an if_down or halt on the
server, the dd's IO error will happen immediately, not even wait for
any timeout. This is because the socket is orderly closed and all
sends/receives will return quickly with a "disconnect-error".

When I use a single Server like the nfs4 above. Then there is one fact
in above scenario that I want to point out:

    At some point in the NFS-Core state. There is a point that no more
    requests are issued, all old request have released, and an error is
    returned to the application. At that point the client will not call
    skb-send, and will not try farther communication with the Server.

This is what must happen with ALL DSs that belong to a layout, before
client should be LAYOUT_RETURN(ing). The client can only do it's job. That is:

   STOP any skb-send, to any of the DSs in a layout.
   Only then it is complying to the RFC.

So this is what I mean by "return with a timeout below"

>> (Lets assume files layout, for blocks and objects it's a bit different
>>  but mostly the same.)
> 
> That, and the fact that fencing hasn't been implemented for blocks and
> objects. 

That's not true. At Panasas and both at EMC there is fencing in place and
it is used every day. This is why I insist that it is very much
the same for all of us.

> The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT
> for each file with failed DS connection I/O) and touches only
> fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects.
> 

OK I had in mind the patches that Andy sent. I'll look again for what
actually went in. (It was all while I was unavailable)

>> - Heavy IO is going on, the device_id in question has *3* DSs in it's
>>   device topography. Say DS1, DS2, DS3
>>
>> - We have been queuing IO, and all queues are full. (we have 3 queues in
>>   in question, right? What is the maximum Q depth per files-DS? I know
>>   that in blocks and objects we usually have, I think, something like 128.
>>   This is a *tunable* in the block-layer's request-queue. Is it not some
>>   negotiated parameter with the NFS servers?)
>>
>> - Now, boom DS2 has timed-out. The Linux-client resets the session and
>>   internally closes all sockets of that session. All the RPCs that
>>   belong to DS2 are being returned up with a timeout error. This one
>>   is just the first of all those belonging to this DS2. They will
>>   be decrementing the reference for this layout very, very soon.
>>
>> - But what about DS1, and DS3 RPCs. What should we do with those?
>>   This is where you guys (Trond and Andy) are wrong. We must also
>>   wait for these RPC's as well. And opposite to what you think, this
>>   should not take long. Let me explain:
>>
>>   We don't know anything about DS1 and DS3, each might be, either,
>>   "Having the same communication problem, like DS2". Or "is just working
>>   fine". So lets say for example that DS3 will also time-out in the
>>   future, and that DS1 is just fine and is writing as usual.
>>
>>   * DS1 - Since it's working, it has most probably already done
>>     with all it's IO, because the NFS timeout is usually much longer
>>     then the normal RPC time, and since we are queuing evenly on
>>     all 3 DSs, at this point must probably, all of DS1 RPCs are
>>     already done. (And layout has been de-referenced).
>>
>>   * DS3 - Will timeout in the future, when will that be?
>>     So let me start with, saying:
>>     (1). We could enhance our code and proactively, 
>>         "cancel/abort" all RPCs that belong to DS3 (more on this
>>          below)
> 
> Which makes the race _WORSE_. As I said above, there is no 'cancel RPC'
> operation in SUNRPC. Once your RPC call is launched, it cannot be
> recalled. All your discussion above is about the client side, and
> ignores what may be happening on the data server side. The fencing is
> what is needed to deal with the data server picture.
> 

Again, some miss understanding. I never said we should not send
a LAYOUT_RETURN before writing through MDS. The opposite is true,
I think it is a novel idea and gives you the kind of barrier that
will harden and robust the system.

   WHAT I'm saying is that this cannot happen while the schizophrenic
   client is busily still actively skb-sending more and more bytes
   to all the other DSs in the layout. LONG AFTER THE LAYOUT_RETURN
   HAS BEEN SENT AND RESPONDED.

So what you are saying does not at all contradicts what I want.

   "The fencing is what is needed to deal with the data server picture"

    Fine But ONLY after the client has really stopped all sends.
    (Each one will do it's job)

BTW: The Server does not *need* the Client to send a LAYOUT_RETURN
     It's just a nice-to-have, which I'm fine with.
     Both Panasas and EMC when IO is sent through MDS, will first
     recall, overlapping layouts, and only then proceed with
     MDS processing. (This is some deeply rooted mechanism inside
     the FS, an MDS being just another client)

     So this is a known problem that is taken care of. But I totally
     agree with you, the client LAYOUT_RETURN(ing) the layout will save
     lots of protocol time by avoiding the recalls.
     Now you understand why in Objects we mandated this LAYOUT_RETURN
     on errors. And while at it we want the exact error reported.

>>     (2). Or We can prove that DS3's RPCs will timeout at worst
>>          case 1 x NFS-timeout after above DS2 timeout event, or
>>          2 x NFS-timeout after the queuing of the first timed-out
>>          RPC. And statistically in the average case DS3 will timeout
>>          very near the time DS2 timed-out.
>>
>>          This is easy since the last IO we queued was the one that
>>          made DS2's queue to be full, and it was kept full because
>>          DS2 stopped responding and nothing emptied the queue.
>>
>>      So the easiest we can do is wait for DS3 to timeout, soon
>>      enough, and once that will happen, session will be reset and all
>>      RPCs will end with an error.
> 
> 
> You are still only discussing the client side.
> 
> Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR
> CANCELED. Fencing is the closest we can come to an abort operation.
> 

Again I did not mean the "Sun RPC OPERATIONS" on the wire. I meant
the Linux-request-entity which while exist has a potential to be
submitted for skb-send. As seen above these entities do timeout
in "-o soft" mode and once released remove the potential of any more
future skb-sends on the wire.

BUT what I do not understand is: In above example we are talking
about DS3. We assumed that DS3 has a communication problem. So no amount
of "fencing" or vudu or any other kind of operation can ever affect
the client regarding DS3. Because even if On-the-server pending requests
from client on DS3 are fenced and discarded these errors will not
be communicated back the client. The client will sit idle on DS3
communication until the end of the timeout, regardless.

Actually what I propose for DS3 in the best robust client is to
destroy it's DS3's sessions and therefor cause all Linux-request-entities
to return much much faster, then if *just waiting* for the timeout to expire.

>> So in the worst case scenario we can recover 2 x NFS-timeout after
>> a network partition, which is just 1 x NFS-timeout, after your
>> schizophrenic FENCE_ME_OFF, newly invented operation.
>>
>> What we can do to enhance our code to reduce error recovery to
>> 1 x NFS-timeout:
>>
>> - DS3 above:
>>   (As I said DS1's queues are now empty, because it was working fine,
>>    So DS3 is a representation of all DS's that have RPCs at the
>>    time DS2 timed-out, which belong to this layout)
>>
>>   We can proactively abort all RPCs belonging to DS3. If there is
>>   a way to internally abort RPC's use that. Else just reset it's
>>   session and all sockets will close (and reopen), and all RPC's
>>   will end with a disconnect error.
> 
> Not on most servers that I'm aware of. If you close or reset the socket
> on the client, then the Linux server will happily continue to process
> those RPC calls; it just won't be able to send a reply.
> Furthermore, if the problem is that the data server isn't responding,
> then a socket close/reset tells you nothing either.
> 

Again I'm talking about the NFS-Internal-request-entities these will
be released, though guarantying that no more threads will use any of
these to send any more bytes over to any DSs.

AND yes, yes. Once the client has done it's job and stopped any future
skb-sends to *all* DSs in question, only then it can report to MDS:
  "Hey I'm done sending in all other routs here LAYOUT_RETURN"
  (Now fencing happens on servers)

  and client goes on and says

  "Hey can you MDS please also write this data"

Which is perfect for MDS because otherwise if it wants to make sure,
it will need to recall all outstanding layouts, exactly for your
reason, for concern for the data corruption that can happen.

>> - Both DS2 that timed-out, and DS3 that was aborted. Should be
>>   marked with a flag. When new IO that belong to some other
>>   inode through some other layout+device_id encounters a flagged
>>   device, it should abort and turn to MDS IO, with also invalidating
>>   it's layout, and hens, soon enough the device_id for DS2&3 will be
>>   de-referenced and be removed from device cache. (And all referencing
>>   layouts are now gone)
> 
> There is no RPC abort functionality Sun RPC. Again, this argument relies
> on functionality that _doesn't_ exist.
> 

Again I mean internally at the client. For example closing the socket will
have the effect I want. (And some other tricks we can talk about those
later, lets agree about the principal first)

>>   So we do not continue queuing new IO to dead devices. And since most
>>   probably MDS will not give us dead servers in new layout, we should be
>>   good.
>> In summery.
>> - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client
>>   *must not* skb-send a single byte belonging to a layout, after the send
>>   of LAYOUT_RETURN.
>>   (It need not wait for OPT_DONE from DS to do that, it just must make
>>    sure, that all it's internal, or on-the-wire request, are aborted
>>    by easily closing the sockets they belong too, and/or waiting for
>>    healthy DS's IO to be OPT_DONE . So the client is not dependent on
>>    any DS response, it is only dependent on it's internal state being
>>    *clean* from any more skb-send(s))
> 
> Ditto
> 
>> - The proper implementation of LAYOUT_RETURN on error for fast turnover
>>   is not hard, and does not involve a new invented NFS operation such
>>   as FENCE_ME_OFF. Proper codded client, independently, without
>>   the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround
>>   by actively returning all layouts that belong to a bad DS, and not
>>   waiting for a fence-off of a single layout, then encountering just
>>   the same error with all other layouts that have the same DS     
> 
> What do you mean by "all layouts that belong to a bad DS"? Layouts don't
> belong to a DS, and so there is no way to get from a DS to a layout.
> 

Why, sure. loop on all layouts and ask if it has a specific DS.

>> - And I know that just as you did not read my emails from before
>>   me going to Hospital, you will continue to not understand this
>>   one, or what I'm trying to explain, and will most probably ignore
>>   all of it. But please note one thing:
> 
> I read them, but just as now, they continue to ignore the reality about
> timeouts: timeouts mean _nothing_ in an RPC failover situation. There is
> no RPC abort functionality that you can rely on other than fencing.
> 

I hope I explained this by now. If not please, please, lets organize
a phone call. We can use Panasas conference number whenever you are
available. I think we communicate better in person.

Everyone else is also invited.

BUT there is one most important point for me:

   As stated by the RFC. Client must guaranty that no more bytes will be
   sent to any DSs in a layout, once LAYOUT_RETURN is sent. This is the
   only definition of LAYOUT_RETURN, and NO_MATCHING_LAYOUT as response
   to a LAYOUT_RECALL. Which is:
   Client has indicated no more future sends on a layout. (And server will
   enforce it with a fencing)

>>     YOU have sabotaged the NFS 4.1 Linux client, which is now totally
>>     not STD complaint, and have introduced CRASHs. And for no good
>>     reason.
> 
> See above.
> 

OK We'll have to see about these crashes, lets talk about them.

Thanks
Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 14, 2012, 12:16 a.m. UTC | #13

T24gVHVlLCAyMDEyLTA4LTE0IGF0IDAyOjM5ICswMzAwLCBCb2F6IEhhcnJvc2ggd3JvdGU6DQo+
IE9uIDA4LzEzLzIwMTIgMDc6MjYgUE0sIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+IA0KPiA+
PiBUaGlzIGFib3ZlIGhlcmUgaXMgd2hlcmUgeW91IGFyZSB3cm9uZyEhIFlvdSBkb24ndCB1bmRl
cnN0YW5kIG15IHBvaW50LA0KPiA+PiBhbmQgaWdub3JlIG15IGNvbW1lbnRzLiBTbyBsZXQgbWUg
c3RhdGUgaXQgYXMgY2xlYXIgYXMgSSBjYW4uDQo+ID4gDQo+ID4gWU9VIGFyZSBpZ25vcmluZyB0
aGUgcmVhbGl0eSBvZiBTdW5SUEMuIFRoZXJlIGlzIG5vIGFib3J0L2NhbmNlbC90aW1lb3V0DQo+
ID4gZm9yIGFuIFJQQyBjYWxsIG9uY2UgaXQgaGFzIHN0YXJ0ZWQuIFRoaXMgaXMgd2h5IHdlIG5l
ZWQgZmVuY2luZw0KPiA+IF9zcGVjaWZpY2FsbHlfIGZvciB0aGUgcE5GUyBmaWxlcyBjbGllbnQu
DQo+ID4gDQo+IA0KPiANCj4gQWdhaW4gd2UgaGF2ZSBhIGNvbW11bmljYXRpb24gcHJvYmxlbSBi
ZXR3ZWVuIHVzLiBJIHNheSBzb21lIHdvcmRzIGFuZA0KPiBtZWFuIG9uZSB0aGluZywgYW5kIHlv
dSBzYXkgYW5kIGhlYXIgdGhlIHNhbWUgd29yZHMgYnV0IGF0dGFjaCBkaWZmZXJlbnQNCj4gbWVh
bmluZ3MgdG8gdGhlbS4gVGhpcyBpcyBubyBvbmUncyBmYXVsdCBpdCdzIGp1c3QgaXMuDQo+IA0K
PiBMZXRzIGRvIGFuIGV4cGVyaW1lbnQsIG1vdW50IGEgcmVndWxhciBORlM0IGluIC1vIHNvZnQg
bW9kZSBhbmQgc3RhcnQNCj4gd3JpdGluZyB0byBzZXJ2ZXIsIHNheSB3aXRoIGRkLiBOb3cgZGlz
Y29ubmVjdCB0aGUgY2FibGUuIEFmdGVyIHNvbWUgdGltZW91dA0KPiB0aGUgZGQgd2lsbCByZXR1
cm4gd2l0aCAiSU8gZXJyb3IiLCBhbmQgd2lsbCBzdG9wIHdyaXRpbmcgdG8gZmlsZS4NCj4gDQo+
IFRoaXMgaXMgdGhlIHRpbWVvdXQgSSBtZWFuLiBTdXJlbHkgc29tZSBSUEMtcmVxdWVzdHMgZGlk
IG5vdCBjb21wbGV0ZSBhbmQNCj4gcmV0dXJuZWQgdG8gTkZTIGNvcmUgd2l0aCBzb21lIGtpbmQg
b2YgZXJyb3IuDQo+IA0KPiBXaXRoIFJQQy1yZXF1ZXN0cyBJIGRvIG5vdCBtZWFuIHRoZSBSUEMg
cHJvdG9jb2wgb24gdGhlIHdpcmUsIEkgbWVhbiB0aGF0DQo+IGVudGl0eSBpbnNpZGUgdGhlIExp
bnV4IEtlcm5lbCB3aGljaCByZXByZXNlbnRzIGFuIFJQQy4gU3VybHkgc29tZQ0KPiBsaW51eC1S
UEMtcmVxdWVzdHMgb2JqZWN0cyB3ZXJlIG5vdCByZWxlYXNlZCBkbyB0byBhIHNlcnZlciAicnBj
LWRvbmUiDQo+IHJlY2VpdmVkLiBCdXQgZG8gdG8gYW4gaW50ZXJuYWwgbWVjaGFuaXNtIHRoYXQg
Y2FsbGVkIHRoZSAicmVsZWFzZSIgbWV0aG9kIGRvIHRvDQo+IGEgY29tbXVuaWNhdGlvbiB0aW1l
b3V0Lg0KPiANCj4gU28gdGhpcyBpcyB3aGF0IEkgY2FsbCAicmV0dXJuZWQgd2l0aCBhIHRpbWVv
dXQiLiBJdCBkb2VzIGV4aXN0IGFuZCB1c2VkDQo+IGV2ZXJ5IGRheS4NCj4gDQo+IEV2ZW4gYmV0
dGVyIGlmIEkgZG9uJ3QgZGlzY29ubmVjdCB0aGUgd2lyZSBidXQgZG8gYW4gaWZfZG93biBvciBo
YWx0IG9uIHRoZQ0KPiBzZXJ2ZXIsIHRoZSBkZCdzIElPIGVycm9yIHdpbGwgaGFwcGVuIGltbWVk
aWF0ZWx5LCBub3QgZXZlbiB3YWl0IGZvcg0KPiBhbnkgdGltZW91dC4gVGhpcyBpcyBiZWNhdXNl
IHRoZSBzb2NrZXQgaXMgb3JkZXJseSBjbG9zZWQgYW5kIGFsbA0KPiBzZW5kcy9yZWNlaXZlcyB3
aWxsIHJldHVybiBxdWlja2x5IHdpdGggYSAiZGlzY29ubmVjdC1lcnJvciIuDQo+IA0KPiBXaGVu
IEkgdXNlIGEgc2luZ2xlIFNlcnZlciBsaWtlIHRoZSBuZnM0IGFib3ZlLiBUaGVuIHRoZXJlIGlz
IG9uZSBmYWN0DQo+IGluIGFib3ZlIHNjZW5hcmlvIHRoYXQgSSB3YW50IHRvIHBvaW50IG91dDoN
Cj4gDQo+ICAgICBBdCBzb21lIHBvaW50IGluIHRoZSBORlMtQ29yZSBzdGF0ZS4gVGhlcmUgaXMg
YSBwb2ludCB0aGF0IG5vIG1vcmUNCj4gICAgIHJlcXVlc3RzIGFyZSBpc3N1ZWQsIGFsbCBvbGQg
cmVxdWVzdCBoYXZlIHJlbGVhc2VkLCBhbmQgYW4gZXJyb3IgaXMNCj4gICAgIHJldHVybmVkIHRv
IHRoZSBhcHBsaWNhdGlvbi4gQXQgdGhhdCBwb2ludCB0aGUgY2xpZW50IHdpbGwgbm90IGNhbGwN
Cj4gICAgIHNrYi1zZW5kLCBhbmQgd2lsbCBub3QgdHJ5IGZhcnRoZXIgY29tbXVuaWNhdGlvbiB3
aXRoIHRoZSBTZXJ2ZXIuDQo+IA0KPiBUaGlzIGlzIHdoYXQgbXVzdCBoYXBwZW4gd2l0aCBBTEwg
RFNzIHRoYXQgYmVsb25nIHRvIGEgbGF5b3V0LCBiZWZvcmUNCj4gY2xpZW50IHNob3VsZCBiZSBM
QVlPVVRfUkVUVVJOKGluZykuIFRoZSBjbGllbnQgY2FuIG9ubHkgZG8gaXQncyBqb2IuIFRoYXQg
aXM6DQo+IA0KPiAgICBTVE9QIGFueSBza2Itc2VuZCwgdG8gYW55IG9mIHRoZSBEU3MgaW4gYSBs
YXlvdXQuDQo+ICAgIE9ubHkgdGhlbiBpdCBpcyBjb21wbHlpbmcgdG8gdGhlIFJGQy4NCj4gDQo+
IFNvIHRoaXMgaXMgd2hhdCBJIG1lYW4gYnkgInJldHVybiB3aXRoIGEgdGltZW91dCBiZWxvdyIN
Cg0KDQpJIGhlYXIgeW91LCBub3cgbGlzdGVuIHRvIG1lLg0KDQpXaG8gX2NhcmVzXyBpZiB0aGUg
Y2xpZW50IHNlbmRzIGFuIFJQQyBjYWxsIGFmdGVyIHRoZSBsYXlvdXRyZXR1cm4/IEluDQp0aGUg
Y2FzZSBvZiBhbiB1bnJlc3BvbnNpdmUgZGF0YSBzZXJ2ZXIgdGhlIGNsaWVudCBjYW4ndCBndWFy
YW50ZWUgdGhhdA0KdGhpcyB3b24ndCBoYXBwZW4gZXZlbiBpZiBpdCBkb2VzIHdhaXQuDQpBIHBO
RlMgZmlsZXMgc2VydmVyIHRoYXQgZG9lc24ndCBwcm9wYWdhdGUgdGhlIGxheW91dHJldHVybiB0
byB0aGUgZGF0YQ0Kc2VydmVyIGluIGEgdGltZWx5IGZhc2hpb24gaXMgZnVuZGFtZW50YWxseSBf
YnJva2VuXyBpbiB0aGUgY2FzZSB3aGVyZQ0KdGhlIGNvbW11bmljYXRpb24gYmV0d2VlbiB0aGUg
ZGF0YSBzZXJ2ZXIgYW5kIGNsaWVudCBpcyBkb3duLiBJdCBjYW5ub3QNCm9mZmVyIGFueSBkYXRh
IGludGVncml0eSBndWFyYW50ZWVzIHdoZW4gdGhlIGNsaWVudCB0cmllcyB0byB3cml0ZQ0KdGhy
b3VnaCBNRFMsIGJlY2F1c2UgdGhlIERTZXMgbWF5IHN0aWxsIGJlIHByb2Nlc3Npbmcgb2xkIHdy
aXRlIFJQQw0KY2FsbHMuDQoNCj4gPj4gKExldHMgYXNzdW1lIGZpbGVzIGxheW91dCwgZm9yIGJs
b2NrcyBhbmQgb2JqZWN0cyBpdCdzIGEgYml0IGRpZmZlcmVudA0KPiA+PiAgYnV0IG1vc3RseSB0
aGUgc2FtZS4pDQo+ID4gDQo+ID4gVGhhdCwgYW5kIHRoZSBmYWN0IHRoYXQgZmVuY2luZyBoYXNu
J3QgYmVlbiBpbXBsZW1lbnRlZCBmb3IgYmxvY2tzIGFuZA0KPiA+IG9iamVjdHMuIA0KPiANCj4g
DQo+IFRoYXQncyBub3QgdHJ1ZS4gQXQgUGFuYXNhcyBhbmQgYm90aCBhdCBFTUMgdGhlcmUgaXMg
ZmVuY2luZyBpbiBwbGFjZSBhbmQNCj4gaXQgaXMgdXNlZCBldmVyeSBkYXkuIFRoaXMgaXMgd2h5
IEkgaW5zaXN0IHRoYXQgaXQgaXMgdmVyeSBtdWNoDQo+IHRoZSBzYW1lIGZvciBhbGwgb2YgdXMu
DQoNCkknbSB0YWxraW5nIGFib3V0IHRoZSB1c2Ugb2YgbGF5b3V0cmV0dXJuIGZvciBjbGllbnQg
ZmVuY2luZywgd2hpY2ggaXMNCm9ubHkgaW1wbGVtZW50ZWQgZm9yIGZpbGVzLg0KDQpIb3dldmVy
IFRhbyBhZG1pdHRlZCB0aGF0IHRoZSBibG9ja3MgY2xpZW50IGhhcyBub3QgeWV0IGltcGxlbWVu
dGVkIHRoZQ0KdGltZWQtbGVhc2UgZmVuY2luZyBhcyBkZXNjcmliZWQgaW4gUkZDNTY2Mywgc28g
dGhlcmUgaXMgc3RpbGwgd29yayB0bw0KYmUgZG9uZSB0aGVyZS4NCg0KSSd2ZSBubyBpZGVhIHdo
YXQgdGhlIG9iamVjdCBjbGllbnQgaXMgZG9pbmcuDQoNCj4gPiBUaGUgY29tbWl0IGluIHF1ZXN0
aW9uIGlzIDgyYzdjN2E1YSAoTkZTdjQuMSByZXR1cm4gdGhlIExBWU9VVA0KPiA+IGZvciBlYWNo
IGZpbGUgd2l0aCBmYWlsZWQgRFMgY29ubmVjdGlvbiBJL08pIGFuZCB0b3VjaGVzIG9ubHkNCj4g
PiBmcy9uZnMvbmZzNGZpbGVsYXlvdXQuYy4gSXQgY2Fubm90IGJlIGFmZmVjdGluZyBibG9ja3Mg
YW5kIG9iamVjdHMuDQo+ID4gDQo+IA0KPiANCj4gT0sgSSBoYWQgaW4gbWluZCB0aGUgcGF0Y2hl
cyB0aGF0IEFuZHkgc2VudC4gSSdsbCBsb29rIGFnYWluIGZvciB3aGF0DQo+IGFjdHVhbGx5IHdl
bnQgaW4uIChJdCB3YXMgYWxsIHdoaWxlIEkgd2FzIHVuYXZhaWxhYmxlKQ0KDQpIZSBzZW50IGEg
cmV2aXNlZCBwYXRjaCBzZXQsIHdoaWNoIHNob3VsZCBvbmx5IGFmZmVjdCB0aGUgZmlsZXMgbGF5
b3V0Lg0KDQo+ID4+IC0gSGVhdnkgSU8gaXMgZ29pbmcgb24sIHRoZSBkZXZpY2VfaWQgaW4gcXVl
c3Rpb24gaGFzICozKiBEU3MgaW4gaXQncw0KPiA+PiAgIGRldmljZSB0b3BvZ3JhcGh5LiBTYXkg
RFMxLCBEUzIsIERTMw0KPiA+Pg0KPiA+PiAtIFdlIGhhdmUgYmVlbiBxdWV1aW5nIElPLCBhbmQg
YWxsIHF1ZXVlcyBhcmUgZnVsbC4gKHdlIGhhdmUgMyBxdWV1ZXMgaW4NCj4gPj4gICBpbiBxdWVz
dGlvbiwgcmlnaHQ/IFdoYXQgaXMgdGhlIG1heGltdW0gUSBkZXB0aCBwZXIgZmlsZXMtRFM/IEkg
a25vdw0KPiA+PiAgIHRoYXQgaW4gYmxvY2tzIGFuZCBvYmplY3RzIHdlIHVzdWFsbHkgaGF2ZSwg
SSB0aGluaywgc29tZXRoaW5nIGxpa2UgMTI4Lg0KPiA+PiAgIFRoaXMgaXMgYSAqdHVuYWJsZSog
aW4gdGhlIGJsb2NrLWxheWVyJ3MgcmVxdWVzdC1xdWV1ZS4gSXMgaXQgbm90IHNvbWUNCj4gPj4g
ICBuZWdvdGlhdGVkIHBhcmFtZXRlciB3aXRoIHRoZSBORlMgc2VydmVycz8pDQo+ID4+DQo+ID4+
IC0gTm93LCBib29tIERTMiBoYXMgdGltZWQtb3V0LiBUaGUgTGludXgtY2xpZW50IHJlc2V0cyB0
aGUgc2Vzc2lvbiBhbmQNCj4gPj4gICBpbnRlcm5hbGx5IGNsb3NlcyBhbGwgc29ja2V0cyBvZiB0
aGF0IHNlc3Npb24uIEFsbCB0aGUgUlBDcyB0aGF0DQo+ID4+ICAgYmVsb25nIHRvIERTMiBhcmUg
YmVpbmcgcmV0dXJuZWQgdXAgd2l0aCBhIHRpbWVvdXQgZXJyb3IuIFRoaXMgb25lDQo+ID4+ICAg
aXMganVzdCB0aGUgZmlyc3Qgb2YgYWxsIHRob3NlIGJlbG9uZ2luZyB0byB0aGlzIERTMi4gVGhl
eSB3aWxsDQo+ID4+ICAgYmUgZGVjcmVtZW50aW5nIHRoZSByZWZlcmVuY2UgZm9yIHRoaXMgbGF5
b3V0IHZlcnksIHZlcnkgc29vbi4NCj4gPj4NCj4gPj4gLSBCdXQgd2hhdCBhYm91dCBEUzEsIGFu
ZCBEUzMgUlBDcy4gV2hhdCBzaG91bGQgd2UgZG8gd2l0aCB0aG9zZT8NCj4gPj4gICBUaGlzIGlz
IHdoZXJlIHlvdSBndXlzIChUcm9uZCBhbmQgQW5keSkgYXJlIHdyb25nLiBXZSBtdXN0IGFsc28N
Cj4gPj4gICB3YWl0IGZvciB0aGVzZSBSUEMncyBhcyB3ZWxsLiBBbmQgb3Bwb3NpdGUgdG8gd2hh
dCB5b3UgdGhpbmssIHRoaXMNCj4gPj4gICBzaG91bGQgbm90IHRha2UgbG9uZy4gTGV0IG1lIGV4
cGxhaW46DQo+ID4+DQo+ID4+ICAgV2UgZG9uJ3Qga25vdyBhbnl0aGluZyBhYm91dCBEUzEgYW5k
IERTMywgZWFjaCBtaWdodCBiZSwgZWl0aGVyLA0KPiA+PiAgICJIYXZpbmcgdGhlIHNhbWUgY29t
bXVuaWNhdGlvbiBwcm9ibGVtLCBsaWtlIERTMiIuIE9yICJpcyBqdXN0IHdvcmtpbmcNCj4gPj4g
ICBmaW5lIi4gU28gbGV0cyBzYXkgZm9yIGV4YW1wbGUgdGhhdCBEUzMgd2lsbCBhbHNvIHRpbWUt
b3V0IGluIHRoZQ0KPiA+PiAgIGZ1dHVyZSwgYW5kIHRoYXQgRFMxIGlzIGp1c3QgZmluZSBhbmQg
aXMgd3JpdGluZyBhcyB1c3VhbC4NCj4gPj4NCj4gPj4gICAqIERTMSAtIFNpbmNlIGl0J3Mgd29y
a2luZywgaXQgaGFzIG1vc3QgcHJvYmFibHkgYWxyZWFkeSBkb25lDQo+ID4+ICAgICB3aXRoIGFs
bCBpdCdzIElPLCBiZWNhdXNlIHRoZSBORlMgdGltZW91dCBpcyB1c3VhbGx5IG11Y2ggbG9uZ2Vy
DQo+ID4+ICAgICB0aGVuIHRoZSBub3JtYWwgUlBDIHRpbWUsIGFuZCBzaW5jZSB3ZSBhcmUgcXVl
dWluZyBldmVubHkgb24NCj4gPj4gICAgIGFsbCAzIERTcywgYXQgdGhpcyBwb2ludCBtdXN0IHBy
b2JhYmx5LCBhbGwgb2YgRFMxIFJQQ3MgYXJlDQo+ID4+ICAgICBhbHJlYWR5IGRvbmUuIChBbmQg
bGF5b3V0IGhhcyBiZWVuIGRlLXJlZmVyZW5jZWQpLg0KPiA+Pg0KPiA+PiAgICogRFMzIC0gV2ls
bCB0aW1lb3V0IGluIHRoZSBmdXR1cmUsIHdoZW4gd2lsbCB0aGF0IGJlPw0KPiA+PiAgICAgU28g
bGV0IG1lIHN0YXJ0IHdpdGgsIHNheWluZzoNCj4gPj4gICAgICgxKS4gV2UgY291bGQgZW5oYW5j
ZSBvdXIgY29kZSBhbmQgcHJvYWN0aXZlbHksIA0KPiA+PiAgICAgICAgICJjYW5jZWwvYWJvcnQi
IGFsbCBSUENzIHRoYXQgYmVsb25nIHRvIERTMyAobW9yZSBvbiB0aGlzDQo+ID4+ICAgICAgICAg
IGJlbG93KQ0KPiA+IA0KPiA+IFdoaWNoIG1ha2VzIHRoZSByYWNlIF9XT1JTRV8uIEFzIEkgc2Fp
ZCBhYm92ZSwgdGhlcmUgaXMgbm8gJ2NhbmNlbCBSUEMnDQo+ID4gb3BlcmF0aW9uIGluIFNVTlJQ
Qy4gT25jZSB5b3VyIFJQQyBjYWxsIGlzIGxhdW5jaGVkLCBpdCBjYW5ub3QgYmUNCj4gPiByZWNh
bGxlZC4gQWxsIHlvdXIgZGlzY3Vzc2lvbiBhYm92ZSBpcyBhYm91dCB0aGUgY2xpZW50IHNpZGUs
IGFuZA0KPiA+IGlnbm9yZXMgd2hhdCBtYXkgYmUgaGFwcGVuaW5nIG9uIHRoZSBkYXRhIHNlcnZl
ciBzaWRlLiBUaGUgZmVuY2luZyBpcw0KPiA+IHdoYXQgaXMgbmVlZGVkIHRvIGRlYWwgd2l0aCB0
aGUgZGF0YSBzZXJ2ZXIgcGljdHVyZS4NCj4gPiANCj4gDQo+IA0KPiBBZ2Fpbiwgc29tZSBtaXNz
IHVuZGVyc3RhbmRpbmcuIEkgbmV2ZXIgc2FpZCB3ZSBzaG91bGQgbm90IHNlbmQNCj4gYSBMQVlP
VVRfUkVUVVJOIGJlZm9yZSB3cml0aW5nIHRocm91Z2ggTURTLiBUaGUgb3Bwb3NpdGUgaXMgdHJ1
ZSwNCj4gSSB0aGluayBpdCBpcyBhIG5vdmVsIGlkZWEgYW5kIGdpdmVzIHlvdSB0aGUga2luZCBv
ZiBiYXJyaWVyIHRoYXQNCj4gd2lsbCBoYXJkZW4gYW5kIHJvYnVzdCB0aGUgc3lzdGVtLg0KPiAN
Cj4gICAgV0hBVCBJJ20gc2F5aW5nIGlzIHRoYXQgdGhpcyBjYW5ub3QgaGFwcGVuIHdoaWxlIHRo
ZSBzY2hpem9waHJlbmljDQo+ICAgIGNsaWVudCBpcyBidXNpbHkgc3RpbGwgYWN0aXZlbHkgc2ti
LXNlbmRpbmcgbW9yZSBhbmQgbW9yZSBieXRlcw0KPiAgICB0byBhbGwgdGhlIG90aGVyIERTcyBp
biB0aGUgbGF5b3V0LiBMT05HIEFGVEVSIFRIRSBMQVlPVVRfUkVUVVJODQo+ICAgIEhBUyBCRUVO
IFNFTlQgQU5EIFJFU1BPTkRFRC4NCj4gDQo+IFNvIHdoYXQgeW91IGFyZSBzYXlpbmcgZG9lcyBu
b3QgYXQgYWxsIGNvbnRyYWRpY3RzIHdoYXQgSSB3YW50Lg0KPiANCj4gICAgIlRoZSBmZW5jaW5n
IGlzIHdoYXQgaXMgbmVlZGVkIHRvIGRlYWwgd2l0aCB0aGUgZGF0YSBzZXJ2ZXIgcGljdHVyZSIN
Cj4gICAgIA0KPiAgICAgRmluZSBCdXQgT05MWSBhZnRlciB0aGUgY2xpZW50IGhhcyByZWFsbHkg
c3RvcHBlZCBhbGwgc2VuZHMuDQo+ICAgICAoRWFjaCBvbmUgd2lsbCBkbyBpdCdzIGpvYikNCj4g
DQo+IEJUVzogVGhlIFNlcnZlciBkb2VzIG5vdCAqbmVlZCogdGhlIENsaWVudCB0byBzZW5kIGEg
TEFZT1VUX1JFVFVSTg0KPiAgICAgIEl0J3MganVzdCBhIG5pY2UtdG8taGF2ZSwgd2hpY2ggSSdt
IGZpbmUgd2l0aC4NCj4gICAgICBCb3RoIFBhbmFzYXMgYW5kIEVNQyB3aGVuIElPIGlzIHNlbnQg
dGhyb3VnaCBNRFMsIHdpbGwgZmlyc3QNCj4gICAgICByZWNhbGwsIG92ZXJsYXBwaW5nIGxheW91
dHMsIGFuZCBvbmx5IHRoZW4gcHJvY2VlZCB3aXRoDQo+ICAgICAgTURTIHByb2Nlc3NpbmcuIChU
aGlzIGlzIHNvbWUgZGVlcGx5IHJvb3RlZCBtZWNoYW5pc20gaW5zaWRlDQo+ICAgICAgdGhlIEZT
LCBhbiBNRFMgYmVpbmcganVzdCBhbm90aGVyIGNsaWVudCkNCg0KQWdhaW4sIHdlJ3JlIG5vdCB0
YWxraW5nIGFib3V0IGJsb2NrcyBvciBvYmplY3RzLg0KYQ0KPiAgICAgIFNvIHRoaXMgaXMgYSBr
bm93biBwcm9ibGVtIHRoYXQgaXMgdGFrZW4gY2FyZSBvZi4gQnV0IEkgdG90YWxseQ0KPiAgICAg
IGFncmVlIHdpdGggeW91LCB0aGUgY2xpZW50IExBWU9VVF9SRVRVUk4oaW5nKSB0aGUgbGF5b3V0
IHdpbGwgc2F2ZQ0KPiAgICAgIGxvdHMgb2YgcHJvdG9jb2wgdGltZSBieSBhdm9pZGluZyB0aGUg
cmVjYWxscy4NCj4gICAgICBOb3cgeW91IHVuZGVyc3RhbmQgd2h5IGluIE9iamVjdHMgd2UgbWFu
ZGF0ZWQgdGhpcyBMQVlPVVRfUkVUVVJODQo+ICAgICAgb24gZXJyb3JzLiBBbmQgd2hpbGUgYXQg
aXQgd2Ugd2FudCB0aGUgZXhhY3QgZXJyb3IgcmVwb3J0ZWQuDQo+IA0KPiA+PiAgICAgKDIpLiBP
ciBXZSBjYW4gcHJvdmUgdGhhdCBEUzMncyBSUENzIHdpbGwgdGltZW91dCBhdCB3b3JzdA0KPiA+
PiAgICAgICAgICBjYXNlIDEgeCBORlMtdGltZW91dCBhZnRlciBhYm92ZSBEUzIgdGltZW91dCBl
dmVudCwgb3INCj4gPj4gICAgICAgICAgMiB4IE5GUy10aW1lb3V0IGFmdGVyIHRoZSBxdWV1aW5n
IG9mIHRoZSBmaXJzdCB0aW1lZC1vdXQNCj4gPj4gICAgICAgICAgUlBDLiBBbmQgc3RhdGlzdGlj
YWxseSBpbiB0aGUgYXZlcmFnZSBjYXNlIERTMyB3aWxsIHRpbWVvdXQNCj4gPj4gICAgICAgICAg
dmVyeSBuZWFyIHRoZSB0aW1lIERTMiB0aW1lZC1vdXQuDQo+ID4+DQo+ID4+ICAgICAgICAgIFRo
aXMgaXMgZWFzeSBzaW5jZSB0aGUgbGFzdCBJTyB3ZSBxdWV1ZWQgd2FzIHRoZSBvbmUgdGhhdA0K
PiA+PiAgICAgICAgICBtYWRlIERTMidzIHF1ZXVlIHRvIGJlIGZ1bGwsIGFuZCBpdCB3YXMga2Vw
dCBmdWxsIGJlY2F1c2UNCj4gPj4gICAgICAgICAgRFMyIHN0b3BwZWQgcmVzcG9uZGluZyBhbmQg
bm90aGluZyBlbXB0aWVkIHRoZSBxdWV1ZS4NCj4gPj4NCj4gPj4gICAgICBTbyB0aGUgZWFzaWVz
dCB3ZSBjYW4gZG8gaXMgd2FpdCBmb3IgRFMzIHRvIHRpbWVvdXQsIHNvb24NCj4gPj4gICAgICBl
bm91Z2gsIGFuZCBvbmNlIHRoYXQgd2lsbCBoYXBwZW4sIHNlc3Npb24gd2lsbCBiZSByZXNldCBh
bmQgYWxsDQo+ID4+ICAgICAgUlBDcyB3aWxsIGVuZCB3aXRoIGFuIGVycm9yLg0KPiA+IA0KPiA+
IA0KPiA+IFlvdSBhcmUgc3RpbGwgb25seSBkaXNjdXNzaW5nIHRoZSBjbGllbnQgc2lkZS4NCj4g
PiANCj4gPiBSZWFkIG15IGxpcHM6IFN1biBSUEMgT1BFUkFUSU9OUyBETyBOT1QgVElNRU9VVCBB
TkQgQ0FOTk9UIEJFIEFCT1JURUQgT1INCj4gPiBDQU5DRUxFRC4gRmVuY2luZyBpcyB0aGUgY2xv
c2VzdCB3ZSBjYW4gY29tZSB0byBhbiBhYm9ydCBvcGVyYXRpb24uDQo+ID4gDQo+IA0KPiANCj4g
QWdhaW4gSSBkaWQgbm90IG1lYW4gdGhlICJTdW4gUlBDIE9QRVJBVElPTlMiIG9uIHRoZSB3aXJl
LiBJIG1lYW50DQo+IHRoZSBMaW51eC1yZXF1ZXN0LWVudGl0eSB3aGljaCB3aGlsZSBleGlzdCBo
YXMgYSBwb3RlbnRpYWwgdG8gYmUNCj4gc3VibWl0dGVkIGZvciBza2Itc2VuZC4gQXMgc2VlbiBh
Ym92ZSB0aGVzZSBlbnRpdGllcyBkbyB0aW1lb3V0DQo+IGluICItbyBzb2Z0IiBtb2RlIGFuZCBv
bmNlIHJlbGVhc2VkIHJlbW92ZSB0aGUgcG90ZW50aWFsIG9mIGFueSBtb3JlDQo+IGZ1dHVyZSBz
a2Itc2VuZHMgb24gdGhlIHdpcmUuDQo+IA0KPiBCVVQgd2hhdCBJIGRvIG5vdCB1bmRlcnN0YW5k
IGlzOiBJbiBhYm92ZSBleGFtcGxlIHdlIGFyZSB0YWxraW5nDQo+IGFib3V0IERTMy4gV2UgYXNz
dW1lZCB0aGF0IERTMyBoYXMgYSBjb21tdW5pY2F0aW9uIHByb2JsZW0uIFNvIG5vIGFtb3VudA0K
PiBvZiAiZmVuY2luZyIgb3IgdnVkdSBvciBhbnkgb3RoZXIga2luZCBvZiBvcGVyYXRpb24gY2Fu
IGV2ZXIgYWZmZWN0DQo+IHRoZSBjbGllbnQgcmVnYXJkaW5nIERTMy4gQmVjYXVzZSBldmVuIGlm
IE9uLXRoZS1zZXJ2ZXIgcGVuZGluZyByZXF1ZXN0cw0KPiBmcm9tIGNsaWVudCBvbiBEUzMgYXJl
IGZlbmNlZCBhbmQgZGlzY2FyZGVkIHRoZXNlIGVycm9ycyB3aWxsIG5vdA0KPiBiZSBjb21tdW5p
Y2F0ZWQgYmFjayB0aGUgY2xpZW50LiBUaGUgY2xpZW50IHdpbGwgc2l0IGlkbGUgb24gRFMzDQo+
IGNvbW11bmljYXRpb24gdW50aWwgdGhlIGVuZCBvZiB0aGUgdGltZW91dCwgcmVnYXJkbGVzcy4N
Cg0KV2UgZG9uJ3QgY2FyZSBhYm91dCBhbnkgcmVjZWl2aW5nIHRoZSBlcnJvcnMuIFdlJ3ZlIHRp
bWVkIG91dC4gQWxsIHdlDQp3YW50IHRvIGRvIGlzIHRvIGZlbmNlIG9mZiB0aGUgZGFtbmVkIHdy
aXRlcyB0aGF0IGhhdmUgYWxyZWFkeSBiZWVuIHNlbnQNCnRvIHRoZSBib3JrZW4gRFMgYW5kIHRo
ZW4gcmVzZW5kIHRoZW0gdGhyb3VnaCB0aGUgTURTLg0KDQo+IEFjdHVhbGx5IHdoYXQgSSBwcm9w
b3NlIGZvciBEUzMgaW4gdGhlIGJlc3Qgcm9idXN0IGNsaWVudCBpcyB0bw0KPiBkZXN0cm95IGl0
J3MgRFMzJ3Mgc2Vzc2lvbnMgYW5kIHRoZXJlZm9yIGNhdXNlIGFsbCBMaW51eC1yZXF1ZXN0LWVu
dGl0aWVzDQo+IHRvIHJldHVybiBtdWNoIG11Y2ggZmFzdGVyLCB0aGVuIGlmICpqdXN0IHdhaXRp
bmcqIGZvciB0aGUgdGltZW91dCB0byBleHBpcmUuDQoNCkkgcmVwZWF0OiBkZXN0cm95aW5nIHRo
ZSBzZXNzaW9uIG9uIHRoZSBjbGllbnQgZG9lcyBOT1RISU5HIHRvIGhlbHAgeW91Lg0KDQo+ID4+
IFNvIGluIHRoZSB3b3JzdCBjYXNlIHNjZW5hcmlvIHdlIGNhbiByZWNvdmVyIDIgeCBORlMtdGlt
ZW91dCBhZnRlcg0KPiA+PiBhIG5ldHdvcmsgcGFydGl0aW9uLCB3aGljaCBpcyBqdXN0IDEgeCBO
RlMtdGltZW91dCwgYWZ0ZXIgeW91cg0KPiA+PiBzY2hpem9waHJlbmljIEZFTkNFX01FX09GRiwg
bmV3bHkgaW52ZW50ZWQgb3BlcmF0aW9uLg0KPiA+Pg0KPiA+PiBXaGF0IHdlIGNhbiBkbyB0byBl
bmhhbmNlIG91ciBjb2RlIHRvIHJlZHVjZSBlcnJvciByZWNvdmVyeSB0bw0KPiA+PiAxIHggTkZT
LXRpbWVvdXQ6DQo+ID4+DQo+ID4+IC0gRFMzIGFib3ZlOg0KPiA+PiAgIChBcyBJIHNhaWQgRFMx
J3MgcXVldWVzIGFyZSBub3cgZW1wdHksIGJlY2F1c2UgaXQgd2FzIHdvcmtpbmcgZmluZSwNCj4g
Pj4gICAgU28gRFMzIGlzIGEgcmVwcmVzZW50YXRpb24gb2YgYWxsIERTJ3MgdGhhdCBoYXZlIFJQ
Q3MgYXQgdGhlDQo+ID4+ICAgIHRpbWUgRFMyIHRpbWVkLW91dCwgd2hpY2ggYmVsb25nIHRvIHRo
aXMgbGF5b3V0KQ0KPiA+Pg0KPiA+PiAgIFdlIGNhbiBwcm9hY3RpdmVseSBhYm9ydCBhbGwgUlBD
cyBiZWxvbmdpbmcgdG8gRFMzLiBJZiB0aGVyZSBpcw0KPiA+PiAgIGEgd2F5IHRvIGludGVybmFs
bHkgYWJvcnQgUlBDJ3MgdXNlIHRoYXQuIEVsc2UganVzdCByZXNldCBpdCdzDQo+ID4+ICAgc2Vz
c2lvbiBhbmQgYWxsIHNvY2tldHMgd2lsbCBjbG9zZSAoYW5kIHJlb3BlbiksIGFuZCBhbGwgUlBD
J3MNCj4gPj4gICB3aWxsIGVuZCB3aXRoIGEgZGlzY29ubmVjdCBlcnJvci4NCj4gPiANCj4gPiBO
b3Qgb24gbW9zdCBzZXJ2ZXJzIHRoYXQgSSdtIGF3YXJlIG9mLiBJZiB5b3UgY2xvc2Ugb3IgcmVz
ZXQgdGhlIHNvY2tldA0KPiA+IG9uIHRoZSBjbGllbnQsIHRoZW4gdGhlIExpbnV4IHNlcnZlciB3
aWxsIGhhcHBpbHkgY29udGludWUgdG8gcHJvY2Vzcw0KPiA+IHRob3NlIFJQQyBjYWxsczsgaXQg
anVzdCB3b24ndCBiZSBhYmxlIHRvIHNlbmQgYSByZXBseS4NCj4gPiBGdXJ0aGVybW9yZSwgaWYg
dGhlIHByb2JsZW0gaXMgdGhhdCB0aGUgZGF0YSBzZXJ2ZXIgaXNuJ3QgcmVzcG9uZGluZywNCj4g
PiB0aGVuIGEgc29ja2V0IGNsb3NlL3Jlc2V0IHRlbGxzIHlvdSBub3RoaW5nIGVpdGhlci4NCj4g
PiANCj4gDQo+IA0KPiBBZ2FpbiBJJ20gdGFsa2luZyBhYm91dCB0aGUgTkZTLUludGVybmFsLXJl
cXVlc3QtZW50aXRpZXMgdGhlc2Ugd2lsbA0KPiBiZSByZWxlYXNlZCwgdGhvdWdoIGd1YXJhbnR5
aW5nIHRoYXQgbm8gbW9yZSB0aHJlYWRzIHdpbGwgdXNlIGFueSBvZg0KPiB0aGVzZSB0byBzZW5k
IGFueSBtb3JlIGJ5dGVzIG92ZXIgdG8gYW55IERTcy4NCj4gDQo+IEFORCB5ZXMsIHllcy4gT25j
ZSB0aGUgY2xpZW50IGhhcyBkb25lIGl0J3Mgam9iIGFuZCBzdG9wcGVkIGFueSBmdXR1cmUNCj4g
c2tiLXNlbmRzIHRvICphbGwqIERTcyBpbiBxdWVzdGlvbiwgb25seSB0aGVuIGl0IGNhbiByZXBv
cnQgdG8gTURTOg0KPiAgICJIZXkgSSdtIGRvbmUgc2VuZGluZyBpbiBhbGwgb3RoZXIgcm91dHMg
aGVyZSBMQVlPVVRfUkVUVVJOIg0KPiAgIChOb3cgZmVuY2luZyBoYXBwZW5zIG9uIHNlcnZlcnMp
DQo+ICAgDQo+ICAgYW5kIGNsaWVudCBnb2VzIG9uIGFuZCBzYXlzDQo+ICAgDQo+ICAgIkhleSBj
YW4geW91IE1EUyBwbGVhc2UgYWxzbyB3cml0ZSB0aGlzIGRhdGEiDQo+IA0KPiBXaGljaCBpcyBw
ZXJmZWN0IGZvciBNRFMgYmVjYXVzZSBvdGhlcndpc2UgaWYgaXQgd2FudHMgdG8gbWFrZSBzdXJl
LA0KPiBpdCB3aWxsIG5lZWQgdG8gcmVjYWxsIGFsbCBvdXRzdGFuZGluZyBsYXlvdXRzLCBleGFj
dGx5IGZvciB5b3VyDQo+IHJlYXNvbiwgZm9yIGNvbmNlcm4gZm9yIHRoZSBkYXRhIGNvcnJ1cHRp
b24gdGhhdCBjYW4gaGFwcGVuLg0KDQpTbyBpdCByZWNhbGxzIHRoZSBsYXlvdXRzLCBhbmQgdGhl
bi4uLiBpdCBfc3RpbGxfIGhhcyB0byBmZW5jZSBvZmYgYW55DQp3cml0ZXMgdGhhdCBhcmUgaW4g
cHJvZ3Jlc3Mgb24gdGhlIGJyb2tlbiBEUy4NCg0KQWxsIHlvdSd2ZSBkb25lIGlzIGFkZCBhIHJl
Y2FsbCB0byB0aGUgd2hvbGUgcHJvY2Vzcy4gV2h5Pw0KDQo+ID4+IC0gQm90aCBEUzIgdGhhdCB0
aW1lZC1vdXQsIGFuZCBEUzMgdGhhdCB3YXMgYWJvcnRlZC4gU2hvdWxkIGJlDQo+ID4+ICAgbWFy
a2VkIHdpdGggYSBmbGFnLiBXaGVuIG5ldyBJTyB0aGF0IGJlbG9uZyB0byBzb21lIG90aGVyDQo+
ID4+ICAgaW5vZGUgdGhyb3VnaCBzb21lIG90aGVyIGxheW91dCtkZXZpY2VfaWQgZW5jb3VudGVy
cyBhIGZsYWdnZWQNCj4gPj4gICBkZXZpY2UsIGl0IHNob3VsZCBhYm9ydCBhbmQgdHVybiB0byBN
RFMgSU8sIHdpdGggYWxzbyBpbnZhbGlkYXRpbmcNCj4gPj4gICBpdCdzIGxheW91dCwgYW5kIGhl
bnMsIHNvb24gZW5vdWdoIHRoZSBkZXZpY2VfaWQgZm9yIERTMiYzIHdpbGwgYmUNCj4gPj4gICBk
ZS1yZWZlcmVuY2VkIGFuZCBiZSByZW1vdmVkIGZyb20gZGV2aWNlIGNhY2hlLiAoQW5kIGFsbCBy
ZWZlcmVuY2luZw0KPiA+PiAgIGxheW91dHMgYXJlIG5vdyBnb25lKQ0KPiA+IA0KPiA+IFRoZXJl
IGlzIG5vIFJQQyBhYm9ydCBmdW5jdGlvbmFsaXR5IFN1biBSUEMuIEFnYWluLCB0aGlzIGFyZ3Vt
ZW50IHJlbGllcw0KPiA+IG9uIGZ1bmN0aW9uYWxpdHkgdGhhdCBfZG9lc24ndF8gZXhpc3QuDQo+
ID4gDQo+IA0KPiANCj4gQWdhaW4gSSBtZWFuIGludGVybmFsbHkgYXQgdGhlIGNsaWVudC4gRm9y
IGV4YW1wbGUgY2xvc2luZyB0aGUgc29ja2V0IHdpbGwNCj4gaGF2ZSB0aGUgZWZmZWN0IEkgd2Fu
dC4gKEFuZCBzb21lIG90aGVyIHRyaWNrcyB3ZSBjYW4gdGFsayBhYm91dCB0aG9zZQ0KPiBsYXRl
ciwgbGV0cyBhZ3JlZSBhYm91dCB0aGUgcHJpbmNpcGFsIGZpcnN0KQ0KDQpUaW1pbmcgb3V0IHdp
bGwgcHJldmVudCB0aGUgZGFtbmVkIGNsaWVudCBmcm9tIHNlbmRpbmcgbW9yZSBkYXRhLiBTbw0K
d2hhdD8NCg0KPiA+PiAgIFNvIHdlIGRvIG5vdCBjb250aW51ZSBxdWV1aW5nIG5ldyBJTyB0byBk
ZWFkIGRldmljZXMuIEFuZCBzaW5jZSBtb3N0DQo+ID4+ICAgcHJvYmFibHkgTURTIHdpbGwgbm90
IGdpdmUgdXMgZGVhZCBzZXJ2ZXJzIGluIG5ldyBsYXlvdXQsIHdlIHNob3VsZCBiZQ0KPiA+PiAg
IGdvb2QuDQo+ID4+IEluIHN1bW1lcnkuDQo+ID4+IC0gRkVOQ0VfTUVfT0ZGIGlzIGEgbmV3IG9w
ZXJhdGlvbiwgYW5kIGlzIG5vdCA9PT0gTEFZT1VUX1JFVFVSTi4gQ2xpZW50DQo+ID4+ICAgKm11
c3Qgbm90KiBza2Itc2VuZCBhIHNpbmdsZSBieXRlIGJlbG9uZ2luZyB0byBhIGxheW91dCwgYWZ0
ZXIgdGhlIHNlbmQNCj4gPj4gICBvZiBMQVlPVVRfUkVUVVJOLg0KPiA+PiAgIChJdCBuZWVkIG5v
dCB3YWl0IGZvciBPUFRfRE9ORSBmcm9tIERTIHRvIGRvIHRoYXQsIGl0IGp1c3QgbXVzdCBtYWtl
DQo+ID4+ICAgIHN1cmUsIHRoYXQgYWxsIGl0J3MgaW50ZXJuYWwsIG9yIG9uLXRoZS13aXJlIHJl
cXVlc3QsIGFyZSBhYm9ydGVkDQo+ID4+ICAgIGJ5IGVhc2lseSBjbG9zaW5nIHRoZSBzb2NrZXRz
IHRoZXkgYmVsb25nIHRvbywgYW5kL29yIHdhaXRpbmcgZm9yDQo+ID4+ICAgIGhlYWx0aHkgRFMn
cyBJTyB0byBiZSBPUFRfRE9ORSAuIFNvIHRoZSBjbGllbnQgaXMgbm90IGRlcGVuZGVudCBvbg0K
PiA+PiAgICBhbnkgRFMgcmVzcG9uc2UsIGl0IGlzIG9ubHkgZGVwZW5kZW50IG9uIGl0J3MgaW50
ZXJuYWwgc3RhdGUgYmVpbmcNCj4gPj4gICAgKmNsZWFuKiBmcm9tIGFueSBtb3JlIHNrYi1zZW5k
KHMpKQ0KPiA+IA0KPiA+IERpdHRvDQo+ID4gDQo+ID4+IC0gVGhlIHByb3BlciBpbXBsZW1lbnRh
dGlvbiBvZiBMQVlPVVRfUkVUVVJOIG9uIGVycm9yIGZvciBmYXN0IHR1cm5vdmVyDQo+ID4+ICAg
aXMgbm90IGhhcmQsIGFuZCBkb2VzIG5vdCBpbnZvbHZlIGEgbmV3IGludmVudGVkIE5GUyBvcGVy
YXRpb24gc3VjaA0KPiA+PiAgIGFzIEZFTkNFX01FX09GRi4gUHJvcGVyIGNvZGRlZCBjbGllbnQs
IGluZGVwZW5kZW50bHksIHdpdGhvdXQNCj4gPj4gICB0aGUgYWlkIG9mIGFueSBGRU5DRV9NRV9P
RkYgb3BlcmF0aW9uLCBjYW4gYWNoaWV2ZSBhIGZhc3RlciB0dXJuYXJvdW5kDQo+ID4+ICAgYnkg
YWN0aXZlbHkgcmV0dXJuaW5nIGFsbCBsYXlvdXRzIHRoYXQgYmVsb25nIHRvIGEgYmFkIERTLCBh
bmQgbm90DQo+ID4+ICAgd2FpdGluZyBmb3IgYSBmZW5jZS1vZmYgb2YgYSBzaW5nbGUgbGF5b3V0
LCB0aGVuIGVuY291bnRlcmluZyBqdXN0DQo+ID4+ICAgdGhlIHNhbWUgZXJyb3Igd2l0aCBhbGwg
b3RoZXIgbGF5b3V0cyB0aGF0IGhhdmUgdGhlIHNhbWUgRFMgICAgIA0KPiA+IA0KPiA+IFdoYXQg
ZG8geW91IG1lYW4gYnkgImFsbCBsYXlvdXRzIHRoYXQgYmVsb25nIHRvIGEgYmFkIERTIj8gTGF5
b3V0cyBkb24ndA0KPiA+IGJlbG9uZyB0byBhIERTLCBhbmQgc28gdGhlcmUgaXMgbm8gd2F5IHRv
IGdldCBmcm9tIGEgRFMgdG8gYSBsYXlvdXQuDQo+ID4gDQo+IA0KPiANCj4gV2h5LCBzdXJlLiBs
b29wIG9uIGFsbCBsYXlvdXRzIGFuZCBhc2sgaWYgaXQgaGFzIGEgc3BlY2lmaWMgRFMuDQoNCg0K
DQo+ID4+IC0gQW5kIEkga25vdyB0aGF0IGp1c3QgYXMgeW91IGRpZCBub3QgcmVhZCBteSBlbWFp
bHMgZnJvbSBiZWZvcmUNCj4gPj4gICBtZSBnb2luZyB0byBIb3NwaXRhbCwgeW91IHdpbGwgY29u
dGludWUgdG8gbm90IHVuZGVyc3RhbmQgdGhpcw0KPiA+PiAgIG9uZSwgb3Igd2hhdCBJJ20gdHJ5
aW5nIHRvIGV4cGxhaW4sIGFuZCB3aWxsIG1vc3QgcHJvYmFibHkgaWdub3JlDQo+ID4+ICAgYWxs
IG9mIGl0LiBCdXQgcGxlYXNlIG5vdGUgb25lIHRoaW5nOg0KPiA+IA0KPiA+IEkgcmVhZCB0aGVt
LCBidXQganVzdCBhcyBub3csIHRoZXkgY29udGludWUgdG8gaWdub3JlIHRoZSByZWFsaXR5IGFi
b3V0DQo+ID4gdGltZW91dHM6IHRpbWVvdXRzIG1lYW4gX25vdGhpbmdfIGluIGFuIFJQQyBmYWls
b3ZlciBzaXR1YXRpb24uIFRoZXJlIGlzDQo+ID4gbm8gUlBDIGFib3J0IGZ1bmN0aW9uYWxpdHkg
dGhhdCB5b3UgY2FuIHJlbHkgb24gb3RoZXIgdGhhbiBmZW5jaW5nLg0KPiA+IA0KPiANCj4gDQo+
IEkgaG9wZSBJIGV4cGxhaW5lZCB0aGlzIGJ5IG5vdy4gSWYgbm90IHBsZWFzZSwgcGxlYXNlLCBs
ZXRzIG9yZ2FuaXplDQo+IGEgcGhvbmUgY2FsbC4gV2UgY2FuIHVzZSBQYW5hc2FzIGNvbmZlcmVu
Y2UgbnVtYmVyIHdoZW5ldmVyIHlvdSBhcmUNCj4gYXZhaWxhYmxlLiBJIHRoaW5rIHdlIGNvbW11
bmljYXRlIGJldHRlciBpbiBwZXJzb24uDQo+IA0KPiBFdmVyeW9uZSBlbHNlIGlzIGFsc28gaW52
aXRlZC4NCj4gDQo+IEJVVCB0aGVyZSBpcyBvbmUgbW9zdCBpbXBvcnRhbnQgcG9pbnQgZm9yIG1l
Og0KPiANCj4gICAgQXMgc3RhdGVkIGJ5IHRoZSBSRkMuIENsaWVudCBtdXN0IGd1YXJhbnR5IHRo
YXQgbm8gbW9yZSBieXRlcyB3aWxsIGJlDQo+ICAgIHNlbnQgdG8gYW55IERTcyBpbiBhIGxheW91
dCwgb25jZSBMQVlPVVRfUkVUVVJOIGlzIHNlbnQuIFRoaXMgaXMgdGhlDQo+ICAgIG9ubHkgZGVm
aW5pdGlvbiBvZiBMQVlPVVRfUkVUVVJOLCBhbmQgTk9fTUFUQ0hJTkdfTEFZT1VUIGFzIHJlc3Bv
bnNlDQo+ICAgIHRvIGEgTEFZT1VUX1JFQ0FMTC4gV2hpY2ggaXM6DQo+ICAgIENsaWVudCBoYXMg
aW5kaWNhdGVkIG5vIG1vcmUgZnV0dXJlIHNlbmRzIG9uIGEgbGF5b3V0LiAoQW5kIHNlcnZlciB3
aWxsDQo+ICAgIGVuZm9yY2UgaXQgd2l0aCBhIGZlbmNpbmcpDQoNClRoZSBjbGllbnQgY2FuJ3Qg
Z3VhcmFudGVlIHRoYXQuIFRoZSBwcm90b2NvbCBvZmZlcnMgbm8gd2F5IGZvciBpdCB0byBkbw0K
c28sIG5vIG1hdHRlciB3aGF0IHRoZSBwTkZTIHRleHQgbWF5IGNob29zZSB0byBzYXkuDQoNCj4g
Pj4gICAgIFlPVSBoYXZlIHNhYm90YWdlZCB0aGUgTkZTIDQuMSBMaW51eCBjbGllbnQsIHdoaWNo
IGlzIG5vdyB0b3RhbGx5DQo+ID4+ICAgICBub3QgU1REIGNvbXBsYWludCwgYW5kIGhhdmUgaW50
cm9kdWNlZCBDUkFTSHMuIEFuZCBmb3Igbm8gZ29vZA0KPiA+PiAgICAgcmVhc29uLg0KPiA+IA0K
PiA+IFNlZSBhYm92ZS4NCj4gPiANCj4gDQo+IA0KPiBPSyBXZSdsbCBoYXZlIHRvIHNlZSBhYm91
dCB0aGVzZSBjcmFzaGVzLCBsZXRzIHRhbGsgYWJvdXQgdGhlbS4NCj4gDQo+IFRoYW5rcw0KPiBC
b2F6DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIN
Cg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boaz Harrosh Aug. 14, 2012, 12:28 a.m. UTC | #14

On 08/14/2012 03:16 AM, Myklebust, Trond wrote:

> 
> The client can't guarantee that. The protocol offers no way for it to do
> so, no matter what the pNFS text may choose to say.
> 

What? why not? all the client needs to do is stop sending bytes.
What is not guarantied? I completely do not understand what you
say. How stopping any sends does not guaranty that?

Please explain

Thanks
Boaz

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 14, 2012, 12:49 a.m. UTC | #15

On Tue, 2012-08-14 at 03:28 +0300, Boaz Harrosh wrote:
> On 08/14/2012 03:16 AM, Myklebust, Trond wrote:

> 

> > 

> > The client can't guarantee that. The protocol offers no way for it to do

> > so, no matter what the pNFS text may choose to say.

> > 

> 

> 

> What? why not? all the client needs to do is stop sending bytes.

> What is not guarantied? I completely do not understand what you

> say. How stopping any sends does not guaranty that?

If the client has lost control of the transport, then it has no control
over what the data server sees and when. It can process the client's
write RPC 5 minutes from now, and it will never know.

THAT is what is absurd about the whole "the client MUST NOT send..."
shebang.
THAT is why there is no guarantee.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Benny Halevy Aug. 14, 2012, 7:48 a.m. UTC | #16

On 2012-08-09 18:39, Myklebust, Trond wrote:
> On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
>> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>>>> <Trond.Myklebust@netapp.com> wrote:
>>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>>>>> disconnected data server) we've been sending layoutreturn calls
>>>>> while there is potentially still outstanding I/O to the data
>>>>> servers. The reason we do this is to avoid races between replayed
>>>>> writes to the MDS and the original writes to the DS.
>>>>>
>>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>>>>> be triggered because it assumes that we would never call
>>>>> layoutreturn without knowing that all I/O to the DS is
>>>>> finished. The fix is to remove the BUG_ON() now that the
>>>>> assumptions behind the test are obsolete.
>>>>>
>>>> Isn't MDS supposed to recall the layout if races are possible between
>>>> outstanding write-to-DS and write-through-MDS?
>>>
>>> Where do you read that in RFC5661?
>>>
>> That's my (maybe mis-)understanding of how server works... But looking
>> at rfc5661 section 18.44.3. layoutreturn implementation.
>> "
>> After this call,
>>    the client MUST NOT use the returned layout(s) and the associated
>>    storage protocol to access the file data.
>> "
>> And given commit 0a57cdac3f, client is using the layout even after
>> layoutreturn, which IMHO is a violation of rfc5661.
> 
> No. It is using the layoutreturn to tell the MDS to fence off I/O to a
> data server that is not responding. It isn't attempting to use the
> layout after the layoutreturn: the whole point is that we are attempting
> write-through-MDS after the attempt to write through the DS timed out.
> 

I hear you, but this use case is valid after a time out / disconnect
(which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout)
In other cases, I/Os to the DS might obviously be in flight and the BUG_ON
indicates that.

IMO, the right way to implement that is to initially mark the lsegs invalid
and increment plh_block_lgets, as we do today in _pnfs_return_layout
but actually send the layout return only when the last segment is dereferenced.

Benny
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Trond Myklebust Aug. 14, 2012, 1:45 p.m. UTC | #17

On Tue, 2012-08-14 at 10:48 +0300, Benny Halevy wrote:
> On 2012-08-09 18:39, Myklebust, Trond wrote:

> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:

> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond

> >> <Trond.Myklebust@netapp.com> wrote:

> >>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:

> >>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust

> >>>> <Trond.Myklebust@netapp.com> wrote:

> >>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence

> >>>>> disconnected data server) we've been sending layoutreturn calls

> >>>>> while there is potentially still outstanding I/O to the data

> >>>>> servers. The reason we do this is to avoid races between replayed

> >>>>> writes to the MDS and the original writes to the DS.

> >>>>>

> >>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can

> >>>>> be triggered because it assumes that we would never call

> >>>>> layoutreturn without knowing that all I/O to the DS is

> >>>>> finished. The fix is to remove the BUG_ON() now that the

> >>>>> assumptions behind the test are obsolete.

> >>>>>

> >>>> Isn't MDS supposed to recall the layout if races are possible between

> >>>> outstanding write-to-DS and write-through-MDS?

> >>>

> >>> Where do you read that in RFC5661?

> >>>

> >> That's my (maybe mis-)understanding of how server works... But looking

> >> at rfc5661 section 18.44.3. layoutreturn implementation.

> >> "

> >> After this call,

> >>    the client MUST NOT use the returned layout(s) and the associated

> >>    storage protocol to access the file data.

> >> "

> >> And given commit 0a57cdac3f, client is using the layout even after

> >> layoutreturn, which IMHO is a violation of rfc5661.

> > 

> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a

> > data server that is not responding. It isn't attempting to use the

> > layout after the layoutreturn: the whole point is that we are attempting

> > write-through-MDS after the attempt to write through the DS timed out.

> > 

> 

> I hear you, but this use case is valid after a time out / disconnect

> (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout)

> In other cases, I/Os to the DS might obviously be in flight and the BUG_ON

> indicates that.

> 

> IMO, the right way to implement that is to initially mark the lsegs invalid

> and increment plh_block_lgets, as we do today in _pnfs_return_layout

> but actually send the layout return only when the last segment is dereferenced.


This is what we do for object and block layout types, so your
objects-specific objection is unfounded.

As I understand it, iSCSI has different semantics w.r.t. disconnect and
timeout, which means that the client can in principle rely on a timeout
leaving the DS in a known state. Ditto for FCP.
I've no idea about other block/object transport types, but I assume
those that support multi-pathing implement similar devices.

The problem is that RPC does not, so the files layout needs to be
treated differently.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Peng Tao Aug. 14, 2012, 2:30 p.m. UTC | #18

On Tue, Aug 14, 2012 at 9:45 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Tue, 2012-08-14 at 10:48 +0300, Benny Halevy wrote:
>> On 2012-08-09 18:39, Myklebust, Trond wrote:
>> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
>> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>> >>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>> >>>> <Trond.Myklebust@netapp.com> wrote:
>> >>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>> >>>>> disconnected data server) we've been sending layoutreturn calls
>> >>>>> while there is potentially still outstanding I/O to the data
>> >>>>> servers. The reason we do this is to avoid races between replayed
>> >>>>> writes to the MDS and the original writes to the DS.
>> >>>>>
>> >>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>> >>>>> be triggered because it assumes that we would never call
>> >>>>> layoutreturn without knowing that all I/O to the DS is
>> >>>>> finished. The fix is to remove the BUG_ON() now that the
>> >>>>> assumptions behind the test are obsolete.
>> >>>>>
>> >>>> Isn't MDS supposed to recall the layout if races are possible between
>> >>>> outstanding write-to-DS and write-through-MDS?
>> >>>
>> >>> Where do you read that in RFC5661?
>> >>>
>> >> That's my (maybe mis-)understanding of how server works... But looking
>> >> at rfc5661 section 18.44.3. layoutreturn implementation.
>> >> "
>> >> After this call,
>> >>    the client MUST NOT use the returned layout(s) and the associated
>> >>    storage protocol to access the file data.
>> >> "
>> >> And given commit 0a57cdac3f, client is using the layout even after
>> >> layoutreturn, which IMHO is a violation of rfc5661.
>> >
>> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a
>> > data server that is not responding. It isn't attempting to use the
>> > layout after the layoutreturn: the whole point is that we are attempting
>> > write-through-MDS after the attempt to write through the DS timed out.
>> >
>>
>> I hear you, but this use case is valid after a time out / disconnect
>> (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout)
>> In other cases, I/Os to the DS might obviously be in flight and the BUG_ON
>> indicates that.
>>
>> IMO, the right way to implement that is to initially mark the lsegs invalid
>> and increment plh_block_lgets, as we do today in _pnfs_return_layout
>> but actually send the layout return only when the last segment is dereferenced.
>
> This is what we do for object and block layout types, so your
> objects-specific objection is unfounded.
>
object layout is also doing layout return on IO error (commit
fe0fe83585f8). And it doesn't take care of draining concurrent
in-flight IO. I guess that's why Boaz saw the same BUG_ON.

Trond Myklebust Aug. 14, 2012, 2:53 p.m. UTC | #19

T24gVHVlLCAyMDEyLTA4LTE0IGF0IDIyOjMwICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gT24g
VHVlLCBBdWcgMTQsIDIwMTIgYXQgOTo0NSBQTSwgTXlrbGVidXN0LCBUcm9uZA0KPiA8VHJvbmQu
TXlrbGVidXN0QG5ldGFwcC5jb20+IHdyb3RlOg0KPiA+IE9uIFR1ZSwgMjAxMi0wOC0xNCBhdCAx
MDo0OCArMDMwMCwgQmVubnkgSGFsZXZ5IHdyb3RlOg0KPiA+PiBPbiAyMDEyLTA4LTA5IDE4OjM5
LCBNeWtsZWJ1c3QsIFRyb25kIHdyb3RlOg0KPiA+PiA+IE9uIFRodSwgMjAxMi0wOC0wOSBhdCAy
MzowMSArMDgwMCwgUGVuZyBUYW8gd3JvdGU6DQo+ID4+ID4+IE9uIFRodSwgQXVnIDksIDIwMTIg
YXQgMTA6MzYgUE0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPj4gPj4gPFRyb25kLk15a2xlYnVzdEBu
ZXRhcHAuY29tPiB3cm90ZToNCj4gPj4gPj4+IE9uIFRodSwgMjAxMi0wOC0wOSBhdCAyMjozMCAr
MDgwMCwgUGVuZyBUYW8gd3JvdGU6DQo+ID4+ID4+Pj4gT24gVGh1LCBBdWcgOSwgMjAxMiBhdCA0
OjIxIEFNLCBUcm9uZCBNeWtsZWJ1c3QNCj4gPj4gPj4+PiA8VHJvbmQuTXlrbGVidXN0QG5ldGFw
cC5jb20+IHdyb3RlOg0KPiA+PiA+Pj4+PiBFdmVyIHNpbmNlIGNvbW1pdCAwYTU3Y2RhYzNmIChO
RlN2NC4xIHNlbmQgbGF5b3V0cmV0dXJuIHRvIGZlbmNlDQo+ID4+ID4+Pj4+IGRpc2Nvbm5lY3Rl
ZCBkYXRhIHNlcnZlcikgd2UndmUgYmVlbiBzZW5kaW5nIGxheW91dHJldHVybiBjYWxscw0KPiA+
PiA+Pj4+PiB3aGlsZSB0aGVyZSBpcyBwb3RlbnRpYWxseSBzdGlsbCBvdXRzdGFuZGluZyBJL08g
dG8gdGhlIGRhdGENCj4gPj4gPj4+Pj4gc2VydmVycy4gVGhlIHJlYXNvbiB3ZSBkbyB0aGlzIGlz
IHRvIGF2b2lkIHJhY2VzIGJldHdlZW4gcmVwbGF5ZWQNCj4gPj4gPj4+Pj4gd3JpdGVzIHRvIHRo
ZSBNRFMgYW5kIHRoZSBvcmlnaW5hbCB3cml0ZXMgdG8gdGhlIERTLg0KPiA+PiA+Pj4+Pg0KPiA+
PiA+Pj4+PiBXaGVuIHRoaXMgaGFwcGVucywgdGhlIEJVR19PTigpIGluIG5mczRfbGF5b3V0cmV0
dXJuX2RvbmUgY2FuDQo+ID4+ID4+Pj4+IGJlIHRyaWdnZXJlZCBiZWNhdXNlIGl0IGFzc3VtZXMg
dGhhdCB3ZSB3b3VsZCBuZXZlciBjYWxsDQo+ID4+ID4+Pj4+IGxheW91dHJldHVybiB3aXRob3V0
IGtub3dpbmcgdGhhdCBhbGwgSS9PIHRvIHRoZSBEUyBpcw0KPiA+PiA+Pj4+PiBmaW5pc2hlZC4g
VGhlIGZpeCBpcyB0byByZW1vdmUgdGhlIEJVR19PTigpIG5vdyB0aGF0IHRoZQ0KPiA+PiA+Pj4+
PiBhc3N1bXB0aW9ucyBiZWhpbmQgdGhlIHRlc3QgYXJlIG9ic29sZXRlLg0KPiA+PiA+Pj4+Pg0K
PiA+PiA+Pj4+IElzbid0IE1EUyBzdXBwb3NlZCB0byByZWNhbGwgdGhlIGxheW91dCBpZiByYWNl
cyBhcmUgcG9zc2libGUgYmV0d2Vlbg0KPiA+PiA+Pj4+IG91dHN0YW5kaW5nIHdyaXRlLXRvLURT
IGFuZCB3cml0ZS10aHJvdWdoLU1EUz8NCj4gPj4gPj4+DQo+ID4+ID4+PiBXaGVyZSBkbyB5b3Ug
cmVhZCB0aGF0IGluIFJGQzU2NjE/DQo+ID4+ID4+Pg0KPiA+PiA+PiBUaGF0J3MgbXkgKG1heWJl
IG1pcy0pdW5kZXJzdGFuZGluZyBvZiBob3cgc2VydmVyIHdvcmtzLi4uIEJ1dCBsb29raW5nDQo+
ID4+ID4+IGF0IHJmYzU2NjEgc2VjdGlvbiAxOC40NC4zLiBsYXlvdXRyZXR1cm4gaW1wbGVtZW50
YXRpb24uDQo+ID4+ID4+ICINCj4gPj4gPj4gQWZ0ZXIgdGhpcyBjYWxsLA0KPiA+PiA+PiAgICB0
aGUgY2xpZW50IE1VU1QgTk9UIHVzZSB0aGUgcmV0dXJuZWQgbGF5b3V0KHMpIGFuZCB0aGUgYXNz
b2NpYXRlZA0KPiA+PiA+PiAgICBzdG9yYWdlIHByb3RvY29sIHRvIGFjY2VzcyB0aGUgZmlsZSBk
YXRhLg0KPiA+PiA+PiAiDQo+ID4+ID4+IEFuZCBnaXZlbiBjb21taXQgMGE1N2NkYWMzZiwgY2xp
ZW50IGlzIHVzaW5nIHRoZSBsYXlvdXQgZXZlbiBhZnRlcg0KPiA+PiA+PiBsYXlvdXRyZXR1cm4s
IHdoaWNoIElNSE8gaXMgYSB2aW9sYXRpb24gb2YgcmZjNTY2MS4NCj4gPj4gPg0KPiA+PiA+IE5v
LiBJdCBpcyB1c2luZyB0aGUgbGF5b3V0cmV0dXJuIHRvIHRlbGwgdGhlIE1EUyB0byBmZW5jZSBv
ZmYgSS9PIHRvIGENCj4gPj4gPiBkYXRhIHNlcnZlciB0aGF0IGlzIG5vdCByZXNwb25kaW5nLiBJ
dCBpc24ndCBhdHRlbXB0aW5nIHRvIHVzZSB0aGUNCj4gPj4gPiBsYXlvdXQgYWZ0ZXIgdGhlIGxh
eW91dHJldHVybjogdGhlIHdob2xlIHBvaW50IGlzIHRoYXQgd2UgYXJlIGF0dGVtcHRpbmcNCj4g
Pj4gPiB3cml0ZS10aHJvdWdoLU1EUyBhZnRlciB0aGUgYXR0ZW1wdCB0byB3cml0ZSB0aHJvdWdo
IHRoZSBEUyB0aW1lZCBvdXQuDQo+ID4+ID4NCj4gPj4NCj4gPj4gSSBoZWFyIHlvdSwgYnV0IHRo
aXMgdXNlIGNhc2UgaXMgdmFsaWQgYWZ0ZXIgYSB0aW1lIG91dCAvIGRpc2Nvbm5lY3QNCj4gPj4g
KHdoaWNoIHdpbGwgdHJhbnNsYXRlIHRvIFBORlNfT1NEX0VSUl9VTlJFQUNIQUJMRSBmb3IgdGhl
IG9iamVjdHMgbGF5b3V0KQ0KPiA+PiBJbiBvdGhlciBjYXNlcywgSS9PcyB0byB0aGUgRFMgbWln
aHQgb2J2aW91c2x5IGJlIGluIGZsaWdodCBhbmQgdGhlIEJVR19PTg0KPiA+PiBpbmRpY2F0ZXMg
dGhhdC4NCj4gPj4NCj4gPj4gSU1PLCB0aGUgcmlnaHQgd2F5IHRvIGltcGxlbWVudCB0aGF0IGlz
IHRvIGluaXRpYWxseSBtYXJrIHRoZSBsc2VncyBpbnZhbGlkDQo+ID4+IGFuZCBpbmNyZW1lbnQg
cGxoX2Jsb2NrX2xnZXRzLCBhcyB3ZSBkbyB0b2RheSBpbiBfcG5mc19yZXR1cm5fbGF5b3V0DQo+
ID4+IGJ1dCBhY3R1YWxseSBzZW5kIHRoZSBsYXlvdXQgcmV0dXJuIG9ubHkgd2hlbiB0aGUgbGFz
dCBzZWdtZW50IGlzIGRlcmVmZXJlbmNlZC4NCj4gPg0KPiA+IFRoaXMgaXMgd2hhdCB3ZSBkbyBm
b3Igb2JqZWN0IGFuZCBibG9jayBsYXlvdXQgdHlwZXMsIHNvIHlvdXINCj4gPiBvYmplY3RzLXNw
ZWNpZmljIG9iamVjdGlvbiBpcyB1bmZvdW5kZWQuDQo+ID4NCj4gb2JqZWN0IGxheW91dCBpcyBh
bHNvIGRvaW5nIGxheW91dCByZXR1cm4gb24gSU8gZXJyb3IgKGNvbW1pdA0KPiBmZTBmZTgzNTg1
ZjgpLiBBbmQgaXQgZG9lc24ndCB0YWtlIGNhcmUgb2YgZHJhaW5pbmcgY29uY3VycmVudA0KPiBp
bi1mbGlnaHQgSU8uIEkgZ3Vlc3MgdGhhdCdzIHdoeSBCb2F6IHNhdyB0aGUgc2FtZSBCVUdfT04u
DQoNClllcy4gSSBkaWQgbm90aWNlIHRoYXQgY29kZSB3aGVuIEkgd2FzIGxvb2tpbmcgaW50byB0
aGlzLiBIb3dldmVyIHRoYXQncw0KQm9heidzIG93biBwYXRjaCwgYW5kIGl0IF9vbmx5XyBhcHBs
aWVzIHRvIHRoZSBvYmplY3RzIGxheW91dCB0eXBlLiBJDQphc3N1bWVkIHRoYXQgaGUgaGFkIHRl
c3RlZCBpdCB3aGVuIEkgYXBwbGllZCBpdC4uLg0KDQpPbmUgd2F5IHRvIGZpeCB0aGF0IHdvdWxk
IGJlIHRvIGtlZXAgYSBjb3VudCBvZiAib3V0c3RhbmRpbmcNCnJlYWQvd3JpdGVzIiBpbiB0aGUg
bGF5b3V0LCBzbyB0aGF0IHdoZW4gdGhlIGVycm9yIG9jY3VycywgYW5kIHdlIHdhbnQNCnRvIGZh
bGwgYmFjayB0byBNRFMsIHdlIGp1c3QgaW5jcmVtZW50IHBsaF9ibG9ja19sZ2V0cywgaW52YWxp
ZGF0ZSB0aGUNCmxheW91dCwgYW5kIHRoZW4gbGV0IHRoZSBvdXRzdGFuZGluZyByZWFkL3dyaXRl
cyBmYWxsIHRvIHplcm8gYmVmb3JlDQpzZW5kaW5nIHRoZSBsYXlvdXRyZXR1cm4uDQpJZiB0aGUg
b2JqZWN0cyBsYXlvdXQgd2FudHMgdG8gZG8gdGhhdCwgdGhlbiBJIGhhdmUgbm8gb2JqZWN0aW9u
LiBBcw0KSSd2ZSBzYWlkIG11bHRpcGxlIHRpbWVzLCB0aG91Z2gsIEknbSBub3QgY29udmluY2Vk
IHdlIHdhbnQgdG8gZG8gdGhhdA0KZm9yIHRoZSBmaWxlcyBsYXlvdXQuDQoNCi0tIA0KVHJvbmQg
TXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5N
eWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Benny Halevy Aug. 15, 2012, 11:50 a.m. UTC | #20

On 2012-08-14 17:53, Myklebust, Trond wrote:
> On Tue, 2012-08-14 at 22:30 +0800, Peng Tao wrote:
>> On Tue, Aug 14, 2012 at 9:45 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>>> On Tue, 2012-08-14 at 10:48 +0300, Benny Halevy wrote:
>>>> On 2012-08-09 18:39, Myklebust, Trond wrote:
>>>>> On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote:
>>>>>> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond
>>>>>> <Trond.Myklebust@netapp.com> wrote:
>>>>>>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote:
>>>>>>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust
>>>>>>>> <Trond.Myklebust@netapp.com> wrote:
>>>>>>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
>>>>>>>>> disconnected data server) we've been sending layoutreturn calls
>>>>>>>>> while there is potentially still outstanding I/O to the data
>>>>>>>>> servers. The reason we do this is to avoid races between replayed
>>>>>>>>> writes to the MDS and the original writes to the DS.
>>>>>>>>>
>>>>>>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can
>>>>>>>>> be triggered because it assumes that we would never call
>>>>>>>>> layoutreturn without knowing that all I/O to the DS is
>>>>>>>>> finished. The fix is to remove the BUG_ON() now that the
>>>>>>>>> assumptions behind the test are obsolete.
>>>>>>>>>
>>>>>>>> Isn't MDS supposed to recall the layout if races are possible between
>>>>>>>> outstanding write-to-DS and write-through-MDS?
>>>>>>>
>>>>>>> Where do you read that in RFC5661?
>>>>>>>
>>>>>> That's my (maybe mis-)understanding of how server works... But looking
>>>>>> at rfc5661 section 18.44.3. layoutreturn implementation.
>>>>>> "
>>>>>> After this call,
>>>>>>    the client MUST NOT use the returned layout(s) and the associated
>>>>>>    storage protocol to access the file data.
>>>>>> "
>>>>>> And given commit 0a57cdac3f, client is using the layout even after
>>>>>> layoutreturn, which IMHO is a violation of rfc5661.
>>>>>
>>>>> No. It is using the layoutreturn to tell the MDS to fence off I/O to a
>>>>> data server that is not responding. It isn't attempting to use the
>>>>> layout after the layoutreturn: the whole point is that we are attempting
>>>>> write-through-MDS after the attempt to write through the DS timed out.
>>>>>
>>>>
>>>> I hear you, but this use case is valid after a time out / disconnect
>>>> (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout)
>>>> In other cases, I/Os to the DS might obviously be in flight and the BUG_ON
>>>> indicates that.
>>>>
>>>> IMO, the right way to implement that is to initially mark the lsegs invalid
>>>> and increment plh_block_lgets, as we do today in _pnfs_return_layout
>>>> but actually send the layout return only when the last segment is dereferenced.
>>>
>>> This is what we do for object and block layout types, so your
>>> objects-specific objection is unfounded.
>>>
>> object layout is also doing layout return on IO error (commit
>> fe0fe83585f8). And it doesn't take care of draining concurrent
>> in-flight IO. I guess that's why Boaz saw the same BUG_ON.
> 
> Yes. I did notice that code when I was looking into this. However that's
> Boaz's own patch, and it _only_ applies to the objects layout type. I
> assumed that he had tested it when I applied it...
> 
> One way to fix that would be to keep a count of "outstanding
> read/writes" in the layout, so that when the error occurs, and we want
> to fall back to MDS, we just increment plh_block_lgets, invalidate the
> layout, and then let the outstanding read/writes fall to zero before
> sending the layoutreturn.

Sounds reasonable to me too.

> If the objects layout wants to do that, then I have no objection. As
> I've said multiple times, though, I'm not convinced we want to do that
> for the files layout.
> 

I just fear that removing the BUG_ON will prevent us from detecting cases
where a LAYOUTRETURN is sent while there are layout segments in use
in the error free or non-timeout case.

Benny
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done

Commit Message

Comments

Patch