diff mbox

CIFS: Decrease reconnection delay when switching nics

Message ID 1361831310-24260-1-git-send-email-chiluk@canonical.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dave Chiluk Feb. 25, 2013, 10:28 p.m. UTC
When messages are currently in queue awaiting a response, decrease amount of
time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
seconds) since the last response was recieved.  This does not take into account
the fact that messages waiting for a response should be serviced within a
reasonable round trip time.

This fixes the issue where user moves from wired to wireless or vice versa
causing the mount to hang for 120 seconds, when it could reconnect considerably
faster.  After this fix it will take SMB_MAX_RTT (10 seconds) from the last
time the user attempted to access the volume or SMB_MAX_RTT after the last
echo.  The worst case of the latter scenario being
2*SMB_ECHO_INTERVAL+SMB_MAX_RTT+small scheduling delay (about 130 seconds).
Statistically speaking it would normally reconnect sooner.  However in the best
case where the user changes nics, and immediately tries to access the cifs
share it will take SMB_MAX_RTT=10 seconds.

BugLink: http://bugs.launchpad.net/bugs/1017622

Signed-off-by: Dave Chiluk <chiluk@canonical.com>
---
 fs/cifs/cifsglob.h |   15 +++++++------
 fs/cifs/connect.c  |   61 +++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 59 insertions(+), 17 deletions(-)

Comments

Stefan Metzmacher Feb. 27, 2013, 11:06 a.m. UTC | #1
Hi Dave,

> When messages are currently in queue awaiting a response, decrease amount of
> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> seconds) since the last response was recieved.  This does not take into account
> the fact that messages waiting for a response should be serviced within a
> reasonable round trip time.

Wouldn't that mean that the client will disconnect a good connection,
if the server doesn't response within 10 seconds?
Reads and Writes can take longer than 10 seconds...

> This fixes the issue where user moves from wired to wireless or vice versa
> causing the mount to hang for 120 seconds, when it could reconnect considerably
> faster.  After this fix it will take SMB_MAX_RTT (10 seconds) from the last
> time the user attempted to access the volume or SMB_MAX_RTT after the last
> echo.  The worst case of the latter scenario being
> 2*SMB_ECHO_INTERVAL+SMB_MAX_RTT+small scheduling delay (about 130 seconds).
> Statistically speaking it would normally reconnect sooner.  However in the best
> case where the user changes nics, and immediately tries to access the cifs
> share it will take SMB_MAX_RTT=10 seconds.

I think it would be better to detect the broken connection
by using an AF_NETLINK socket listening for RTM_DELADDR
messages?

metze
Jeff Layton Feb. 27, 2013, 4:34 p.m. UTC | #2
On Wed, 27 Feb 2013 12:06:14 +0100
"Stefan (metze) Metzmacher" <metze@samba.org> wrote:

> Hi Dave,
> 
> > When messages are currently in queue awaiting a response, decrease amount of
> > time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> > wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> > seconds) since the last response was recieved.  This does not take into account
> > the fact that messages waiting for a response should be serviced within a
> > reasonable round trip time.
> 
> Wouldn't that mean that the client will disconnect a good connection,
> if the server doesn't response within 10 seconds?
> Reads and Writes can take longer than 10 seconds...
> 

Where does this magic value of 10s come from? Note that a slow server
can take *minutes* to respond to writes that are long past the EOF.

> > This fixes the issue where user moves from wired to wireless or vice versa
> > causing the mount to hang for 120 seconds, when it could reconnect considerably
> > faster.  After this fix it will take SMB_MAX_RTT (10 seconds) from the last
> > time the user attempted to access the volume or SMB_MAX_RTT after the last
> > echo.  The worst case of the latter scenario being
> > 2*SMB_ECHO_INTERVAL+SMB_MAX_RTT+small scheduling delay (about 130 seconds).
> > Statistically speaking it would normally reconnect sooner.  However in the best
> > case where the user changes nics, and immediately tries to access the cifs
> > share it will take SMB_MAX_RTT=10 seconds.
> 
> I think it would be better to detect the broken connection
> by using an AF_NETLINK socket listening for RTM_DELADDR
> messages?
> 
> metze
> 

Ick -- that sounds horrid ;)

Dave, this problem sounds very similar to the one that your colleague
Chris J Arges was trying to solve several months ago. You may want to
go back and review that thread. Perhaps you can solve both problems at
the same time here...
Dave Chiluk Feb. 27, 2013, 10:24 p.m. UTC | #3
On 02/27/2013 10:34 AM, Jeff Layton wrote:
> On Wed, 27 Feb 2013 12:06:14 +0100
> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> 
>> Hi Dave,
>>
>>> When messages are currently in queue awaiting a response, decrease amount of
>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>> seconds) since the last response was recieved.  This does not take into account
>>> the fact that messages waiting for a response should be serviced within a
>>> reasonable round trip time.
>>
>> Wouldn't that mean that the client will disconnect a good connection,
>> if the server doesn't response within 10 seconds?
>> Reads and Writes can take longer than 10 seconds...
>>
> 
> Where does this magic value of 10s come from? Note that a slow server
> can take *minutes* to respond to writes that are long past the EOF.
It comes from the desire to decrease the reconnection delay to something
better than a random number between 60 and 120 seconds.  I am not
committed to this number, and it is open for discussion.  Additionally
if you look closely at the logic it's not 10 seconds per request, but
actually when requests have been in flight for more than 10 seconds make
sure we've heard from the server in the last 10 seconds.

Can you explain more fully your use case of writes that are long past
the EOF?  Perhaps with a test-case or script that I can test?  As far as
I know writes long past EOF will just result in a sparse file, and
return in a reasonable round trip time *(that's at least what I'm seeing
with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
seek=100000, starts receiving responses from the server in about .05
seconds with subsequent responses following at roughly .002-.01 second
intervals.  This is well within my 10 second value.  Even adding the
latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
less than my 10 second value.

The new logic goes like this
if( we've been expecting a response from the server (in_flight), and
 message has been in_flight for more than 10 seconds and
 we haven't had any other contact from the server in that time
  reconnect

On a side note, I discovered a small race condition in the previous
logic while working on this, that my new patch also fixes.
1s  request
2s  response
61.995 echo job pops
121.995 echo job pops and sends echo
122 server_unresponsive called.  Finds no response and attempts to
	 reconnect
122.95 response to echo received

>>> This fixes the issue where user moves from wired to wireless or vice versa
>>> causing the mount to hang for 120 seconds, when it could reconnect considerably
>>> faster.  After this fix it will take SMB_MAX_RTT (10 seconds) from the last
>>> time the user attempted to access the volume or SMB_MAX_RTT after the last
>>> echo.  The worst case of the latter scenario being
>>> 2*SMB_ECHO_INTERVAL+SMB_MAX_RTT+small scheduling delay (about 130 seconds).
>>> Statistically speaking it would normally reconnect sooner.  However in the best
>>> case where the user changes nics, and immediately tries to access the cifs
>>> share it will take SMB_MAX_RTT=10 seconds.
>>
>> I think it would be better to detect the broken connection
>> by using an AF_NETLINK socket listening for RTM_DELADDR
>> messages?
>>
>> metze
>>
> 
> Ick -- that sounds horrid ;)
> 
> Dave, this problem sounds very similar to the one that your colleague
> Chris J Arges was trying to solve several months ago. You may want to
> go back and review that thread. Perhaps you can solve both problems at
> the same time here...
> 
This is the same problem as was discussed here.
https://patchwork.kernel.org/patch/1717841/

From that thread you made the suggestion of
"What would really be better is fixing the code to only echo when there
are outstanding calls to the server."

I thought about that, and liked keeping the echo functionality as a
heart beat when nothing else is going on.  If we only echo when there
are outstanding calls, then the client will not attempt to reconnect
until the user attempts to use the mount.  I'd rather it reconnect when
nothing is happening.

As for the rest of the suggestion from that thread, we aren't trying to
solve a suspend/resume use case, but actually a dock/undock use case.
Basically reconnecting quickly when going from wired to wireless or vice
versa.

Dave.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve French Feb. 27, 2013, 10:40 p.m. UTC | #4
On Wed, Feb 27, 2013 at 4:24 PM, Dave Chiluk <dave.chiluk@canonical.com> wrote:
> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>> On Wed, 27 Feb 2013 12:06:14 +0100
>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
>>
>>> Hi Dave,
>>>
>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>> seconds) since the last response was recieved.  This does not take into account
>>>> the fact that messages waiting for a response should be serviced within a
>>>> reasonable round trip time.
>>>
>>> Wouldn't that mean that the client will disconnect a good connection,
>>> if the server doesn't response within 10 seconds?
>>> Reads and Writes can take longer than 10 seconds...
>>>
>>
>> Where does this magic value of 10s come from? Note that a slow server
>> can take *minutes* to respond to writes that are long past the EOF.
> It comes from the desire to decrease the reconnection delay to something
> better than a random number between 60 and 120 seconds.  I am not
> committed to this number, and it is open for discussion.  Additionally
> if you look closely at the logic it's not 10 seconds per request, but
> actually when requests have been in flight for more than 10 seconds make
> sure we've heard from the server in the last 10 seconds.
>
> Can you explain more fully your use case of writes that are long past
> the EOF?  Perhaps with a test-case or script that I can test?  As far as
> I know writes long past EOF will just result in a sparse file, and
> return in a reasonable round trip time *(that's at least what I'm seeing
> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
> seek=100000, starts receiving responses from the server in about .05
> seconds with subsequent responses following at roughly .002-.01 second
> intervals.  This is well within my 10 second value.

Note that not all Linux file systems support sparse files and
certainly there are cifs servers running on operating systems other
than Linux which have popular file systems which don't support sparse
files (e.g. FAT32 but there are many others) - in any case, writes
after end of file can take a LONG time if sparse files are not
supported and I don't know a good way for the client to know that
attribute of the server file system ahead of time (although we could
attempt to set the sparse flag, servers can and do lie)
Dave Chiluk Feb. 27, 2013, 10:44 p.m. UTC | #5
On 02/27/2013 04:40 PM, Steve French wrote:
> On Wed, Feb 27, 2013 at 4:24 PM, Dave Chiluk <dave.chiluk@canonical.com> wrote:
>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>>> On Wed, 27 Feb 2013 12:06:14 +0100
>>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
>>>
>>>> Hi Dave,
>>>>
>>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>>> seconds) since the last response was recieved.  This does not take into account
>>>>> the fact that messages waiting for a response should be serviced within a
>>>>> reasonable round trip time.
>>>>
>>>> Wouldn't that mean that the client will disconnect a good connection,
>>>> if the server doesn't response within 10 seconds?
>>>> Reads and Writes can take longer than 10 seconds...
>>>>
>>>
>>> Where does this magic value of 10s come from? Note that a slow server
>>> can take *minutes* to respond to writes that are long past the EOF.
>> It comes from the desire to decrease the reconnection delay to something
>> better than a random number between 60 and 120 seconds.  I am not
>> committed to this number, and it is open for discussion.  Additionally
>> if you look closely at the logic it's not 10 seconds per request, but
>> actually when requests have been in flight for more than 10 seconds make
>> sure we've heard from the server in the last 10 seconds.
>>
>> Can you explain more fully your use case of writes that are long past
>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
>> I know writes long past EOF will just result in a sparse file, and
>> return in a reasonable round trip time *(that's at least what I'm seeing
>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>> seek=100000, starts receiving responses from the server in about .05
>> seconds with subsequent responses following at roughly .002-.01 second
>> intervals.  This is well within my 10 second value.
> 
> Note that not all Linux file systems support sparse files and
> certainly there are cifs servers running on operating systems other
> than Linux which have popular file systems which don't support sparse
> files (e.g. FAT32 but there are many others) - in any case, writes
> after end of file can take a LONG time if sparse files are not
> supported and I don't know a good way for the client to know that
> attribute of the server file system ahead of time (although we could
> attempt to set the sparse flag, servers can and do lie)
> 

It doesn't matter how long it takes for the entire operation to
complete, just so long as the server acks something in less than 10
seconds.  Now the question becomes, is there an OS out there that
doesn't ack the request or doesn't ack the progress regularly.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stefan Metzmacher Feb. 28, 2013, 12:15 a.m. UTC | #6
Am 27.02.2013 17:34, schrieb Jeff Layton:
> On Wed, 27 Feb 2013 12:06:14 +0100
> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> 
>> Hi Dave,
>>
>>> When messages are currently in queue awaiting a response, decrease amount of
>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>> seconds) since the last response was recieved.  This does not take into account
>>> the fact that messages waiting for a response should be serviced within a
>>> reasonable round trip time.
>>
>> Wouldn't that mean that the client will disconnect a good connection,
>> if the server doesn't response within 10 seconds?
>> Reads and Writes can take longer than 10 seconds...
>>
> 
> Where does this magic value of 10s come from? Note that a slow server
> can take *minutes* to respond to writes that are long past the EOF.
> 
>>> This fixes the issue where user moves from wired to wireless or vice versa
>>> causing the mount to hang for 120 seconds, when it could reconnect considerably
>>> faster.  After this fix it will take SMB_MAX_RTT (10 seconds) from the last
>>> time the user attempted to access the volume or SMB_MAX_RTT after the last
>>> echo.  The worst case of the latter scenario being
>>> 2*SMB_ECHO_INTERVAL+SMB_MAX_RTT+small scheduling delay (about 130 seconds).
>>> Statistically speaking it would normally reconnect sooner.  However in the best
>>> case where the user changes nics, and immediately tries to access the cifs
>>> share it will take SMB_MAX_RTT=10 seconds.
>>
>> I think it would be better to detect the broken connection
>> by using an AF_NETLINK socket listening for RTM_DELADDR
>> messages?
>>
>> metze
>>
> 
> Ick -- that sounds horrid ;)

This is what winbindd uses to detect that a source ip of outgoing
connections
are gone. I don't know much of the kernel, there might be a better way
from within
the kernel to detect this. But this is exactly the correct thing to
do to failover to another interface, as it just happens when the ip is
removed
without messing with a timeout value.

Another optimization would be to use tcp keepalives (I think there 10
seconds would be ok),
I think that's what Windows SMB3 clients are using.

metze
Stefan Metzmacher Feb. 28, 2013, 12:17 a.m. UTC | #7
Am 27.02.2013 23:44, schrieb Dave Chiluk:
> On 02/27/2013 04:40 PM, Steve French wrote:
>> On Wed, Feb 27, 2013 at 4:24 PM, Dave Chiluk <dave.chiluk@canonical.com> wrote:
>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>>>> On Wed, 27 Feb 2013 12:06:14 +0100
>>>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
>>>>
>>>>> Hi Dave,
>>>>>
>>>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>>>> seconds) since the last response was recieved.  This does not take into account
>>>>>> the fact that messages waiting for a response should be serviced within a
>>>>>> reasonable round trip time.
>>>>>
>>>>> Wouldn't that mean that the client will disconnect a good connection,
>>>>> if the server doesn't response within 10 seconds?
>>>>> Reads and Writes can take longer than 10 seconds...
>>>>>
>>>>
>>>> Where does this magic value of 10s come from? Note that a slow server
>>>> can take *minutes* to respond to writes that are long past the EOF.
>>> It comes from the desire to decrease the reconnection delay to something
>>> better than a random number between 60 and 120 seconds.  I am not
>>> committed to this number, and it is open for discussion.  Additionally
>>> if you look closely at the logic it's not 10 seconds per request, but
>>> actually when requests have been in flight for more than 10 seconds make
>>> sure we've heard from the server in the last 10 seconds.
>>>
>>> Can you explain more fully your use case of writes that are long past
>>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
>>> I know writes long past EOF will just result in a sparse file, and
>>> return in a reasonable round trip time *(that's at least what I'm seeing
>>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>>> seek=100000, starts receiving responses from the server in about .05
>>> seconds with subsequent responses following at roughly .002-.01 second
>>> intervals.  This is well within my 10 second value.
>>
>> Note that not all Linux file systems support sparse files and
>> certainly there are cifs servers running on operating systems other
>> than Linux which have popular file systems which don't support sparse
>> files (e.g. FAT32 but there are many others) - in any case, writes
>> after end of file can take a LONG time if sparse files are not
>> supported and I don't know a good way for the client to know that
>> attribute of the server file system ahead of time (although we could
>> attempt to set the sparse flag, servers can and do lie)
>>
> 
> It doesn't matter how long it takes for the entire operation to
> complete, just so long as the server acks something in less than 10
> seconds.  Now the question becomes, is there an OS out there that
> doesn't ack the request or doesn't ack the progress regularly.

This kind of ack can only be at the tcp layer not at the smb layer.

metze
simo Feb. 28, 2013, 1:25 a.m. UTC | #8
On Wed, 2013-02-27 at 16:44 -0600, Dave Chiluk wrote:
> On 02/27/2013 04:40 PM, Steve French wrote:
> > On Wed, Feb 27, 2013 at 4:24 PM, Dave Chiluk <dave.chiluk@canonical.com> wrote:
> >> On 02/27/2013 10:34 AM, Jeff Layton wrote:
> >>> On Wed, 27 Feb 2013 12:06:14 +0100
> >>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> >>>
> >>>> Hi Dave,
> >>>>
> >>>>> When messages are currently in queue awaiting a response, decrease amount of
> >>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> >>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> >>>>> seconds) since the last response was recieved.  This does not take into account
> >>>>> the fact that messages waiting for a response should be serviced within a
> >>>>> reasonable round trip time.
> >>>>
> >>>> Wouldn't that mean that the client will disconnect a good connection,
> >>>> if the server doesn't response within 10 seconds?
> >>>> Reads and Writes can take longer than 10 seconds...
> >>>>
> >>>
> >>> Where does this magic value of 10s come from? Note that a slow server
> >>> can take *minutes* to respond to writes that are long past the EOF.
> >> It comes from the desire to decrease the reconnection delay to something
> >> better than a random number between 60 and 120 seconds.  I am not
> >> committed to this number, and it is open for discussion.  Additionally
> >> if you look closely at the logic it's not 10 seconds per request, but
> >> actually when requests have been in flight for more than 10 seconds make
> >> sure we've heard from the server in the last 10 seconds.
> >>
> >> Can you explain more fully your use case of writes that are long past
> >> the EOF?  Perhaps with a test-case or script that I can test?  As far as
> >> I know writes long past EOF will just result in a sparse file, and
> >> return in a reasonable round trip time *(that's at least what I'm seeing
> >> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
> >> seek=100000, starts receiving responses from the server in about .05
> >> seconds with subsequent responses following at roughly .002-.01 second
> >> intervals.  This is well within my 10 second value.
> > 
> > Note that not all Linux file systems support sparse files and
> > certainly there are cifs servers running on operating systems other
> > than Linux which have popular file systems which don't support sparse
> > files (e.g. FAT32 but there are many others) - in any case, writes
> > after end of file can take a LONG time if sparse files are not
> > supported and I don't know a good way for the client to know that
> > attribute of the server file system ahead of time (although we could
> > attempt to set the sparse flag, servers can and do lie)
> > 
> 
> It doesn't matter how long it takes for the entire operation to
> complete, just so long as the server acks something in less than 10
> seconds.  Now the question becomes, is there an OS out there that
> doesn't ack the request or doesn't ack the progress regularly.

IIRC older samba servers were fully synchronous and wouldn't reply to
anything while processing an operation. I am sure you can still find old
code bases in older (and slow) appliances out there.

Simo.
Tom Talpey Feb. 28, 2013, 1:26 a.m. UTC | #9
> -----Original Message-----
> From: linux-cifs-owner@vger.kernel.org [mailto:linux-cifs-
> owner@vger.kernel.org] On Behalf Of Dave Chiluk
> Sent: Wednesday, February 27, 2013 5:44 PM
> To: Steve French
> Cc: Jeff Layton; Stefan (metze) Metzmacher; Dave Chiluk; Steve French;
> linux-cifs@vger.kernel.org; samba-technical@lists.samba.org; linux-
> kernel@vger.kernel.org
> Subject: Re: [PATCH] CIFS: Decrease reconnection delay when switching nics
> 
> On 02/27/2013 04:40 PM, Steve French wrote:
> > On Wed, Feb 27, 2013 at 4:24 PM, Dave Chiluk
> <dave.chiluk@canonical.com> wrote:
> >> On 02/27/2013 10:34 AM, Jeff Layton wrote:
> >>> On Wed, 27 Feb 2013 12:06:14 +0100
> >>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> >>>
> >>>> Hi Dave,
> >>>>
> >>>>> When messages are currently in queue awaiting a response, decrease
> >>>>> amount of time before attempting cifs_reconnect to SMB_MAX_RTT
> =
> >>>>> 10 seconds. The current wait time before attempting to reconnect
> >>>>> is currently 2*SMB_ECHO_INTERVAL(120
> >>>>> seconds) since the last response was recieved.  This does not take
> >>>>> into account the fact that messages waiting for a response should
> >>>>> be serviced within a reasonable round trip time.
> >>>>
> >>>> Wouldn't that mean that the client will disconnect a good
> >>>> connection, if the server doesn't response within 10 seconds?
> >>>> Reads and Writes can take longer than 10 seconds...
> >>>>
> >>>
> >>> Where does this magic value of 10s come from? Note that a slow
> >>> server can take *minutes* to respond to writes that are long past the
> EOF.
> >> It comes from the desire to decrease the reconnection delay to
> >> something better than a random number between 60 and 120 seconds.  I
> >> am not committed to this number, and it is open for discussion.
> >> Additionally if you look closely at the logic it's not 10 seconds per
> >> request, but actually when requests have been in flight for more than
> >> 10 seconds make sure we've heard from the server in the last 10 seconds.
> >>
> >> Can you explain more fully your use case of writes that are long past
> >> the EOF?  Perhaps with a test-case or script that I can test?  As far
> >> as I know writes long past EOF will just result in a sparse file, and
> >> return in a reasonable round trip time *(that's at least what I'm
> >> seeing with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M
> >> count=100 seek=100000, starts receiving responses from the server in
> >> about .05 seconds with subsequent responses following at roughly
> >> .002-.01 second intervals.  This is well within my 10 second value.
> >
> > Note that not all Linux file systems support sparse files and
> > certainly there are cifs servers running on operating systems other
> > than Linux which have popular file systems which don't support sparse
> > files (e.g. FAT32 but there are many others) - in any case, writes
> > after end of file can take a LONG time if sparse files are not
> > supported and I don't know a good way for the client to know that
> > attribute of the server file system ahead of time (although we could
> > attempt to set the sparse flag, servers can and do lie)
> >
> 
> It doesn't matter how long it takes for the entire operation to complete, just
> so long as the server acks something in less than 10 seconds.  Now the
> question becomes, is there an OS out there that doesn't ack the request or
> doesn't ack the progress regularly.

SMB/CIFS servers will signal the operation "going async" by returning a
STATUS_PENDING response if the operation is not prompt, but this only
happens once. The client is still expected to run a timer, and recover from
possibly lost responses and/or unresponsive servers. Windows clients
extend their timeout when this occurs, typically quadrupling it.

Some clients will issue ECHO requests to probe the server in this
case, but it is neither a protocol requirement nor does it truly address
the issue of tracking each pending operation. Windows SMB2 clients
do not do this.

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Talpey Feb. 28, 2013, 1:01 p.m. UTC | #10
> -----Original Message-----
> From: samba-technical-bounces@lists.samba.org [mailto:samba-technical-
> bounces@lists.samba.org] On Behalf Of Stefan (metze) Metzmacher
> Sent: Wednesday, February 27, 2013 7:16 PM
> To: Jeff Layton
> Cc: Steve French; Dave Chiluk; samba-technical@lists.samba.org; linux-
> kernel@vger.kernel.org; linux-cifs@vger.kernel.org
> Subject: Re: [PATCH] CIFS: Decrease reconnection delay when switching nics
> 
> Am 27.02.2013 17:34, schrieb Jeff Layton:
> > On Wed, 27 Feb 2013 12:06:14 +0100
> > "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> >
> >> Hi Dave,
> >>
> >>> When messages are currently in queue awaiting a response, decrease
> >>> amount of time before attempting cifs_reconnect to SMB_MAX_RTT =
> 10
> >>> seconds. The current wait time before attempting to reconnect is
> >>> currently 2*SMB_ECHO_INTERVAL(120
> >>> seconds) since the last response was recieved.  This does not take
> >>> into account the fact that messages waiting for a response should be
> >>> serviced within a reasonable round trip time.
> >>
> >> Wouldn't that mean that the client will disconnect a good connection,
> >> if the server doesn't response within 10 seconds?
> >> Reads and Writes can take longer than 10 seconds...
> >>
> >
> > Where does this magic value of 10s come from? Note that a slow server
> > can take *minutes* to respond to writes that are long past the EOF.
> >
> >>> This fixes the issue where user moves from wired to wireless or vice
> >>> versa causing the mount to hang for 120 seconds, when it could
> >>> reconnect considerably faster.  After this fix it will take
> >>> SMB_MAX_RTT (10 seconds) from the last time the user attempted to
> >>> access the volume or SMB_MAX_RTT after the last echo.  The worst
> >>> case of the latter scenario being
> 2*SMB_ECHO_INTERVAL+SMB_MAX_RTT+small scheduling delay (about 130
> seconds).
> >>> Statistically speaking it would normally reconnect sooner.  However
> >>> in the best case where the user changes nics, and immediately tries
> >>> to access the cifs share it will take SMB_MAX_RTT=10 seconds.
> >>
> >> I think it would be better to detect the broken connection by using
> >> an AF_NETLINK socket listening for RTM_DELADDR messages?
> >>
> >> metze
> >>
> >
> > Ick -- that sounds horrid ;)
> 
> This is what winbindd uses to detect that a source ip of outgoing connections
> are gone. I don't know much of the kernel, there might be a better way from
> within the kernel to detect this. But this is exactly the correct thing to do to
> failover to another interface, as it just happens when the ip is removed
> without messing with a timeout value.
> 
> Another optimization would be to use tcp keepalives (I think there 10
> seconds would be ok), I think that's what Windows SMB3 clients are using.

Yes, they do. See MS-SMB2 behavior note 144 attached to section 3.2.5.14.9.

10 seconds seems a fairly rapid keepalive interval. The TCP stack probably
won't allow it to be less than the maximum retransmit, for instance.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton Feb. 28, 2013, 3:26 p.m. UTC | #11
On Wed, 27 Feb 2013 16:24:07 -0600
Dave Chiluk <dave.chiluk@canonical.com> wrote:

> On 02/27/2013 10:34 AM, Jeff Layton wrote:
> > On Wed, 27 Feb 2013 12:06:14 +0100
> > "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> > 
> >> Hi Dave,
> >>
> >>> When messages are currently in queue awaiting a response, decrease amount of
> >>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> >>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> >>> seconds) since the last response was recieved.  This does not take into account
> >>> the fact that messages waiting for a response should be serviced within a
> >>> reasonable round trip time.
> >>
> >> Wouldn't that mean that the client will disconnect a good connection,
> >> if the server doesn't response within 10 seconds?
> >> Reads and Writes can take longer than 10 seconds...
> >>
> > 
> > Where does this magic value of 10s come from? Note that a slow server
> > can take *minutes* to respond to writes that are long past the EOF.
> It comes from the desire to decrease the reconnection delay to something
> better than a random number between 60 and 120 seconds.  I am not
> committed to this number, and it is open for discussion.  Additionally
> if you look closely at the logic it's not 10 seconds per request, but
> actually when requests have been in flight for more than 10 seconds make
> sure we've heard from the server in the last 10 seconds.
> 
> Can you explain more fully your use case of writes that are long past
> the EOF?  Perhaps with a test-case or script that I can test?  As far as
> I know writes long past EOF will just result in a sparse file, and
> return in a reasonable round trip time *(that's at least what I'm seeing
> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
> seek=100000, starts receiving responses from the server in about .05
> seconds with subsequent responses following at roughly .002-.01 second
> intervals.  This is well within my 10 second value.  Even adding the
> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
> less than my 10 second value.
> 
> The new logic goes like this
> if( we've been expecting a response from the server (in_flight), and
>  message has been in_flight for more than 10 seconds and
>  we haven't had any other contact from the server in that time
>   reconnect
> 

That will break writes long past the EOF. Note too that reconnects on
CIFS are horrifically expensive and problematic. Much of the state on a
CIFS mount is tied to the connection. When that drops, open files are
closed and things like locks are dropped. SMB1 has no real mechanism
for state recovery, so that can really be a problem.

> On a side note, I discovered a small race condition in the previous
> logic while working on this, that my new patch also fixes.
> 1s  request
> 2s  response
> 61.995 echo job pops
> 121.995 echo job pops and sends echo
> 122 server_unresponsive called.  Finds no response and attempts to
> 	 reconnect
> 122.95 response to echo received
> 

Sure, here's a reproducer. Do this against a windows server, preferably
one exporting NTFS on relatively slow storage. Make sure that
"testfile" doesn't exist first:

     $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192

NTFS doesn't support sparse files, so the OS has to zero-fill up to the
point where you're writing. That can take a looooong time on slow
storage (minutes even). What we do now is periodically send a SMB echo
to make sure the server is alive rather than trying to time out a
particular call.

The logic that handles that today is somewhat sub-optimal though. We
send an echo every 60s whether there are any calls in flight or not and
wait for 60s until we decide that the server isn't there. What would be
better is to only send one when we've been waiting a long time for a
response.

That "long time" is debatable -- 10s would be fine with me but the
logic needs to be fixed not to send echoes unless there is an
outstanding request first.

I think though that you're trying to use this mechanism to do something
that it wasn't really designed to do. A better method might be to try
and detect whether the TCP connection is really dead somehow. That
would be more immediate, but I'm unclear on how best to do that.
Probably it'll mean groveling around down in the TCP layer...

FWIW, there was a thread on the linux-cifs mailing list started on Dec
3, 2010 entitled "cifs client timeouts and hard/soft mounts" that lays
out the rationale for the current reconnection behavior. You may want
to look over that before you go making changes here...
Steve French Feb. 28, 2013, 4:04 p.m. UTC | #12
On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@samba.org> wrote:
> On Wed, 27 Feb 2013 16:24:07 -0600
> Dave Chiluk <dave.chiluk@canonical.com> wrote:
>
>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>> > On Wed, 27 Feb 2013 12:06:14 +0100
>> > "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
>> >
>> >> Hi Dave,
>> >>
>> >>> When messages are currently in queue awaiting a response, decrease amount of
>> >>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>> >>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>> >>> seconds) since the last response was recieved.  This does not take into account
>> >>> the fact that messages waiting for a response should be serviced within a
>> >>> reasonable round trip time.
>> >>
>> >> Wouldn't that mean that the client will disconnect a good connection,
>> >> if the server doesn't response within 10 seconds?
>> >> Reads and Writes can take longer than 10 seconds...
>> >>
>> >
>> > Where does this magic value of 10s come from? Note that a slow server
>> > can take *minutes* to respond to writes that are long past the EOF.
>> It comes from the desire to decrease the reconnection delay to something
>> better than a random number between 60 and 120 seconds.  I am not
>> committed to this number, and it is open for discussion.  Additionally
>> if you look closely at the logic it's not 10 seconds per request, but
>> actually when requests have been in flight for more than 10 seconds make
>> sure we've heard from the server in the last 10 seconds.
>>
>> Can you explain more fully your use case of writes that are long past
>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
>> I know writes long past EOF will just result in a sparse file, and
>> return in a reasonable round trip time *(that's at least what I'm seeing
>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>> seek=100000, starts receiving responses from the server in about .05
>> seconds with subsequent responses following at roughly .002-.01 second
>> intervals.  This is well within my 10 second value.  Even adding the
>> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
>> less than my 10 second value.
>>
>> The new logic goes like this
>> if( we've been expecting a response from the server (in_flight), and
>>  message has been in_flight for more than 10 seconds and
>>  we haven't had any other contact from the server in that time
>>   reconnect
>>
>
> That will break writes long past the EOF. Note too that reconnects on
> CIFS are horrifically expensive and problematic. Much of the state on a
> CIFS mount is tied to the connection. When that drops, open files are
> closed and things like locks are dropped. SMB1 has no real mechanism
> for state recovery, so that can really be a problem.
>
>> On a side note, I discovered a small race condition in the previous
>> logic while working on this, that my new patch also fixes.
>> 1s  request
>> 2s  response
>> 61.995 echo job pops
>> 121.995 echo job pops and sends echo
>> 122 server_unresponsive called.  Finds no response and attempts to
>>        reconnect
>> 122.95 response to echo received
>>
>
> Sure, here's a reproducer. Do this against a windows server, preferably
> one exporting NTFS on relatively slow storage. Make sure that
> "testfile" doesn't exist first:
>
>      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
>
> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
> point where you're writing. That can take a looooong time on slow
> storage (minutes even). What we do now is periodically send a SMB echo
> to make sure the server is alive rather than trying to time out a
> particular call.

Writing past end of file in Windows can be very slow, but note that it
is possible for a windows to set as sparse a file on an NTFS
partition.   Quoting from
http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
Windows NTFS does support sparse files (and we could even send it over
cifs if we want) but it has to be explicitly set by the app on the
file:

"To determine whether a file system supports sparse files, call the
GetVolumeInformation function and examine the
FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
lpFileSystemFlags parameter.

Most applications are not aware of sparse files and will not create
sparse files. The fact that an application is reading a sparse file is
transparent to the application. An application that is aware of
sparse-files should determine whether its data set is suitable to be
kept in a sparse file. After that determination is made, the
application must explicitly declare a file as sparse, using the
FSCTL_SET_SPARSE control code."
Jeff Layton Feb. 28, 2013, 4:47 p.m. UTC | #13
On Thu, 28 Feb 2013 10:04:36 -0600
Steve French <smfrench@gmail.com> wrote:

> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@samba.org> wrote:
> > On Wed, 27 Feb 2013 16:24:07 -0600
> > Dave Chiluk <dave.chiluk@canonical.com> wrote:
> >
> >> On 02/27/2013 10:34 AM, Jeff Layton wrote:
> >> > On Wed, 27 Feb 2013 12:06:14 +0100
> >> > "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> >> >
> >> >> Hi Dave,
> >> >>
> >> >>> When messages are currently in queue awaiting a response, decrease amount of
> >> >>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> >> >>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> >> >>> seconds) since the last response was recieved.  This does not take into account
> >> >>> the fact that messages waiting for a response should be serviced within a
> >> >>> reasonable round trip time.
> >> >>
> >> >> Wouldn't that mean that the client will disconnect a good connection,
> >> >> if the server doesn't response within 10 seconds?
> >> >> Reads and Writes can take longer than 10 seconds...
> >> >>
> >> >
> >> > Where does this magic value of 10s come from? Note that a slow server
> >> > can take *minutes* to respond to writes that are long past the EOF.
> >> It comes from the desire to decrease the reconnection delay to something
> >> better than a random number between 60 and 120 seconds.  I am not
> >> committed to this number, and it is open for discussion.  Additionally
> >> if you look closely at the logic it's not 10 seconds per request, but
> >> actually when requests have been in flight for more than 10 seconds make
> >> sure we've heard from the server in the last 10 seconds.
> >>
> >> Can you explain more fully your use case of writes that are long past
> >> the EOF?  Perhaps with a test-case or script that I can test?  As far as
> >> I know writes long past EOF will just result in a sparse file, and
> >> return in a reasonable round trip time *(that's at least what I'm seeing
> >> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
> >> seek=100000, starts receiving responses from the server in about .05
> >> seconds with subsequent responses following at roughly .002-.01 second
> >> intervals.  This is well within my 10 second value.  Even adding the
> >> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
> >> less than my 10 second value.
> >>
> >> The new logic goes like this
> >> if( we've been expecting a response from the server (in_flight), and
> >>  message has been in_flight for more than 10 seconds and
> >>  we haven't had any other contact from the server in that time
> >>   reconnect
> >>
> >
> > That will break writes long past the EOF. Note too that reconnects on
> > CIFS are horrifically expensive and problematic. Much of the state on a
> > CIFS mount is tied to the connection. When that drops, open files are
> > closed and things like locks are dropped. SMB1 has no real mechanism
> > for state recovery, so that can really be a problem.
> >
> >> On a side note, I discovered a small race condition in the previous
> >> logic while working on this, that my new patch also fixes.
> >> 1s  request
> >> 2s  response
> >> 61.995 echo job pops
> >> 121.995 echo job pops and sends echo
> >> 122 server_unresponsive called.  Finds no response and attempts to
> >>        reconnect
> >> 122.95 response to echo received
> >>
> >
> > Sure, here's a reproducer. Do this against a windows server, preferably
> > one exporting NTFS on relatively slow storage. Make sure that
> > "testfile" doesn't exist first:
> >
> >      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
> >
> > NTFS doesn't support sparse files, so the OS has to zero-fill up to the
> > point where you're writing. That can take a looooong time on slow
> > storage (minutes even). What we do now is periodically send a SMB echo
> > to make sure the server is alive rather than trying to time out a
> > particular call.
> 
> Writing past end of file in Windows can be very slow, but note that it
> is possible for a windows to set as sparse a file on an NTFS
> partition.   Quoting from
> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
> Windows NTFS does support sparse files (and we could even send it over
> cifs if we want) but it has to be explicitly set by the app on the
> file:
> 
> "To determine whether a file system supports sparse files, call the
> GetVolumeInformation function and examine the
> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
> lpFileSystemFlags parameter.
> 
> Most applications are not aware of sparse files and will not create
> sparse files. The fact that an application is reading a sparse file is
> transparent to the application. An application that is aware of
> sparse-files should determine whether its data set is suitable to be
> kept in a sparse file. After that determination is made, the
> application must explicitly declare a file as sparse, using the
> FSCTL_SET_SPARSE control code."
> 
> 

That's interesting. I didn't know about the fsctl.

It doesn't really help us though. Not all servers support passthrough
infolevels, and there are other filesystems (e.g. FAT) that don't
support sparse files at all.

In any case, the upshot of all of this is that we simply can't assume
that we'll get the response to a particular call in any given amount of
time, so we have to periodically check that the server is still
responding via echoes before giving up on it completely.
Dave Chiluk Feb. 28, 2013, 5:31 p.m. UTC | #14
On 02/28/2013 10:47 AM, Jeff Layton wrote:
> On Thu, 28 Feb 2013 10:04:36 -0600
> Steve French <smfrench@gmail.com> wrote:
> 
>> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@samba.org> wrote:
>>> On Wed, 27 Feb 2013 16:24:07 -0600
>>> Dave Chiluk <dave.chiluk@canonical.com> wrote:
>>>
>>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>>>>> On Wed, 27 Feb 2013 12:06:14 +0100
>>>>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
>>>>>
>>>>>> Hi Dave,
>>>>>>
>>>>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>>>>> seconds) since the last response was recieved.  This does not take into account
>>>>>>> the fact that messages waiting for a response should be serviced within a
>>>>>>> reasonable round trip time.
>>>>>>
>>>>>> Wouldn't that mean that the client will disconnect a good connection,
>>>>>> if the server doesn't response within 10 seconds?
>>>>>> Reads and Writes can take longer than 10 seconds...
>>>>>>
>>>>>
>>>>> Where does this magic value of 10s come from? Note that a slow server
>>>>> can take *minutes* to respond to writes that are long past the EOF.
>>>> It comes from the desire to decrease the reconnection delay to something
>>>> better than a random number between 60 and 120 seconds.  I am not
>>>> committed to this number, and it is open for discussion.  Additionally
>>>> if you look closely at the logic it's not 10 seconds per request, but
>>>> actually when requests have been in flight for more than 10 seconds make
>>>> sure we've heard from the server in the last 10 seconds.
>>>>
>>>> Can you explain more fully your use case of writes that are long past
>>>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
>>>> I know writes long past EOF will just result in a sparse file, and
>>>> return in a reasonable round trip time *(that's at least what I'm seeing
>>>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>>>> seek=100000, starts receiving responses from the server in about .05
>>>> seconds with subsequent responses following at roughly .002-.01 second
>>>> intervals.  This is well within my 10 second value.  Even adding the
>>>> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
>>>> less than my 10 second value.
>>>>
>>>> The new logic goes like this
>>>> if( we've been expecting a response from the server (in_flight), and
>>>>  message has been in_flight for more than 10 seconds and
>>>>  we haven't had any other contact from the server in that time
>>>>   reconnect
>>>>
>>>
>>> That will break writes long past the EOF. Note too that reconnects on
>>> CIFS are horrifically expensive and problematic. Much of the state on a
>>> CIFS mount is tied to the connection. When that drops, open files are
>>> closed and things like locks are dropped. SMB1 has no real mechanism
>>> for state recovery, so that can really be a problem.
>>>
>>>> On a side note, I discovered a small race condition in the previous
>>>> logic while working on this, that my new patch also fixes.
>>>> 1s  request
>>>> 2s  response
>>>> 61.995 echo job pops
>>>> 121.995 echo job pops and sends echo
>>>> 122 server_unresponsive called.  Finds no response and attempts to
>>>>        reconnect
>>>> 122.95 response to echo received
>>>>
>>>
>>> Sure, here's a reproducer. Do this against a windows server, preferably
>>> one exporting NTFS on relatively slow storage. Make sure that
>>> "testfile" doesn't exist first:
>>>
>>>      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
>>>
>>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
>>> point where you're writing. That can take a looooong time on slow
>>> storage (minutes even). What we do now is periodically send a SMB echo
>>> to make sure the server is alive rather than trying to time out a
>>> particular call.
>>
>> Writing past end of file in Windows can be very slow, but note that it
>> is possible for a windows to set as sparse a file on an NTFS
>> partition.   Quoting from
>> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
>> Windows NTFS does support sparse files (and we could even send it over
>> cifs if we want) but it has to be explicitly set by the app on the
>> file:
>>
>> "To determine whether a file system supports sparse files, call the
>> GetVolumeInformation function and examine the
>> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
>> lpFileSystemFlags parameter.
>>
>> Most applications are not aware of sparse files and will not create
>> sparse files. The fact that an application is reading a sparse file is
>> transparent to the application. An application that is aware of
>> sparse-files should determine whether its data set is suitable to be
>> kept in a sparse file. After that determination is made, the
>> application must explicitly declare a file as sparse, using the
>> FSCTL_SET_SPARSE control code."
>>
>>
> 
> That's interesting. I didn't know about the fsctl.
> 
> It doesn't really help us though. Not all servers support passthrough
> infolevels, and there are other filesystems (e.g. FAT) that don't
> support sparse files at all.
> 
> In any case, the upshot of all of this is that we simply can't assume
> that we'll get the response to a particular call in any given amount of
> time, so we have to periodically check that the server is still
> responding via echoes before giving up on it completely.
> 

I just verified this by running the dd testcase against a windows 7
server.  I'm going to rewrite my patch to optimise the echo logic as
Jeff suggested earlier.  The only difference being that, I think we
should still have regular echos when nothing else is happening, so that
the connection can be rebuilt when nothing urgent is going on.

It still makes more sense to me that we should be checking the status of
the tcp socket, and it's underlying nic, but I'm still not completely
clear on how that could be accomplished.  Any pointers to that regard
would be appreciated.

Thanks for the help guys.
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve French Feb. 28, 2013, 5:45 p.m. UTC | #15
On Thu, Feb 28, 2013 at 11:31 AM, Dave Chiluk <dave.chiluk@canonical.com> wrote:
> On 02/28/2013 10:47 AM, Jeff Layton wrote:
>> On Thu, 28 Feb 2013 10:04:36 -0600
>> Steve French <smfrench@gmail.com> wrote:
>>
>>> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@samba.org> wrote:
>>>> On Wed, 27 Feb 2013 16:24:07 -0600
>>>> Dave Chiluk <dave.chiluk@canonical.com> wrote:
>>>>
>>>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
>>>>>> On Wed, 27 Feb 2013 12:06:14 +0100
>>>>>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
>>>>>>
>>>>>>> Hi Dave,
>>>>>>>
>>>>>>>> When messages are currently in queue awaiting a response, decrease amount of
>>>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
>>>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
>>>>>>>> seconds) since the last response was recieved.  This does not take into account
>>>>>>>> the fact that messages waiting for a response should be serviced within a
>>>>>>>> reasonable round trip time.
>>>>>>>
>>>>>>> Wouldn't that mean that the client will disconnect a good connection,
>>>>>>> if the server doesn't response within 10 seconds?
>>>>>>> Reads and Writes can take longer than 10 seconds...
>>>>>>>
>>>>>>
>>>>>> Where does this magic value of 10s come from? Note that a slow server
>>>>>> can take *minutes* to respond to writes that are long past the EOF.
>>>>> It comes from the desire to decrease the reconnection delay to something
>>>>> better than a random number between 60 and 120 seconds.  I am not
>>>>> committed to this number, and it is open for discussion.  Additionally
>>>>> if you look closely at the logic it's not 10 seconds per request, but
>>>>> actually when requests have been in flight for more than 10 seconds make
>>>>> sure we've heard from the server in the last 10 seconds.
>>>>>
>>>>> Can you explain more fully your use case of writes that are long past
>>>>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
>>>>> I know writes long past EOF will just result in a sparse file, and
>>>>> return in a reasonable round trip time *(that's at least what I'm seeing
>>>>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
>>>>> seek=100000, starts receiving responses from the server in about .05
>>>>> seconds with subsequent responses following at roughly .002-.01 second
>>>>> intervals.  This is well within my 10 second value.  Even adding the
>>>>> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
>>>>> less than my 10 second value.
>>>>>
>>>>> The new logic goes like this
>>>>> if( we've been expecting a response from the server (in_flight), and
>>>>>  message has been in_flight for more than 10 seconds and
>>>>>  we haven't had any other contact from the server in that time
>>>>>   reconnect
>>>>>
>>>>
>>>> That will break writes long past the EOF. Note too that reconnects on
>>>> CIFS are horrifically expensive and problematic. Much of the state on a
>>>> CIFS mount is tied to the connection. When that drops, open files are
>>>> closed and things like locks are dropped. SMB1 has no real mechanism
>>>> for state recovery, so that can really be a problem.
>>>>
>>>>> On a side note, I discovered a small race condition in the previous
>>>>> logic while working on this, that my new patch also fixes.
>>>>> 1s  request
>>>>> 2s  response
>>>>> 61.995 echo job pops
>>>>> 121.995 echo job pops and sends echo
>>>>> 122 server_unresponsive called.  Finds no response and attempts to
>>>>>        reconnect
>>>>> 122.95 response to echo received
>>>>>
>>>>
>>>> Sure, here's a reproducer. Do this against a windows server, preferably
>>>> one exporting NTFS on relatively slow storage. Make sure that
>>>> "testfile" doesn't exist first:
>>>>
>>>>      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
>>>>
>>>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
>>>> point where you're writing. That can take a looooong time on slow
>>>> storage (minutes even). What we do now is periodically send a SMB echo
>>>> to make sure the server is alive rather than trying to time out a
>>>> particular call.
>>>
>>> Writing past end of file in Windows can be very slow, but note that it
>>> is possible for a windows to set as sparse a file on an NTFS
>>> partition.   Quoting from
>>> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
>>> Windows NTFS does support sparse files (and we could even send it over
>>> cifs if we want) but it has to be explicitly set by the app on the
>>> file:
>>>
>>> "To determine whether a file system supports sparse files, call the
>>> GetVolumeInformation function and examine the
>>> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
>>> lpFileSystemFlags parameter.
>>>
>>> Most applications are not aware of sparse files and will not create
>>> sparse files. The fact that an application is reading a sparse file is
>>> transparent to the application. An application that is aware of
>>> sparse-files should determine whether its data set is suitable to be
>>> kept in a sparse file. After that determination is made, the
>>> application must explicitly declare a file as sparse, using the
>>> FSCTL_SET_SPARSE control code."
>>>
>>>
>>
>> That's interesting. I didn't know about the fsctl.
>>
>> It doesn't really help us though. Not all servers support passthrough
>> infolevels, and there are other filesystems (e.g. FAT) that don't
>> support sparse files at all.
>>
>> In any case, the upshot of all of this is that we simply can't assume
>> that we'll get the response to a particular call in any given amount of
>> time, so we have to periodically check that the server is still
>> responding via echoes before giving up on it completely.
>>
>
> I just verified this by running the dd testcase against a windows 7
> server.  I'm going to rewrite my patch to optimise the echo logic as
> Jeff suggested earlier.  The only difference being that, I think we
> should still have regular echos when nothing else is happening, so that
> the connection can be rebuilt when nothing urgent is going on.
>
> It still makes more sense to me that we should be checking the status of
> the tcp socket, and it's underlying nic, but I'm still not completely
> clear on how that could be accomplished.  Any pointers to that regard
> would be appreciated.

It is also worth checking if the witness protocol would help us (even
in a nonclustered environment) because it was designed to allow (at
least for smb3 mounts) a client to tell when a server is up or down
Jeff Layton Feb. 28, 2013, 6:04 p.m. UTC | #16
On Thu, 28 Feb 2013 11:31:54 -0600
Dave Chiluk <dave.chiluk@canonical.com> wrote:

> On 02/28/2013 10:47 AM, Jeff Layton wrote:
> > On Thu, 28 Feb 2013 10:04:36 -0600
> > Steve French <smfrench@gmail.com> wrote:
> > 
> >> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@samba.org> wrote:
> >>> On Wed, 27 Feb 2013 16:24:07 -0600
> >>> Dave Chiluk <dave.chiluk@canonical.com> wrote:
> >>>
> >>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
> >>>>> On Wed, 27 Feb 2013 12:06:14 +0100
> >>>>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> >>>>>
> >>>>>> Hi Dave,
> >>>>>>
> >>>>>>> When messages are currently in queue awaiting a response, decrease amount of
> >>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> >>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> >>>>>>> seconds) since the last response was recieved.  This does not take into account
> >>>>>>> the fact that messages waiting for a response should be serviced within a
> >>>>>>> reasonable round trip time.
> >>>>>>
> >>>>>> Wouldn't that mean that the client will disconnect a good connection,
> >>>>>> if the server doesn't response within 10 seconds?
> >>>>>> Reads and Writes can take longer than 10 seconds...
> >>>>>>
> >>>>>
> >>>>> Where does this magic value of 10s come from? Note that a slow server
> >>>>> can take *minutes* to respond to writes that are long past the EOF.
> >>>> It comes from the desire to decrease the reconnection delay to something
> >>>> better than a random number between 60 and 120 seconds.  I am not
> >>>> committed to this number, and it is open for discussion.  Additionally
> >>>> if you look closely at the logic it's not 10 seconds per request, but
> >>>> actually when requests have been in flight for more than 10 seconds make
> >>>> sure we've heard from the server in the last 10 seconds.
> >>>>
> >>>> Can you explain more fully your use case of writes that are long past
> >>>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
> >>>> I know writes long past EOF will just result in a sparse file, and
> >>>> return in a reasonable round trip time *(that's at least what I'm seeing
> >>>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
> >>>> seek=100000, starts receiving responses from the server in about .05
> >>>> seconds with subsequent responses following at roughly .002-.01 second
> >>>> intervals.  This is well within my 10 second value.  Even adding the
> >>>> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
> >>>> less than my 10 second value.
> >>>>
> >>>> The new logic goes like this
> >>>> if( we've been expecting a response from the server (in_flight), and
> >>>>  message has been in_flight for more than 10 seconds and
> >>>>  we haven't had any other contact from the server in that time
> >>>>   reconnect
> >>>>
> >>>
> >>> That will break writes long past the EOF. Note too that reconnects on
> >>> CIFS are horrifically expensive and problematic. Much of the state on a
> >>> CIFS mount is tied to the connection. When that drops, open files are
> >>> closed and things like locks are dropped. SMB1 has no real mechanism
> >>> for state recovery, so that can really be a problem.
> >>>
> >>>> On a side note, I discovered a small race condition in the previous
> >>>> logic while working on this, that my new patch also fixes.
> >>>> 1s  request
> >>>> 2s  response
> >>>> 61.995 echo job pops
> >>>> 121.995 echo job pops and sends echo
> >>>> 122 server_unresponsive called.  Finds no response and attempts to
> >>>>        reconnect
> >>>> 122.95 response to echo received
> >>>>
> >>>
> >>> Sure, here's a reproducer. Do this against a windows server, preferably
> >>> one exporting NTFS on relatively slow storage. Make sure that
> >>> "testfile" doesn't exist first:
> >>>
> >>>      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
> >>>
> >>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
> >>> point where you're writing. That can take a looooong time on slow
> >>> storage (minutes even). What we do now is periodically send a SMB echo
> >>> to make sure the server is alive rather than trying to time out a
> >>> particular call.
> >>
> >> Writing past end of file in Windows can be very slow, but note that it
> >> is possible for a windows to set as sparse a file on an NTFS
> >> partition.   Quoting from
> >> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
> >> Windows NTFS does support sparse files (and we could even send it over
> >> cifs if we want) but it has to be explicitly set by the app on the
> >> file:
> >>
> >> "To determine whether a file system supports sparse files, call the
> >> GetVolumeInformation function and examine the
> >> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
> >> lpFileSystemFlags parameter.
> >>
> >> Most applications are not aware of sparse files and will not create
> >> sparse files. The fact that an application is reading a sparse file is
> >> transparent to the application. An application that is aware of
> >> sparse-files should determine whether its data set is suitable to be
> >> kept in a sparse file. After that determination is made, the
> >> application must explicitly declare a file as sparse, using the
> >> FSCTL_SET_SPARSE control code."
> >>
> >>
> > 
> > That's interesting. I didn't know about the fsctl.
> > 
> > It doesn't really help us though. Not all servers support passthrough
> > infolevels, and there are other filesystems (e.g. FAT) that don't
> > support sparse files at all.
> > 
> > In any case, the upshot of all of this is that we simply can't assume
> > that we'll get the response to a particular call in any given amount of
> > time, so we have to periodically check that the server is still
> > responding via echoes before giving up on it completely.
> > 
> 
> I just verified this by running the dd testcase against a windows 7
> server.  I'm going to rewrite my patch to optimise the echo logic as
> Jeff suggested earlier.  The only difference being that, I think we
> should still have regular echos when nothing else is happening, so that
> the connection can be rebuilt when nothing urgent is going on.
> 

OTOH, you don't want to hammer the server with echoes. They are fairly
light weight, but they aren't completely free. That's why I think we
might get better milage out of trying to look at the socket itself to
figure out the state.

> It still makes more sense to me that we should be checking the status of
> the tcp socket, and it's underlying nic, but I'm still not completely
> clear on how that could be accomplished.  Any pointers to that regard
> would be appreciated.
> 
> Thanks for the help guys.

You can always look at the sk_state flags to figure out the state of
the TCP connection (there are some examples of that in sunrpc code, but
there may be simpler ones elsewhere).

As far as the underlying interface goes, I'm not sure what you can do.
There's not always a straightforward 1:1 correspondance between an
interface and connection is there? Also, we don't necessarily want to
reconnect just because networkmanager got upgraded and took down the if
for a second and brought it right back up again. What if an address
migrates to a different interface altogether on the same subnet? Do TCP
connections normally keep chugging along in that situation?

I think you probably need to nail down the specific circumstances where
you want to reconnect and then try to figure out how best to detect them.
simo Feb. 28, 2013, 10:23 p.m. UTC | #17
On Thu, 2013-02-28 at 11:31 -0600, Dave Chiluk wrote:
> On 02/28/2013 10:47 AM, Jeff Layton wrote:
> > On Thu, 28 Feb 2013 10:04:36 -0600
> > Steve French <smfrench@gmail.com> wrote:
> > 
> >> On Thu, Feb 28, 2013 at 9:26 AM, Jeff Layton <jlayton@samba.org> wrote:
> >>> On Wed, 27 Feb 2013 16:24:07 -0600
> >>> Dave Chiluk <dave.chiluk@canonical.com> wrote:
> >>>
> >>>> On 02/27/2013 10:34 AM, Jeff Layton wrote:
> >>>>> On Wed, 27 Feb 2013 12:06:14 +0100
> >>>>> "Stefan (metze) Metzmacher" <metze@samba.org> wrote:
> >>>>>
> >>>>>> Hi Dave,
> >>>>>>
> >>>>>>> When messages are currently in queue awaiting a response, decrease amount of
> >>>>>>> time before attempting cifs_reconnect to SMB_MAX_RTT = 10 seconds. The current
> >>>>>>> wait time before attempting to reconnect is currently 2*SMB_ECHO_INTERVAL(120
> >>>>>>> seconds) since the last response was recieved.  This does not take into account
> >>>>>>> the fact that messages waiting for a response should be serviced within a
> >>>>>>> reasonable round trip time.
> >>>>>>
> >>>>>> Wouldn't that mean that the client will disconnect a good connection,
> >>>>>> if the server doesn't response within 10 seconds?
> >>>>>> Reads and Writes can take longer than 10 seconds...
> >>>>>>
> >>>>>
> >>>>> Where does this magic value of 10s come from? Note that a slow server
> >>>>> can take *minutes* to respond to writes that are long past the EOF.
> >>>> It comes from the desire to decrease the reconnection delay to something
> >>>> better than a random number between 60 and 120 seconds.  I am not
> >>>> committed to this number, and it is open for discussion.  Additionally
> >>>> if you look closely at the logic it's not 10 seconds per request, but
> >>>> actually when requests have been in flight for more than 10 seconds make
> >>>> sure we've heard from the server in the last 10 seconds.
> >>>>
> >>>> Can you explain more fully your use case of writes that are long past
> >>>> the EOF?  Perhaps with a test-case or script that I can test?  As far as
> >>>> I know writes long past EOF will just result in a sparse file, and
> >>>> return in a reasonable round trip time *(that's at least what I'm seeing
> >>>> with my testing).  dd if=/dev/zero of=/mnt/cifs/a bs=1M count=100
> >>>> seek=100000, starts receiving responses from the server in about .05
> >>>> seconds with subsequent responses following at roughly .002-.01 second
> >>>> intervals.  This is well within my 10 second value.  Even adding the
> >>>> latency of AT&T's 2g cell network brings it up to only 1s.  Still 10x
> >>>> less than my 10 second value.
> >>>>
> >>>> The new logic goes like this
> >>>> if( we've been expecting a response from the server (in_flight), and
> >>>>  message has been in_flight for more than 10 seconds and
> >>>>  we haven't had any other contact from the server in that time
> >>>>   reconnect
> >>>>
> >>>
> >>> That will break writes long past the EOF. Note too that reconnects on
> >>> CIFS are horrifically expensive and problematic. Much of the state on a
> >>> CIFS mount is tied to the connection. When that drops, open files are
> >>> closed and things like locks are dropped. SMB1 has no real mechanism
> >>> for state recovery, so that can really be a problem.
> >>>
> >>>> On a side note, I discovered a small race condition in the previous
> >>>> logic while working on this, that my new patch also fixes.
> >>>> 1s  request
> >>>> 2s  response
> >>>> 61.995 echo job pops
> >>>> 121.995 echo job pops and sends echo
> >>>> 122 server_unresponsive called.  Finds no response and attempts to
> >>>>        reconnect
> >>>> 122.95 response to echo received
> >>>>
> >>>
> >>> Sure, here's a reproducer. Do this against a windows server, preferably
> >>> one exporting NTFS on relatively slow storage. Make sure that
> >>> "testfile" doesn't exist first:
> >>>
> >>>      $ dd if=/dev/zero of=/path/to/cifs/share/testfile bs=1M count=1 seek=3192
> >>>
> >>> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
> >>> point where you're writing. That can take a looooong time on slow
> >>> storage (minutes even). What we do now is periodically send a SMB echo
> >>> to make sure the server is alive rather than trying to time out a
> >>> particular call.
> >>
> >> Writing past end of file in Windows can be very slow, but note that it
> >> is possible for a windows to set as sparse a file on an NTFS
> >> partition.   Quoting from
> >> http://msdn.microsoft.com/en-us/library/windows/desktop/aa365566%28v=vs.85%29.aspx
> >> Windows NTFS does support sparse files (and we could even send it over
> >> cifs if we want) but it has to be explicitly set by the app on the
> >> file:
> >>
> >> "To determine whether a file system supports sparse files, call the
> >> GetVolumeInformation function and examine the
> >> FILE_SUPPORTS_SPARSE_FILES bit flag returned through the
> >> lpFileSystemFlags parameter.
> >>
> >> Most applications are not aware of sparse files and will not create
> >> sparse files. The fact that an application is reading a sparse file is
> >> transparent to the application. An application that is aware of
> >> sparse-files should determine whether its data set is suitable to be
> >> kept in a sparse file. After that determination is made, the
> >> application must explicitly declare a file as sparse, using the
> >> FSCTL_SET_SPARSE control code."
> >>
> >>
> > 
> > That's interesting. I didn't know about the fsctl.
> > 
> > It doesn't really help us though. Not all servers support passthrough
> > infolevels, and there are other filesystems (e.g. FAT) that don't
> > support sparse files at all.
> > 
> > In any case, the upshot of all of this is that we simply can't assume
> > that we'll get the response to a particular call in any given amount of
> > time, so we have to periodically check that the server is still
> > responding via echoes before giving up on it completely.
> > 
> 
> I just verified this by running the dd testcase against a windows 7
> server.  I'm going to rewrite my patch to optimise the echo logic as
> Jeff suggested earlier.  The only difference being that, I think we
> should still have regular echos when nothing else is happening, so that
> the connection can be rebuilt when nothing urgent is going on.

Constant echo requests, keep the server busy and create unnecessary
traffic. They would also probably kill connections that would otherwise
survive temporary disruption of communications (laptop gets briefly out
of range while moving through rooms, etc...) when otherwise not needing
to contact the server and the underlying TCP connection would not be
dropped.

Simo.

> It still makes more sense to me that we should be checking the status of
> the tcp socket, and it's underlying nic, but I'm still not completely
> clear on how that could be accomplished.  Any pointers to that regard
> would be appreciated.
Björn Jacke Feb. 28, 2013, 10:54 p.m. UTC | #18
On 2013-02-28 at 07:26 -0800 Jeff Layton sent off:
> NTFS doesn't support sparse files, so the OS has to zero-fill up to the
> point where you're writing. That can take a looooong time on slow
> storage (minutes even).

but you are talking about FAT here, right? NTFS does support sparse files if
the sparse bit has been explicitly been set on it. Bit even if the sparse bit
is not set filling a file with zeros by writing after a seek long beyond the
end of the file is very fast because NTFS supports that feature what Unix
filesystems like xfs call extents.

If writing beyond the end of a file is really slow via cifs vfs in the test
case against a ntfs volume then I wonder if that operation is being really done
optimally over the wire. ntfs really isn't that bad with handling this kind of
files.

Cheers
Björn
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton March 1, 2013, 12:11 a.m. UTC | #19
On Thu, 28 Feb 2013 23:54:13 +0100
Björn JACKE <bj@SerNet.DE> wrote:

> On 2013-02-28 at 07:26 -0800 Jeff Layton sent off:
> > NTFS doesn't support sparse files, so the OS has to zero-fill up to the
> > point where you're writing. That can take a looooong time on slow
> > storage (minutes even).
> 
> but you are talking about FAT here, right? NTFS does support sparse files if
> the sparse bit has been explicitly been set on it. Bit even if the sparse bit
> is not set filling a file with zeros by writing after a seek long beyond the
> end of the file is very fast because NTFS supports that feature what Unix
> filesystems like xfs call extents.
> 
> If writing beyond the end of a file is really slow via cifs vfs in the test
> case against a ntfs volume then I wonder if that operation is being really done
> optimally over the wire. ntfs really isn't that bad with handling this kind of
> files.
> 

I'm not sure since I don't know the internals of NTFS. I had always
assumed that it didn't really handle sparse files well (hence the
"rabbit-pellet" thing that windows clients do).

All I can say however is that writes long past the EOF can take a
*really* long time to run. Typically we just issue a SMB_COM_WRITEX at
the offset to which we want to put the data. Is there some other way we
ought to be doing this?

In any case, it doesn't really change the fact that there is no
guaranteed time of response from CIFS servers. They can easily take a
really long time to respond to certain requests. The best method we
have to deal with that is to periodically "ping" the server with an
echo to see if it's still there.
Steve French March 1, 2013, 2:54 a.m. UTC | #20
On Thu, Feb 28, 2013 at 6:11 PM, Jeff Layton <jlayton@samba.org> wrote:
> On Thu, 28 Feb 2013 23:54:13 +0100
> Björn JACKE <bj@SerNet.DE> wrote:
>
>> On 2013-02-28 at 07:26 -0800 Jeff Layton sent off:
>> > NTFS doesn't support sparse files, so the OS has to zero-fill up to the
>> > point where you're writing. That can take a looooong time on slow
>> > storage (minutes even).
>>
>> but you are talking about FAT here, right? NTFS does support sparse files if
>> the sparse bit has been explicitly been set on it. Bit even if the sparse bit
>> is not set filling a file with zeros by writing after a seek long beyond the
>> end of the file is very fast because NTFS supports that feature what Unix
>> filesystems like xfs call extents.
>>
>> If writing beyond the end of a file is really slow via cifs vfs in the test
>> case against a ntfs volume then I wonder if that operation is being really done
>> optimally over the wire. ntfs really isn't that bad with handling this kind of
>> files.
>>
>
> I'm not sure since I don't know the internals of NTFS. I had always
> assumed that it didn't really handle sparse files well (hence the
> "rabbit-pellet" thing that windows clients do).
>
> All I can say however is that writes long past the EOF can take a
> *really* long time to run. Typically we just issue a SMB_COM_WRITEX at
> the offset to which we want to put the data. Is there some other way we
> ought to be doing this?
>
> In any case, it doesn't really change the fact that there is no
> guaranteed time of response from CIFS servers. They can easily take a
> really long time to respond to certain requests. The best method we
> have to deal with that is to periodically "ping" the server with an
> echo to see if it's still there.

SMB2/SMB3 with better async support may make this easier - but Jeff is right.
diff mbox

Patch

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index e6899ce..138c8cf 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -80,6 +80,8 @@ 
 
 /* SMB echo "timeout" -- FIXME: tunable? */
 #define SMB_ECHO_INTERVAL (60 * HZ)
+/* Maximum acceptable round trip time to server */
+#define SMB_MAX_RTT (10 * HZ)
 
 #include "cifspdu.h"
 
@@ -1141,8 +1143,8 @@  struct mid_q_entry {
 	__u32 pid;		/* process id */
 	__u32 sequence_number;  /* for CIFS signing */
 	unsigned long when_alloc;  /* when mid was created */
-#ifdef CONFIG_CIFS_STATS2
 	unsigned long when_sent; /* time when smb send finished */
+#ifdef CONFIG_CIFS_STATS2
 	unsigned long when_received; /* when demux complete (taken off wire) */
 #endif
 	mid_receive_t *receive; /* call receive callback */
@@ -1179,11 +1181,6 @@  static inline void cifs_num_waiters_dec(struct TCP_Server_Info *server)
 {
 	atomic_dec(&server->num_waiters);
 }
-
-static inline void cifs_save_when_sent(struct mid_q_entry *mid)
-{
-	mid->when_sent = jiffies;
-}
 #else
 static inline void cifs_in_send_inc(struct TCP_Server_Info *server)
 {
@@ -1199,11 +1196,15 @@  static inline void cifs_num_waiters_inc(struct TCP_Server_Info *server)
 static inline void cifs_num_waiters_dec(struct TCP_Server_Info *server)
 {
 }
+#endif
 
+/* We always need to know when a mid was sent in order to determine if
+ * the server is not responding.
+ */
 static inline void cifs_save_when_sent(struct mid_q_entry *mid)
 {
+	mid->when_sent = jiffies;
 }
-#endif
 
 /* for pending dnotify requests */
 struct dir_notify_req {
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 12b3da3..57c78b3 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -456,25 +456,66 @@  allocate_buffers(struct TCP_Server_Info *server)
 	return true;
 }
 
+/* Takes struct *TCP_Server_Info and returns the when_sent jiffy of the
+ * oldest unanswered mid in the pending queue or the newest response.
+ * Whichever is newer.
+ */
+static unsigned long
+oldest_req_or_newest_resp(struct TCP_Server_Info *server)
+{
+	struct mid_q_entry *mid;
+	unsigned long oldest_jiffy = jiffies;
+
+	spin_lock(&GlobalMid_Lock);
+	list_for_each_entry(mid, &server->pending_mid_q, qhead) {
+		if (mid->mid_state == MID_REQUEST_SUBMITTED) {
+			if (time_before(mid->when_sent, oldest_jiffy))
+				oldest_jiffy = mid->when_sent;
+		}
+	}
+	spin_unlock(&GlobalMid_Lock);
+
+	/* Check to see if the last response is newer than the oldest request
+	 * This could mean that the server is just responding very slowly,
+	 * possibly even longer than SMB_MAX_RTT.  In which * case we don't
+	 * want to cause a reconnect.
+	 */
+	if (time_after(server->lstrp , oldest_jiffy))
+		return server->lstrp;
+	else
+		return oldest_jiffy;
+}
+
 static bool
 server_unresponsive(struct TCP_Server_Info *server)
 {
+	unsigned long oldest;
+
 	/*
-	 * We need to wait 2 echo intervals to make sure we handle such
-	 * situations right:
+	 * When no messages are in flight max wait is
+	 * 2*SMB_ECHO_INTERVAL + SMB_MAX_RTT + scheduling delay
+	 *
+	 * 1s   client sends a normal SMB request
+	 * 2s   client gets a response
+	 * 61s  echo workqueue job pops, and decides we got a response < 60
+	 *      seconds ago and don't need to send another
+	 * 121s kernel_recvmsg times out, and we see that we haven't gotten
+	 *      a response in >60s. Send echo causing in_flight() to return
+	 *      true
+	 * 131s echo hasn't returned run cifs_reconnect
+	 *
+	 * Situation 2 where non-echo messages are in_flight
 	 * 1s  client sends a normal SMB request
 	 * 2s  client gets a response
-	 * 30s echo workqueue job pops, and decides we got a response recently
-	 *     and don't need to send another
-	 * ...
-	 * 65s kernel_recvmsg times out, and we see that we haven't gotten
-	 *     a response in >60s.
+	 * 3s  client sends a normal SMB request
+	 * 13s client still has not received SMB response run cifs_reconnect
 	 */
 	if (server->tcpStatus == CifsGood &&
-	    time_after(jiffies, server->lstrp + 2 * SMB_ECHO_INTERVAL)) {
-		cERROR(1, "Server %s has not responded in %d seconds. "
+	    (in_flight(server) > 0 && time_after(jiffies,
+		  oldest = oldest_req_or_newest_resp(server) + SMB_MAX_RTT))) {
+		cERROR(1, "Server %s has not responded in %lu seconds. "
 			  "Reconnecting...", server->hostname,
-			  (2 * SMB_ECHO_INTERVAL) / HZ);
+			  ((jiffies - oldest) / HZ));
 		cifs_reconnect(server);
 		wake_up(&server->response_q);
 		return true;