[linux-cifs-client,4/4] cifs: flush data on any setattr
diff mbox

Message ID 1238088937-10994-5-git-send-email-jlayton@redhat.com
State New, archived
Headers show

Commit Message

Jeff Layton March 26, 2009, 5:35 p.m. UTC
We already flush all the dirty pages for an inode before doing
ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
we change the mode so that the file becomes read-only then we may not
be able to write data to it after a reconnect.

Fix this by just going back to flushing all the dirty data on any
setattr call. There are probably some cases that can be optimized out,
but I'm not sure they're worthwhile and we need to consider them more
carefully to make sure that we don't cause regressions if we have
to reconnect before writeback occurs.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/cifs/inode.c |   58 ++++++++++++++++++++++++++++--------------------------
 1 files changed, 30 insertions(+), 28 deletions(-)

Comments

Steve French March 26, 2009, 8:10 p.m. UTC | #1
On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> We already flush all the dirty pages for an inode before doing
> ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
> we change the mode so that the file becomes read-only then we may not
> be able to write data to it after a reconnect.

Do we have any idea of how badly this would hurt performance?

Changing the mode or owner may be rare enough ... but I don't have any
data to prove that.

Changing the file's mode on the fly while writing could cause problems
for file systems other than cifs too, not sure we want to hurt
performance for a corner case of a bad application.  I don't this
posix specifies that chmod does an implied fsync
Jeff Layton March 26, 2009, 9:01 p.m. UTC | #2
On Thu, 26 Mar 2009 15:10:49 -0500
Steve French <smfrench@gmail.com> wrote:

> On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > We already flush all the dirty pages for an inode before doing
> > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
> > we change the mode so that the file becomes read-only then we may not
> > be able to write data to it after a reconnect.
> 
> Do we have any idea of how badly this would hurt performance?
> 
> Changing the mode or owner may be rare enough ... but I don't have any
> data to prove that.
> 
> Changing the file's mode on the fly while writing could cause problems
> for file systems other than cifs too, not sure we want to hurt
> performance for a corner case of a bad application.  I don't this
> posix specifies that chmod does an implied fsync
> 

(cc'ing Linus since he brought this to our attention and might have
input)

No, posix doesn't specify that chmod does an implied fsync, but if
that's what's required to ensure that we don't break writeback across a
reconnect then I think it's the right thing to do. We can try to
optimize it away on some setattr cases, but they're probably not worth
it.

The best analogue is probably NFS, which also flushes dirty data on
all setattr calls. Granted there are differences there (due to the
connectionless nature of older NFS protocols), but it's a pretty similar
situation to CIFS if you account for the possibility of a reconnect
between the setattr and writeback.

I don't think we can make a real generalization about performance
impact of this change. It will heavily depend on workload. My personal
feeling (just a guess, really) is that setattr calls are pretty rare in
general in comparison to reads and writes. There are obviously
workloads where this won't be the case but it's hard to imagine them as
a majority.
Linus Torvalds March 26, 2009, 10:32 p.m. UTC | #3
On Thu, 26 Mar 2009, Jeff Layton wrote:
> On Thu, 26 Mar 2009 15:10:49 -0500
> Steve French <smfrench@gmail.com> wrote:
> 
> > On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > We already flush all the dirty pages for an inode before doing
> > > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
> > > we change the mode so that the file becomes read-only then we may not
> > > be able to write data to it after a reconnect.
> > 
> > Do we have any idea of how badly this would hurt performance?
> > 
> > Changing the mode or owner may be rare enough ... but I don't have any
> > data to prove that.
> > 
> > Changing the file's mode on the fly while writing could cause problems
> > for file systems other than cifs too, not sure we want to hurt
> > performance for a corner case of a bad application.  I don't this
> > posix specifies that chmod does an implied fsync
> > 
> 
> (cc'ing Linus since he brought this to our attention and might have
> input)
> 
> No, posix doesn't specify that chmod does an implied fsync, but if
> that's what's required to ensure that we don't break writeback across a
> reconnect then I think it's the right thing to do. We can try to
> optimize it away on some setattr cases, but they're probably not worth
> it.

Yes, I don't care about anything but obvious correctness.

Right now, CIFS is _broken_. Real programs have seen real breakage. I 
don't see why Steve argues against the patch on some specious grounds of 
it "not being necessary", when it clearly is, and there clearly is a CIFS 
bug there.

Performance is totally irrelevant. It doesn't matter at all, when the 
alternative is "lost data".

			Linus
Jeff Layton March 26, 2009, 11:56 p.m. UTC | #4
On Thu, 26 Mar 2009 15:32:44 -0700 (PDT)
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Thu, 26 Mar 2009, Jeff Layton wrote:
> > On Thu, 26 Mar 2009 15:10:49 -0500
> > Steve French <smfrench@gmail.com> wrote:
> > 
> > > On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > > > We already flush all the dirty pages for an inode before doing
> > > > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
> > > > we change the mode so that the file becomes read-only then we may not
> > > > be able to write data to it after a reconnect.
> > > 
> > > Do we have any idea of how badly this would hurt performance?
> > > 
> > > Changing the mode or owner may be rare enough ... but I don't have any
> > > data to prove that.
> > > 
> > > Changing the file's mode on the fly while writing could cause problems
> > > for file systems other than cifs too, not sure we want to hurt
> > > performance for a corner case of a bad application.  I don't this
> > > posix specifies that chmod does an implied fsync
> > > 
> > 
> > (cc'ing Linus since he brought this to our attention and might have
> > input)
> > 
> > No, posix doesn't specify that chmod does an implied fsync, but if
> > that's what's required to ensure that we don't break writeback across a
> > reconnect then I think it's the right thing to do. We can try to
> > optimize it away on some setattr cases, but they're probably not worth
> > it.
> 
> Yes, I don't care about anything but obvious correctness.
> 
> Right now, CIFS is _broken_. Real programs have seen real breakage. I 
> don't see why Steve argues against the patch on some specious grounds of 
> it "not being necessary", when it clearly is, and there clearly is a CIFS 
> bug there.
> 
> Performance is totally irrelevant. It doesn't matter at all, when the 
> alternative is "lost data".
> 

I suspect that there is another bug for the reporter that's consistently
causing a reconnect of the socket after the fchmod. That problem could
even be server-side (hard to know without a wire capture and/or some
debug logs).

The current code should work just fine as long as we don't incur a
reconnect and have to reclaim the open files between the fchmod and the
writes. The problem is that we can never 100% predict when we'll need
to reconnect, so we have to be very careful about how we handle changes
that might make us unable to reclaim the open.

On the performance side, we can always go back and optimize the flush
out of some of setattr calls when we *know* that they won't cause a
problem, but the list of those cases is starting to look pretty small.
Linus Torvalds March 27, 2009, 12:07 a.m. UTC | #5
On Thu, 26 Mar 2009, Jeff Layton wrote:
>
> I suspect that there is another bug for the reporter that's consistently
> causing a reconnect of the socket after the fchmod. That problem could
> even be server-side (hard to know without a wire capture and/or some
> debug logs).

Sure. There probably is. But even if we were to fix that unrelated issue, 
you might still have totally random events that cause re-connects for 
non-buggy reasons (say, a real network problem), so regardless it's not 
correct to assume that reconnects cannot happen.

> On the performance side, we can always go back and optimize the flush
> out of some of setattr calls when we *know* that they won't cause a
> problem, but the list of those cases is starting to look pretty small.

Agreed. Especially as you end up having to flush at the close() anyway, 
and anybody who does some attribute change after writing dirty data is 
likely to close the file afterwards anyway. IOW, I suspect

	open
	write+write+write
	fchmod/fchown
	close

is fairly normal, but _mixing_ things so that you see

	open
	write+write
	fchmod/fchown
	write+write
	close

is rather less likely. So flushing the data probably doesn't even add any 
overhead.

		Linus
Steve French March 27, 2009, 2:15 a.m. UTC | #6
On Thu, Mar 26, 2009 at 7:07 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, 26 Mar 2009, Jeff Layton wrote:

>> I suspect that there is another bug for the reporter that's consistently
>> causing a reconnect of the socket after the fchmod. That problem could
>> even be server-side (hard to know without a wire capture and/or some
>> debug logs).
>
> Sure. There probably is. But even if we were to fix that unrelated issue,
> you might still have totally random events that cause re-connects for
> non-buggy reasons (say, a real network problem), so regardless it's not
> correct to assume that reconnects cannot happen.
There is probably still more that could be done, e.g. to flush data to the
server more frequently in the background, that would help.

>> On the performance side, we can always go back and optimize the flush
>> out of some of setattr calls when we *know* that they won't cause a
>> problem, but the list of those cases is starting to look pretty small.
>
> Agreed. Especially as you end up having to flush at the close() anyway,
> and anybody who does some attribute change after writing dirty data is
> likely to close the file afterwards

OK.   I will merge the change to flush on all setattr into cifs-2.6.git
Steve French March 27, 2009, 2:52 a.m. UTC | #7
On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> We already flush all the dirty pages for an inode before doing
> ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
> we change the mode so that the file becomes read-only then we may not
> be able to write data to it after a reconnect.
>
> Fix this by just going back to flushing all the dirty data on any
> setattr call.
> +        * BB: This should be smarter. Why bother flushing pages that
> +        * will be truncated anyway? Also, should we error out here if
> +        * the flush returns error?
> +        */

Merged the patch into cifs-2.6.git, but Jeff brings up an interesting
point about what really happens when we are shrinking the file size
(truncate) here ... is it possible that the vfs will flush pages that
are going to be thrown away
Jeff Layton March 27, 2009, 11:30 a.m. UTC | #8
On Thu, 26 Mar 2009 21:52:51 -0500
Steve French <smfrench@gmail.com> wrote:

> On Thu, Mar 26, 2009 at 12:35 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > We already flush all the dirty pages for an inode before doing
> > ATTR_SIZE and ATTR_MTIME changes. There's another problem though -- if
> > we change the mode so that the file becomes read-only then we may not
> > be able to write data to it after a reconnect.
> >
> > Fix this by just going back to flushing all the dirty data on any
> > setattr call.
> > +        * BB: This should be smarter. Why bother flushing pages that
> > +        * will be truncated anyway? Also, should we error out here if
> > +        * the flush returns error?
> > +        */
> 
> Merged the patch into cifs-2.6.git, but Jeff brings up an interesting
> point about what really happens when we are shrinking the file size
> (truncate) here ... is it possible that the vfs will flush pages that
> are going to be thrown away
> 

I think so. If I follow the code correctly, cifs_set_file_size does:

flush the data (filemap_write_and_wait)
set the file size on the server (CIFSSMBSetFileSize, CIFSSMBSetEOF, ...) 
truncate the inode in memory (cifs_vmtruncate)

Perhaps it makes sense to reorder those operations, but that might make
it tough to handle errors. Another (better?) option would be to only
flush the data that's within the new size of the file.

A trivial optimization might be to cheat here and only skip the flushing
if the file is going to be truncated to 0 length (that's probably the
most common case anyway).

Now that I look at this though, I have to wonder...does the existing
code have a race? Is it possible for some writes to occur after we set
the file size on the server but before we truncate the pages off of the
inode?

Patch
diff mbox

diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 3525ea5..ed7f048 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -1793,20 +1793,21 @@  cifs_setattr_unix(struct dentry *direntry, struct iattr *attrs)
 		goto out;
 	}
 
-	if ((attrs->ia_valid & ATTR_MTIME) || (attrs->ia_valid & ATTR_SIZE)) {
-		/*
-		   Flush data before changing file size or changing the last
-		   write time of the file on the server. If the
-		   flush returns error, store it to report later and continue.
-		   BB: This should be smarter. Why bother flushing pages that
-		   will be truncated anyway? Also, should we error out here if
-		   the flush returns error?
-		 */
-		rc = filemap_write_and_wait(inode->i_mapping);
-		if (rc != 0) {
-			cifsInode->write_behind_rc = rc;
-			rc = 0;
-		}
+	/*
+	 * Attempt to flush data before changing attributes. We need to do
+	 * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the
+	 * ownership or mode then we may also need to do this. Here, we take
+	 * the safe way out and just do the flush on all setattr requests. If
+	 * the flush returns error, store it to report later and continue.
+	 *
+	 * BB: This should be smarter. Why bother flushing pages that
+	 * will be truncated anyway? Also, should we error out here if
+	 * the flush returns error?
+	 */
+	rc = filemap_write_and_wait(inode->i_mapping);
+	if (rc != 0) {
+		cifsInode->write_behind_rc = rc;
+		rc = 0;
 	}
 
 	if (attrs->ia_valid & ATTR_SIZE) {
@@ -1904,20 +1905,21 @@  cifs_setattr_nounix(struct dentry *direntry, struct iattr *attrs)
 		return -ENOMEM;
 	}
 
-	if ((attrs->ia_valid & ATTR_MTIME) || (attrs->ia_valid & ATTR_SIZE)) {
-		/*
-		   Flush data before changing file size or changing the last
-		   write time of the file on the server. If the
-		   flush returns error, store it to report later and continue.
-		   BB: This should be smarter. Why bother flushing pages that
-		   will be truncated anyway? Also, should we error out here if
-		   the flush returns error?
-		 */
-		rc = filemap_write_and_wait(inode->i_mapping);
-		if (rc != 0) {
-			cifsInode->write_behind_rc = rc;
-			rc = 0;
-		}
+	/*
+	 * Attempt to flush data before changing attributes. We need to do
+	 * this for ATTR_SIZE and ATTR_MTIME for sure, and if we change the
+	 * ownership or mode then we may also need to do this. Here, we take
+	 * the safe way out and just do the flush on all setattr requests. If
+	 * the flush returns error, store it to report later and continue.
+	 *
+	 * BB: This should be smarter. Why bother flushing pages that
+	 * will be truncated anyway? Also, should we error out here if
+	 * the flush returns error?
+	 */
+	rc = filemap_write_and_wait(inode->i_mapping);
+	if (rc != 0) {
+		cifsInode->write_behind_rc = rc;
+		rc = 0;
 	}
 
 	if (attrs->ia_valid & ATTR_SIZE) {