diff mbox

[RFC] what's going on with file->f_pos uses in orangefs_file_write_iter()?

Message ID 20171208163921.GA40992@stumphouse.sysblue.org (mailing list archive)
State New, archived
Headers show

Commit Message

Martin Brandenburg Dec. 8, 2017, 4:39 p.m. UTC
On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote:
> I'd missed that back then, but...
> 
>         if (file->f_pos > i_size_read(file->f_mapping->host))
>                 orangefs_i_size_write(file->f_mapping->host, file->f_pos);
> 
>         rc = generic_write_checks(iocb, iter);
> 
>         if (rc <= 0) {
>                 gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
>                            __func__, rc);
>                 goto out;
>         }
> 
>         /*
>          * if we are appending, generic_write_checks would have updated
>          * pos to the end of the file, so we will wait till now to set
>          * pos...
>          */
>         pos = *(&iocb->ki_pos);
> 
> looks suspicious as hell.  What's going on there?  Not to mention anything
> else file->f_pos might be completely unrelated to any IO going on -
> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
> to do with file->f_pos.  Then there's the question of WTF is write()
> (or pwrite()) past the current EOF doing bumping the file size, before
> it even gets a chance to decide whether it'll be trying to do any IO at
> all.

It seems to be a poor attempt at doing what the code immediately above
it does.

http://dev.orangefs.org/trac/orangefs/changeset/4828

"put the i_size update at the beginning instead of the end. The issue is
that generic_write_checks sets the offset to i_size in the case of
append, and if i_pos has changed (so that its > i_size) outside of
write, than we need to update the i_size to reflect that value before
the offset is updated."

Oh lovely, he's just moving it.

http://dev.orangefs.org/trac/orangefs/changeset/4827

"fix for pwrite04 test that does a pwrite with open(O_APPEND).

"fix for f_frsize (fragment size) to be equal to f_bsize. This should fix
newer statfs tools that use frsize as the block size instead of bsize.
Hopefully, it won't break anything that depended on frsize being
hardcoded to 1024."

Which is just a rework of

http://dev.orangefs.org/trac/orangefs/changeset/4723

"(from branch): fix for append bug"

I'm not sure what branch he's talking about, and SVN seems to be
designed so as to make it impossible to find out.

Note that nothing updates i_size after a successful write.  So now we do
the getattr which he claims is ridiculous (it may be cached today though
it probably won't be) anywhere i_size is needed.  In do_readv_writev we
invalidate the cached attribute so as to force an update from the
server.

Well none of this does particular wonders for performance.  None of it
really solves the problem, either.  OrangeFS does not properly support
append writes, so we send a offset equal to the size of the file.  It
completely fails when the file is being written to from another machine
(or even the same machine but not through the kernel).  Without some
work on the OrangeFS server, solving that problem is impossible.

Given that limitation, the best we can hope for is consistency on a
single machine.  We could try to avoid the getattr by updating i_size
then not doing a getattr if within a cache timeout.  That would be more
in line with the getattr cache I've implemented.

I avoided it here because there was already an explicit getattr.  I am
not sure under what circumstances it was added.  I've looked through
Mike's pre-merge code; it's in there as early (3.15) as I can see.  It's
not in the out-of-tree module.  Perhaps Mike remembers why it was added.

https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110

> 
> _Then_ there's the deadlock on 32bit SMP in that code.  Look: several
> lines prior we'd done
>         inode_lock(file->f_mapping->host);
> and hadn't unlocked the sucker since then.  And
> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
> {
> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>         inode_lock(inode);
> #endif
>         i_size_write(inode, i_size);
> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>         inode_unlock(inode);
> #endif
> }
> means that if we get around to calling it there in SMP/32bit case, we'll
> get as plain a deadlock as possible.  And AFAICS it had been that way
> since the initial merge.
> 
> What the hell is that code about and what is it trying to do?

Guess what our original commit message is: "Trac 16"

http://dev.orangefs.org/trac/orangefs/changeset/6709

Hardly seems to be any use trying to figure out what motivated it.  I
don't think it was a deadlock then except to the extent that all the
OrangeFS code was a snarl of deadlocks then.  But the obfuscation only
serves to hide it from us.

> 
> PS: While we are at it, what's the point of that *(&...) in there?

https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467

Not sure why it's obfuscated but there's another one in
orangefs_file_read_iter.

I intended to send a series that implements pagecache support in
OrangeFS at some point in the future.  It's not quite done, but I
suppose it would do just as well to send it for review sooner rather
than later.  All of this code has been significantly re-written there.

Martin

There is no xfstests regression with the following.

Comments

Mike Marshall Dec. 12, 2017, 4:31 p.m. UTC | #1
AL> What's going on there?

Around 3.18 (before we went upstream) I got rid of .aio_read and
.aio_write and "replaced" them with implementations of .read_iter
and .write_iter.

Then, in Linux 4.0, generic_write_checks got iterized too.

-inline int generic_write_checks(struct file *file, loff_t *pos, size_t *count)
+inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)

In my 4.1 rebase I refactored a bunch of un-iterized stuff in file.c and
tried to retain functionality... some code (the "suspicious as hell"
code) I iterized and moved from do_readv_writev to orangefs_file_write_iter.

AL> _Then_ there's the deadlock on 32bit SMP in that code.

Yeah, that doesn't need to be there... I'm a little afraid to say
who put it there in commit 5ecfcb2... <g>

Hopefully Martin's suggested fix will do for now. We'll try to send
it up - I guess next merge window is the right time? In one of the
upcoming merge windows, I hope we have some page-cache related code
to send up that will get rid of all of that...

As far as the cumbersome C expression is concerned... I'm reminded
of this cartoon I drew...

https://sites.google.com/site/hubcapsite3/misc/tinFoilHat.jpg

-Mike

On Fri, Dec 8, 2017 at 11:39 AM,  <martin@omnibond.com> wrote:
> On Thu, Dec 07, 2017 at 10:26:10PM +0000, Al Viro wrote:
>> I'd missed that back then, but...
>>
>>         if (file->f_pos > i_size_read(file->f_mapping->host))
>>                 orangefs_i_size_write(file->f_mapping->host, file->f_pos);
>>
>>         rc = generic_write_checks(iocb, iter);
>>
>>         if (rc <= 0) {
>>                 gossip_err("%s: generic_write_checks failed, rc:%zd:.\n",
>>                            __func__, rc);
>>                 goto out;
>>         }
>>
>>         /*
>>          * if we are appending, generic_write_checks would have updated
>>          * pos to the end of the file, so we will wait till now to set
>>          * pos...
>>          */
>>         pos = *(&iocb->ki_pos);
>>
>> looks suspicious as hell.  What's going on there?  Not to mention anything
>> else file->f_pos might be completely unrelated to any IO going on -
>> consider e.g. pwrite(2), where the position (in iocb->ki_pos) has nothing
>> to do with file->f_pos.  Then there's the question of WTF is write()
>> (or pwrite()) past the current EOF doing bumping the file size, before
>> it even gets a chance to decide whether it'll be trying to do any IO at
>> all.
>
> It seems to be a poor attempt at doing what the code immediately above
> it does.
>
> http://dev.orangefs.org/trac/orangefs/changeset/4828
>
> "put the i_size update at the beginning instead of the end. The issue is
> that generic_write_checks sets the offset to i_size in the case of
> append, and if i_pos has changed (so that its > i_size) outside of
> write, than we need to update the i_size to reflect that value before
> the offset is updated."
>
> Oh lovely, he's just moving it.
>
> http://dev.orangefs.org/trac/orangefs/changeset/4827
>
> "fix for pwrite04 test that does a pwrite with open(O_APPEND).
>
> "fix for f_frsize (fragment size) to be equal to f_bsize. This should fix
> newer statfs tools that use frsize as the block size instead of bsize.
> Hopefully, it won't break anything that depended on frsize being
> hardcoded to 1024."
>
> Which is just a rework of
>
> http://dev.orangefs.org/trac/orangefs/changeset/4723
>
> "(from branch): fix for append bug"
>
> I'm not sure what branch he's talking about, and SVN seems to be
> designed so as to make it impossible to find out.
>
> Note that nothing updates i_size after a successful write.  So now we do
> the getattr which he claims is ridiculous (it may be cached today though
> it probably won't be) anywhere i_size is needed.  In do_readv_writev we
> invalidate the cached attribute so as to force an update from the
> server.
>
> Well none of this does particular wonders for performance.  None of it
> really solves the problem, either.  OrangeFS does not properly support
> append writes, so we send a offset equal to the size of the file.  It
> completely fails when the file is being written to from another machine
> (or even the same machine but not through the kernel).  Without some
> work on the OrangeFS server, solving that problem is impossible.
>
> Given that limitation, the best we can hope for is consistency on a
> single machine.  We could try to avoid the getattr by updating i_size
> then not doing a getattr if within a cache timeout.  That would be more
> in line with the getattr cache I've implemented.
>
> I avoided it here because there was already an explicit getattr.  I am
> not sure under what circumstances it was added.  I've looked through
> Mike's pre-merge code; it's in there as early (3.15) as I can see.  It's
> not in the out-of-tree module.  Perhaps Mike remembers why it was added.
>
> https://github.com/hubcapsc/linux/commit/60eb8a4f7fc5931ccec0372483ed405ef9ca9110
>
>>
>> _Then_ there's the deadlock on 32bit SMP in that code.  Look: several
>> lines prior we'd done
>>         inode_lock(file->f_mapping->host);
>> and hadn't unlocked the sucker since then.  And
>> static inline void orangefs_i_size_write(struct inode *inode, loff_t i_size)
>> {
>> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>>         inode_lock(inode);
>> #endif
>>         i_size_write(inode, i_size);
>> #if BITS_PER_LONG == 32 && defined(CONFIG_SMP)
>>         inode_unlock(inode);
>> #endif
>> }
>> means that if we get around to calling it there in SMP/32bit case, we'll
>> get as plain a deadlock as possible.  And AFAICS it had been that way
>> since the initial merge.
>>
>> What the hell is that code about and what is it trying to do?
>
> Guess what our original commit message is: "Trac 16"
>
> http://dev.orangefs.org/trac/orangefs/changeset/6709
>
> Hardly seems to be any use trying to figure out what motivated it.  I
> don't think it was a deadlock then except to the extent that all the
> OrangeFS code was a snarl of deadlocks then.  But the obfuscation only
> serves to hide it from us.
>
>>
>> PS: While we are at it, what's the point of that *(&...) in there?
>
> https://github.com/hubcapsc/linux/commit/cae140731030471ec4782d65efefe8e819d6a467
>
> Not sure why it's obfuscated but there's another one in
> orangefs_file_read_iter.
>
> I intended to send a series that implements pagecache support in
> OrangeFS at some point in the future.  It's not quite done, but I
> suppose it would do just as well to send it for review sooner rather
> than later.  All of this code has been significantly re-written there.
>
> Martin
>
> There is no xfstests regression with the following.
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index 1668fd645c45..0d228cd087e6 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -452,7 +452,7 @@ ssize_t orangefs_inode_read(struct inode *inode,
>  static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  {
>         struct file *file = iocb->ki_filp;
> -       loff_t pos = *(&iocb->ki_pos);
> +       loff_t pos = iocb->ki_pos;
>         ssize_t rc = 0;
>
>         BUG_ON(iocb->private);
> @@ -492,9 +492,6 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
>                 }
>         }
>
> -       if (file->f_pos > i_size_read(file->f_mapping->host))
> -               orangefs_i_size_write(file->f_mapping->host, file->f_pos);
> -
>         rc = generic_write_checks(iocb, iter);
>
>         if (rc <= 0) {
> @@ -508,7 +505,7 @@ static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
>          * pos to the end of the file, so we will wait till now to set
>          * pos...
>          */
> -       pos = *(&iocb->ki_pos);
> +       pos = iocb->ki_pos;
>
>         rc = do_readv_writev(ORANGEFS_IO_WRITE,
>                              file,
diff mbox

Patch

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 1668fd645c45..0d228cd087e6 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -452,7 +452,7 @@  ssize_t orangefs_inode_read(struct inode *inode,
 static ssize_t orangefs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
-	loff_t pos = *(&iocb->ki_pos);
+	loff_t pos = iocb->ki_pos;
 	ssize_t rc = 0;
 
 	BUG_ON(iocb->private);
@@ -492,9 +492,6 @@  static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 		}
 	}
 
-	if (file->f_pos > i_size_read(file->f_mapping->host))
-		orangefs_i_size_write(file->f_mapping->host, file->f_pos);
-
 	rc = generic_write_checks(iocb, iter);
 
 	if (rc <= 0) {
@@ -508,7 +505,7 @@  static ssize_t orangefs_file_write_iter(struct kiocb *iocb, struct iov_iter *ite
 	 * pos to the end of the file, so we will wait till now to set
 	 * pos...
 	 */
-	pos = *(&iocb->ki_pos);
+	pos = iocb->ki_pos;
 
 	rc = do_readv_writev(ORANGEFS_IO_WRITE,
 			     file,