diff mbox series

[v8,06/14] iomap: Return -EAGAIN from iomap_write_iter()

Message ID 20220608171741.3875418-7-shr@fb.com (mailing list archive)
State New
Headers show
Series io-uring/xfs: support async buffered writes | expand

Commit Message

Stefan Roesch June 8, 2022, 5:17 p.m. UTC
If iomap_write_iter() encounters -EAGAIN, return -EAGAIN to the caller.

Signed-off-by: Stefan Roesch <shr@fb.com>
---
 fs/iomap/buffered-io.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox June 8, 2022, 7:02 p.m. UTC | #1
On Wed, Jun 08, 2022 at 10:17:33AM -0700, Stefan Roesch wrote:
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index b06a5c24a4db..f701dcb7c26a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -829,7 +829,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>  		length -= status;
>  	} while (iov_iter_count(i) && length);
>  
> -	return written ? written : status;
> +	if (status == -EAGAIN) {
> +		iov_iter_revert(i, written);
> +		return -EAGAIN;
> +	}
> +	if (written)
> +		return written;
> +	return status;
>  }

I still don't understand how this can possibly work.  Walk me through it.

Let's imagine we have a file laid out such that extent 1 is bytes
0-4095 of the file and extent 2 is extent 4096-16385 of the file.
We do a write of 5000 bytes starting at offset 4000 of the file.

iomap_iter() tells us about the first extent and we write the first
96 bytes of our data to the first extent, returning 96.  iomap_iter()
tells us about the second extent, and we write the next 4000 bytes to
the second extent.  Then we get a page fault and get to the -EAGAIN case.
We rewind the iter 4000 bytes.

How do we not end up writing garbage when the kworker does the retry?
I'd understand if we rewound the iter all the way to the start.  Or if
we didn't rewind the iter at all and were able to pick up partway through
the write.  But rewinding to the start of the extent feels like it can't
possibly work.
Stefan Roesch June 9, 2022, 6:49 p.m. UTC | #2
On 6/8/22 12:02 PM, Matthew Wilcox wrote:
> On Wed, Jun 08, 2022 at 10:17:33AM -0700, Stefan Roesch wrote:
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index b06a5c24a4db..f701dcb7c26a 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -829,7 +829,13 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
>>  		length -= status;
>>  	} while (iov_iter_count(i) && length);
>>  
>> -	return written ? written : status;
>> +	if (status == -EAGAIN) {
>> +		iov_iter_revert(i, written);
>> +		return -EAGAIN;
>> +	}
>> +	if (written)
>> +		return written;
>> +	return status;
>>  }
> 
> I still don't understand how this can possibly work.  Walk me through it.
> 
> Let's imagine we have a file laid out such that extent 1 is bytes
> 0-4095 of the file and extent 2 is extent 4096-16385 of the file.
> We do a write of 5000 bytes starting at offset 4000 of the file.
> 
> iomap_iter() tells us about the first extent and we write the first
> 96 bytes of our data to the first extent, returning 96.  iomap_iter()
> tells us about the second extent, and we write the next 4000 bytes to
> the second extent.  Then we get a page fault and get to the -EAGAIN case.
> We rewind the iter 4000 bytes.
> 

We have two data structures, the iomap_iter and iov_iter. After the first
96 bytes, the iov_iter offset get updated in iomap_write_iter() and then the
iomap_iter pos gets updated in iomap_iter()->iomap_iter_advance().

We then get the second extend from iomap_iter(). In iomap_write_iter() the
first page is obtained and written successfully, then the second page is
faulted. At this point the iov offset of the iov_iter has advanced. To reset
it to the state when the function iomap_write_iter() was entered, the iov_iter
is reset to iov_offset - written bytes.

iomap_write_iter() is exited and returns -EAGAIN. As iomap_write_iter() returns
an error, the iomap_iter pos is not updated in iomap_iter(). Only the number
of bytes written in the write of the first extent from iomap_file_buffered_write()
is returned from iomap_file_buffered_write().

In xfs_file_buffered_write we updated the iocb->ki_pos with the number of
bytes written. In io-uring, the io_write() call receives the short write result.
It copies the iov_iter struct into the work context for the io worker.
The io_worker uses that information to complete the rest of the write.

The above reset is required to keep the pos in iomap_iter and the offset in
iov_iter in sync.



Side Note:

I had an earlier version of the patch that was changing the signature of the
function iomap_write_iter(). It was returning a return code and changing the
processed value of the iomap_iter (which then also changes the pos value of 
the iomap_iter). This version (version 7 of the patch) does not require to
reset the offset of the iov_iter. It can update the pos in iomap_iter even
when -EAGAIN is returned.



> How do we not end up writing garbage when the kworker does the retry?
> I'd understand if we rewound the iter all the way to the start.  Or if
> we didn't rewind the iter at all and were able to pick up partway through
> the write.  But rewinding to the start of the extent feels like it can't
> possibly work.
diff mbox series

Patch

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index b06a5c24a4db..f701dcb7c26a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -829,7 +829,13 @@  static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		length -= status;
 	} while (iov_iter_count(i) && length);
 
-	return written ? written : status;
+	if (status == -EAGAIN) {
+		iov_iter_revert(i, written);
+		return -EAGAIN;
+	}
+	if (written)
+		return written;
+	return status;
 }
 
 ssize_t