diff mbox

[v5,1/2] Return bytes transferred for partial direct I/O

Message ID 20180119005741.32058-1-rgoldwyn@suse.de (mailing list archive)
State New, archived
Headers show

Commit Message

Goldwyn Rodrigues Jan. 19, 2018, 12:57 a.m. UTC
From: Goldwyn Rodrigues <rgoldwyn@suse.com>

In case direct I/O encounters an error midway, it returns the error.
Instead it should be returning the number of bytes transferred so far.

Test case for filesystems (with ENOSPC):
1. Create an almost full filesystem
2. Create a file, say /mnt/lastfile, until the filesystem is full.
3. Direct write() with count > sizeof /mnt/lastfile.

Result: write() returns -ENOSPC. However, file content has data written
in step 3.

This fixes fstest generic/472.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/block_dev.c |  2 +-
 fs/direct-io.c |  4 +---
 fs/iomap.c     | 20 ++++++++++----------
 3 files changed, 12 insertions(+), 14 deletions(-)

Comments

Al Viro Jan. 19, 2018, 2:13 a.m. UTC | #1
On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> In case direct I/O encounters an error midway, it returns the error.
> Instead it should be returning the number of bytes transferred so far.
> 
> Test case for filesystems (with ENOSPC):
> 1. Create an almost full filesystem
> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> 3. Direct write() with count > sizeof /mnt/lastfile.
> 
> Result: write() returns -ENOSPC. However, file content has data written
> in step 3.
> 
> This fixes fstest generic/472.

OK...  I can live with that.  What about the XFS side?  It should be
a prereq, to avoid bisection hazard; I can throw both into vfs.git,
if XFS folks are OK with that.  Objections?
Dave Chinner Jan. 19, 2018, 3:59 a.m. UTC | #2
On Fri, Jan 19, 2018 at 02:13:53AM +0000, Al Viro wrote:
> On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > In case direct I/O encounters an error midway, it returns the error.
> > Instead it should be returning the number of bytes transferred so far.
> > 
> > Test case for filesystems (with ENOSPC):
> > 1. Create an almost full filesystem
> > 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> > 3. Direct write() with count > sizeof /mnt/lastfile.
> > 
> > Result: write() returns -ENOSPC. However, file content has data written
> > in step 3.
> > 
> > This fixes fstest generic/472.
> 
> OK...  I can live with that.  What about the XFS side?  It should be
> a prereq, to avoid bisection hazard; I can throw both into vfs.git,
> if XFS folks are OK with that.  Objections?

Going through the VFS tree seesm the best approach to me - it's a
trivial change. I'm sure Darrick will shout if it's going to be a
problem, though.

Cheers,

Dave.
Darrick J. Wong Jan. 19, 2018, 6:31 a.m. UTC | #3
On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote:
> On Fri, Jan 19, 2018 at 02:13:53AM +0000, Al Viro wrote:
> > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > 
> > > In case direct I/O encounters an error midway, it returns the error.
> > > Instead it should be returning the number of bytes transferred so far.
> > > 
> > > Test case for filesystems (with ENOSPC):
> > > 1. Create an almost full filesystem
> > > 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> > > 3. Direct write() with count > sizeof /mnt/lastfile.
> > > 
> > > Result: write() returns -ENOSPC. However, file content has data written
> > > in step 3.
> > > 
> > > This fixes fstest generic/472.
> > 
> > OK...  I can live with that.  What about the XFS side?  It should be
> > a prereq, to avoid bisection hazard; I can throw both into vfs.git,
> > if XFS folks are OK with that.  Objections?
> 
> Going through the VFS tree seesm the best approach to me - it's a
> trivial change. I'm sure Darrick will shout if it's going to be a
> problem, though.

vfs.git is fine, though the second patch to remove the xfs assert should
go first, as Al points out.

For both patches,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Al Viro Jan. 19, 2018, 6:33 a.m. UTC | #4
On Thu, Jan 18, 2018 at 10:31:18PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote:
> > On Fri, Jan 19, 2018 at 02:13:53AM +0000, Al Viro wrote:
> > > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > > > 
> > > > In case direct I/O encounters an error midway, it returns the error.
> > > > Instead it should be returning the number of bytes transferred so far.
> > > > 
> > > > Test case for filesystems (with ENOSPC):
> > > > 1. Create an almost full filesystem
> > > > 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> > > > 3. Direct write() with count > sizeof /mnt/lastfile.
> > > > 
> > > > Result: write() returns -ENOSPC. However, file content has data written
> > > > in step 3.
> > > > 
> > > > This fixes fstest generic/472.
> > > 
> > > OK...  I can live with that.  What about the XFS side?  It should be
> > > a prereq, to avoid bisection hazard; I can throw both into vfs.git,
> > > if XFS folks are OK with that.  Objections?
> > 
> > Going through the VFS tree seesm the best approach to me - it's a
> > trivial change. I'm sure Darrick will shout if it's going to be a
> > problem, though.
> 
> vfs.git is fine, though the second patch to remove the xfs assert should
> go first, as Al points out.
> 
> For both patches,
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Applied; will be in -next tomorrow morning after the tree I'm putting together
gets through local beating.
Al Viro Jan. 20, 2018, 7:47 p.m. UTC | #5
On Fri, Jan 19, 2018 at 06:33:56AM +0000, Al Viro wrote:
> > For both patches,
> > Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Applied; will be in -next tomorrow morning after the tree I'm putting together
> gets through local beating.

FWIW, it consistently blows generic/250 and generic/252.
Andi Kleen Jan. 21, 2018, 2:11 a.m. UTC | #6
Goldwyn Rodrigues <rgoldwyn@suse.de> writes:

> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>
> In case direct I/O encounters an error midway, it returns the error.
> Instead it should be returning the number of bytes transferred so far.

It's likely there's a lot of code in user space that does

     if (write(..., N) < 0) handle error

With your change it would need to be

     if (write(..., N) != N) handle error

How much code is actually doing that?

I can understand it fixes your artifical test suite, but it seems to me your
change has a high potential to break a lot of existing user code
in subtle ways. So it seems to be a bad idea.

-Andi
Goldwyn Rodrigues Jan. 21, 2018, 2:23 a.m. UTC | #7
On 01/20/2018 08:11 PM, Andi Kleen wrote:
> Goldwyn Rodrigues <rgoldwyn@suse.de> writes:
> 
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> In case direct I/O encounters an error midway, it returns the error.
>> Instead it should be returning the number of bytes transferred so far.
> 
> It's likely there's a lot of code in user space that does
> 
>      if (write(..., N) < 0) handle error
> 
> With your change it would need to be
> 
>      if (write(..., N) != N) handle error
> 
> How much code is actually doing that?
> 
> I can understand it fixes your artifical test suite, but it seems to me your
> change has a high potential to break a lot of existing user code
> in subtle ways. So it seems to be a bad idea.
> 
> -Andi
> 


Quoting 'man 2 write':

RETURN VALUE
 On  success,  the  number  of bytes written is returned (zero indicates
 nothing was written).  It is not an error if  this  number  is  smaller
 than the number of bytes requested; this may happen for example because
 the disk device was filled.  See also NOTES.
Goldwyn Rodrigues Jan. 21, 2018, 2:57 a.m. UTC | #8
On 01/20/2018 01:47 PM, Al Viro wrote:
> On Fri, Jan 19, 2018 at 06:33:56AM +0000, Al Viro wrote:
>>> For both patches,
>>> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
>>
>> Applied; will be in -next tomorrow morning after the tree I'm putting together
>> gets through local beating.
> 
> FWIW, it consistently blows generic/250 and generic/252.
> 

Thanks. I will look into it. It is performing direct writes with dm_error.
Hopefully it should be just updating the md5sum in the output files.
Jens Axboe Jan. 21, 2018, 3:07 a.m. UTC | #9
On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:
> 
> 
> On 01/20/2018 08:11 PM, Andi Kleen wrote:
>> Goldwyn Rodrigues <rgoldwyn@suse.de> writes:
>>
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>> In case direct I/O encounters an error midway, it returns the error.
>>> Instead it should be returning the number of bytes transferred so far.
>>
>> It's likely there's a lot of code in user space that does
>>
>>      if (write(..., N) < 0) handle error
>>
>> With your change it would need to be
>>
>>      if (write(..., N) != N) handle error
>>
>> How much code is actually doing that?
>>
>> I can understand it fixes your artifical test suite, but it seems to me your
>> change has a high potential to break a lot of existing user code
>> in subtle ways. So it seems to be a bad idea.
>>
>> -Andi
>>
> 
> 
> Quoting 'man 2 write':
> 
> RETURN VALUE
>  On  success,  the  number  of bytes written is returned (zero indicates
>  nothing was written).  It is not an error if  this  number  is  smaller
>  than the number of bytes requested; this may happen for example because
>  the disk device was filled.  See also NOTES.

You can quote as much man page as you want - Andi is well aware of how
read/write system call works, as I'm sure all of us are, that is not the
issue. The issue is that there are potentially LOTS of applications out
there that do not check for short writes, they do exactly what Andi
speculated above. If you break it with this change, it doesn't matter
what's in the man page. What matters is previous behavior, and that
you are breaking user space. At that point nobody cares what's in the
man page.
Goldwyn Rodrigues Jan. 21, 2018, 12:06 p.m. UTC | #10
On 01/20/2018 09:07 PM, Jens Axboe wrote:
> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:
>>
>>
>> On 01/20/2018 08:11 PM, Andi Kleen wrote:
>>> Goldwyn Rodrigues <rgoldwyn@suse.de> writes:
>>>
>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>
>>>> In case direct I/O encounters an error midway, it returns the error.
>>>> Instead it should be returning the number of bytes transferred so far.
>>>
>>> It's likely there's a lot of code in user space that does
>>>
>>>      if (write(..., N) < 0) handle error
>>>
>>> With your change it would need to be
>>>
>>>      if (write(..., N) != N) handle error
>>>
>>> How much code is actually doing that?
>>>
>>> I can understand it fixes your artifical test suite, but it seems to me your
>>> change has a high potential to break a lot of existing user code
>>> in subtle ways. So it seems to be a bad idea.
>>>
>>> -Andi
>>>
>>
>>
>> Quoting 'man 2 write':
>>
>> RETURN VALUE
>>  On  success,  the  number  of bytes written is returned (zero indicates
>>  nothing was written).  It is not an error if  this  number  is  smaller
>>  than the number of bytes requested; this may happen for example because
>>  the disk device was filled.  See also NOTES.
> 
> You can quote as much man page as you want - Andi is well aware of how
> read/write system call works, as I'm sure all of us are, that is not the
> issue. The issue is that there are potentially LOTS of applications out
> there that do not check for short writes, they do exactly what Andi
> speculated above. If you break it with this change, it doesn't matter
> what's in the man page. What matters is previous behavior, and that
> you are breaking user space. At that point nobody cares what's in the
> man page.
> 

Agree. So how do you think we should fix this to accommodate userspace
application who did not cater to the fact that write can return short
write, and still be consistent?

The only way I can think is that a DIO write should check early enough
that the write(N) will complete with N bytes without an error. Is it
possible to completely guarantee that?

Leaving it as it is incorrect as quoted in the artificial test case. You
should not be changing the file and yet conveying to the user an error
for the same write() call. It should either be an error and the file
contents are unchanged, or it should be change in contents and the write
size returned.
Andi Kleen Jan. 22, 2018, 6:08 p.m. UTC | #11
> The only way I can think is that a DIO write should check early enough
> that the write(N) will complete with N bytes without an error. Is it
> possible to completely guarantee that?

Probably not.

> 
> Leaving it as it is incorrect as quoted in the artificial test case. You
> should not be changing the file and yet conveying to the user an error
> for the same write() call. It should either be an error and the file
> contents are unchanged, or it should be change in contents and the write
> size returned.

There are already lots of syscall cases that don't completely
undo changes on error handling.  Fixing that would likely require
a transaction system higher level in the kernel, or lots of 
complicated code everywhere, which is unlikely to happen. 

Also the complicated code would be difficult to test,
and likely bit rot over time, because it would be only an error handling
path that is infrequently exercised.

So yes it's not nice, but the alternative would be worse.

I think we're best of leaving it as it is now.

Adding some comments/documentation to explain this would be good though.
Perhaps you could submit a patch to the manpage?

-Andi
Andreas Dilger Jan. 22, 2018, 7:10 p.m. UTC | #12
> On Jan 20, 2018, at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:
>> 
>> 
>> On 01/20/2018 08:11 PM, Andi Kleen wrote:
>>> 
>>> It's likely there's a lot of code in user space that does
>>> 
>>>     if (write(..., N) < 0) handle error
>>> 
>>> With your change it would need to be
>>> 
>>>     if (write(..., N) != N) handle error
>>> 
>>> How much code is actually doing that?
>>> 
>>> I can understand it fixes your artifical test suite, but it seems to me your
>>> change has a high potential to break a lot of existing user code
>>> in subtle ways. So it seems to be a bad idea.
>> 
>> 
>> Quoting 'man 2 write':
>> 
>> RETURN VALUE
>> On  success,  the  number  of bytes written is returned (zero indicates
>> nothing was written).  It is not an error if  this  number  is  smaller
>> than the number of bytes requested; this may happen for example because
>> the disk device was filled.  See also NOTES.
> 
> You can quote as much man page as you want - Andi is well aware of how
> read/write system call works, as I'm sure all of us are, that is not the
> issue. The issue is that there are potentially LOTS of applications out
> there that do not check for short writes, they do exactly what Andi
> speculated above. If you break it with this change, it doesn't matter
> what's in the man page. What matters is previous behavior, and that
> you are breaking user space. At that point nobody cares what's in the
> man page.

Applications that don't handle partial writes are already broken with
buffered I/O, this change doesn't really make them more broken.

Cheers, Andreas
Jens Axboe Jan. 22, 2018, 7:13 p.m. UTC | #13
On 1/22/18 12:10 PM, Andreas Dilger wrote:
> 
>> On Jan 20, 2018, at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:
>>>
>>>
>>> On 01/20/2018 08:11 PM, Andi Kleen wrote:
>>>>
>>>> It's likely there's a lot of code in user space that does
>>>>
>>>>     if (write(..., N) < 0) handle error
>>>>
>>>> With your change it would need to be
>>>>
>>>>     if (write(..., N) != N) handle error
>>>>
>>>> How much code is actually doing that?
>>>>
>>>> I can understand it fixes your artifical test suite, but it seems to me your
>>>> change has a high potential to break a lot of existing user code
>>>> in subtle ways. So it seems to be a bad idea.
>>>
>>>
>>> Quoting 'man 2 write':
>>>
>>> RETURN VALUE
>>> On  success,  the  number  of bytes written is returned (zero indicates
>>> nothing was written).  It is not an error if  this  number  is  smaller
>>> than the number of bytes requested; this may happen for example because
>>> the disk device was filled.  See also NOTES.
>>
>> You can quote as much man page as you want - Andi is well aware of how
>> read/write system call works, as I'm sure all of us are, that is not the
>> issue. The issue is that there are potentially LOTS of applications out
>> there that do not check for short writes, they do exactly what Andi
>> speculated above. If you break it with this change, it doesn't matter
>> what's in the man page. What matters is previous behavior, and that
>> you are breaking user space. At that point nobody cares what's in the
>> man page.
> 
> Applications that don't handle partial writes are already broken with
> buffered I/O, this change doesn't really make them more broken.

Not disagreeing that they are broken, of course they are. But if you've
been doing direct IO and not seeing short writes, then this could break
your application. And if that's the case, it doesn't really help to say
that their application was "already broken". I'd hate for a kernel
upgrade to break them.

I do wish we could make the change, and maybe we can. But it probably
needs some safe guard proc entry to toggle the behavior, something we
can drop in a few years when we're confident it won't break real
applications.
Elliott, Robert (Servers) Jan. 22, 2018, 10:25 p.m. UTC | #14
> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:

> > On 01/20/2018 08:11 PM, Andi Kleen wrote:

> >> Goldwyn Rodrigues <rgoldwyn@suse.de> writes:

> >>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>

> >>> In case direct I/O encounters an error midway, it returns the error.

> >>> Instead it should be returning the number of bytes transferred so far.

> >>

> >> It's likely there's a lot of code in user space that does

> >>

> >>      if (write(..., N) < 0) handle error

> >>

> >> With your change it would need to be

> >>

> >>      if (write(..., N) != N) handle error

> >>

> >> How much code is actually doing that?

> >>

> >> I can understand it fixes your artifical test suite, but it seems to me your

> >> change has a high potential to break a lot of existing user code

> >> in subtle ways. So it seems to be a bad idea.

> >>

> >> -Andi

> >

> > Quoting 'man 2 write':

> >

> > RETURN VALUE

> >  On  success,  the  number  of bytes written is returned (zero indicates

> >  nothing was written).  It is not an error if  this  number  is smaller

> >  than the number of bytes requested; this may happen for example because

> >  the disk device was filled.  See also NOTES.

> 

> You can quote as much man page as you want - Andi is well aware of how

> read/write system call works, as I'm sure all of us are, that is not the

> issue. The issue is that there are potentially LOTS of applications out

> there that do not check for short writes, they do exactly what Andi

> speculated above. If you break it with this change, it doesn't matter

> what's in the man page. What matters is previous behavior, and that

> you are breaking user space. At that point nobody cares what's in the

> man page.

 

fio engines/sg.c fio_sgio_rw_doio() has that pattern:

	ret = write(f->fd, hdr, sizeof(*hdr));
	if (ret < 0)
		return ret;
	...
	return FIO_Q_QUEUED;   [which is 1]

although there might be special circumstances for the sg interface
making that safe.



---
Robert Elliott, HPE Persistent Memory
Jens Axboe Jan. 22, 2018, 11:14 p.m. UTC | #15
On 1/22/18 3:25 PM, Elliott, Robert (Persistent Memory) wrote:
> 
> 
>> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:
>>> On 01/20/2018 08:11 PM, Andi Kleen wrote:
>>>> Goldwyn Rodrigues <rgoldwyn@suse.de> writes:
>>>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>>> In case direct I/O encounters an error midway, it returns the error.
>>>>> Instead it should be returning the number of bytes transferred so far.
>>>>
>>>> It's likely there's a lot of code in user space that does
>>>>
>>>>      if (write(..., N) < 0) handle error
>>>>
>>>> With your change it would need to be
>>>>
>>>>      if (write(..., N) != N) handle error
>>>>
>>>> How much code is actually doing that?
>>>>
>>>> I can understand it fixes your artifical test suite, but it seems to me your
>>>> change has a high potential to break a lot of existing user code
>>>> in subtle ways. So it seems to be a bad idea.
>>>>
>>>> -Andi
>>>
>>> Quoting 'man 2 write':
>>>
>>> RETURN VALUE
>>>  On  success,  the  number  of bytes written is returned (zero indicates
>>>  nothing was written).  It is not an error if  this  number  is smaller
>>>  than the number of bytes requested; this may happen for example because
>>>  the disk device was filled.  See also NOTES.
>>
>> You can quote as much man page as you want - Andi is well aware of how
>> read/write system call works, as I'm sure all of us are, that is not the
>> issue. The issue is that there are potentially LOTS of applications out
>> there that do not check for short writes, they do exactly what Andi
>> speculated above. If you break it with this change, it doesn't matter
>> what's in the man page. What matters is previous behavior, and that
>> you are breaking user space. At that point nobody cares what's in the
>> man page.
>  
> 
> fio engines/sg.c fio_sgio_rw_doio() has that pattern:
> 
> 	ret = write(f->fd, hdr, sizeof(*hdr));
> 	if (ret < 0)
> 		return ret;
> 	...
> 	return FIO_Q_QUEUED;   [which is 1]
> 
> although there might be special circumstances for the sg interface
> making that safe.

That's for SG scsi direct IO, I don't think that supports partial
IO since it's sending raw SCSI commands.

For the regular libaio or sync IO system calls, fio of course checks
and handles short IOs correctly. It even logs if it got any.
Bart Van Assche Jan. 22, 2018, 11:24 p.m. UTC | #16
On Mon, 2018-01-22 at 16:14 -0700, Jens Axboe wrote:
> On 1/22/18 3:25 PM, Elliott, Robert (Persistent Memory) wrote:

> > fio engines/sg.c fio_sgio_rw_doio() has that pattern:

> > 

> > 	ret = write(f->fd, hdr, sizeof(*hdr));

> > 	if (ret < 0)

> > 		return ret;

> > 	...

> > 	return FIO_Q_QUEUED;   [which is 1]

> > 

> > although there might be special circumstances for the sg interface

> > making that safe.

> 

> That's for SG scsi direct IO, I don't think that supports partial

> IO since it's sending raw SCSI commands.

> 

> For the regular libaio or sync IO system calls, fio of course checks

> and handles short IOs correctly. It even logs if it got any.


The entire fio_sgio_rw_doio() function is as follows:

static int fio_sgio_rw_doio(struct fio_file *f, struct io_u *io_u, int do_sync)
{
	struct sg_io_hdr *hdr = &io_u->hdr;
	int ret;

	ret = write(f->fd, hdr, sizeof(*hdr));
	if (ret < 0)
		return ret;

	if (do_sync) {
		ret = read(f->fd, hdr, sizeof(*hdr));
		if (ret < 0)
			return ret;

		/* record if an io error occurred */
		if (hdr->info & SG_INFO_CHECK)
			io_u->error = EIO;

		return FIO_Q_COMPLETED;
	}

	return FIO_Q_QUEUED;
}

I think the 'resid' member of the struct sg_io_hdr that is provided by the
sg_io kernel driver as a response represents the number of bytes that has not
been written. So it should be possible to recognize and handle short I/Os in
that function. From include/scsi/sg.h:

    int resid;                  /* [o] dxfer_len - actual_transferred */

Bart.
Jens Axboe Jan. 22, 2018, 11:27 p.m. UTC | #17
On 1/22/18 4:24 PM, Bart Van Assche wrote:
> On Mon, 2018-01-22 at 16:14 -0700, Jens Axboe wrote:
>> On 1/22/18 3:25 PM, Elliott, Robert (Persistent Memory) wrote:
>>> fio engines/sg.c fio_sgio_rw_doio() has that pattern:
>>>
>>> 	ret = write(f->fd, hdr, sizeof(*hdr));
>>> 	if (ret < 0)
>>> 		return ret;
>>> 	...
>>> 	return FIO_Q_QUEUED;   [which is 1]
>>>
>>> although there might be special circumstances for the sg interface
>>> making that safe.
>>
>> That's for SG scsi direct IO, I don't think that supports partial
>> IO since it's sending raw SCSI commands.
>>
>> For the regular libaio or sync IO system calls, fio of course checks
>> and handles short IOs correctly. It even logs if it got any.
> 
> The entire fio_sgio_rw_doio() function is as follows:
> 
> static int fio_sgio_rw_doio(struct fio_file *f, struct io_u *io_u, int do_sync)
> {
> 	struct sg_io_hdr *hdr = &io_u->hdr;
> 	int ret;
> 
> 	ret = write(f->fd, hdr, sizeof(*hdr));
> 	if (ret < 0)
> 		return ret;
> 
> 	if (do_sync) {
> 		ret = read(f->fd, hdr, sizeof(*hdr));
> 		if (ret < 0)
> 			return ret;
> 
> 		/* record if an io error occurred */
> 		if (hdr->info & SG_INFO_CHECK)
> 			io_u->error = EIO;
> 
> 		return FIO_Q_COMPLETED;
> 	}
> 
> 	return FIO_Q_QUEUED;
> }
> 
> I think the 'resid' member of the struct sg_io_hdr that is provided by the
> sg_io kernel driver as a response represents the number of bytes that has not
> been written. So it should be possible to recognize and handle short I/Os in
> that function. From include/scsi/sg.h:
> 
>     int resid;                  /* [o] dxfer_len - actual_transferred */

Yeah that's true.

But let's not side track the discussion here, fio+sg isn't relevant to the
topic at hand. Fio and other engines would be (like libaio, or sync and
friends), but those handle short/partial IOs just fine.

That said, if someone wants to submit a patch for the sg engine, I would
of course take it.
Goldwyn Rodrigues Jan. 23, 2018, 3:18 a.m. UTC | #18
On 01/22/2018 01:13 PM, Jens Axboe wrote:
> On 1/22/18 12:10 PM, Andreas Dilger wrote:
>>
>>> On Jan 20, 2018, at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 1/20/18 7:23 PM, Goldwyn Rodrigues wrote:
>>>>
>>>>
>>>> On 01/20/2018 08:11 PM, Andi Kleen wrote:
>>>>>
>>>>> It's likely there's a lot of code in user space that does
>>>>>
>>>>>     if (write(..., N) < 0) handle error
>>>>>
>>>>> With your change it would need to be
>>>>>
>>>>>     if (write(..., N) != N) handle error
>>>>>
>>>>> How much code is actually doing that?
>>>>>
>>>>> I can understand it fixes your artifical test suite, but it seems to me your
>>>>> change has a high potential to break a lot of existing user code
>>>>> in subtle ways. So it seems to be a bad idea.
>>>>
>>>>
>>>> Quoting 'man 2 write':
>>>>
>>>> RETURN VALUE
>>>> On  success,  the  number  of bytes written is returned (zero indicates
>>>> nothing was written).  It is not an error if  this  number  is  smaller
>>>> than the number of bytes requested; this may happen for example because
>>>> the disk device was filled.  See also NOTES.
>>>
>>> You can quote as much man page as you want - Andi is well aware of how
>>> read/write system call works, as I'm sure all of us are, that is not the
>>> issue. The issue is that there are potentially LOTS of applications out
>>> there that do not check for short writes, they do exactly what Andi
>>> speculated above. If you break it with this change, it doesn't matter
>>> what's in the man page. What matters is previous behavior, and that
>>> you are breaking user space. At that point nobody cares what's in the
>>> man page.
>>
>> Applications that don't handle partial writes are already broken with
>> buffered I/O, this change doesn't really make them more broken.
> 
> Not disagreeing that they are broken, of course they are. But if you've
> been doing direct IO and not seeing short writes, then this could break
> your application. And if that's the case, it doesn't really help to say

I started exploring short writes and how direct I/O behaves on errors
after your suggestion to incorporate short writes in a previous
conversation [1]. I started looking into how midway errors with direct
I/O's are handled now and I stumble upon this issue.

> that their application was "already broken". I'd hate for a kernel
> upgrade to break them.
> 
> I do wish we could make the change, and maybe we can. But it probably
> needs some safe guard proc entry to toggle the behavior, something we
> can drop in a few years when we're confident it won't break real
> applications.

Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?),
should it be enabled or disabled by default?

[1] https://www.spinics.net/lists/linux-block/msg15910.html
Jens Axboe Jan. 23, 2018, 3:28 a.m. UTC | #19
On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote:
>> that their application was "already broken". I'd hate for a kernel
>> upgrade to break them.
>>
>> I do wish we could make the change, and maybe we can. But it probably
>> needs some safe guard proc entry to toggle the behavior, something we
>> can drop in a few years when we're confident it won't break real
>> applications.
> 
> Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?),
> should it be enabled or disabled by default?

I'd enable it by default, if not, you are never going to be able to
remove it because you'll have no confidence that anyone actually flipped
the switch and ran with it enabled. The point of having it there and on
by default would be that if something does break, people have the option
of turning it off and restoring the previous behavior, without having to
change the kernel.
Matthew Wilcox (Oracle) Jan. 23, 2018, 6:35 a.m. UTC | #20
On Mon, Jan 22, 2018 at 08:28:54PM -0700, Jens Axboe wrote:
> On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote:
> >> that their application was "already broken". I'd hate for a kernel
> >> upgrade to break them.
> >>
> >> I do wish we could make the change, and maybe we can. But it probably
> >> needs some safe guard proc entry to toggle the behavior, something we
> >> can drop in a few years when we're confident it won't break real
> >> applications.
> > 
> > Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?),
> > should it be enabled or disabled by default?
> 
> I'd enable it by default, if not, you are never going to be able to
> remove it because you'll have no confidence that anyone actually flipped
> the switch and ran with it enabled. The point of having it there and on
> by default would be that if something does break, people have the option
> of turning it off and restoring the previous behavior, without having to
> change the kernel.

I think it's an opt-in prctl that's something like PRCTL_SHORT_WRITES_ALLOWED.
Andreas Dilger Jan. 24, 2018, 12:19 a.m. UTC | #21
On Jan 22, 2018, at 8:28 PM, Jens Axboe <axboe@kernel.dk> wrote:
> 
> On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote:
>>> that their application was "already broken". I'd hate for a kernel
>>> upgrade to break them.
>>> 
>>> I do wish we could make the change, and maybe we can. But it probably
>>> needs some safe guard proc entry to toggle the behavior, something we
>>> can drop in a few years when we're confident it won't break real
>>> applications.
>> 
>> Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?),
>> should it be enabled or disabled by default?
> 
> I'd enable it by default, if not, you are never going to be able to
> remove it because you'll have no confidence that anyone actually flipped
> the switch and ran with it enabled. The point of having it there and on
> by default would be that if something does break, people have the option
> of turning it off and restoring the previous behavior, without having to
> change the kernel.

... or fixing their application. :-)

But, yes, I agree that this should be on by default.

Cheers, Andreas
Goldwyn Rodrigues Jan. 25, 2018, 6:01 p.m. UTC | #22
On 01/23/2018 12:35 AM, Matthew Wilcox wrote:
> On Mon, Jan 22, 2018 at 08:28:54PM -0700, Jens Axboe wrote:
>> On 1/22/18 8:18 PM, Goldwyn Rodrigues wrote:
>>>> that their application was "already broken". I'd hate for a kernel
>>>> upgrade to break them.
>>>>
>>>> I do wish we could make the change, and maybe we can. But it probably
>>>> needs some safe guard proc entry to toggle the behavior, something we
>>>> can drop in a few years when we're confident it won't break real
>>>> applications.
>>>
>>> Assuming we call it /proc/sys/fs/dio_short_writes(better names/paths?),
>>> should it be enabled or disabled by default?
>>
>> I'd enable it by default, if not, you are never going to be able to
>> remove it because you'll have no confidence that anyone actually flipped
>> the switch and ran with it enabled. The point of having it there and on
>> by default would be that if something does break, people have the option
>> of turning it off and restoring the previous behavior, without having to
>> change the kernel.
> 
> I think it's an opt-in prctl that's something like PRCTL_SHORT_WRITES_ALLOWED.
> 

I cannot decide where to stick this bit in the task_struct.
current->flags/PF_* does not seem right. Don't want to start a new
fs_flags field. Suggestions?
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..49d94360bb51 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -409,7 +409,7 @@  __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 
 	if (!ret)
 		ret = blk_status_to_errno(dio->bio.bi_status);
-	if (likely(!ret))
+	if (likely(dio->size))
 		ret = dio->size;
 
 	bio_put(&dio->bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 3aafb3343a65..0c98b6e65d7c 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -262,8 +262,6 @@  static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
 		ret = dio->page_errors;
 	if (ret == 0)
 		ret = dio->io_error;
-	if (ret == 0)
-		ret = transferred;
 
 	if (dio->end_io) {
 		// XXX: ki_pos??
@@ -310,7 +308,7 @@  static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
 	}
 
 	kmem_cache_free(dio_cache, dio);
-	return ret;
+	return transferred ? transferred : ret;
 }
 
 static void dio_aio_complete_work(struct work_struct *work)
diff --git a/fs/iomap.c b/fs/iomap.c
index 47d29ccffaef..cab57d85404e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -716,23 +716,23 @@  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	struct kiocb *iocb = dio->iocb;
 	struct inode *inode = file_inode(iocb->ki_filp);
 	loff_t offset = iocb->ki_pos;
-	ssize_t ret;
+	ssize_t err;
+	ssize_t transferred = dio->size;
 
 	if (dio->end_io) {
-		ret = dio->end_io(iocb,
-				dio->error ? dio->error : dio->size,
+		err = dio->end_io(iocb,
+				transferred ? transferred : dio->error,
 				dio->flags);
 	} else {
-		ret = dio->error;
+		err = dio->error;
 	}
 
-	if (likely(!ret)) {
-		ret = dio->size;
+	if (likely(transferred)) {
 		/* check for short read */
-		if (offset + ret > dio->i_size &&
+		if (offset + transferred > dio->i_size &&
 		    !(dio->flags & IOMAP_DIO_WRITE))
-			ret = dio->i_size - offset;
-		iocb->ki_pos += ret;
+			transferred = dio->i_size - offset;
+		iocb->ki_pos += transferred;
 	}
 
 	/*
@@ -759,7 +759,7 @@  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	inode_dio_end(file_inode(iocb->ki_filp));
 	kfree(dio);
 
-	return ret;
+	return transferred ? transferred : err;
 }
 
 static void iomap_dio_complete_work(struct work_struct *work)