diff mbox

[07/13] aio: enabled thread based async fsync

Message ID 80934665e0dd2360e2583522c7c7569e5a92be0e.1452549431.git.bcrl@kvack.org (mailing list archive)
State New, archived
Headers show

Commit Message

Benjamin LaHaise Jan. 11, 2016, 10:07 p.m. UTC
Enable a fully asynchronous fsync and fdatasync operations in aio using
the aio thread queuing mechanism.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

Comments

Dave Chinner Jan. 12, 2016, 1:11 a.m. UTC | #1
On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> Enable a fully asynchronous fsync and fdatasync operations in aio using
> the aio thread queuing mechanism.
> 
> Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>

Insufficient. Needs the range to be passed through and call
vfs_fsync_range(), as I implemented here:

https://lkml.org/lkml/2015/10/28/878

Cheers,

Dave.
Linus Torvalds Jan. 12, 2016, 1:20 a.m. UTC | #2
On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

And I think that's insufficient *also*.

What you actually want is "sync_file_range()", with the full set of arguments.

Yes, really. Sometimes you want to start the writeback, sometimes you
want to wait for it. Sometimes you want both.

For example, if you are doing your own manual write-behind logic, it
is not sufficient for "wait for data". What you want is "start IO on
new data" followed by "wait for old data to have been written out".

I think this only strengthens my "stop with the idiotic
special-case-AIO magic already" argument.  If we want something more
generic than the usual aio, then we should go all in. Not "let's make
more limited special cases".

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Jan. 12, 2016, 1:30 a.m. UTC | #3
On Tue, Jan 12, 2016 at 12:11:28PM +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:

Noted.

> https://lkml.org/lkml/2015/10/28/878

Please at least Cc the aio list in the future on aio patches, as I do
not have the time to read linux-kernel these days unless prodded to do
so...

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Jan. 12, 2016, 2:25 a.m. UTC | #4
On Mon, Jan 11, 2016 at 05:20:42PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Insufficient. Needs the range to be passed through and call
> > vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.

That's a different interface. the aio fsync interface has been
exposed to userspace for years, we just haven't implemented it in
the kernel. That's a major difference to everything else being
proposed in this patch set, especially this one.

FYI sync_file_range() is definitely not a fsync/fdatasync
replacement as it does not guarantee data durability in any way.
i.e. you can call sync_file_range, have it wait for data to be
written, return to userspace, then lose power and lose the data that
sync_file_range said it wrote. That's because sync_file_range()
does not:

	a) write the metadata needed to reference the data to disk;
	   and
	b) flush volatile storage caches after data and metadata is
	   written.

Hence sync_file_range is useless to applications that need to
guarantee data durability. Not to mention that most AIO applications
use direct IO, and so have no use for fine grained control over page
cache writeback semantics. They only require a) and b) above, so
implementing the AIO fsync primitive is exactly what they want.

> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.

Without durability guarantees such application level optimisations
are pretty much worthless.

> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

No, I don't think this specific case does, because the AIO fsync
interface already exists....

Cheers,

Dave.
Linus Torvalds Jan. 12, 2016, 2:38 a.m. UTC | #5
On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> That's a different interface.

So is openat. So is readahead.

My point is that this idiotic "let's expose special cases" must end.
It's broken. It inevitably only exposes a subset of what different
people would want.

Making "aio_read()" and friends a special interface had historical
reasons for it. But expanding willy-nilly on that model does not.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 12, 2016, 3:37 a.m. UTC | #6
On Mon, Jan 11, 2016 at 06:38:15PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > That's a different interface.
> 
> So is openat. So is readahead.
>
> My point is that this idiotic "let's expose special cases" must end.
> It's broken. It inevitably only exposes a subset of what different
> people would want.
> 
> Making "aio_read()" and friends a special interface had historical
> reasons for it. But expanding willy-nilly on that model does not.

Yes, I heard you the first time, but you haven't acknowledged that
the aio fsync interface is indeed different because it already
exists. What's the problem with implementing an AIO call that we've
advertised as supported for many years now that people are asking us
to implement it?

As for a generic async syscall interface, why not just add
IOCB_CMD_SYSCALL that encodes the syscall number and parameters
into the iovec structure and let the existing aio subsystem handle
demultiplexing it and handing them off to threads/workqueues/etc?
That was we get contexts, events, signals, completions,
cancelations, etc from the existing infrastructure, and there's
really only a dispatch/collection layer that needs to be added?

If we then provide the userspace interface via the libaio library to
call the async syscalls with an AIO context handle, then there's
little more that needs to be done to support just about everything
as an async syscall...

Cheers,

Dave.
Linus Torvalds Jan. 12, 2016, 4:03 a.m. UTC | #7
On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Yes, I heard you the first time, but you haven't acknowledged that
> the aio fsync interface is indeed different because it already
> exists. What's the problem with implementing an AIO call that we've
> advertised as supported for many years now that people are asking us
> to implement it?

Oh, I don't disagree with that. I think it should be exposed, my point
was that that too was not enough.

I don't see why you argue. You said "that's not enough". And I jjust
said that your expansion wasn't sufficient either, and that I think we
should strive to expand things even more.

And preferably not in some ad-hoc manner. Expand it to *everything* we can do.

> As for a generic async syscall interface, why not just add
> IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> into the iovec structure and let the existing aio subsystem handle
> demultiplexing it and handing them off to threads/workqueues/etc?

That would likely be the simplest approach, yes.

There's a few arguments against it, though:

 - doing the indirect system call thing does end up being
architecture-specific, so now you do need the AIO code to call into
some arch wrapper.

   Not a huge deal, since the arch wrapper will be pretty simple (and
we can have a default one that just returns ENOSYS, so that we don't
have to synchronize all architectures)

 - the aio interface really is horrible crap. Really really.

   For example, the whole "send signal as a completion model" is so
f*cking broken that I really don't want to extend the aio interface
too much. I think it's unfixable.

So I really think we'd be *much* better off with a new interface
entirely - preferably one that allows the old aio interfaces to fall
out fairly naturally.

Ben mentioned lio_listio() as a reason for why he wanted to extend the
AIO interface, but I think it works the other way around: yes, we
should look at lio_listio(), but we should look at it mainly as a way
to ask ourselves: "can we implement a new aynchronous system call
submission model that would also make it possible to implement
lio_listio() as a user space wrapper around it".

For example, if we had an actual _good_ way to queue up things, you
could probably make that "struct sigevent" completion for lio_listio()
just be another asynchronous system call at the end of the list - a
system call that sends the completion signal.  And the aiocb_list[]
itself? Maybe those could just be done as normal (individual) aio
calls (so that you end up having the aiocb that you can wait on with
aio_suspend() etc).

But then people who do *not* want the crazy aiocb, and do *not* want
some SIGIO or whatever, could just fire off asynchronous system calls
without that cruddy interface.

So my argument is really that I think it would be better to at least
look into maybe creating something less crapulent, and striving to
make it easy to make the old legacy interfaces be just wrappers around
a more capable model.

And hey, it may be that in the end nobody cares enough, and the right
thing (or at least the prudent thing) to do is to just pile the crap
on deeper and higher, and just add a single IOCB_CMD_SYSCALL
indirection entry.

So I'm not dismissing that as a solution - I just don't think it's a
particularly clean one.

It does have the advantage of likely being a fairly simple hack. But
it smells like a hack.

                Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 12, 2016, 4:48 a.m. UTC | #8
On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So my argument is really that I think it would be better to at least
> look into maybe creating something less crapulent, and striving to
> make it easy to make the old legacy interfaces be just wrappers around
> a more capable model.

Hmm. Thinking more about this makes me worry about all the system call
versioning and extra work done by libc.

At least glibc has traditionally decided to munge and extend on kernel
system call interfaces, to the point where even fairly core data
structures (like "struct stat") may not always look the same to the
kernel as they do to user space.

So with that worry, I have to admit that maybe a limited interface -
rather than allowing arbitrary generic async system calls - might have
advantages. Less room for mismatches.

I'll have to think about this some more.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski Jan. 12, 2016, 10:59 p.m. UTC | #9
On Jan 11, 2016 8:04 PM, "Linus Torvalds" <torvalds@linux-foundation.org> wrote:
>
> On Mon, Jan 11, 2016 at 7:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Yes, I heard you the first time, but you haven't acknowledged that
> > the aio fsync interface is indeed different because it already
> > exists. What's the problem with implementing an AIO call that we've
> > advertised as supported for many years now that people are asking us
> > to implement it?
>
> Oh, I don't disagree with that. I think it should be exposed, my point
> was that that too was not enough.
>
> I don't see why you argue. You said "that's not enough". And I jjust
> said that your expansion wasn't sufficient either, and that I think we
> should strive to expand things even more.
>
> And preferably not in some ad-hoc manner. Expand it to *everything* we can do.
>
> > As for a generic async syscall interface, why not just add
> > IOCB_CMD_SYSCALL that encodes the syscall number and parameters
> > into the iovec structure and let the existing aio subsystem handle
> > demultiplexing it and handing them off to threads/workqueues/etc?
>
> That would likely be the simplest approach, yes.
>
> There's a few arguments against it, though:
>
>  - doing the indirect system call thing does end up being
> architecture-specific, so now you do need the AIO code to call into
> some arch wrapper.

How many arches *can* do it?  As of 4.4, x86_32 can, but x86_64 can't
yet.  We'd also need a whitelist of acceptable indirect syscalls (e.g.
exit is bad).  And we have to worry about things that depend on the mm
or creds.

It would be extra nice if we could avoid switch_mm for things that
don't need it (fsync) and only do it for things like read that do.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini Jan. 14, 2016, 9:19 a.m. UTC | #10
On 12/01/2016 02:20, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Insufficient. Needs the range to be passed through and call
>> vfs_fsync_range(), as I implemented here:
> 
> And I think that's insufficient *also*.
> 
> What you actually want is "sync_file_range()", with the full set of arguments.
> 
> Yes, really. Sometimes you want to start the writeback, sometimes you
> want to wait for it. Sometimes you want both.
> 
> For example, if you are doing your own manual write-behind logic, it
> is not sufficient for "wait for data". What you want is "start IO on
> new data" followed by "wait for old data to have been written out".
> 
> I think this only strengthens my "stop with the idiotic
> special-case-AIO magic already" argument.  If we want something more
> generic than the usual aio, then we should go all in. Not "let's make
> more limited special cases".

The question is, do we really want something more generic than the usual
AIO?

Virt is one of the 10 (that's a binary number) users of AIO, and we
don't even use it by default because in most cases it's really a wash.

Let's compare AIO with a simple userspace thread pool.

AIO has the ability to submit and retrieve the results of multiple
operations at once.  Thread pools do not have the ability to submit
multiple operations at a time (you could play games with FUTEX_WAKE, but
then all the threads in the pool would have cacheline bounces on the futex).

The syscall overhead on the critical path is comparable.  For AIO it's
io_submit+io_getevents, for a thread pool it's FUTEX_WAKE plus invoking
the actual syscall.  Again, the only difference for AIO is batching.

Unless userspace is submitting tens of thousands of operations per
second, which is pretty much the case only for read/write, there's no
real benefit in asynchronous system calls over a userspace thread pool.
 That applies to openat, unlinkat, fadvise (for readahead).  It also
applies to msync and fsync, etc. because if your workload is doing tons
of those you'd better buy yourself a disk with a battery-backed cache,
or an UPS, and remove the msync/fsync altogether.

So I'm really happy if we can move the thread creation overhead for such
a thread pool to the kernel.  It keeps the benefits of batching, it uses
the optimized kernel workqueues, it doesn't incur the cost of pthreads,
it makes it easy to remove the cases where AIO is blocking, it makes it
easy to add support for !O_DIRECT.  But everything else seems overkill.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Jan. 15, 2016, 8:21 p.m. UTC | #11
On Mon, Jan 11, 2016 at 08:48:23PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 8:03 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So my argument is really that I think it would be better to at least
> > look into maybe creating something less crapulent, and striving to
> > make it easy to make the old legacy interfaces be just wrappers around
> > a more capable model.
> 
> Hmm. Thinking more about this makes me worry about all the system call
> versioning and extra work done by libc.
> 
> At least glibc has traditionally decided to munge and extend on kernel
> system call interfaces, to the point where even fairly core data
> structures (like "struct stat") may not always look the same to the
> kernel as they do to user space.
> 
> So with that worry, I have to admit that maybe a limited interface -
> rather than allowing arbitrary generic async system calls - might have
> advantages. Less room for mismatches.
> 
> I'll have to think about this some more.

Any further thoughts on this after a few days worth of pondering?

		-ben

>                   Linus
Linus Torvalds Jan. 20, 2016, 3:59 a.m. UTC | #12
On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
>>
>> I'll have to think about this some more.
>
> Any further thoughts on this after a few days worth of pondering?

Sorry about the delay, with the merge window and me being sick for a
couple of days I didn't get around to this.

After thinking it over some more, I guess I'm ok with your approach.
The table-driven patch makes me a bit happier, and I guess not very
many people end up ever wanting to do async system calls anyway.

Are there other users outside of Solace? It would be good to get comments..

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Jan. 20, 2016, 5:02 a.m. UTC | #13
On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

For async I/O?  We're using it inside Google, for networking and for
storage I/O's.  We don't need async fsync/fdatasync, but we do need
very fast, low overhead I/O's.  To that end, we have some patches to
batch block layer completion handling, which Kent tried upstreaming a
few years back but which everyone thought was too ugly to live.

(It *was* ugly, but we had access to some very fast storage devices
where it really mattered.  With upcoming NVMe devices, that sort of
hardware should be available to more folks, so it's something that
I've been meaning to revisit from an upstreaming perspective,
especially if I can get my hands on some publically available hardware
for benchmarking purposes to demonstrate why it's useful, even if it
is ugly.)

The other thing which we have which is a bit more experimental is that
we've plumbed through the aio priority bits to the block layer, as
well as aio_cancel.  The idea for the latter is if you are are
interested in low latency access to a clustered file system, where
sometimes a read request can get stuck behind other I/O requests if a
server has a long queue of requests to service.  So the client for
which low latency is very important fires off the request to more than
one server, and as soon as it gets an answer it sends a "never mind"
message to the other server(s).

The code to do aio_cancellation in the block layer is fairly well
tested, and was in Kent's git trees, but never got formally pushed
upstream.  The code to push the cancellation request all the way to
the HDD (for those hard disks / storage devices that support I/O
cancellation) is even more experimental, and needs a lot of cleanup
before it could be sent for review (it was done by someone who isn't
used to upstream coding standards).

The reason why we haven't tried to pushed more of these changes
upsream has been lack of resources, and the fact that the AIO code
*is* ugly, which means extensions tend to make the code at the very
least, more complex.  Especially since some of the folks working on
it, such as Kent, were really worried about performance at all costs,
and Kerningham's "it's twice as hard to debug code as to write it"
comment really applies here.  And since very few people outside of
Google seem to use AIO, and even fewer seem eager to review or work on
AIO, and our team is quite small for the work we need to do, it just
hasn't risen to the top of the priority list.

Still, it's fair to say that if you are using Google Hangouts, or
Google Mail, or Google Docs, AIO is most definitely getting used to
process your queries.

As far as comments, aside from the "we really care about performance",
and "the code is scary complex and barely on the edge of being
maintainable", the other comment I'd make is libaio is pretty awful,
and so as a result a number (most?) of our AIO users have elected to
use the raw system call interfaces and are *not* using the libaio
abstractions --- which, as near as I can tell, don't really buy you
much anyway.  (Do we really need to keep code that provides backwards
compatibility with kernels over 10+ years old at this point?)

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 20, 2016, 7:59 p.m. UTC | #14
On Tue, Jan 19, 2016 at 07:59:35PM -0800, Linus Torvalds wrote:
> On Fri, Jan 15, 2016 at 12:21 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> >>
> >> I'll have to think about this some more.
> >
> > Any further thoughts on this after a few days worth of pondering?
> 
> Sorry about the delay, with the merge window and me being sick for a
> couple of days I didn't get around to this.
> 
> After thinking it over some more, I guess I'm ok with your approach.
> The table-driven patch makes me a bit happier, and I guess not very
> many people end up ever wanting to do async system calls anyway.
> 
> Are there other users outside of Solace? It would be good to get comments..

I know of quite a few storage/db products that use AIO. The most
recent high profile project that have been reporting issues with AIO
on XFS is http://www.scylladb.com/. That project is architected
around non-blocking AIO for scalability reasons...

Cheers,

Dave.
Linus Torvalds Jan. 20, 2016, 8:29 p.m. UTC | #15
On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> Are there other users outside of Solace? It would be good to get comments..
>
> I know of quite a few storage/db products that use AIO. The most
> recent high profile project that have been reporting issues with AIO
> on XFS is http://www.scylladb.com/. That project is architected
> around non-blocking AIO for scalability reasons...

I was more wondering about the new interfaces, making sure that the
feature set actually matches what people want to do..

That said, I also agree that it would be interesting to hear what the
performance impact is for existing performance-sensitive users. Could
we make that "aio_may_use_threads()" case be unconditional, making
things simpler?

          Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Jan. 20, 2016, 8:44 p.m. UTC | #16
On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

I suspect this will be an ongoing learning exercise as people start to use 
the new functionality and find gaps in terms of what is needed.  Certainly 
there is a bunch of stuff we need to add to cover the cases where disk i/o 
is required.  getdents() is one example, but the ABI issues we have with it 
are somewhat more complicated given the history associated with that 
interface.

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

Making it unconditional is a goal, but some work is required before that 
can be the case.  The O_DIRECT issue is one such matter -- it requires some 
changes to the filesystems to ensure that they adhere to the non-blocking 
nature of the new interface (ie taking i_mutex is a Bad Thing that users 
really do not want to be exposed to; if taking it blocks, the code should 
punt to a helper thread).  Additional auditing of some of the read/write 
implementations is also required, which will likely need some minor changes 
in things like sysfs and other weird functionality we have.  Having the 
flag reflects that while the functionality is useful, not all of the bugs 
have been worked out yet.

What's the desired approach to merge these changes?  Does it make sense 
to merge what is ready now and prepare the next round of changes for 4.6?  
Or is it more important to grow things to a more complete state before 
merging?

Regards,

		-ben

>           Linus
Dave Chinner Jan. 20, 2016, 9:45 p.m. UTC | #17
On Wed, Jan 20, 2016 at 03:44:49PM -0500, Benjamin LaHaise wrote:
> On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> Are there other users outside of Solace? It would be good to get comments..
> > >
> > > I know of quite a few storage/db products that use AIO. The most
> > > recent high profile project that have been reporting issues with AIO
> > > on XFS is http://www.scylladb.com/. That project is architected
> > > around non-blocking AIO for scalability reasons...
> > 
> > I was more wondering about the new interfaces, making sure that the
> > feature set actually matches what people want to do..
> 
> I suspect this will be an ongoing learning exercise as people start to use 
> the new functionality and find gaps in terms of what is needed.  Certainly 
> there is a bunch of stuff we need to add to cover the cases where disk i/o 
> is required.  getdents() is one example, but the ABI issues we have with it 
> are somewhat more complicated given the history associated with that 
> interface.
> 
> > That said, I also agree that it would be interesting to hear what the
> > performance impact is for existing performance-sensitive users. Could
> > we make that "aio_may_use_threads()" case be unconditional, making
> > things simpler?
> 
> Making it unconditional is a goal, but some work is required before that 
> can be the case.  The O_DIRECT issue is one such matter -- it requires some 
> changes to the filesystems to ensure that they adhere to the non-blocking 
> nature of the new interface (ie taking i_mutex is a Bad Thing that users 
> really do not want to be exposed to; if taking it blocks, the code should 
> punt to a helper thread).

Filesystems *must take locks* in the IO path. We have to serialise
against truncate and other operations at some point in the IO path
(e.g. block mapping vs concurrent allocation and/or removal), and
that can only be done sanely with sleeping locks.  There is no way
of knowing in advance if we are going to block, and so either we
always use threads for IO submission or we accept that occasionally
the AIO submission will block.

Cheers,

Dave.
Benjamin LaHaise Jan. 20, 2016, 9:56 p.m. UTC | #18
On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> Filesystems *must take locks* in the IO path. We have to serialise
> against truncate and other operations at some point in the IO path
> (e.g. block mapping vs concurrent allocation and/or removal), and
> that can only be done sanely with sleeping locks.  There is no way
> of knowing in advance if we are going to block, and so either we
> always use threads for IO submission or we accept that occasionally
> the AIO submission will block.

I never said we don't take locks.  Still, we can be more intelligent 
about when and where we do so.  With the nonblocking pread() and pwrite() 
changes being proposed elsewhere, we can do the part of the I/O that 
doesn't block in the submitter, which is a huge win when possible.

As it stands today, *every* buffered write takes i_mutex immediately 
on entering ->write().  That one issue alone accounts for a nearly 10x 
performance difference between an O_SYNC write and an O_DIRECT write, 
and using O_SYNC writes is a legitimate use-case for users who want 
caching of data by the kernel (duplicating that functionality is a huge 
amount of work for an application, plus if you want the cache to be 
persistent between runs of an app, you have to get the kernel to do it).

		-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Jan. 20, 2016, 9:57 p.m. UTC | #19
On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

Well, they have mentioned that openat() can block, as will the first
operation after open that requires reading the file extent map from
disk. There are some ways of hacking around this (e.g. running
FIEMAP with a zero extent count or ext4's special extent prefetch
ioctl in a separate thread to prefetch the extent list into memory
before IO is required) so I suspect we may actually need some
interfaces that don't current exist at all....

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

That would make things a lot simpler in the kernel and AIO
submission a lot more predictable/deterministic for userspace. I'd
suggest that, at minimum, it should be the default behaviour...

Cheers,

Dave.
Andres Freund Jan. 22, 2016, 3:31 p.m. UTC | #20
On 2016-01-12 12:11:28 +1100, Dave Chinner wrote:
> On Mon, Jan 11, 2016 at 05:07:23PM -0500, Benjamin LaHaise wrote:
> > Enable a fully asynchronous fsync and fdatasync operations in aio using
> > the aio thread queuing mechanism.
> > 
> > Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
> > Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
> 
> Insufficient. Needs the range to be passed through and call
> vfs_fsync_range(), as I implemented here:
> 
> https://lkml.org/lkml/2015/10/28/878

FWIW, I finally started to play around with this (or more precisely
https://lkml.org/lkml/2015/10/29/517). There were some prerequisite
changes in postgres required, to actually be able to benefit, delaying
things.  First results are good, increasing OLTP throughput
considerably.

It'd also be rather helpful to be able to do
sync_file_range(SYNC_FILE_RANGE_WRITE) asynchronously, i.e. flush
without an implied barrier. Currently this blocks very frequently, even
if there's actually IO bandwidth available.

Regards,

Andres
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andres Freund Jan. 22, 2016, 3:41 p.m. UTC | #21
On 2016-01-19 19:59:35 -0800, Linus Torvalds wrote:
> Are there other users outside of Solace? It would be good to get comments..

PostgreSQL is a potential user of async fdatasync, fsync,
sync_file_range and potentially readahead, write, read. First tests with
Dave's async fsync/fsync_range are positive, so are the results with a
self-hacked async sync_file_range (although I'm kinda thinking that it
shouldn't really require to be used asynchronously).

I rather doubt openat, unlink et al are going to be interesting for
*us*, the requires structural changes would be too bit. But obviously
that doesn't mean anything for others.

Andres
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 23, 2016, 4:24 a.m. UTC | #22
On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > Filesystems *must take locks* in the IO path. We have to serialise
> > against truncate and other operations at some point in the IO path
> > (e.g. block mapping vs concurrent allocation and/or removal), and
> > that can only be done sanely with sleeping locks.  There is no way
> > of knowing in advance if we are going to block, and so either we
> > always use threads for IO submission or we accept that occasionally
> > the AIO submission will block.
> 
> I never said we don't take locks.  Still, we can be more intelligent 
> about when and where we do so.  With the nonblocking pread() and pwrite() 
> changes being proposed elsewhere, we can do the part of the I/O that 
> doesn't block in the submitter, which is a huge win when possible.
> 
> As it stands today, *every* buffered write takes i_mutex immediately 
> on entering ->write().  That one issue alone accounts for a nearly 10x 
> performance difference between an O_SYNC write and an O_DIRECT write, 

Yes, that locking is for correct behaviour, not for performance
reasons.  The i_mutex is providing the required semantics for POSIX
write(2) functionality - writes must serialise against other reads
and writes so that they are completed atomically w.r.t. other IO.
i.e. writes to the same offset must not interleave, not should reads
be able to see partial data from a write in progress.

Direct IO does not conform to POSIX concurrency standards, so we
don't have to serialise concurrent IO against each other.

> and using O_SYNC writes is a legitimate use-case for users who want 
> caching of data by the kernel (duplicating that functionality is a huge 
> amount of work for an application, plus if you want the cache to be 
> persistent between runs of an app, you have to get the kernel to do it).

Yes, but you take what you get given. Buffered IO sucks in many ways;
this is just one of them.

Cheers,

Dave.
Dave Chinner Jan. 23, 2016, 4:39 a.m. UTC | #23
On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
> On Jan 20, 2016 1:46 PM, "Dave Chinner" <david@fromorbit.com> wrote:
> > >
> > > > That said, I also agree that it would be interesting to hear what the
> > > > performance impact is for existing performance-sensitive users. Could
> > > > we make that "aio_may_use_threads()" case be unconditional, making
> > > > things simpler?
> > >
> > > Making it unconditional is a goal, but some work is required before that
> > > can be the case.  The O_DIRECT issue is one such matter -- it requires
> some
> > > changes to the filesystems to ensure that they adhere to the
> non-blocking
> > > nature of the new interface (ie taking i_mutex is a Bad Thing that users
> > > really do not want to be exposed to; if taking it blocks, the code
> should
> > > punt to a helper thread).
> >
> > Filesystems *must take locks* in the IO path.
> 
> I agree.
> 
> I also would prefer to make the aio code have as little interaction and
> magic flags with the filesystem code as humanly possible.
> 
> I wonder if we could make the rough rule be that the only synchronous case
> the aio code ever has is more or less entirely in the generic vfs caches?
> IOW, could we possibly aim to make the rule be that if we call down to the
> filesystem layer, we do that within a thread?

We have to go through the filesystem layer locking even on page
cache hits, and even if we get into the page cache copy-in/copy-out
code we can still get stuck on things like page locks and page
faults. Even if hte pages are cached, we can still get caught on
deeper filesystem locks for block mapping. e.g. read from a hole,
get zeros back, page cache is populated. Write data into range,
fetch page, realise it's unmapped, need to do block/delayed
allocation which requires filesystem locks and potentially
transactions and IO....

> We could do things like that for the name loopkup for openat() too, where
> we could handle the successful RCU loopkup synchronously, but then if we
> fall out of RCU mode we'd do the thread.

We'd have to do quite a bit of work to unwind back out to the AIO
layer before we can dispatch the open operation again in a thread,
wouldn't we?

So I'm not convinced that conditional thread dispatch makes sense. I
think the simplest thing to do is make all AIO use threads/
workqueues by default, and if the application is smart enough to
only do things that minimise blocking they can turn off the threaded
dispatch and get the same behaviour they get now.

Cheers,

Dave.
Benjamin LaHaise Jan. 23, 2016, 4:50 a.m. UTC | #24
On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > Filesystems *must take locks* in the IO path. We have to serialise
> > > against truncate and other operations at some point in the IO path
> > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > that can only be done sanely with sleeping locks.  There is no way
> > > of knowing in advance if we are going to block, and so either we
> > > always use threads for IO submission or we accept that occasionally
> > > the AIO submission will block.
> > 
> > I never said we don't take locks.  Still, we can be more intelligent 
> > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > changes being proposed elsewhere, we can do the part of the I/O that 
> > doesn't block in the submitter, which is a huge win when possible.
> > 
> > As it stands today, *every* buffered write takes i_mutex immediately 
> > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > performance difference between an O_SYNC write and an O_DIRECT write, 
> 
> Yes, that locking is for correct behaviour, not for performance
> reasons.  The i_mutex is providing the required semantics for POSIX
> write(2) functionality - writes must serialise against other reads
> and writes so that they are completed atomically w.r.t. other IO.
> i.e. writes to the same offset must not interleave, not should reads
> be able to see partial data from a write in progress.

No, the locks are not *required* for POSIX semantics, they are a legacy
of how Linux filesystem code has been implemented and how we ensure the
necessary internal consistency needed inside our filesystems is
provided.  There are other ways to achieve the required semantics that
do not involve a single giant lock for the entire file/inode.  And no, I
am not saying that doing this is simple or easy to do.

		-ben

> Direct IO does not conform to POSIX concurrency standards, so we
> don't have to serialise concurrent IO against each other.
> 
> > and using O_SYNC writes is a legitimate use-case for users who want 
> > caching of data by the kernel (duplicating that functionality is a huge 
> > amount of work for an application, plus if you want the cache to be 
> > persistent between runs of an app, you have to get the kernel to do it).
> 
> Yes, but you take what you get given. Buffered IO sucks in many ways;
> this is just one of them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner Jan. 23, 2016, 10:22 p.m. UTC | #25
On Fri, Jan 22, 2016 at 11:50:24PM -0500, Benjamin LaHaise wrote:
> On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> > On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > > Filesystems *must take locks* in the IO path. We have to serialise
> > > > against truncate and other operations at some point in the IO path
> > > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > > that can only be done sanely with sleeping locks.  There is no way
> > > > of knowing in advance if we are going to block, and so either we
> > > > always use threads for IO submission or we accept that occasionally
> > > > the AIO submission will block.
> > > 
> > > I never said we don't take locks.  Still, we can be more intelligent 
> > > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > > changes being proposed elsewhere, we can do the part of the I/O that 
> > > doesn't block in the submitter, which is a huge win when possible.
> > > 
> > > As it stands today, *every* buffered write takes i_mutex immediately 
> > > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > > performance difference between an O_SYNC write and an O_DIRECT write, 
> > 
> > Yes, that locking is for correct behaviour, not for performance
> > reasons.  The i_mutex is providing the required semantics for POSIX
> > write(2) functionality - writes must serialise against other reads
> > and writes so that they are completed atomically w.r.t. other IO.
> > i.e. writes to the same offset must not interleave, not should reads
> > be able to see partial data from a write in progress.
> 
> No, the locks are not *required* for POSIX semantics, they are a legacy
> of how Linux filesystem code has been implemented and how we ensure the
> necessary internal consistency needed inside our filesystems is
> provided.

That may be the case, but I really don't see how you can provide
such required functionality without some kind of exclusion barrier
in place. No matter how you implement that exclusion, it can be seen
effectively as a lock.

Even if the filesystem doesn't use the i_mutex for exclusion to the
page cache, it has to use some kind of lock as that IO still needs
to be serialised against any truncate, hole punch or other extent
manipulation that is currently in progress on the inode...

> There are other ways to achieve the required semantics that
> do not involve a single giant lock for the entire file/inode.

Most performant filesystems don't have a "single giant lock"
anymore. The problem is that the VFS expects the i_mutex to be held
for certain operations in the IO path and the VFS lock order
heirarchy makes it impossible to do anything but "get i_mutex
first".  That's the problem that needs to be solved - the VFS
enforces the "one giant lock" model, even when underlying
filesystems do not require it.

i.e. we could quite happily remove the i_mutex completely from the XFS
buffered IO path without breaking anything, but we can't because
that results in the VFS throwing warnings that we don't hold the
i_mutex (e.g like when removing the SUID bits on write). So there's
lots of VFS functionality that needs to be turned on it's head
before the i_mutex can be removed from the IO path.

> And no, I
> am not saying that doing this is simple or easy to do.

Sure. That's always been the problem. Even when a split IO/metadata
locking strategy like what XFS uses (and other modern filesystems
are moving to internally) is suggested as a model for solving
these problems, the usual response instant dismissal with
"no way, that's unworkable" and so nothing ever changes...

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 88af450..576b780 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -224,8 +224,9 @@  static const struct file_operations aio_ring_fops;
 static const struct address_space_operations aio_ctx_aops;
 
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
+ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 
-static bool aio_may_use_threads(void)
+static __always_inline bool aio_may_use_threads(void)
 {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
 	return !!(aio_auto_threads & 1);
@@ -1654,6 +1655,26 @@  ssize_t generic_async_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 				     AIO_THREAD_NEED_TASK);
 }
 EXPORT_SYMBOL(generic_async_write_iter);
+
+static long aio_thread_op_fsync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 0);
+}
+
+static long aio_thread_op_fdatasync(struct aio_kiocb *iocb)
+{
+	return vfs_fsync(iocb->common.ki_filp, 1);
+}
+
+ssize_t aio_fsync(struct kiocb *iocb, int datasync)
+{
+	struct aio_kiocb *req;
+
+	req = container_of(iocb, struct aio_kiocb, common);
+
+	return aio_thread_queue_iocb(req, datasync ? aio_thread_op_fdatasync
+						   : aio_thread_op_fsync, 0);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
@@ -1664,7 +1685,7 @@  static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
 			    char __user *buf, size_t len, bool compat)
 {
 	struct file *file = req->common.ki_filp;
-	ssize_t ret;
+	ssize_t ret = -EINVAL;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
@@ -1730,17 +1751,17 @@  rw_common:
 		break;
 
 	case IOCB_CMD_FDSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 1);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 1);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 1);
 		break;
 
 	case IOCB_CMD_FSYNC:
-		if (!file->f_op->aio_fsync)
-			return -EINVAL;
-
-		ret = file->f_op->aio_fsync(&req->common, 0);
+		if (file->f_op->aio_fsync)
+			ret = file->f_op->aio_fsync(&req->common, 0);
+		else if (file->f_op->fsync && (aio_may_use_threads()))
+			ret = aio_fsync(&req->common, 0);
 		break;
 
 	default: