diff mbox

[5/5] aio: Refactor aio_read_evt, use cmxchg(), fix bug

Message ID 1349764760-21093-5-git-send-email-koverstreet@google.com (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Kent Overstreet Oct. 9, 2012, 6:39 a.m. UTC
Bunch of cleanup, and make it lockless so that userspace can safely pull
events off the ringbuffer without racing with io_getevents().

Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 fs/aio.c |  220 +++++++++++++++++++++-----------------------------------------
 1 file changed, 73 insertions(+), 147 deletions(-)

Comments

Zach Brown Oct. 9, 2012, 6:37 p.m. UTC | #1
On Mon, Oct 08, 2012 at 11:39:20PM -0700, Kent Overstreet wrote:
> Bunch of cleanup

Ugh.  That's way too much noisy change for one patch with no
description.  Break it up into functional pieces and actually describe
them.

> events off the ringbuffer without racing with io_getevents().

Are you sure this is safe in the presence of wrapping indices?  It's
been a very long time since I've looked at this, but I could never
convince myself that it was safe.

What I'm worried about is cmpxchg()s caller sampling, say, and index of
0, having another task GO NUTS and wrap that index all the way around
back to 0, and then having that initial cmpchg() caller see the wrapped
0 index and think that's nothing's changed in the interim.

Is this a problem?

(I wish I could find the comment I wrote in a very old patch to tear out
the mapped ring entirely.. It was a great list of its design mistakes.)

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 9, 2012, 9:27 p.m. UTC | #2
On Tue, Oct 09, 2012 at 11:37:53AM -0700, Zach Brown wrote:
> On Mon, Oct 08, 2012 at 11:39:20PM -0700, Kent Overstreet wrote:
> > Bunch of cleanup
> 
> Ugh.  That's way too much noisy change for one patch with no
> description.  Break it up into functional pieces and actually describe
> them.

Heh, I actually didn't mean to send these patches out to lkml just yet
(though now I'm glad I did). There definitely is too much in this patch,
it kinda snowballed.

> > events off the ringbuffer without racing with io_getevents().
> 
> Are you sure this is safe in the presence of wrapping indices?  It's
> been a very long time since I've looked at this, but I could never
> convince myself that it was safe.
> 
> What I'm worried about is cmpxchg()s caller sampling, say, and index of
> 0, having another task GO NUTS and wrap that index all the way around
> back to 0, and then having that initial cmpchg() caller see the wrapped
> 0 index and think that's nothing's changed in the interim.
> 
> Is this a problem?

Ohhhh, yeah it is - ABA. That's easily solved though, we do the head %
nr when we're using the head pointer, and let the stored value use the
full 32 bits.

If libaio is the only thing in userspace looking at the ringbuffer, and
if I'm looking at the latest libaio code this shouldn't break
anything...

> (I wish I could find the comment I wrote in a very old patch to tear out
> the mapped ring entirely.. It was a great list of its design mistakes.)

Sticking the head and tail pointers on the same cacheline is one of my
main complaints. If you could find that comment I'd be interested in
reading it, though.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 9, 2012, 10:47 p.m. UTC | #3
> If libaio is the only thing in userspace looking at the ringbuffer, and
> if I'm looking at the latest libaio code this shouldn't break
> anything...

We can't assume that libaio is the only thing in userspace using the
mapped buffer -- as scary a thought as that is :).

If we wanted to change the behaviour of the ring we'd give userspace a
way to indicate that it accepts the new semantics.  I'm not immediately
sure how best to do that -- haven't given it much thought.

> Sticking the head and tail pointers on the same cacheline is one of my
> main complaints. If you could find that comment I'd be interested in
> reading it, though.

Yeah, that was on my list.. Heh, found it!

/*
* This structure defines the head of a ring of events that is mapped into user
* space.  It isn't used anymore because it has a number of design flaws.  I'll
* mention them here in the hopes that people read this when considering a
* design for a ring shared between user and kernel space.
*
* - The head and tail pointers are in the same cacheline.  This introduces
* false sharing between producers and consumers which could otherwise operate
* independent of one-another, presuming that the producers know ahead of time
* that room has been reserved for them.
*
* - The head and tail semantics are unconventional.  They're always less than
* the number of events in the ring and their meaning is reversed from the
* usual construct that one sees in, for example, ethernet hardware where the
* ring is a power of two and the indexes wrap but are masked when used to
* dereference elements.
*
* - Because of the weird head and tail semantics one can't distinguish from
* completely empty and full rings.  To work around this the ring is actually
* created with more events than the aio user asked for and that number is
* accounted for separately.  The number of free elements in the ring is
* greater than the number of operations that the kernel aio submission will
* allow.
*
* - Synchronizing head and tail updates between the kernel and user space
* requires a u32 cmpxchg.  futexes have infrastructure for this which requires
* working with user addresses.  Trying to nicely re-use the futex code leads
* to awkward code in aio which sometimes wants to work through kmap()
* addresses and other times want to work through the mmaped user space
* pointers.
*
* - The head and tail pointers are restricted in value to being less than the
* number of elements in the ring.  This means that a cmpxchg can see a index
* which has wrapped and confuse it for the index that it read before trying
* cmpxchg.  This can end up sending a duplicated previous event instead of a
* currently queued event in the worst case.
*
* - User space doesn't explicitly indicate that it wants to work with the ring
* from user space.  Even if user space never touches the ring the kernel still
* has pay the cost of the potential sharing with user space when inserting
* events.
*
* - The kernel magically maps it into the address space.  Users can't put it
* in whatever memory they like, like shared mem or hugetlb pages or tmpfs
* pages that they want to sendfile out of.
*
* - The ring is allocated out of unswappable kernel memory.  It would be
* better to have the ring allocated in userspace and given to the kernel.  In
* the common case the pages will be present and the overhead is minimal and
* complexity is kept out of the kernel.  In the worst case of memory pressure
* the kernel has fewer pinned pages tying its hands.
*/

The submission and completion interface that I threw together for the
acall experiments took all this into account:

 http://lwn.net/Articles/316806/

Man, this all brings back memories :).

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 9, 2012, 10:55 p.m. UTC | #4
On Tue, Oct 09, 2012 at 03:47:03PM -0700, Zach Brown wrote:
> > If libaio is the only thing in userspace looking at the ringbuffer, and
> > if I'm looking at the latest libaio code this shouldn't break
> > anything...
> 
> We can't assume that libaio is the only thing in userspace using the
> mapped buffer -- as scary a thought as that is :).
> 
> If we wanted to change the behaviour of the ring we'd give userspace a
> way to indicate that it accepts the new semantics.  I'm not immediately
> sure how best to do that -- haven't given it much thought.

Well, the ringbuffer does have those compat flags and incompat flags.
Which libaio conveniently doesn't check, but for what it does it
shouldn't really matter I guess.

I figure if anyone else is using the ringbuffer directly and not
checking the flag fields... well, they deserve to have their stuff
broken :P

OTOH, that is the same situation the tracing stuff was in... but the
stuff that used the tracing fields incorrectly was more widespread.

Anyways, if we can't change the ringbuffer at all we could always create
a new version of the io_setup() syscall that gives you a new ringbuffer
format.

That'd arguably be better anyways, since if we went the incompat flags
route old code would be stuck with not being able to use the ringbuffer
at all on newer kernels. But it would be more of a pain and more kernel
code.


> > Sticking the head and tail pointers on the same cacheline is one of my
> > main complaints. If you could find that comment I'd be interested in
> > reading it, though.
> 
> Yeah, that was on my list.. Heh, found it!

Thanks!

I'm wondering what interest there is in creating a new aio interface to
solve these and other issues.

I kind of feel like as long as we've got a list of complaints, we should
prototype something in one place that fixes all our complaints... think
of it as documenting all the known issues, if nothing else.

> 
> /*
> * This structure defines the head of a ring of events that is mapped into user
> * space.  It isn't used anymore because it has a number of design flaws.  I'll
> * mention them here in the hopes that people read this when considering a
> * design for a ring shared between user and kernel space.
> *
> * - The head and tail pointers are in the same cacheline.  This introduces
> * false sharing between producers and consumers which could otherwise operate
> * independent of one-another, presuming that the producers know ahead of time
> * that room has been reserved for them.
> *
> * - The head and tail semantics are unconventional.  They're always less than
> * the number of events in the ring and their meaning is reversed from the
> * usual construct that one sees in, for example, ethernet hardware where the
> * ring is a power of two and the indexes wrap but are masked when used to
> * dereference elements.
> *
> * - Because of the weird head and tail semantics one can't distinguish from
> * completely empty and full rings.  To work around this the ring is actually
> * created with more events than the aio user asked for and that number is
> * accounted for separately.  The number of free elements in the ring is
> * greater than the number of operations that the kernel aio submission will
> * allow.
> *
> * - Synchronizing head and tail updates between the kernel and user space
> * requires a u32 cmpxchg.  futexes have infrastructure for this which requires
> * working with user addresses.  Trying to nicely re-use the futex code leads
> * to awkward code in aio which sometimes wants to work through kmap()
> * addresses and other times want to work through the mmaped user space
> * pointers.
> *
> * - The head and tail pointers are restricted in value to being less than the
> * number of elements in the ring.  This means that a cmpxchg can see a index
> * which has wrapped and confuse it for the index that it read before trying
> * cmpxchg.  This can end up sending a duplicated previous event instead of a
> * currently queued event in the worst case.
> *
> * - User space doesn't explicitly indicate that it wants to work with the ring
> * from user space.  Even if user space never touches the ring the kernel still
> * has pay the cost of the potential sharing with user space when inserting
> * events.
> *
> * - The kernel magically maps it into the address space.  Users can't put it
> * in whatever memory they like, like shared mem or hugetlb pages or tmpfs
> * pages that they want to sendfile out of.
> *
> * - The ring is allocated out of unswappable kernel memory.  It would be
> * better to have the ring allocated in userspace and given to the kernel.  In
> * the common case the pages will be present and the overhead is minimal and
> * complexity is kept out of the kernel.  In the worst case of memory pressure
> * the kernel has fewer pinned pages tying its hands.
> */
> 
> The submission and completion interface that I threw together for the
> acall experiments took all this into account:
> 
>  http://lwn.net/Articles/316806/
> 
> Man, this all brings back memories :).
> 
> - z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 9, 2012, 11:10 p.m. UTC | #5
> Well, the ringbuffer does have those compat flags and incompat flags.
> Which libaio conveniently doesn't check, but for what it does it
> shouldn't really matter I guess.

Well, the presumed point of the incompat flags would be to tell an app
that it isn't going to get what it expects!  Ideally it'd abort, not
blindly charge on ahead.

> I figure if anyone else is using the ringbuffer directly and not
> checking the flag fields... well, they deserve to have their stuff
> broken :P

Nope!  I subscribe to the unpopular notion that you don't change
interfaces just because you can.

> Anyways, if we can't change the ringbuffer at all we could always create
> a new version of the io_setup() syscall that gives you a new ringbuffer
> format.

That might be the easiest way to tweak the existing aio interface, yeah.
Jens wants to do that in his patches as well.  He used the hack of
setting nr_events to INT_MAX to indicate not using the ring, but adding
a flags parameter to a new syscall seems a lot less funky.

> I'm wondering what interest there is in creating a new aio interface to
> solve these and other issues.
> 
> I kind of feel like as long as we've got a list of complaints, we should
> prototype something in one place that fixes all our complaints... think
> of it as documenting all the known issues, if nothing else.

I'd help out with that, yes.

On my list of complaints would be how heavy the existing aio
setup/submission/completion/teardown interface is.   A better interface
should make it trivial to just bang out a call and synchronously wait
for it.  Get that right and you don't have to mess around with aio and
sync variants.

One of the harder things to get right would be specifying the DIF/DIX
checksums per sector.  But I think we should.  Poor Martin has hung out
to dry for too long.

And perhaps obviously, I'd start with the acall stuff :).  It was a lot
lighter.  We could talk about how to make it extensible without going
all the way to the generic packed variable size duplicating or not and
returning or not or.. attributes :).

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 10, 2012, 12:06 a.m. UTC | #6
On Tue, Oct 09, 2012 at 04:10:59PM -0700, Zach Brown wrote:
> > Well, the ringbuffer does have those compat flags and incompat flags.
> > Which libaio conveniently doesn't check, but for what it does it
> > shouldn't really matter I guess.
> 
> Well, the presumed point of the incompat flags would be to tell an app
> that it isn't going to get what it expects!  Ideally it'd abort, not
> blindly charge on ahead.
> 
> > I figure if anyone else is using the ringbuffer directly and not
> > checking the flag fields... well, they deserve to have their stuff
> > broken :P
> 
> Nope!  I subscribe to the unpopular notion that you don't change
> interfaces just because you can.

Heh, I won't argue.

The AIO ringbuffer stuff just annoys me more than most (it wasn't until
the other day that I realized it was actually exported to userspace...
what led to figuring that out was noticing aio_context_t was a ulong,
and got truncated to 32 bits with a 32 bit program running on a 64 bit
kernel. I'd been horribly misled by the code comments and the lack of
documentation.) 

> > Anyways, if we can't change the ringbuffer at all we could always create
> > a new version of the io_setup() syscall that gives you a new ringbuffer
> > format.
> 
> That might be the easiest way to tweak the existing aio interface, yeah.
> Jens wants to do that in his patches as well.  He used the hack of
> setting nr_events to INT_MAX to indicate not using the ring, but adding
> a flags parameter to a new syscall seems a lot less funky.

Alright. Maybe I'll start hacking on that...

> > I'm wondering what interest there is in creating a new aio interface to
> > solve these and other issues.
> > 
> > I kind of feel like as long as we've got a list of complaints, we should
> > prototype something in one place that fixes all our complaints... think
> > of it as documenting all the known issues, if nothing else.
> 
> I'd help out with that, yes.
> 
> On my list of complaints would be how heavy the existing aio
> setup/submission/completion/teardown interface is.   A better interface
> should make it trivial to just bang out a call and synchronously wait
> for it.  Get that right and you don't have to mess around with aio and
> sync variants.

Hmm yeah, setup and teardown is a good point.

I never liked aio_context_t too much - in some respects it would be
cleaner if it was just implicit and per thread. But we probably can't do
that since there are legitimate use cases for one thread submitting and
iocb and another thread reaping the events.

But if we do have an explicit handle, I don't see why it shouldn't be a
file descriptor.

But an implicit per thread context might be useful for the use case you
describe... or perhaps we can add a syscall to submit an iocb and wait
for it synchronously, without any aio_context_t involved.

> One of the harder things to get right would be specifying the DIF/DIX
> checksums per sector.  But I think we should.  Poor Martin has hung out
> to dry for too long.

Yes, that's one of the things I want to address with the aio attributes
stuff.

> 
> And perhaps obviously, I'd start with the acall stuff :).  It was a lot
> lighter.  We could talk about how to make it extensible without going
> all the way to the generic packed variable size duplicating or not and
> returning or not or.. attributes :).

Link? I haven't heard of acall before.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 10, 2012, 12:26 a.m. UTC | #7
> The AIO ringbuffer stuff just annoys me more than most

Not more than everyone, though, I can personally promise you that :).

> (it wasn't until
> the other day that I realized it was actually exported to userspace...
> what led to figuring that out was noticing aio_context_t was a ulong,
> and got truncated to 32 bits with a 32 bit program running on a 64 bit
> kernel. I'd been horribly misled by the code comments and the lack of
> documentation.) 

Yeah.  It's the userspace address of the mmaped ring.  This has annoyed
the process migration people who can't recreate the context in a new
kernel because there's no userspace interface to specify creation of a
context at a specific address.

> But if we do have an explicit handle, I don't see why it shouldn't be a
> file descriptor.

Because they're expensive to create and destroy when compared to a
single system call.  Imagine that we're using waiting for a single
completion to implement a cheap one-off sync call.  Imagine it's a
buffered op which happens to hit the cache and is really quick.

(And they're annoying to manage: libraries and O_CLOEXEC, running into
fd/file limit tunables, bleh.)

If the 'completion context' is no more than a structure in userspace
memory then a lot of stuff just works.  Tasks can share it amongst
themselves as they see fit.  A trivial one-off sync call can just dump
it on the stack and point to it.  It doesn't have to be specifically
torn down on task exit.

> > And perhaps obviously, I'd start with the acall stuff :).  It was a lot
> > lighter.  We could talk about how to make it extensible without going
> > all the way to the generic packed variable size duplicating or not and
> > returning or not or.. attributes :).
> 
> Link? I haven't heard of acall before.

I linked to it after that giant silly comment earlier in the thread,
here it is again:

  http://lwn.net/Articles/316806/

There's a mostly embarassing video of a jetlagged me giving that talk at
LCA kicking around.. ah, here:

 http://mirror.linux.org.au/pub/linux.conf.au/2009/Thursday/131.ogg

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 10, 2012, 12:47 a.m. UTC | #8
On Tue, Oct 09, 2012 at 05:26:34PM -0700, Zach Brown wrote:
> > The AIO ringbuffer stuff just annoys me more than most
> 
> Not more than everyone, though, I can personally promise you that :).
> 
> > (it wasn't until
> > the other day that I realized it was actually exported to userspace...
> > what led to figuring that out was noticing aio_context_t was a ulong,
> > and got truncated to 32 bits with a 32 bit program running on a 64 bit
> > kernel. I'd been horribly misled by the code comments and the lack of
> > documentation.) 
> 
> Yeah.  It's the userspace address of the mmaped ring.  This has annoyed
> the process migration people who can't recreate the context in a new
> kernel because there's no userspace interface to specify creation of a
> context at a specific address.

Yeah I did finally figure that out - and a file descriptor that
userspace then mmap()ed would solve that problem...

> 
> > But if we do have an explicit handle, I don't see why it shouldn't be a
> > file descriptor.
> 
> Because they're expensive to create and destroy when compared to a
> single system call.  Imagine that we're using waiting for a single
> completion to implement a cheap one-off sync call.  Imagine it's a
> buffered op which happens to hit the cache and is really quick.

True. But that could be solved with a separate interface that either
doesn't use a context to submit a call synchronously, or uses an
implicit per thread context.

> (And they're annoying to manage: libraries and O_CLOEXEC, running into
> fd/file limit tunables, bleh.)

I don't have a _strong_ opinion there, but my intuition is that we
shouldn't be creating new types of handles without a good reason. I
don't think the annoyances are for the most part particular to file
descriptors, I think the tend to be applicable to handles in general and
at least with file descriptors they're known and solved.

Also, with a file descriptor it naturally works with an epoll event
loop. (eventfd for aio is a hack).

> If the 'completion context' is no more than a structure in userspace
> memory then a lot of stuff just works.  Tasks can share it amongst
> themselves as they see fit.  A trivial one-off sync call can just dump
> it on the stack and point to it.  It doesn't have to be specifically
> torn down on task exit.

That would be awesome, though for it to be worthwhile there couldn't be
any kernel notion of a context at all and I'm not sure if that's
practical. But the idea hadn't occured to me before and I'm sure you've
thought about it more than I have... hrm.

Oh hey, that's what acall does :P

For completions though you really want the ringbuffer pinned... what do
you do about that?

> > > And perhaps obviously, I'd start with the acall stuff :).  It was a lot
> > > lighter.  We could talk about how to make it extensible without going
> > > all the way to the generic packed variable size duplicating or not and
> > > returning or not or.. attributes :).
> > 
> > Link? I haven't heard of acall before.
> 
> I linked to it after that giant silly comment earlier in the thread,
> here it is again:
> 
>   http://lwn.net/Articles/316806/

Oh whoops, hadn't started reading yet - looking at it now :)

> There's a mostly embarassing video of a jetlagged me giving that talk at
> LCA kicking around.. ah, here:
> 
>  http://mirror.linux.org.au/pub/linux.conf.au/2009/Thursday/131.ogg
> 
> - z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 10, 2012, 9:43 p.m. UTC | #9
> True. But that could be solved with a separate interface that either
> doesn't use a context to submit a call synchronously, or uses an
> implicit per thread context.

Sure, but why bother if we can make the one submission interface fast
enough to satisfy quick callers?  Less is more, and all that.

> I don't have a _strong_ opinion there, but my intuition is that we
> shouldn't be creating new types of handles without a good reason. I
> don't think the annoyances are for the most part particular to file
> descriptors, I think the tend to be applicable to handles in general and
> at least with file descriptors they're known and solved.

I strongly disagree.  That descriptors are an expensive limited
resources is a perfectly good reason to not make them required to access
the ring.

> That would be awesome, though for it to be worthwhile there couldn't be
> any kernel notion of a context at all and I'm not sure if that's
> practical. But the idea hadn't occured to me before and I'm sure you've
> thought about it more than I have... hrm.
> 
> Oh hey, that's what acall does :P

:)

> For completions though you really want the ringbuffer pinned... what do
> you do about that?

I don't think the kernel has to mandate that, no.  The code has to deal
with completions faulting, but they probably won't.  In acall it
happened that completions always came from threads that could block so
its coping mechanism was to just use put_user() :).

If userspace wants them rings locked, they can mlock() the memory.

Think about it from another angle: the current mechanism of creating an
aio ring is a way to allocate pinned memory outside of the usual mlock
accounting.  This could be abused, so aio grew an additional tunable to
limit the number of total entries in rings in the system.

By putting the ring in normal user memory we avoid that problem
entirely.

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 11, 2012, 2:51 a.m. UTC | #10
On Wed, Oct 10, 2012 at 02:43:15PM -0700, Zach Brown wrote:
> > True. But that could be solved with a separate interface that either
> > doesn't use a context to submit a call synchronously, or uses an
> > implicit per thread context.
> 
> Sure, but why bother if we can make the one submission interface fast
> enough to satisfy quick callers?  Less is more, and all that.

Very true, if it's possible. I'm just still skeptical.

> > I don't have a _strong_ opinion there, but my intuition is that we
> > shouldn't be creating new types of handles without a good reason. I
> > don't think the annoyances are for the most part particular to file
> > descriptors, I think the tend to be applicable to handles in general and
> > at least with file descriptors they're known and solved.
> 
> I strongly disagree.  That descriptors are an expensive limited
> resources is a perfectly good reason to not make them required to access
> the ring.

What's so special about aio vs. epoll, and now signalfd/eventfd/timerfd
etc.? 

> > That would be awesome, though for it to be worthwhile there couldn't be
> > any kernel notion of a context at all and I'm not sure if that's
> > practical. But the idea hadn't occured to me before and I'm sure you've
> > thought about it more than I have... hrm.
> > 
> > Oh hey, that's what acall does :P
> 
> :)
> 
> > For completions though you really want the ringbuffer pinned... what do
> > you do about that?
> 
> I don't think the kernel has to mandate that, no.  The code has to deal
> with completions faulting, but they probably won't.  In acall it
> happened that completions always came from threads that could block so
> its coping mechanism was to just use put_user() :).

Yeah, but that means the completion has to be delivered from process
context. That's not what aio does today, and it'd be a real performance
regression.

I don't know of a way around that myself.

> If userspace wants them rings locked, they can mlock() the memory.
> 
> Think about it from another angle: the current mechanism of creating an
> aio ring is a way to allocate pinned memory outside of the usual mlock
> accounting.  This could be abused, so aio grew an additional tunable to
> limit the number of total entries in rings in the system.
> 
> By putting the ring in normal user memory we avoid that problem
> entirely.

No different from any other place the kernel allocates memory on behalf
of userspace... it needs a general solution, not a bunch of special case
solutions (though since the general solution is memcg you might argue
the cure is worse than the disease... :P)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 11, 2012, 4:43 p.m. UTC | #11
> Yeah, but that means the completion has to be delivered from process
> context. That's not what aio does today, and it'd be a real performance
> regression.

It'd only have to to complete from process context if it faults. The
cheapest possible delivery mechanism is simple cpu stores.  In the vast
majority of cases the ring will be resident and it'll be done.  In rare
cases it could fall back to a deferred completion.  If apps can't
stomach that latency and want to pay the overhead of pinning to remove
that risk, they're welcome to do so.

That's my hope, anyway.

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index c3d97d1..1efc8a3 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -259,7 +259,6 @@  static struct kioctx *ioctx_alloc(unsigned nr_events)
 	atomic_set(&ctx->reqs_active, 1);
 
 	spin_lock_init(&ctx->ctx_lock);
-	spin_lock_init(&ctx->ring_info.ring_lock);
 	init_waitqueue_head(&ctx->wait);
 
 	INIT_LIST_HEAD(&ctx->active_reqs);
@@ -941,194 +940,121 @@  put_rq:
 }
 EXPORT_SYMBOL(aio_complete);
 
-/* aio_read_evt
- *	Pull an event off of the ioctx's event ring.  Returns the number of 
- *	events fetched (0 or 1 ;-)
- *	FIXME: make this use cmpxchg.
- *	TODO: make the ringbuffer user mmap()able (requires FIXME).
+/* aio_read_events
+ *	Pull an event off of the ioctx's event ring.  Returns the number of
+ *	events fetched
  */
-static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
+static int aio_read_events(struct kioctx *ctx, struct io_event __user *event,
+			   long nr)
 {
-	struct aio_ring_info *info = &ioctx->ring_info;
+	struct aio_ring_info *info = &ctx->ring_info;
 	struct aio_ring *ring = info->ring;
-	unsigned long head;
-	int ret = 0;
-
-	dprintk("in aio_read_evt h%lu t%lu m%lu\n",
-		 (unsigned long)ring->head, (unsigned long)ring->tail,
-		 (unsigned long)ring->nr);
-
-	if (ring->head == ring->tail)
-		goto out;
-
-	spin_lock(&info->ring_lock);
+	unsigned old, new;
+	int ret;
 
-	head = ring->head % info->nr;
-	if (head != ring->tail) {
-		*ent = ring->io_events[head];
-		head = (head + 1) % info->nr;
-		smp_mb(); /* finish reading the event before updatng the head */
-		ring->head = head;
-		ret = 1;
-	}
-	spin_unlock(&info->ring_lock);
+	pr_debug("h%lu t%lu m%lu\n",
+		 (unsigned long) ring->head,
+		 (unsigned long) ring->tail,
+		 (unsigned long) ring->nr);
+retry:
+	ret = 0;
+	old = new = ring->head;
 
-out:
-	dprintk("leaving aio_read_evt: %d  h%lu t%lu\n", ret,
-		 (unsigned long)ring->head, (unsigned long)ring->tail);
-	return ret;
-}
+	while (ret < nr && new != ring->tail) {
+		struct io_event *ev = &ring->io_events[new];
+		unsigned i = (new < ring->tail ? ring->tail : info->nr) - new;
 
-struct aio_timeout {
-	struct timer_list	timer;
-	int			timed_out;
-	struct task_struct	*p;
-};
+		i = min_t(int, i, nr - ret);
 
-static void timeout_func(unsigned long data)
-{
-	struct aio_timeout *to = (struct aio_timeout *)data;
+		if (unlikely(copy_to_user(event + ret, ev, sizeof(*ev) * i)))
+			return -EFAULT;
 
-	to->timed_out = 1;
-	wake_up_process(to->p);
-}
+		ret += i;
+		new += i;
+		new %= info->nr;
+	}
 
-static inline void init_timeout(struct aio_timeout *to)
-{
-	setup_timer_on_stack(&to->timer, timeout_func, (unsigned long) to);
-	to->timed_out = 0;
-	to->p = current;
-}
+	if (new != old) {
+		smp_mb(); /* finish reading the event before updatng the head */
 
-static inline void set_timeout(long start_jiffies, struct aio_timeout *to,
-			       const struct timespec *ts)
-{
-	to->timer.expires = start_jiffies + timespec_to_jiffies(ts);
-	if (time_after(to->timer.expires, jiffies))
-		add_timer(&to->timer);
-	else
-		to->timed_out = 1;
-}
+		if (cmpxchg(&ring->head, old, new) != old)
+			goto retry;
+	}
 
-static inline void clear_timeout(struct aio_timeout *to)
-{
-	del_singleshot_timer_sync(&to->timer);
+	pr_debug("ret %d  h%lu t%lu\n", ret,
+		 (unsigned long)ring->head,
+		 (unsigned long)ring->tail);
+	return ret;
 }
 
-static int read_events(struct kioctx *ctx,
-			long min_nr, long nr,
-			struct io_event __user *event,
-			struct timespec __user *timeout)
+static int read_events(struct kioctx *ctx, long min_nr, long nr,
+		       struct io_event __user *event,
+		       struct timespec __user *timeout)
 {
-	long			start_jiffies = jiffies;
-	struct task_struct	*tsk = current;
-	DECLARE_WAITQUEUE(wait, tsk);
-	int			ret;
-	int			i = 0;
-	struct io_event		ent;
-	struct aio_timeout	to;
-	int			retry = 0;
-
-	/* needed to zero any padding within an entry (there shouldn't be 
-	 * any, but C is fun!
-	 */
-	memset(&ent, 0, sizeof(ent));
+	DEFINE_WAIT(wait);
+	long until = MAX_SCHEDULE_TIMEOUT;
+	size_t i = 0;
+	int ret, retry = 0;
 retry:
-	ret = 0;
-	while (likely(i < nr)) {
-		ret = aio_read_evt(ctx, &ent);
-		if (unlikely(ret <= 0))
-			break;
-
-		dprintk("read event: %Lx %Lx %Lx %Lx\n",
-			ent.data, ent.obj, ent.res, ent.res2);
-
-		/* Could we split the check in two? */
-		ret = -EFAULT;
-		if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
-			dprintk("aio: lost an event due to EFAULT.\n");
-			break;
-		}
-		ret = 0;
-
-		/* Good, event copied to userland, update counts. */
-		event ++;
-		i ++;
-	}
+	ret = aio_read_events(ctx, event + i, nr - i);
+	if (ret < 0)
+		return ret;
 
-	if (min_nr <= i)
+	i += ret;
+	if (i >= min_nr)
 		return i;
-	if (ret)
-		return ret;
 
 	/* End fast path */
 
 	/* racey check, but it gets redone */
+	/* XXX: wtf is this for? */
 	if (!retry && unlikely(!list_empty(&ctx->run_list))) {
 		retry = 1;
 		aio_run_all_iocbs(ctx);
 		goto retry;
 	}
 
-	init_timeout(&to);
 	if (timeout) {
 		struct timespec	ts;
-		ret = -EFAULT;
 		if (unlikely(copy_from_user(&ts, timeout, sizeof(ts))))
-			goto out;
+			return -EFAULT;
 
-		set_timeout(start_jiffies, &to, &ts);
+		until = timespec_to_jiffies(&ts);
 	}
 
-	while (likely(i < nr)) {
-		add_wait_queue_exclusive(&ctx->wait, &wait);
-		do {
-			set_task_state(tsk, TASK_INTERRUPTIBLE);
-			ret = aio_read_evt(ctx, &ent);
-			if (ret)
-				break;
-			if (min_nr <= i)
-				break;
-			if (unlikely(ctx->dead)) {
-				ret = -EINVAL;
-				break;
-			}
-			if (to.timed_out)	/* Only check after read evt */
-				break;
-			/* Try to only show up in io wait if there are ops
-			 *  in flight */
-			if (atomic_read(&ctx->reqs_active) > 1)
-				io_schedule();
-			else
-				schedule();
-			if (signal_pending(tsk)) {
-				ret = -EINTR;
-				break;
-			}
-			/*ret = aio_read_evt(ctx, &ent);*/
-		} while (1) ;
+	while (i < nr) {
+		prepare_to_wait_exclusive(&ctx->wait, &wait,
+					  TASK_INTERRUPTIBLE);
 
-		set_task_state(tsk, TASK_RUNNING);
-		remove_wait_queue(&ctx->wait, &wait);
+		ret = aio_read_events(ctx, event + i, nr - i);
+		if (ret < 0)
+			break;
 
-		if (unlikely(ret <= 0))
+		i += ret;
+		if (i >= min_nr)
 			break;
 
-		ret = -EFAULT;
-		if (unlikely(copy_to_user(event, &ent, sizeof(ent)))) {
-			dprintk("aio: lost an event due to EFAULT.\n");
+		if (unlikely(ctx->dead)) {
+			ret = -EINVAL;
 			break;
 		}
 
-		/* Good, event copied to userland, update counts. */
-		event ++;
-		i ++;
+		if (!until)	/* Only check after read evt */
+			break;
+
+		/* Try to only show up in io wait if there are ops in flight */
+		if (atomic_read(&ctx->reqs_active) > 1)
+			until = io_schedule_timeout(until);
+		else
+			until = schedule_timeout(until);
+
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
 	}
 
-	if (timeout)
-		clear_timeout(&to);
-out:
-	destroy_timer_on_stack(&to.timer);
+	finish_wait(&ctx->wait, &wait);
 	return i ? i : ret;
 }