Message ID | 1349764760-21093-4-git-send-email-koverstreet@google.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
On Mon, Oct 08, 2012 at 11:39:19PM -0700, Kent Overstreet wrote: > It simplifies a lot of stuff if the ringbuffer is contiguously mapped > into kernel space, and we can delete a lot of code - in particular, this > is useful for converting read_events() to cmpxchg. 1) I'm concerned that Our Favourite Database Benchmarkers will see an effect of having more TLB pressure from that mapping. kmap_atomic() should be cheaper. We'll need to measure the difference. 2) I'm not at all convinced that the current ring semantics are safe with cmpxchg(), more in the next patch. - z -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Oct 09, 2012 at 11:29:49AM -0700, Zach Brown wrote: > On Mon, Oct 08, 2012 at 11:39:19PM -0700, Kent Overstreet wrote: > > It simplifies a lot of stuff if the ringbuffer is contiguously mapped > > into kernel space, and we can delete a lot of code - in particular, this > > is useful for converting read_events() to cmpxchg. > > 1) I'm concerned that Our Favourite Database Benchmarkers will see an > effect of having more TLB pressure from that mapping. kmap_atomic() > should be cheaper. We'll need to measure the difference. If it is measurable I'll take another stab at using memory from __get_free_pages() for the ringbuffer. That really would be the ideal solution. The other reason I wanted to do this was for the aio attributes stuff - for return values, I think the only sane way is for the return values to go in the ringbuffer, which means records are no longer fixed size so dealing with pages is even more of a pain. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> If it is measurable I'll take another stab at using memory from > __get_free_pages() for the ringbuffer. That really would be the ideal > solution. No, then you'll run into high order allocation failures with rings that don't fit in a single page. > The other reason I wanted to do this was for the aio attributes stuff - > for return values, I think the only sane way is for the return values to > go in the ringbuffer, which means records are no longer fixed size so > dealing with pages is even more of a pain. Then let's see that, please. And can we please stop calling them attributes? They're inputs and outputs that change behaviour -- they're interfaces. And no, just for the record, I don't think generic packed variable size structs are worth the trouble. If we're going to do a generic interface extension mechanism then we should put it in its own well thought out system calls, not staple it on to the side of aio because it's there. It's a really crummy base to work from. - z -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Oct 09, 2012 at 03:32:10PM -0700, Zach Brown wrote: > > If it is measurable I'll take another stab at using memory from > > __get_free_pages() for the ringbuffer. That really would be the ideal > > solution. > > No, then you'll run into high order allocation failures with rings that > don't fit in a single page. Not if we decouple the ringbuffer size from max_requests. This would be useful to do anyways because right now, allocating a kiocb has to take a global refcount and check head and tail in the ringbuffer just so it can avoid overflowing the ringbuffer. If we change aio_complete() so that if the ringbuffer is full then the kiocb just goes on a linked list - we can size the ringbuffer so this doesn't happen normally and avoid the global synchronization in the fast path. > > The other reason I wanted to do this was for the aio attributes stuff - > > for return values, I think the only sane way is for the return values to > > go in the ringbuffer, which means records are no longer fixed size so > > dealing with pages is even more of a pain. > > Then let's see that, please. I was starting on that, but then I got sidetracked with refactoring... :P > And can we please stop calling them attributes? They're inputs and > outputs that change behaviour -- they're interfaces. Attributes isn't a good name but neither is interfaces, because they don't exist on their own; they're always attached to some other interface. I dunno. > And no, just for the record, I don't think generic packed variable size > structs are worth the trouble. > > If we're going to do a generic interface extension mechanism then we > should put it in its own well thought out system calls, not staple it on > to the side of aio because it's there. It's a really crummy base to > work from. Not arguing with you about aio, but most of the use cases I have for it want aio. So unless we're going to deprecate the existing aio interfaces and make something better (I wouldn't complain about that!) I do need to make it work with aio. Not that I'm opposed to new syscalls passing attributes to sync versions of read/write/etc, I just haven't started that yet or really thought about it. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> Not if we decouple the ringbuffer size from max_requests. Hmm, interesting. > This would be useful to do anyways because right now, allocating a kiocb > has to take a global refcount and check head and tail in the ringbuffer > just so it can avoid overflowing the ringbuffer. I'm not sure what you mean by a 'global refcount'.. do you mean the per-mm ctx_lock? > If we change aio_complete() so that if the ringbuffer is full then the > kiocb just goes on a linked list - we can size the ringbuffer so this > doesn't happen normally and avoid the global synchronization in the fast > path. How would completion events make their way from the list to the ring if an app is only checking the ring for completions from userspace? - z -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Oct 09, 2012 at 03:58:36PM -0700, Zach Brown wrote: > > Not if we decouple the ringbuffer size from max_requests. > > Hmm, interesting. > > > This would be useful to do anyways because right now, allocating a kiocb > > has to take a global refcount and check head and tail in the ringbuffer > > just so it can avoid overflowing the ringbuffer. > > I'm not sure what you mean by a 'global refcount'.. do you mean the > per-mm ctx_lock? kioctx->reqs_active. We just have to keep that count somehow if we're going to avoid overflowing the ringbuffer. > > If we change aio_complete() so that if the ringbuffer is full then the > > kiocb just goes on a linked list - we can size the ringbuffer so this > > doesn't happen normally and avoid the global synchronization in the fast > > path. > > How would completion events make their way from the list to the ring if > an app is only checking the ring for completions from userspace? Either they'd have to make a syscall when the ringbuffer is empty- which should be fine, because at least for most apps all they could do is sleep or spin. Alternately we could maintain a flag next to the tail pointer in the ringbuffer that the kernel could maintain, if userspace wants to be able to avoid that syscall when it's not necessary. Although - since current libaio skips the syscall if the ringbuffer is empty, this is yet another thing we can't do with the current ringbuffer. Crap. Well, we should be able to hack around it with the existing ringbuffer... normally, if the ringbuffer is full and stuff goes on the list, then there's going to be more completions coming later and stuff would get pulled off the list then. The only situation you have to worry about is when the ringbuffer fills up and stuff goes on the list, and then completions completely stop - this should be a rare enough situation that maybe we could just hack around it with a timer that gets flipped on when the list isn't empty. Also, for this to be an issue at all, _all_ the reaping would have to be done from userspace - since existing libaio doesn't do that, there may not be any code out there which triggers it. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
> The only situation you have to worry about is when the ringbuffer fills > up and stuff goes on the list, and then completions completely stop - > this should be a rare enough situation that maybe we could just hack > around it with a timer that gets flipped on when the list isn't empty. Right. And this is when we hopefully realize that we're adding overhead and complexity (how long's the timer? always fire it? MAKE IT STOP) and are actually making the system worse, not better. > Also, for this to be an issue at all, _all_ the reaping would have to be > done from userspace - since existing libaio doesn't do that, there may > not be any code out there which triggers it. And there may be. We default to not breaking interfaces. Seriously. - z -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Oct 09, 2012 at 05:36:26PM -0700, Zach Brown wrote: > > The only situation you have to worry about is when the ringbuffer fills > > up and stuff goes on the list, and then completions completely stop - > > this should be a rare enough situation that maybe we could just hack > > around it with a timer that gets flipped on when the list isn't empty. > > Right. And this is when we hopefully realize that we're adding overhead > and complexity (how long's the timer? always fire it? MAKE IT STOP) > and are actually making the system worse, not better. Can still prototype it, and if it's that ugly... I throw away code all the time :P > > Also, for this to be an issue at all, _all_ the reaping would have to be > > done from userspace - since existing libaio doesn't do that, there may > > not be any code out there which triggers it. > > And there may be. We default to not breaking interfaces. Seriously. All the more reason to think about it now so we don't screw up the next interfaces :) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/fs/aio.c b/fs/aio.c index 3ab12f6..c3d97d1 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -82,6 +82,9 @@ static void aio_free_ring(struct kioctx *ctx) struct aio_ring_info *info = &ctx->ring_info; long i; + if (info->ring) + vunmap(info->ring); + for (i=0; i<info->nr_pages; i++) put_page(info->ring_pages[i]); @@ -99,7 +102,6 @@ static void aio_free_ring(struct kioctx *ctx) static int aio_setup_ring(struct kioctx *ctx) { - struct aio_ring *ring; struct aio_ring_info *info = &ctx->ring_info; unsigned nr_events = ctx->max_reqs; unsigned long size; @@ -149,46 +151,27 @@ static int aio_setup_ring(struct kioctx *ctx) return -EAGAIN; } + info->ring = vmap(info->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (!info->ring) { + aio_free_ring(ctx); + return -ENOMEM; + } + ctx->user_id = info->mmap_base; info->nr = nr_events; /* trusted copy */ - ring = kmap_atomic(info->ring_pages[0]); - ring->nr = nr_events; /* user copy */ - ring->id = ctx->user_id; - ring->head = ring->tail = 0; - ring->magic = AIO_RING_MAGIC; - ring->compat_features = AIO_RING_COMPAT_FEATURES; - ring->incompat_features = AIO_RING_INCOMPAT_FEATURES; - ring->header_length = sizeof(struct aio_ring); - kunmap_atomic(ring); + info->ring->nr = nr_events; /* user copy */ + info->ring->id = ctx->user_id; + info->ring->head = info->ring->tail = 0; + info->ring->magic = AIO_RING_MAGIC; + info->ring->compat_features = AIO_RING_COMPAT_FEATURES; + info->ring->incompat_features = AIO_RING_INCOMPAT_FEATURES; + info->ring->header_length = sizeof(struct aio_ring); return 0; } - -/* aio_ring_event: returns a pointer to the event at the given index from - * kmap_atomic(). Release the pointer with put_aio_ring_event(); - */ -#define AIO_EVENTS_PER_PAGE (PAGE_SIZE / sizeof(struct io_event)) -#define AIO_EVENTS_FIRST_PAGE ((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event)) -#define AIO_EVENTS_OFFSET (AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE) - -#define aio_ring_event(info, nr) ({ \ - unsigned pos = (nr) + AIO_EVENTS_OFFSET; \ - struct io_event *__event; \ - __event = kmap_atomic( \ - (info)->ring_pages[pos / AIO_EVENTS_PER_PAGE]); \ - __event += pos % AIO_EVENTS_PER_PAGE; \ - __event; \ -}) - -#define put_aio_ring_event(event) do { \ - struct io_event *__event = (event); \ - (void)__event; \ - kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK)); \ -} while(0) - static void free_ioctx(struct work_struct *work) { struct kioctx *ctx = container_of(work, struct kioctx, free_work); @@ -465,7 +448,6 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch) unsigned short allocated, to_alloc; long avail; struct kiocb *req, *n; - struct aio_ring *ring; to_alloc = min(batch->count, KIOCB_BATCH_SIZE); for (allocated = 0; allocated < to_alloc; allocated++) { @@ -480,9 +462,8 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch) goto out; spin_lock_irq(&ctx->ctx_lock); - ring = kmap_atomic(ctx->ring_info.ring_pages[0]); - avail = aio_ring_avail(&ctx->ring_info, ring) - atomic_read(&ctx->reqs_active); + avail = aio_ring_avail(&ctx->ring_info) - atomic_read(&ctx->reqs_active); BUG_ON(avail < 0); if (avail < allocated) { /* Trim back the number of requests. */ @@ -500,7 +481,6 @@ static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch) atomic_inc(&ctx->reqs_active); } - kunmap_atomic(ring); spin_unlock_irq(&ctx->ctx_lock); out: @@ -870,10 +850,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2) { struct kioctx *ctx = iocb->ki_ctx; struct aio_ring_info *info; - struct aio_ring *ring; struct io_event *event; unsigned long flags; - unsigned long tail; + unsigned tail; /* * Special case handling for sync iocbs: @@ -892,7 +871,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2) info = &ctx->ring_info; - /* add a completion event to the ring buffer. + /* + * add a completion event to the ring buffer. * must be done holding ctx->ctx_lock to prevent * other code from messing with the tail * pointer since we might be called from irq @@ -910,10 +890,8 @@ void aio_complete(struct kiocb *iocb, long res, long res2) if (kiocbIsCancelled(iocb)) goto put_rq; - ring = kmap_atomic(info->ring_pages[0]); - tail = info->tail; - event = aio_ring_event(info, tail); + event = &info->ring->io_events[tail]; if (++tail >= info->nr) tail = 0; @@ -922,9 +900,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2) event->res = res; event->res2 = res2; - dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n", - ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data, - res, res2); + pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n", + ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data, + res, res2); /* after flagging the request as done, we * must never even look at it again @@ -932,12 +910,9 @@ void aio_complete(struct kiocb *iocb, long res, long res2) smp_wmb(); /* make event visible before updating tail */ info->tail = tail; - ring->tail = tail; - - put_aio_ring_event(event); - kunmap_atomic(ring); + info->ring->tail = tail; - pr_debug("added to ring %p at [%lu]\n", iocb, tail); + pr_debug("added to ring %p at [%u]\n", iocb, tail); /* * Check if the user asked us to deliver the result through an @@ -975,11 +950,10 @@ EXPORT_SYMBOL(aio_complete); static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent) { struct aio_ring_info *info = &ioctx->ring_info; - struct aio_ring *ring; + struct aio_ring *ring = info->ring; unsigned long head; int ret = 0; - ring = kmap_atomic(info->ring_pages[0]); dprintk("in aio_read_evt h%lu t%lu m%lu\n", (unsigned long)ring->head, (unsigned long)ring->tail, (unsigned long)ring->nr); @@ -991,18 +965,15 @@ static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent) head = ring->head % info->nr; if (head != ring->tail) { - struct io_event *evp = aio_ring_event(info, head); - *ent = *evp; + *ent = ring->io_events[head]; head = (head + 1) % info->nr; smp_mb(); /* finish reading the event before updatng the head */ ring->head = head; ret = 1; - put_aio_ring_event(evp); } spin_unlock(&info->ring_lock); out: - kunmap_atomic(ring); dprintk("leaving aio_read_evt: %d h%lu t%lu\n", ret, (unsigned long)ring->head, (unsigned long)ring->tail); return ret; diff --git a/include/linux/aio.h b/include/linux/aio.h index eb6e5e4..150a4b7 100644 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -161,6 +161,7 @@ struct aio_ring { #define AIO_RING_PAGES 8 struct aio_ring_info { + struct aio_ring *ring; unsigned long mmap_base; unsigned long mmap_size; @@ -173,10 +174,10 @@ struct aio_ring_info { struct page *internal_pages[AIO_RING_PAGES]; }; -static inline unsigned aio_ring_avail(struct aio_ring_info *info, - struct aio_ring *ring) +static inline unsigned aio_ring_avail(struct aio_ring_info *info) { - return (ring->head + info->nr - 1 - ring->tail) % info->nr; + return (info->ring->head + info->nr - 1 - info->ring->tail) % + info->nr; } struct kioctx {
It simplifies a lot of stuff if the ringbuffer is contiguously mapped into kernel space, and we can delete a lot of code - in particular, this is useful for converting read_events() to cmpxchg. It'd make more sense if the ringbuffer was allocated with __get_free_pages() and then mapped into userspace, but I couldn't figure out how to sanely do that... so vmap works for now. Signed-off-by: Kent Overstreet <koverstreet@google.com> --- fs/aio.c | 85 +++++++++++++++++---------------------------------- include/linux/aio.h | 7 +++-- 2 files changed, 32 insertions(+), 60 deletions(-)