diff mbox

[4/5] aio: vmap ringbuffer

Message ID 1349764760-21093-4-git-send-email-koverstreet@google.com (mailing list archive)
State Deferred, archived
Headers show

Commit Message

Kent Overstreet Oct. 9, 2012, 6:39 a.m. UTC
It simplifies a lot of stuff if the ringbuffer is contiguously mapped
into kernel space, and we can delete a lot of code - in particular, this
is useful for converting read_events() to cmpxchg.

It'd make more sense if the ringbuffer was allocated with
__get_free_pages() and then mapped into userspace, but I couldn't figure
out how to sanely do that... so vmap works for now.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
---
 fs/aio.c            |   85 +++++++++++++++++----------------------------------
 include/linux/aio.h |    7 +++--
 2 files changed, 32 insertions(+), 60 deletions(-)

Comments

Zach Brown Oct. 9, 2012, 6:29 p.m. UTC | #1
On Mon, Oct 08, 2012 at 11:39:19PM -0700, Kent Overstreet wrote:
> It simplifies a lot of stuff if the ringbuffer is contiguously mapped
> into kernel space, and we can delete a lot of code - in particular, this
> is useful for converting read_events() to cmpxchg.

1) I'm concerned that Our Favourite Database Benchmarkers will see an
effect of having more TLB pressure from that mapping.  kmap_atomic()
should be cheaper.  We'll need to measure the difference.

2) I'm not at all convinced that the current ring semantics are safe
with cmpxchg(), more in the next patch.

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 9, 2012, 9:31 p.m. UTC | #2
On Tue, Oct 09, 2012 at 11:29:49AM -0700, Zach Brown wrote:
> On Mon, Oct 08, 2012 at 11:39:19PM -0700, Kent Overstreet wrote:
> > It simplifies a lot of stuff if the ringbuffer is contiguously mapped
> > into kernel space, and we can delete a lot of code - in particular, this
> > is useful for converting read_events() to cmpxchg.
> 
> 1) I'm concerned that Our Favourite Database Benchmarkers will see an
> effect of having more TLB pressure from that mapping.  kmap_atomic()
> should be cheaper.  We'll need to measure the difference.

If it is measurable I'll take another stab at using memory from
__get_free_pages() for the ringbuffer. That really would be the ideal
solution.

The other reason I wanted to do this was for the aio attributes stuff -
for return values, I think the only sane way is for the return values to
go in the ringbuffer, which means records are no longer fixed size so
dealing with pages is even more of a pain.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 9, 2012, 10:32 p.m. UTC | #3
> If it is measurable I'll take another stab at using memory from
> __get_free_pages() for the ringbuffer. That really would be the ideal
> solution.

No, then you'll run into high order allocation failures with rings that
don't fit in a single page.

> The other reason I wanted to do this was for the aio attributes stuff -
> for return values, I think the only sane way is for the return values to
> go in the ringbuffer, which means records are no longer fixed size so
> dealing with pages is even more of a pain.

Then let's see that, please.

And can we please stop calling them attributes?  They're inputs and
outputs that change behaviour -- they're interfaces.

And no, just for the record, I don't think generic packed variable size
structs are worth the trouble.

If we're going to do a generic interface extension mechanism then we
should put it in its own well thought out system calls, not staple it on
to the side of aio because it's there.  It's a really crummy base to
work from.

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 9, 2012, 10:44 p.m. UTC | #4
On Tue, Oct 09, 2012 at 03:32:10PM -0700, Zach Brown wrote:
> > If it is measurable I'll take another stab at using memory from
> > __get_free_pages() for the ringbuffer. That really would be the ideal
> > solution.
> 
> No, then you'll run into high order allocation failures with rings that
> don't fit in a single page.

Not if we decouple the ringbuffer size from max_requests.

This would be useful to do anyways because right now, allocating a kiocb
has to take a global refcount and check head and tail in the ringbuffer
just so it can avoid overflowing the ringbuffer.

If we change aio_complete() so that if the ringbuffer is full then the
kiocb just goes on a linked list - we can size the ringbuffer so this
doesn't happen normally and avoid the global synchronization in the fast
path.

> > The other reason I wanted to do this was for the aio attributes stuff -
> > for return values, I think the only sane way is for the return values to
> > go in the ringbuffer, which means records are no longer fixed size so
> > dealing with pages is even more of a pain.
> 
> Then let's see that, please.

I was starting on that, but then I got sidetracked with refactoring...
:P

> And can we please stop calling them attributes?  They're inputs and
> outputs that change behaviour -- they're interfaces.

Attributes isn't a good name but neither is interfaces, because they
don't exist on their own; they're always attached to some other
interface.

I dunno.

> And no, just for the record, I don't think generic packed variable size
> structs are worth the trouble.
> 
> If we're going to do a generic interface extension mechanism then we
> should put it in its own well thought out system calls, not staple it on
> to the side of aio because it's there.  It's a really crummy base to
> work from.

Not arguing with you about aio, but most of the use cases I have for it
want aio.

So unless we're going to deprecate the existing aio interfaces and make
something better (I wouldn't complain about that!) I do need to make it
work with aio.

Not that I'm opposed to new syscalls passing attributes to sync versions
of read/write/etc, I just haven't started that yet or really thought
about it.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 9, 2012, 10:58 p.m. UTC | #5
> Not if we decouple the ringbuffer size from max_requests.

Hmm, interesting.

> This would be useful to do anyways because right now, allocating a kiocb
> has to take a global refcount and check head and tail in the ringbuffer
> just so it can avoid overflowing the ringbuffer.

I'm not sure what you mean by a 'global refcount'.. do you mean the
per-mm ctx_lock?

> If we change aio_complete() so that if the ringbuffer is full then the
> kiocb just goes on a linked list - we can size the ringbuffer so this
> doesn't happen normally and avoid the global synchronization in the fast
> path.

How would completion events make their way from the list to the ring if
an app is only checking the ring for completions from userspace?

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 10, 2012, 12:16 a.m. UTC | #6
On Tue, Oct 09, 2012 at 03:58:36PM -0700, Zach Brown wrote:
> > Not if we decouple the ringbuffer size from max_requests.
> 
> Hmm, interesting.
> 
> > This would be useful to do anyways because right now, allocating a kiocb
> > has to take a global refcount and check head and tail in the ringbuffer
> > just so it can avoid overflowing the ringbuffer.
> 
> I'm not sure what you mean by a 'global refcount'.. do you mean the
> per-mm ctx_lock?

kioctx->reqs_active. We just have to keep that count somehow if we're
going to avoid overflowing the ringbuffer.

> > If we change aio_complete() so that if the ringbuffer is full then the
> > kiocb just goes on a linked list - we can size the ringbuffer so this
> > doesn't happen normally and avoid the global synchronization in the fast
> > path.
> 
> How would completion events make their way from the list to the ring if
> an app is only checking the ring for completions from userspace?

Either they'd have to make a syscall when the ringbuffer is empty- which
should be fine, because at least for most apps all they could do is
sleep or spin. Alternately we could maintain a flag next to the tail
pointer in the ringbuffer that the kernel could maintain, if userspace
wants to be able to avoid that syscall when it's not necessary.

Although - since current libaio skips the syscall if the ringbuffer is
empty, this is yet another thing we can't do with the current
ringbuffer.

Crap.

Well, we should be able to hack around it with the existing
ringbuffer... normally, if the ringbuffer is full and stuff goes on the
list, then there's going to be more completions coming later and stuff
would get pulled off the list then.

The only situation you have to worry about is when the ringbuffer fills
up and stuff goes on the list, and then completions completely stop -
this should be a rare enough situation that maybe we could just hack
around it with a timer that gets flipped on when the list isn't empty.

Also, for this to be an issue at all, _all_ the reaping would have to be
done from userspace - since existing libaio doesn't do that, there may
not be any code out there which triggers it.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Zach Brown Oct. 10, 2012, 12:36 a.m. UTC | #7
> The only situation you have to worry about is when the ringbuffer fills
> up and stuff goes on the list, and then completions completely stop -
> this should be a rare enough situation that maybe we could just hack
> around it with a timer that gets flipped on when the list isn't empty.

Right.  And this is when we hopefully realize that we're adding overhead
and complexity (how long's the timer?  always fire it?  MAKE IT STOP)
and are actually making the system worse, not better.

> Also, for this to be an issue at all, _all_ the reaping would have to be
> done from userspace - since existing libaio doesn't do that, there may
> not be any code out there which triggers it.

And there may be.  We default to not breaking interfaces.  Seriously.

- z

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kent Overstreet Oct. 10, 2012, 1:09 a.m. UTC | #8
On Tue, Oct 09, 2012 at 05:36:26PM -0700, Zach Brown wrote:
> > The only situation you have to worry about is when the ringbuffer fills
> > up and stuff goes on the list, and then completions completely stop -
> > this should be a rare enough situation that maybe we could just hack
> > around it with a timer that gets flipped on when the list isn't empty.
> 
> Right.  And this is when we hopefully realize that we're adding overhead
> and complexity (how long's the timer?  always fire it?  MAKE IT STOP)
> and are actually making the system worse, not better.

Can still prototype it, and if it's that ugly... I throw away code all
the time :P

> > Also, for this to be an issue at all, _all_ the reaping would have to be
> > done from userspace - since existing libaio doesn't do that, there may
> > not be any code out there which triggers it.
> 
> And there may be.  We default to not breaking interfaces.  Seriously.

All the more reason to think about it now so we don't screw up the next
interfaces :)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 3ab12f6..c3d97d1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -82,6 +82,9 @@  static void aio_free_ring(struct kioctx *ctx)
 	struct aio_ring_info *info = &ctx->ring_info;
 	long i;
 
+	if (info->ring)
+		vunmap(info->ring);
+
 	for (i=0; i<info->nr_pages; i++)
 		put_page(info->ring_pages[i]);
 
@@ -99,7 +102,6 @@  static void aio_free_ring(struct kioctx *ctx)
 
 static int aio_setup_ring(struct kioctx *ctx)
 {
-	struct aio_ring *ring;
 	struct aio_ring_info *info = &ctx->ring_info;
 	unsigned nr_events = ctx->max_reqs;
 	unsigned long size;
@@ -149,46 +151,27 @@  static int aio_setup_ring(struct kioctx *ctx)
 		return -EAGAIN;
 	}
 
+	info->ring = vmap(info->ring_pages, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!info->ring) {
+		aio_free_ring(ctx);
+		return -ENOMEM;
+	}
+
 	ctx->user_id = info->mmap_base;
 
 	info->nr = nr_events;		/* trusted copy */
 
-	ring = kmap_atomic(info->ring_pages[0]);
-	ring->nr = nr_events;	/* user copy */
-	ring->id = ctx->user_id;
-	ring->head = ring->tail = 0;
-	ring->magic = AIO_RING_MAGIC;
-	ring->compat_features = AIO_RING_COMPAT_FEATURES;
-	ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
-	ring->header_length = sizeof(struct aio_ring);
-	kunmap_atomic(ring);
+	info->ring->nr = nr_events;	/* user copy */
+	info->ring->id = ctx->user_id;
+	info->ring->head = info->ring->tail = 0;
+	info->ring->magic = AIO_RING_MAGIC;
+	info->ring->compat_features = AIO_RING_COMPAT_FEATURES;
+	info->ring->incompat_features = AIO_RING_INCOMPAT_FEATURES;
+	info->ring->header_length = sizeof(struct aio_ring);
 
 	return 0;
 }
 
-
-/* aio_ring_event: returns a pointer to the event at the given index from
- * kmap_atomic().  Release the pointer with put_aio_ring_event();
- */
-#define AIO_EVENTS_PER_PAGE	(PAGE_SIZE / sizeof(struct io_event))
-#define AIO_EVENTS_FIRST_PAGE	((PAGE_SIZE - sizeof(struct aio_ring)) / sizeof(struct io_event))
-#define AIO_EVENTS_OFFSET	(AIO_EVENTS_PER_PAGE - AIO_EVENTS_FIRST_PAGE)
-
-#define aio_ring_event(info, nr) ({					\
-	unsigned pos = (nr) + AIO_EVENTS_OFFSET;			\
-	struct io_event *__event;					\
-	__event = kmap_atomic(						\
-			(info)->ring_pages[pos / AIO_EVENTS_PER_PAGE]); \
-	__event += pos % AIO_EVENTS_PER_PAGE;				\
-	__event;							\
-})
-
-#define put_aio_ring_event(event) do {		\
-	struct io_event *__event = (event);	\
-	(void)__event;				\
-	kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK)); \
-} while(0)
-
 static void free_ioctx(struct work_struct *work)
 {
 	struct kioctx *ctx = container_of(work, struct kioctx, free_work);
@@ -465,7 +448,6 @@  static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
 	unsigned short allocated, to_alloc;
 	long avail;
 	struct kiocb *req, *n;
-	struct aio_ring *ring;
 
 	to_alloc = min(batch->count, KIOCB_BATCH_SIZE);
 	for (allocated = 0; allocated < to_alloc; allocated++) {
@@ -480,9 +462,8 @@  static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
 		goto out;
 
 	spin_lock_irq(&ctx->ctx_lock);
-	ring = kmap_atomic(ctx->ring_info.ring_pages[0]);
 
-	avail = aio_ring_avail(&ctx->ring_info, ring) - atomic_read(&ctx->reqs_active);
+	avail = aio_ring_avail(&ctx->ring_info) - atomic_read(&ctx->reqs_active);
 	BUG_ON(avail < 0);
 	if (avail < allocated) {
 		/* Trim back the number of requests. */
@@ -500,7 +481,6 @@  static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
 		atomic_inc(&ctx->reqs_active);
 	}
 
-	kunmap_atomic(ring);
 	spin_unlock_irq(&ctx->ctx_lock);
 
 out:
@@ -870,10 +850,9 @@  void aio_complete(struct kiocb *iocb, long res, long res2)
 {
 	struct kioctx	*ctx = iocb->ki_ctx;
 	struct aio_ring_info	*info;
-	struct aio_ring	*ring;
 	struct io_event	*event;
 	unsigned long	flags;
-	unsigned long	tail;
+	unsigned tail;
 
 	/*
 	 * Special case handling for sync iocbs:
@@ -892,7 +871,8 @@  void aio_complete(struct kiocb *iocb, long res, long res2)
 
 	info = &ctx->ring_info;
 
-	/* add a completion event to the ring buffer.
+	/*
+	 * add a completion event to the ring buffer.
 	 * must be done holding ctx->ctx_lock to prevent
 	 * other code from messing with the tail
 	 * pointer since we might be called from irq
@@ -910,10 +890,8 @@  void aio_complete(struct kiocb *iocb, long res, long res2)
 	if (kiocbIsCancelled(iocb))
 		goto put_rq;
 
-	ring = kmap_atomic(info->ring_pages[0]);
-
 	tail = info->tail;
-	event = aio_ring_event(info, tail);
+	event = &info->ring->io_events[tail];
 	if (++tail >= info->nr)
 		tail = 0;
 
@@ -922,9 +900,9 @@  void aio_complete(struct kiocb *iocb, long res, long res2)
 	event->res = res;
 	event->res2 = res2;
 
-	dprintk("aio_complete: %p[%lu]: %p: %p %Lx %lx %lx\n",
-		ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
-		res, res2);
+	pr_debug("%p[%u]: %p: %p %Lx %lx %lx\n",
+		 ctx, tail, iocb, iocb->ki_obj.user, iocb->ki_user_data,
+		 res, res2);
 
 	/* after flagging the request as done, we
 	 * must never even look at it again
@@ -932,12 +910,9 @@  void aio_complete(struct kiocb *iocb, long res, long res2)
 	smp_wmb();	/* make event visible before updating tail */
 
 	info->tail = tail;
-	ring->tail = tail;
-
-	put_aio_ring_event(event);
-	kunmap_atomic(ring);
+	info->ring->tail = tail;
 
-	pr_debug("added to ring %p at [%lu]\n", iocb, tail);
+	pr_debug("added to ring %p at [%u]\n", iocb, tail);
 
 	/*
 	 * Check if the user asked us to deliver the result through an
@@ -975,11 +950,10 @@  EXPORT_SYMBOL(aio_complete);
 static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
 {
 	struct aio_ring_info *info = &ioctx->ring_info;
-	struct aio_ring *ring;
+	struct aio_ring *ring = info->ring;
 	unsigned long head;
 	int ret = 0;
 
-	ring = kmap_atomic(info->ring_pages[0]);
 	dprintk("in aio_read_evt h%lu t%lu m%lu\n",
 		 (unsigned long)ring->head, (unsigned long)ring->tail,
 		 (unsigned long)ring->nr);
@@ -991,18 +965,15 @@  static int aio_read_evt(struct kioctx *ioctx, struct io_event *ent)
 
 	head = ring->head % info->nr;
 	if (head != ring->tail) {
-		struct io_event *evp = aio_ring_event(info, head);
-		*ent = *evp;
+		*ent = ring->io_events[head];
 		head = (head + 1) % info->nr;
 		smp_mb(); /* finish reading the event before updatng the head */
 		ring->head = head;
 		ret = 1;
-		put_aio_ring_event(evp);
 	}
 	spin_unlock(&info->ring_lock);
 
 out:
-	kunmap_atomic(ring);
 	dprintk("leaving aio_read_evt: %d  h%lu t%lu\n", ret,
 		 (unsigned long)ring->head, (unsigned long)ring->tail);
 	return ret;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index eb6e5e4..150a4b7 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -161,6 +161,7 @@  struct aio_ring {
 
 #define AIO_RING_PAGES	8
 struct aio_ring_info {
+	struct aio_ring		*ring;
 	unsigned long		mmap_base;
 	unsigned long		mmap_size;
 
@@ -173,10 +174,10 @@  struct aio_ring_info {
 	struct page		*internal_pages[AIO_RING_PAGES];
 };
 
-static inline unsigned aio_ring_avail(struct aio_ring_info *info,
-					struct aio_ring *ring)
+static inline unsigned aio_ring_avail(struct aio_ring_info *info)
 {
-	return (ring->head + info->nr - 1 - ring->tail) % info->nr;
+	return (info->ring->head + info->nr - 1 - info->ring->tail) %
+		info->nr;
 }
 
 struct kioctx {