diff mbox series

[09/13] io_uring: add submission polling

Message ID 20190123153536.7081-12-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Jens Axboe Jan. 23, 2019, 3:35 p.m. UTC
This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new sqes and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.

Proof of concept. If the thread has been idle for 1 second, it will set
sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to
call io_uring_enter() to start things back up again. If IO is kept busy,
that will never be needed. Basically an application that has this
feature enabled will guard it's io_uring_enter(2) call with:

read_barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
	io_uring_enter(fd, to_submit, 0, 0);

instead of calling it unconditionally.

Improvements:

1) Maybe have smarter backoff. Busy loop for X time, then go to
   monitor/mwait, finally the schedule we have now after an idle
   second. Might not be worth the complexity.

2) Probably want the application to pass in the appropriate grace
   period, not hard code it at 1 second.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 219 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |  10 +-
 2 files changed, 222 insertions(+), 7 deletions(-)

Comments

Christoph Hellwig Jan. 28, 2019, 3:09 p.m. UTC | #1
On Wed, Jan 23, 2019 at 08:35:22AM -0700, Jens Axboe wrote:
> Proof of concept.

Is that still true?

> 1) Maybe have smarter backoff. Busy loop for X time, then go to
>    monitor/mwait, finally the schedule we have now after an idle
>    second. Might not be worth the complexity.
> 
> 2) Probably want the application to pass in the appropriate grace
>    period, not hard code it at 1 second.

2) actually sounds really useful.  Should we look into it ASAP?

>  
>  	struct {
>  		/* CQ ring */
> @@ -264,6 +267,9 @@ static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
>  
>  	if (waitqueue_active(&ctx->wait))
>  		wake_up(&ctx->wait);
> +	if ((ctx->flags & IORING_SETUP_SQPOLL) &&
> +	    waitqueue_active(&ctx->sqo_wait))

waitqueue_active is really cheap and sqo_wait should not otherwise
by active.  Do we really need the flags check here?

> +			/*
> +			 * Normal IO, just pretend everything completed.
> +			 * We don't have to poll completions for that.
> +			 */
> +			if (ctx->flags & IORING_SETUP_IOPOLL) {
> +				/*
> +				 * App should not use IORING_ENTER_GETEVENTS
> +				 * with thread polling, but if it does, then
> +				 * ensure we are mutually exclusive.

Should we just return an error early on in this case instead?

>  	if (to_submit) {
> +		if (ctx->flags & IORING_SETUP_SQPOLL) {
> +			wake_up(&ctx->sqo_wait);
> +			ret = to_submit;

Do these semantics really make sense?  Maybe we should have an
IORING_ENTER_WAKE_SQ instead of overloading the to_submit argument?
Especially as we don't really care about returning the number passed in.

> +	if (ctx->sqo_thread) {
> +		kthread_park(ctx->sqo_thread);

Can you explain why we need the whole kthread_park game?  It is only
intended to deal with pausing a thread, and if need it to shut down
a thread we have a bug somewhere.

>  static void io_sq_offload_stop(struct io_ring_ctx *ctx)
>  {
> +	if (ctx->sqo_thread) {
> +		kthread_park(ctx->sqo_thread);
> +		kthread_stop(ctx->sqo_thread);
> +		ctx->sqo_thread = NULL;

Also there isn't really much of a point in setting pointers to NULL
just before freeing the containing structure.  In the best case this
now papers over bugs that poisoning or kasan would otherwise find.
Jens Axboe Jan. 28, 2019, 5:05 p.m. UTC | #2
On 1/28/19 8:09 AM, Christoph Hellwig wrote:
> On Wed, Jan 23, 2019 at 08:35:22AM -0700, Jens Axboe wrote:
>> Proof of concept.
> 
> Is that still true?

I guess I can remove it now, dates back to when it was initially just
a test. But it should be solid, I'll kill that part.

>> 1) Maybe have smarter backoff. Busy loop for X time, then go to
>>    monitor/mwait, finally the schedule we have now after an idle
>>    second. Might not be worth the complexity.
>>
>> 2) Probably want the application to pass in the appropriate grace
>>    period, not hard code it at 1 second.
> 
> 2) actually sounds really useful.  Should we look into it ASAP?

I think so. Question is what kind of granularity we need for this. I
think we can go pretty coarse and keep it in msec, using a short to
pass this in like we do for the thread CPU. That gives us 0..65535 msec,
which should be plenty of range.

>>  	struct {
>>  		/* CQ ring */
>> @@ -264,6 +267,9 @@ static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
>>  
>>  	if (waitqueue_active(&ctx->wait))
>>  		wake_up(&ctx->wait);
>> +	if ((ctx->flags & IORING_SETUP_SQPOLL) &&
>> +	    waitqueue_active(&ctx->sqo_wait))
> 
> waitqueue_active is really cheap and sqo_wait should not otherwise
> by active.  Do we really need the flags check here?

Probably not, I'll kill it.

>> +			/*
>> +			 * Normal IO, just pretend everything completed.
>> +			 * We don't have to poll completions for that.
>> +			 */
>> +			if (ctx->flags & IORING_SETUP_IOPOLL) {
>> +				/*
>> +				 * App should not use IORING_ENTER_GETEVENTS
>> +				 * with thread polling, but if it does, then
>> +				 * ensure we are mutually exclusive.
> 
> Should we just return an error early on in this case instead?

I think that'd make it awkward, since it's out-of-line. If the app is doing
things it shouldn't in this case, its own io_uring_enter() would most likely
fail occasionally with -EBUSY anyway.

>>  	if (to_submit) {
>> +		if (ctx->flags & IORING_SETUP_SQPOLL) {
>> +			wake_up(&ctx->sqo_wait);
>> +			ret = to_submit;
> 
> Do these semantics really make sense?  Maybe we should have an
> IORING_ENTER_WAKE_SQ instead of overloading the to_submit argument?
> Especially as we don't really care about returning the number passed in.

I like that change, I'll add IORING_ENTER_SQ_WAKEUP instead of using
'to_submit' for this. We can't validate the number anyway.

>> +	if (ctx->sqo_thread) {
>> +		kthread_park(ctx->sqo_thread);
> 
> Can you explain why we need the whole kthread_park game?  It is only
> intended to deal with pausing a thread, and if need it to shut down
> a thread we have a bug somewhere.

It is working around a bug in shutting down a thread that is affinitized
to a single CPU, I just didn't want to deal with hunting that down right
now.

>>  static void io_sq_offload_stop(struct io_ring_ctx *ctx)
>>  {
>> +	if (ctx->sqo_thread) {
>> +		kthread_park(ctx->sqo_thread);
>> +		kthread_stop(ctx->sqo_thread);
>> +		ctx->sqo_thread = NULL;
> 
> Also there isn't really much of a point in setting pointers to NULL
> just before freeing the containing structure.  In the best case this
> now papers over bugs that poisoning or kasan would otherwise find.

Removed.
Jeff Moyer Jan. 28, 2019, 9:13 p.m. UTC | #3
Jens Axboe <axboe@kernel.dk> writes:

> @@ -1270,6 +1445,27 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
>  	if (!ctx->sqo_files)
>  		goto err;
>  
> +	if (ctx->flags & IORING_SETUP_SQPOLL) {
> +		if (p->flags & IORING_SETUP_SQ_AFF) {
> +			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
> +							ctx, p->sq_thread_cpu,
> +							"io_uring-sq");

sq_thread_cpu looks like another candidate for array_index_nospec.
Following the macros, kthread_create_on_cpu calls cpu_to_node, which
does:
        return per_cpu(x86_cpu_to_node_map, cpu);

#define per_cpu(var, cpu)       (*per_cpu_ptr(&(var), cpu))
#define per_cpu_ptr(ptr, cpu)                                           \
({                                                                      \
        __verify_pcpu_ptr(ptr);                                         \
        SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)));                 \
})
#define per_cpu_offset(x) (__per_cpu_offset[x])
                           ^^^^^^^^^^^^^^^^^^^

It also looks like there's no bounds checking there, so we probably want
to make sure sq_thread_cpu can't overflow the __per_cpu_offset array
(NR_CPUS).

-Jeff
Jens Axboe Jan. 28, 2019, 9:28 p.m. UTC | #4
On 1/28/19 2:13 PM, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> @@ -1270,6 +1445,27 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx)
>>  	if (!ctx->sqo_files)
>>  		goto err;
>>  
>> +	if (ctx->flags & IORING_SETUP_SQPOLL) {
>> +		if (p->flags & IORING_SETUP_SQ_AFF) {
>> +			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
>> +							ctx, p->sq_thread_cpu,
>> +							"io_uring-sq");
> 
> sq_thread_cpu looks like another candidate for array_index_nospec.
> Following the macros, kthread_create_on_cpu calls cpu_to_node, which
> does:
>         return per_cpu(x86_cpu_to_node_map, cpu);
> 
> #define per_cpu(var, cpu)       (*per_cpu_ptr(&(var), cpu))
> #define per_cpu_ptr(ptr, cpu)                                           \
> ({                                                                      \
>         __verify_pcpu_ptr(ptr);                                         \
>         SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)));                 \
> })
> #define per_cpu_offset(x) (__per_cpu_offset[x])
>                            ^^^^^^^^^^^^^^^^^^^
> 
> It also looks like there's no bounds checking there, so we probably want
> to make sure sq_thread_cpu can't overflow the __per_cpu_offset array
> (NR_CPUS).

Added, can't hurt in any case.
Christoph Hellwig Jan. 29, 2019, 6:29 a.m. UTC | #5
On Mon, Jan 28, 2019 at 10:05:37AM -0700, Jens Axboe wrote:
> >> 2) Probably want the application to pass in the appropriate grace
> >>    period, not hard code it at 1 second.
> > 
> > 2) actually sounds really useful.  Should we look into it ASAP?
> 
> I think so. Question is what kind of granularity we need for this. I
> think we can go pretty coarse and keep it in msec, using a short to
> pass this in like we do for the thread CPU. That gives us 0..65535 msec,
> which should be plenty of range.

msec granualarity sounds fine.  But is it really worth using a short
instead of a 32-bit value these days?

Also 16385 seems very limiting for a cpu value, especially as our
normal cpu ids are ints.
Jens Axboe Jan. 29, 2019, 1:21 p.m. UTC | #6
On 1/28/19 11:29 PM, Christoph Hellwig wrote:
> On Mon, Jan 28, 2019 at 10:05:37AM -0700, Jens Axboe wrote:
>>>> 2) Probably want the application to pass in the appropriate grace
>>>>    period, not hard code it at 1 second.
>>>
>>> 2) actually sounds really useful.  Should we look into it ASAP?
>>
>> I think so. Question is what kind of granularity we need for this. I
>> think we can go pretty coarse and keep it in msec, using a short to
>> pass this in like we do for the thread CPU. That gives us 0..65535 msec,
>> which should be plenty of range.
> 
> msec granualarity sounds fine.  But is it really worth using a short
> instead of a 32-bit value these days?
> 
> Also 16385 seems very limiting for a cpu value, especially as our
> normal cpu ids are ints.

I'll bump them both to 4-bytes.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 86add82e1008..2deda7b1b3dd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -24,6 +24,7 @@ 
 #include <linux/percpu.h>
 #include <linux/slab.h>
 #include <linux/workqueue.h>
+#include <linux/kthread.h>
 #include <linux/blkdev.h>
 #include <linux/bvec.h>
 #include <linux/anon_inodes.h>
@@ -87,8 +88,10 @@  struct io_ring_ctx {
 
 	/* IO offload */
 	struct workqueue_struct	*sqo_wq;
+	struct task_struct	*sqo_thread;	/* if using sq thread polling */
 	struct mm_struct	*sqo_mm;
 	struct files_struct	*sqo_files;
+	wait_queue_head_t	sqo_wait;
 
 	struct {
 		/* CQ ring */
@@ -264,6 +267,9 @@  static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
 
 	if (waitqueue_active(&ctx->wait))
 		wake_up(&ctx->wait);
+	if ((ctx->flags & IORING_SETUP_SQPOLL) &&
+	    waitqueue_active(&ctx->sqo_wait))
+		wake_up(&ctx->sqo_wait);
 }
 
 static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data,
@@ -1106,6 +1112,168 @@  static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	return false;
 }
 
+static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes,
+			  unsigned int nr, bool mm_fault)
+{
+	struct io_submit_state state, *statep = NULL;
+	int ret, i, submitted = 0;
+
+	if (nr > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, nr);
+		statep = &state;
+	}
+
+	for (i = 0; i < nr; i++) {
+		if (unlikely(mm_fault))
+			ret = -EFAULT;
+		else
+			ret = io_submit_sqe(ctx, &sqes[i], statep);
+		if (!ret) {
+			submitted++;
+			continue;
+		}
+
+		io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0);
+	}
+
+	if (statep)
+		io_submit_state_end(&state);
+
+	return submitted;
+}
+
+static int io_sq_thread(void *data)
+{
+	struct sqe_submit sqes[IO_IOPOLL_BATCH];
+	struct io_ring_ctx *ctx = data;
+	struct mm_struct *cur_mm = NULL;
+	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	DEFINE_WAIT(wait);
+	unsigned inflight;
+	unsigned long timeout;
+
+	old_files = current->files;
+	current->files = ctx->sqo_files;
+
+	old_fs = get_fs();
+	set_fs(USER_DS);
+
+	timeout = inflight = 0;
+	while (!kthread_should_stop()) {
+		bool all_fixed, mm_fault = false;
+		int i;
+
+		if (inflight) {
+			unsigned int nr_events = 0;
+
+			/*
+			 * Normal IO, just pretend everything completed.
+			 * We don't have to poll completions for that.
+			 */
+			if (ctx->flags & IORING_SETUP_IOPOLL) {
+				/*
+				 * App should not use IORING_ENTER_GETEVENTS
+				 * with thread polling, but if it does, then
+				 * ensure we are mutually exclusive.
+				 */
+				if (mutex_trylock(&ctx->uring_lock)) {
+					io_iopoll_check(ctx, &nr_events, 0);
+					mutex_unlock(&ctx->uring_lock);
+				}
+			} else {
+				nr_events = inflight;
+			}
+
+			inflight -= nr_events;
+			if (!inflight)
+				timeout = jiffies + HZ;
+		}
+
+		if (!io_get_sqring(ctx, &sqes[0])) {
+			/*
+			 * We're polling, let us spin for a second without
+			 * work before going to sleep.
+			 */
+			if (inflight || !time_after(jiffies, timeout)) {
+				cpu_relax();
+				continue;
+			}
+
+			/*
+			 * Drop cur_mm before scheduling, we can't hold it for
+			 * long periods (or over schedule()). Do this before
+			 * adding ourselves to the waitqueue, as the unuse/drop
+			 * may sleep.
+			 */
+			if (cur_mm) {
+				unuse_mm(cur_mm);
+				mmput(cur_mm);
+				cur_mm = NULL;
+			}
+
+			prepare_to_wait(&ctx->sqo_wait, &wait,
+						TASK_INTERRUPTIBLE);
+
+			/* Tell userspace we may need a wakeup call */
+			ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+
+			if (!io_get_sqring(ctx, &sqes[0])) {
+				if (kthread_should_park())
+					kthread_parkme();
+				if (kthread_should_stop()) {
+					finish_wait(&ctx->sqo_wait, &wait);
+					break;
+				}
+				if (signal_pending(current))
+					flush_signals(current);
+				schedule();
+				finish_wait(&ctx->sqo_wait, &wait);
+
+				ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+				smp_wmb();
+				continue;
+			}
+			finish_wait(&ctx->sqo_wait, &wait);
+
+			ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+			smp_wmb();
+		}
+
+		i = 0;
+		all_fixed = true;
+		do {
+			if (all_fixed && io_sqe_needs_user(sqes[i].sqe))
+				all_fixed = false;
+
+			i++;
+			if (i == ARRAY_SIZE(sqes))
+				break;
+		} while (io_get_sqring(ctx, &sqes[i]));
+
+		io_commit_sqring(ctx);
+
+		/* Unless all new commands are FIXED regions, grab mm */
+		if (!all_fixed && !cur_mm) {
+			mm_fault = !mmget_not_zero(ctx->sqo_mm);
+			if (!mm_fault) {
+				use_mm(ctx->sqo_mm);
+				cur_mm = ctx->sqo_mm;
+			}
+		}
+
+		inflight += io_submit_sqes(ctx, sqes, i, mm_fault);
+	}
+	current->files = old_files;
+	set_fs(old_fs);
+	if (cur_mm) {
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
+	}
+	return 0;
+}
+
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	struct io_submit_state state, *statep = NULL;
@@ -1179,9 +1347,14 @@  static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 	int ret = 0;
 
 	if (to_submit) {
-		ret = io_ring_submit(ctx, to_submit);
-		if (ret < 0)
-			return ret;
+		if (ctx->flags & IORING_SETUP_SQPOLL) {
+			wake_up(&ctx->sqo_wait);
+			ret = to_submit;
+		} else {
+			ret = io_ring_submit(ctx, to_submit);
+			if (ret < 0)
+				return ret;
+		}
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
 		unsigned nr_events = 0;
@@ -1254,10 +1427,12 @@  static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
 	return ret;
 }
 
-static int io_sq_offload_start(struct io_ring_ctx *ctx)
+static int io_sq_offload_start(struct io_ring_ctx *ctx,
+			       struct io_uring_params *p)
 {
 	int ret;
 
+	init_waitqueue_head(&ctx->sqo_wait);
 	ctx->sqo_mm = current->mm;
 
 	/*
@@ -1270,6 +1445,27 @@  static int io_sq_offload_start(struct io_ring_ctx *ctx)
 	if (!ctx->sqo_files)
 		goto err;
 
+	if (ctx->flags & IORING_SETUP_SQPOLL) {
+		if (p->flags & IORING_SETUP_SQ_AFF) {
+			ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread,
+							ctx, p->sq_thread_cpu,
+							"io_uring-sq");
+		} else {
+			ctx->sqo_thread = kthread_create(io_sq_thread, ctx,
+							"io_uring-sq");
+		}
+		if (IS_ERR(ctx->sqo_thread)) {
+			ret = PTR_ERR(ctx->sqo_thread);
+			ctx->sqo_thread = NULL;
+			goto err;
+		}
+		wake_up_process(ctx->sqo_thread);
+	} else if (p->flags & IORING_SETUP_SQ_AFF) {
+		/* Can't have SQ_AFF without SQPOLL */
+		ret = -EINVAL;
+		goto err;
+	}
+
 	/* Do QD, or 2 * CPUS, whatever is smallest */
 	ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE,
 			min(ctx->sq_entries - 1, 2 * num_online_cpus()));
@@ -1280,6 +1476,11 @@  static int io_sq_offload_start(struct io_ring_ctx *ctx)
 
 	return 0;
 err:
+	if (ctx->sqo_thread) {
+		kthread_park(ctx->sqo_thread);
+		kthread_stop(ctx->sqo_thread);
+		ctx->sqo_thread = NULL;
+	}
 	if (ctx->sqo_files)
 		ctx->sqo_files = NULL;
 	ctx->sqo_mm = NULL;
@@ -1288,6 +1489,11 @@  static int io_sq_offload_start(struct io_ring_ctx *ctx)
 
 static void io_sq_offload_stop(struct io_ring_ctx *ctx)
 {
+	if (ctx->sqo_thread) {
+		kthread_park(ctx->sqo_thread);
+		kthread_stop(ctx->sqo_thread);
+		ctx->sqo_thread = NULL;
+	}
 	if (ctx->sqo_wq) {
 		destroy_workqueue(ctx->sqo_wq);
 		ctx->sqo_wq = NULL;
@@ -1780,7 +1986,7 @@  static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ret)
 		goto err;
 
-	ret = io_sq_offload_start(ctx);
+	ret = io_sq_offload_start(ctx, p);
 	if (ret)
 		goto err;
 
@@ -1815,7 +2021,8 @@  static long io_uring_setup(u32 entries, struct io_uring_params __user *params,
 			return -EINVAL;
 	}
 
-	if (p.flags & ~IORING_SETUP_IOPOLL)
+	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
+			IORING_SETUP_SQ_AFF))
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p, compat);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 8323320077ec..37c7402be9ca 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -44,6 +44,8 @@  struct io_uring_sqe {
  * io_uring_setup() flags
  */
 #define IORING_SETUP_IOPOLL	(1 << 0)	/* io_context is polled */
+#define IORING_SETUP_SQPOLL	(1 << 1)	/* SQ poll thread */
+#define IORING_SETUP_SQ_AFF	(1 << 2)	/* sq_thread_cpu is valid */
 
 #define IORING_OP_NOP		0
 #define IORING_OP_READV		1
@@ -87,6 +89,11 @@  struct io_sqring_offsets {
 	__u32 resv[3];
 };
 
+/*
+ * sq_ring->flags
+ */
+#define IORING_SQ_NEED_WAKEUP	(1 << 0) /* needs io_uring_enter wakeup */
+
 struct io_cqring_offsets {
 	__u32 head;
 	__u32 tail;
@@ -109,7 +116,8 @@  struct io_uring_params {
 	__u32 sq_entries;
 	__u32 cq_entries;
 	__u32 flags;
-	__u16 resv[10];
+	__u16 sq_thread_cpu;
+	__u16 resv[9];
 	struct io_sqring_offsets sq_off;
 	struct io_cqring_offsets cq_off;
 };