diff mbox

[09/13] aio: add support for async openat()

Message ID 150a0b4905f1d7274b4c2c7f5e3f4d8df5dda1d7.1452549431.git.bcrl@kvack.org (mailing list archive)
State New, archived
Headers show

Commit Message

Benjamin LaHaise Jan. 11, 2016, 10:07 p.m. UTC
Another blocking operation used by applications that want aio
functionality is that of opening files that are not resident in memory.
Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

Signed-off-by: Benjamin LaHaise <ben.lahaise@solacesystems.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
---
 fs/aio.c                     | 120 +++++++++++++++++++++++++++++++++++++------
 include/uapi/linux/aio_abi.h |   2 +
 2 files changed, 107 insertions(+), 15 deletions(-)

Comments

Linus Torvalds Jan. 12, 2016, 12:22 a.m. UTC | #1
On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> Another blocking operation used by applications that want aio
> functionality is that of opening files that are not resident in memory.
> Using the thread based aio helper, add support for IOCB_CMD_OPENAT.

So I think this is ridiculously ugly.

AIO is a horrible ad-hoc design, with the main excuse being "other,
less gifted people, made that design, and we are implementing it for
compatibility because database people - who seldom have any shred of
taste - actually use it".

But AIO was always really really ugly.

Now you introduce the notion of doing almost arbitrary system calls
asynchronously in threads, but then you use that ass-backwards nasty
interface to do so.

Why?

If you want to do arbitrary asynchronous system calls, just *do* it.
But do _that_, not "let's extend this horrible interface in arbitrary
random ways one special system call at a time".

In other words, why is the interface not simply: "do arbitrary system
call X with arguments A, B, C, D asynchronously using a kernel
thread".

That's something that a lot of people might use. In fact, if they can
avoid the nasty AIO interface, maybe they'll even use it for things
like read() and write().

So I really think it would be a nice thing to allow some kind of
arbitrary "queue up asynchronous system call" model.

But I do not think the AIO model should be the model used for that,
even if I think there might be some shared infrastructure.

So I would seriously suggest:

 - how about we add a true "asynchronous system call" interface

 - make it be a list of system calls with a futex completion for each
list entry, so that you can easily wait for the end result that way.

 - maybe (and this is where it gets really iffy) you could even pass
in the result of one system call to the next, so that you can do
things like

       fd = openat(..)
       ret = read(fd, ..)

   asynchronously and then just wait for the read() to complete.

and let us *not* tie this to the aio interface.

In fact, if we do it well, we can go the other way, and try to
implement the nasty AIO interface on top of the generic "just do
things asynchronously".

And I actually think many of your kernel thread parts are good for a
generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
all makes sense, although I do suspect you could just make it
unconditional. The cost of a few atomics shouldn't be excessive when
we're talking "use a thread to do op X".

What do you think? Do you think it might be possible to aim for a
generic "do system call asynchronously" model instead?

I'm adding Ingo the to cc, because I think Ingo had a "run this list
of system calls" patch at one point - in order to avoid system call
overhead. I don't think that was very interesting (because system call
overhead is seldom all that noticeable for any interesting system
calls), but with the "let's do the list asynchronously" addition it
might be much more intriguing. Ingo, do I remember correctly that it
was you? I might be confused about who wrote that patch, and I can't
find it now.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Jan. 12, 2016, 1:17 a.m. UTC | #2
On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.
> 
> Why?

Understood, but there are some reasons behind this.  The core aio submit
mechanism is modeled after the lio_listio() call in posix.  While the
cost of performing syscalls has decreased substantially over the last 10
years, the cost of context switches has not.  Some AIO operations really
want to do part of the work in the context of the original submitter for
the work.  That was/is a critical piece of the async readahead
functionality in this series -- without being able to do a quick return
to the caller when all the cached data is allready resident in the
kernel, there is a significant performance degradation in my tests.  For
other operations which are going to do blocking i/o anyways, the cost of
the context switch often becomes noise.

The async readahead also fills a fills a hole in the proposed extensions
to preadv()/pwritev() -- they need some way to trigger and know when a
readahead operation has completed.  One needs a completion queue of some
sort to figure out which operation has completed in a reasonable
efficient manner.  The futex doesn't really have the ability to do this.

Thread dispatching is another problem the applications I work on
encounter, and AIO helps in this particular area because a thread that
is running hot can simply check the AIO event ring buffer for new events
in its main event loop.  Userspace fundamentally *cannot* do a good job of
dispatching work to threads.  The code I've see other developers come up
with ends up doing things like epoll() in one thread followed by
dispatching the receieved events to different threads.  This ends up
making multiple expensive syscalls (since locking and cross CPU bouncing
is required) when the kernel could just direct things to the right
thread in the first place.

There are a lot of requirements bringing additional complexity that start
to surface once you look at how some of these applications are actually
written.

> If you want to do arbitrary asynchronous system calls, just *do* it.
> But do _that_, not "let's extend this horrible interface in arbitrary
> random ways one special system call at a time".
> 
> In other words, why is the interface not simply: "do arbitrary system
> call X with arguments A, B, C, D asynchronously using a kernel
> thread".

We've had a few proposals to do this, none of which have really managed 
to tackle all the problems that arose.  If we go down this path, we will 
end up needing a table of what syscalls can actually be performed 
asynchronously, and flags indicating what bits of context those syscalls
require.  This does end up looking a bit like how AIO does things
depending on how hard you squint.

I'm not opposed to reworking how AIO dispatches things.  If we're willing 
to relax some constraints (like the hard enforced limits on the number
of AIOs in flight), things can be substantially simplified.  Again,
worries about things like memory usage today are vastly different than
they were back in the early '00s, so the decisions that make sense now
will certainly change the design.

Cancellation is also a concern.  Cancellation is not something that can
be sacrificed.  Without some mechanism to cancel operations that are in
flight, there is no way for a process to cleanly exit.  This patch
series nicely proves that signals work very well for cancellation, and
fit in with a lot of the code we already have.  This implies we would
need to treat threads doing async operations differently from normal
threads.  What happens with the pid namespace?

> That's something that a lot of people might use. In fact, if they can
> avoid the nasty AIO interface, maybe they'll even use it for things
> like read() and write().
> 
> So I really think it would be a nice thing to allow some kind of
> arbitrary "queue up asynchronous system call" model.
> 
> But I do not think the AIO model should be the model used for that,
> even if I think there might be some shared infrastructure.
> 
> So I would seriously suggest:
> 
>  - how about we add a true "asynchronous system call" interface
> 
>  - make it be a list of system calls with a futex completion for each
> list entry, so that you can easily wait for the end result that way.
> 
>  - maybe (and this is where it gets really iffy) you could even pass
> in the result of one system call to the next, so that you can do
> things like
> 
>        fd = openat(..)
>        ret = read(fd, ..)
> 
>    asynchronously and then just wait for the read() to complete.
> 
> and let us *not* tie this to the aio interface.
> 
> In fact, if we do it well, we can go the other way, and try to
> implement the nasty AIO interface on top of the generic "just do
> things asynchronously".
> 
> And I actually think many of your kernel thread parts are good for a
> generic implementation. That whole "AIO_THREAD_NEED_CRED" etc logic
> all makes sense, although I do suspect you could just make it
> unconditional. The cost of a few atomics shouldn't be excessive when
> we're talking "use a thread to do op X".
> 
> What do you think? Do you think it might be possible to aim for a
> generic "do system call asynchronously" model instead?

Maybe it's not too bad to do -- the syscall() primitive is reasonably 
well defined and is supported across architectures, but we're going to 
need new wrappers for *every* syscall supported.  Odds are the work will 
have to be done incrementally to weed out which syscalls are safe and 
which are not, but there is certainly no reason we can't reuse syscall 
numbers and the same argument layout.

Chaining things becomes messy.  There are some cases where that works,
but at least on the applications I've worked on, there tends to be a
fair amount of logic that needs to be run before you can figure out what
and where the next operation is.  The canonical example I can think of
is the case where one is retreiving data from disk.  The first operation
is a read into some table to find out where data is located, the next
operation is a search (binary search in the case I'm thinking of) in the
data that was just read to figure out which record actually contains the
data the app cares about, followed by a read to actually fetch the data
the user actually requires.

And it gets more complicated: different disk i/os need to be issued with
different priorities (something that was not included in what I just
posted today, but is work I plan to propose for merging in the future).
In some cases the priority is known beforehand, but in other cases it
needs to be adjusted dynamically depending on information fetched (users
don't like it if huge i/os completely starve their smaller i/os for
significant amounts of time).

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

I'd certainly be interested in hearing more ideas concerning
requirements.

Sorry for the giant wall of text...  Nothing is simple! =-)

		-ben

>                Linus
Chris Mason Jan. 12, 2016, 1:45 a.m. UTC | #3
On Mon, Jan 11, 2016 at 04:22:28PM -0800, Linus Torvalds wrote:
> On Mon, Jan 11, 2016 at 2:07 PM, Benjamin LaHaise <bcrl@kvack.org> wrote:
> > Another blocking operation used by applications that want aio
> > functionality is that of opening files that are not resident in memory.
> > Using the thread based aio helper, add support for IOCB_CMD_OPENAT.
> 
> So I think this is ridiculously ugly.
> 
> AIO is a horrible ad-hoc design, with the main excuse being "other,
> less gifted people, made that design, and we are implementing it for
> compatibility because database people - who seldom have any shred of
> taste - actually use it".
> 
> But AIO was always really really ugly.
> 
> Now you introduce the notion of doing almost arbitrary system calls
> asynchronously in threads, but then you use that ass-backwards nasty
> interface to do so.

[ ... ]

> I'm adding Ingo the to cc, because I think Ingo had a "run this list
> of system calls" patch at one point - in order to avoid system call
> overhead. I don't think that was very interesting (because system call
> overhead is seldom all that noticeable for any interesting system
> calls), but with the "let's do the list asynchronously" addition it
> might be much more intriguing. Ingo, do I remember correctly that it
> was you? I might be confused about who wrote that patch, and I can't
> find it now.

Zach Brown and Ingo traded a bunch of ideas.  There were chicklets and
syslets?  A little search, it looks like acall was a slightly different
iteration, but the patches didn't make it off oss.oracle.com:

https://lwn.net/Articles/316806/

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Jan. 12, 2016, 9:53 a.m. UTC | #4
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> What do you think? Do you think it might be possible to aim for a generic "do 
> system call asynchronously" model instead?
> 
> I'm adding Ingo the to cc, because I think Ingo had a "run this list of system 
> calls" patch at one point - in order to avoid system call overhead. I don't 
> think that was very interesting (because system call overhead is seldom all that 
> noticeable for any interesting system calls), but with the "let's do the list 
> asynchronously" addition it might be much more intriguing. Ingo, do I remember 
> correctly that it was you? I might be confused about who wrote that patch, and I 
> can't find it now.

Yeah, it was the whole 'syslets' and 'threadlets' stuff - I had both implemented 
and prototyped into a 'list directory entries asynchronously' testcase.

Threadlets was pretty close to what you are suggesting now. Here's a very good (as 
usual!) writeup from LWN:

  https://lwn.net/Articles/223899/

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index 4384df4..346786b 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,8 @@ 
 #include <linux/ramfs.h>
 #include <linux/percpu-refcount.h>
 #include <linux/mount.h>
+#include <linux/fdtable.h>
+#include <linux/fs_struct.h>
 
 #include <asm/kmap_types.h>
 #include <asm/uaccess.h>
@@ -204,6 +206,9 @@  struct aio_kiocb {
 	unsigned long		ki_rlimit_fsize;
 	aio_thread_work_fn_t	ki_work_fn;
 	struct work_struct	ki_work;
+	struct fs_struct	*ki_fs;
+	struct files_struct	*ki_files;
+	const struct cred	*ki_cred;
 #endif
 };
 
@@ -227,6 +232,7 @@  static const struct address_space_operations aio_ctx_aops;
 static void aio_complete(struct kiocb *kiocb, long res, long res2);
 ssize_t aio_fsync(struct kiocb *iocb, int datasync);
 long aio_poll(struct aio_kiocb *iocb);
+long aio_openat(struct aio_kiocb *req);
 
 static __always_inline bool aio_may_use_threads(void)
 {
@@ -1496,6 +1502,9 @@  static int aio_thread_queue_iocb_cancel(struct kiocb *kiocb)
 static void aio_thread_fn(struct work_struct *work)
 {
 	struct aio_kiocb *iocb = container_of(work, struct aio_kiocb, ki_work);
+	struct files_struct *old_files = current->files;
+	const struct cred *old_cred = current_cred();
+	struct fs_struct *old_fs = current->fs;
 	kiocb_cancel_fn *old_cancel;
 	long ret;
 
@@ -1503,6 +1512,13 @@  static void aio_thread_fn(struct work_struct *work)
 	current->kiocb = &iocb->common;		/* For io_send_sig(). */
 	WARN_ON(atomic_read(&current->signal->sigcnt) != 1);
 
+	if (iocb->ki_fs)
+		current->fs = iocb->ki_fs;
+	if (iocb->ki_files)
+		current->files = iocb->ki_files;
+	if (iocb->ki_cred)
+		current->cred = iocb->ki_cred;
+
 	/* Check for early stage cancellation and switch to late stage
 	 * cancellation if it has not already occurred.
 	 */
@@ -1519,6 +1535,19 @@  static void aio_thread_fn(struct work_struct *work)
 		     ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
 		ret = -EINTR;
 
+	if (iocb->ki_cred) {
+		current->cred = old_cred;
+		put_cred(iocb->ki_cred);
+	}
+	if (iocb->ki_files) {
+		current->files = old_files;
+		put_files_struct(iocb->ki_files);
+	}
+	if (iocb->ki_fs) {
+		exit_fs(current);
+		current->fs = old_fs;
+	}
+
 	/* Completion serializes cancellation by taking ctx_lock, so
 	 * aio_complete() will not return until after force_sig() in
 	 * aio_thread_queue_iocb_cancel().  This should ensure that
@@ -1530,6 +1559,9 @@  static void aio_thread_fn(struct work_struct *work)
 }
 
 #define AIO_THREAD_NEED_TASK	0x0001	/* Need aio_kiocb->ki_submit_task */
+#define AIO_THREAD_NEED_FS	0x0002	/* Need aio_kiocb->ki_fs */
+#define AIO_THREAD_NEED_FILES	0x0004	/* Need aio_kiocb->ki_files */
+#define AIO_THREAD_NEED_CRED	0x0008	/* Need aio_kiocb->ki_cred */
 
 /* aio_thread_queue_iocb
  *	Queues an aio_kiocb for dispatch to a worker thread.  Prepares the
@@ -1547,6 +1579,20 @@  static ssize_t aio_thread_queue_iocb(struct aio_kiocb *iocb,
 		iocb->ki_submit_task = current;
 		get_task_struct(iocb->ki_submit_task);
 	}
+	if (flags & AIO_THREAD_NEED_FS) {
+		struct fs_struct *fs = current->fs;
+
+		iocb->ki_fs = fs;
+		spin_lock(&fs->lock);
+		fs->users++;
+		spin_unlock(&fs->lock);
+	}
+	if (flags & AIO_THREAD_NEED_FILES) {
+		iocb->ki_files = current->files;
+		atomic_inc(&iocb->ki_files->count);
+	}
+	if (flags & AIO_THREAD_NEED_CRED)
+		iocb->ki_cred = get_current_cred();
 
 	/* Cancellation needs to be always available for operations performed
 	 * using helper threads.  Prior to the iocb being assigned to a worker
@@ -1716,22 +1762,54 @@  long aio_poll(struct aio_kiocb *req)
 {
 	return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
+
+static long aio_thread_op_openat(struct aio_kiocb *req)
+{
+	u64 buf, offset;
+	long ret;
+	u32 fd;
+
+	use_mm(req->ki_ctx->mm);
+	if (unlikely(__get_user(fd, &req->ki_user_iocb->aio_fildes)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(buf, &req->ki_user_iocb->aio_buf)))
+		ret = -EFAULT;
+	else if (unlikely(__get_user(offset, &req->ki_user_iocb->aio_offset)))
+		ret = -EFAULT;
+	else {
+		ret = do_sys_open((s32)fd,
+				  (const char __user *)(long)buf,
+				  (int)offset,
+				  (unsigned short)(offset >> 32));
+	}
+	unuse_mm(req->ki_ctx->mm);
+	return ret;
+}
+
+long aio_openat(struct aio_kiocb *req)
+{
+	return aio_thread_queue_iocb(req, aio_thread_op_openat,
+				     AIO_THREAD_NEED_TASK |
+				     AIO_THREAD_NEED_FILES |
+				     AIO_THREAD_NEED_CRED);
+}
 #endif /* IS_ENABLED(CONFIG_AIO_THREAD) */
 
 /*
  * aio_run_iocb:
  *	Performs the initial checks and io submission.
  */
-static ssize_t aio_run_iocb(struct aio_kiocb *req, unsigned opcode,
-			    char __user *buf, size_t len, bool compat)
+static ssize_t aio_run_iocb(struct aio_kiocb *req, struct iocb *user_iocb,
+			    bool compat)
 {
 	struct file *file = req->common.ki_filp;
 	ssize_t ret = -EINVAL;
+	char __user *buf;
 	int rw;
 	fmode_t mode;
 	rw_iter_op *iter_op;
 
-	switch (opcode) {
+	switch (user_iocb->aio_lio_opcode) {
 	case IOCB_CMD_PREAD:
 	case IOCB_CMD_PREADV:
 		mode	= FMODE_READ;
@@ -1768,12 +1846,17 @@  rw_common:
 		if (!iter_op)
 			return -EINVAL;
 
-		if (opcode == IOCB_CMD_PREADV || opcode == IOCB_CMD_PWRITEV)
-			ret = aio_setup_vectored_rw(rw, buf, len,
+		buf = (char __user *)(unsigned long)user_iocb->aio_buf;
+		if (user_iocb->aio_lio_opcode == IOCB_CMD_PREADV ||
+		    user_iocb->aio_lio_opcode == IOCB_CMD_PWRITEV)
+			ret = aio_setup_vectored_rw(rw, buf,
+						    user_iocb->aio_nbytes,
 						    &req->ki_iovec, compat,
 						    &req->ki_iter);
 		else {
-			ret = import_single_range(rw, buf, len, req->ki_iovec,
+			ret = import_single_range(rw, buf,
+						  user_iocb->aio_nbytes,
+						  req->ki_iovec,
 						  &req->ki_iter);
 		}
 		if (!ret)
@@ -1810,6 +1893,11 @@  rw_common:
 			ret = aio_poll(req);
 		break;
 
+	case IOCB_CMD_OPENAT:
+		if (aio_may_use_threads())
+			ret = aio_openat(req);
+		break;
+
 	default:
 		pr_debug("EINVAL: no operation provided\n");
 		return -EINVAL;
@@ -1856,14 +1944,19 @@  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!req))
 		return -EAGAIN;
 
-	req->common.ki_filp = fget(iocb->aio_fildes);
-	if (unlikely(!req->common.ki_filp)) {
-		ret = -EBADF;
-		goto out_put_req;
+	if (iocb->aio_lio_opcode == IOCB_CMD_OPENAT)
+		req->common.ki_filp = NULL;
+	else {
+		req->common.ki_filp = fget(iocb->aio_fildes);
+		if (unlikely(!req->common.ki_filp)) {
+			ret = -EBADF;
+			goto out_put_req;
+		}
 	}
 	req->common.ki_pos = iocb->aio_offset;
 	req->common.ki_complete = aio_complete;
-	req->common.ki_flags = iocb_flags(req->common.ki_filp);
+	if (req->common.ki_filp)
+		req->common.ki_flags = iocb_flags(req->common.ki_filp);
 
 	if (iocb->aio_flags & IOCB_FLAG_RESFD) {
 		/*
@@ -1891,10 +1984,7 @@  static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	req->ki_user_iocb = user_iocb;
 	req->ki_user_data = iocb->aio_data;
 
-	ret = aio_run_iocb(req, iocb->aio_lio_opcode,
-			   (char __user *)(unsigned long)iocb->aio_buf,
-			   iocb->aio_nbytes,
-			   compat);
+	ret = aio_run_iocb(req, iocb, compat);
 	if (ret)
 		goto out_put_req;
 
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 7639fb1..0e16988 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -44,6 +44,8 @@  enum {
 	IOCB_CMD_NOOP = 6,
 	IOCB_CMD_PREADV = 7,
 	IOCB_CMD_PWRITEV = 8,
+
+	IOCB_CMD_OPENAT = 9,
 };
 
 /*