From patchwork Tue Jan  8 16:56:30 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752495
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9F0E717E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:56:59 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 905EC28F79
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:56:59 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 84A7F28F9D; Tue,  8 Jan 2019 16:56:59 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 268A028F79
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:56:59 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729096AbfAHQ45 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:56:57 -0500
Received: from mail-it1-f194.google.com ([209.85.166.194]:35210 "EHLO
        mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729085AbfAHQ44 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:56:56 -0500
Received: by mail-it1-f194.google.com with SMTP id p197so6942862itp.0
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:56:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=;
        b=T3myOqhCULnqrWDIXm17iGuFuIzj0qtgtYJE/uBroWZSemlwzSeF7iiys/43IItZrF
         UBh8xYILQiLpu7mrGYR8a+qbzi+moIR8Y8yfDBBoZY0PEjSshHq6yiP4OsPNWf7CbkPk
         a4/0eA+mjy0XvYGLSVAPPbnbxiTOuZrQ/feFzG+QNufW3djRPX8L1ZIHoeNxdVWuUNL5
         Sw12GO/WKRWIUah7YH3vkiwVWYTtxX7q29Ir4j5FFvLcdurLOuWerQQ/3FTaIjQAv0kx
         ycxyzcH+tg7Iy5ik+vtnrUsccTSodWxSjw7yHF9zYt8TuQwHteg3+3C1tudXqduK9hQw
         cKrQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=;
        b=r+L5lYodaI1GBFxjKaj8em7tQbyQBNMqvim1RXRAjgGG8l/eqHlbDBQWxwCdunh0zx
         pC9eKjej5m08iZ4momkS0H1UdYxWPZ9kI9rwxRLRzcHQzfsts3TAu4nuk60BFz/5BBsK
         3hdyGBkijO+kCv/glGQlzmV+C6QfU+Y4xJEPq3JRJfn4B/7lyK7FU/CAsjigJcWzJo+Q
         pbA1/ChAWYcEADn9BYxg+TxEa/bglVj+kHgez9Pb4kg4qHJQzgxBM3oGzrtOL0dGWa9U
         v3CTVnAM9qX7p3X32Do1pXkB8AUf7p4rkMNMxS9ygxfGjTw4NFxrTSmhkEhU7BDq7ann
         nNXQ==
X-Gm-Message-State: AJcUukc0RdiR3gSjQFLBcWTb8xc6phvWRsOa2UTG6C+a+JfyPw6Hge1W
        Wl/Ztd5PRbugIs4Y+3EGYnPbZtP76my4dg==
X-Google-Smtp-Source: 
 ALg8bN521+owdIqdRbFEYj2WTUvMaWHOADu+tRbAJudOWJwqOtCvwGqJdV08Q8SjzGSq4NCdd1VDLg==
X-Received: by 2002:a02:8244:: with SMTP id q4mr1610249jag.43.1546966614931;
        Tue, 08 Jan 2019 08:56:54 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.56.53
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:56:53 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 01/16] fs: add an iopoll method to struct file_operations
Date: Tue,  8 Jan 2019 09:56:30 -0700
Message-Id: <20190108165645.19311-2-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Christoph Hellwig <hch@lst.de>

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.

TODO: we can probably union ki_cookie with the existing hint and I/O
priority fields to avoid struct kiocb growth.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/filesystems/vfs.txt | 3 +++
 include/linux/fs.h                | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 8dc8e9c2913f..761c6fd24a53 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -857,6 +857,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);
@@ -902,6 +903,8 @@ otherwise noted.
 
   write_iter: possibly asynchronous write with iov_iter as source
 
+  iopoll: called when aio wants to poll for completions on HIPRI iocbs
+
   iterate: called when the VFS needs to read the directory contents
 
   iterate_shared: called when the VFS needs to read the directory contents
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 811c77743dad..ccb0b7a63aa5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -310,6 +310,7 @@ struct kiocb {
 	int			ki_flags;
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
+	unsigned int		ki_cookie; /* for ->iopoll */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1786,6 +1787,7 @@ struct file_operations {
 	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
 	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
 	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
+	int (*iopoll)(struct kiocb *kiocb, bool spin);
 	int (*iterate) (struct file *, struct dir_context *);
 	int (*iterate_shared) (struct file *, struct dir_context *);
 	__poll_t (*poll) (struct file *, struct poll_table_struct *);

From patchwork Tue Jan  8 16:56:31 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752509
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2479617E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:07 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 16DF228DC1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:07 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 153B628F8A; Tue,  8 Jan 2019 16:57:07 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EAD2728F9E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729108AbfAHQ5A (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:00 -0500
Received: from mail-io1-f68.google.com ([209.85.166.68]:44990 "EHLO
        mail-io1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1728186AbfAHQ46 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:56:58 -0500
Received: by mail-io1-f68.google.com with SMTP id r200so3649195iod.11
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:56:57 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=;
        b=WKGSsE3GVW9OsVc6pbFMMMUfluphyhQxm+ObcPPl00IniEBmjXgVTsgY/SXK/5gLwp
         k35Bv9lXLPJkYM/Lkqm7609uOZKMS8X6KqA8QHlo1fjF/RQ52BMfOhz/Yn2BCHNJzqpX
         CaUb7LJM0yIJFLaRuO9Jm90oU2ELqhHFWin5HhV5mgjPOONFL7UYvaekXxPSO5lRrfDE
         oOZl3BSl7+uRpwXKH3/yU74oOzwP7a+logTO72+9dmJAltHaVTrIlPTIakyJeC5ny/LA
         o9TE4Moiji6eHd2uSEIK1bg3371Ge5SR5fitPK7xeQYliH4hmNzGOksyae6ELSp73tJN
         qoGQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=;
        b=MXqxZCK+XOqkDSvaL6z/EQaGOCsafUTRgBWElNXGVg9Btm+XtaWrgrUuVjBPyfwejO
         4/9ngg0LUMa8eCcRPeohgBeikaxux94Grfkaq6XngtwY3X1Dk3QOlLISsXvKoWfiTo1u
         fMkyv3qtNCFFcPZB1VXsoZ1EtVy9rr2nB1/2g4kDP9m3jVlUuEDItm7FQzTsfRL76+rz
         eSze6NOMGBH4paAKukJYDHOWQfyw9b9mRVEHiNhL425AbNxjH6/k9ExJ6tz5JcW8lxsN
         vmceybyfVo4zjlHWhQOOBMQzJGGUjQe1BDcBdRJGV96K5Kp8nW3+2OXX6058sIinaB+5
         jneQ==
X-Gm-Message-State: AJcUukeHs5Rbhega+nyozZ4pL1fFtXe1hzeKVTv/IhEYJ+zR4aRoWDo8
        hM6MAiqHuXgtpvWHsrCIAJ4o0pyN73R1lw==
X-Google-Smtp-Source: 
 ALg8bN7PK0bJChPrkEfd87P4OlBsx3bJVa6o/YBjRxu3xADa5xBLsRqOAkUDcbXSPxHO+Gr9/SoO/w==
X-Received: by 2002:a6b:b902:: with SMTP id j2mr1581017iof.220.1546966616572;
        Tue, 08 Jan 2019 08:56:56 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.56.55
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:56:55 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 02/16] block: wire up block device iopoll method
Date: Tue,  8 Jan 2019 09:56:31 -0700
Message-Id: <20190108165645.19311-3-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Christoph Hellwig <hch@lst.de>

Just call blk_poll on the iocb cookie, we can derive the block device
from the inode trivially.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index c546cdce77e6..5415579f3e14 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -279,6 +279,14 @@ struct blkdev_dio {
 
 static struct bio_set blkdev_dio_pool;
 
+static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
+{
+	struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
+	struct request_queue *q = bdev_get_queue(bdev);
+
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
+}
+
 static void blkdev_bio_end_io(struct bio *bio)
 {
 	struct blkdev_dio *dio = bio->bi_private;
@@ -396,6 +404,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 				bio->bi_opf |= REQ_HIPRI;
 
 			qc = submit_bio(bio);
+			WRITE_ONCE(iocb->ki_cookie, qc);
 			break;
 		}
 
@@ -2068,6 +2077,7 @@ const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read_iter	= blkdev_read_iter,
 	.write_iter	= blkdev_write_iter,
+	.iopoll		= blkdev_iopoll,
 	.mmap		= generic_file_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,

From patchwork Tue Jan  8 16:56:32 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752499
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C579317E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:02 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B353628F8A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:02 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id A7F6F28F66; Tue,  8 Jan 2019 16:57:02 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4A4DE28F8A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:02 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729111AbfAHQ5B (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:01 -0500
Received: from mail-it1-f195.google.com ([209.85.166.195]:35220 "EHLO
        mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729100AbfAHQ47 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:56:59 -0500
Received: by mail-it1-f195.google.com with SMTP id p197so6943160itp.0
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:56:59 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=;
        b=m4N8NbARBmEwjUvDpzAXBUWrau2jEhutC5ECDj8VGmEbtQpJvzj35KpKuVpYHFQ2Rl
         dsY3c3D1wr2PT5wYDpdF/Gw7YVXxn0Jj7/325gbgSzsXNH6KpcGHKPYUXMeYazv8fecc
         dQTCTWAS/zK9AMa032Li31lC2wzb/pADJz5Xzw13r9PKEHuCSkm//p06YPi2A6j5PMNW
         Jba7LXfjj0oZEnJDfhyT/ZRWJqVlcZvAUL5aK4UEewbWsaF/RLCDZgUJ5d5600VQtjMk
         wd2LeWfP5BhMsGbztssjp8CE1ZBuMSIybpNrpUxWe7Vhljd8JTQJ5AcAzFFdgvzHH2xX
         amOw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=;
        b=NtjJElljzG3JCIzxGMhvKNW2Ru6eHHI6s7mdqSF8TAaNOPlfP5r6FMeCR/J4+XUGe9
         +cCccEkXYfLKDMdSGiprazaVYdnBwyxbFGI0WeVXQbRzixE+dN2NfT8z2EjRFiusycUT
         5IKv9rEI+CLVORyu9rktEuwDbBj0cusp+zyrY42LodGMC9lhdpVJpa89/YfrrLahx7O0
         9BSDfbTenB4ZGcDP+gRRq9x4hhqk/FF8HPtNQ35P28sqz0rfkkz/5X4QtN3PytHE22cH
         s2rdq609FyHroTrbOA5BvfSftkeRmL7xvQUtgfEQ6UX3TAE6nEzKZ2rdf3qjdNenDIw8
         Jv8A==
X-Gm-Message-State: AJcUukcrislNbP9T2BbeS0GfFAGnfl6sBS14c9GyFWqAk9Qpfjk/0wdG
        MOC9Y3DU3nijK0JG5p/AqZ3lhpiu8aqDAA==
X-Google-Smtp-Source: 
 ALg8bN6TPENNxihDWK16Dxc8BFhdm+oMKGCX3xZYLJa6OFE2je4LJDm652HDumCWUla6DzRFA/vpPQ==
X-Received: by 2002:a24:411d:: with SMTP id x29mr1443879ita.0.1546966618374;
        Tue, 08 Jan 2019 08:56:58 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.56.56
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:56:57 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 03/16] block: add bio_set_polled() helper
Date: Tue,  8 Jan 2019 09:56:32 -0700
Message-Id: <20190108165645.19311-4-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

For the upcoming async polled IO, we can't sleep allocating requests.
If we do, then we introduce a deadlock where the submitter already
has async polled IO in-flight, but can't wait for them to complete
since polled requests must be active found and reaped.

Utilize the helper in the blockdev DIRECT_IO code.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/block_dev.c      |  4 ++--
 include/linux/bio.h | 14 ++++++++++++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 5415579f3e14..2ebd2a0d7789 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter,
 		task_io_account_write(ret);
 	}
 	if (iocb->ki_flags & IOCB_HIPRI)
-		bio.bi_opf |= REQ_HIPRI;
+		bio_set_polled(&bio, iocb);
 
 	qc = submit_bio(&bio);
 	for (;;) {
@@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages)
 		nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES);
 		if (!nr_pages) {
 			if (iocb->ki_flags & IOCB_HIPRI)
-				bio->bi_opf |= REQ_HIPRI;
+				bio_set_polled(bio, iocb);
 
 			qc = submit_bio(bio);
 			WRITE_ONCE(iocb->ki_cookie, qc);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..f6f0a2b3cbc8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page,
 
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
 
+/*
+ * Mark a bio as polled. Note that for async polled IO, the caller must
+ * expect -EWOULDBLOCK if we cannot allocate a request (or other resources).
+ * We cannot block waiting for requests on polled IO, as those completions
+ * must be found by the caller. This is different than IRQ driven IO, where
+ * it's safe to wait for IO to complete.
+ */
+static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb)
+{
+	bio->bi_opf |= REQ_HIPRI;
+	if (!is_sync_kiocb(kiocb))
+		bio->bi_opf |= REQ_NOWAIT;
+}
+
 #endif /* CONFIG_BLOCK */
 #endif /* __LINUX_BIO_H */

From patchwork Tue Jan  8 16:56:33 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752505
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0B4FE746
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:05 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F0C0A28D79
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:04 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id E430828FA5; Tue,  8 Jan 2019 16:57:04 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78F0D28F8B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729116AbfAHQ5D (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:03 -0500
Received: from mail-io1-f67.google.com ([209.85.166.67]:45464 "EHLO
        mail-io1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1728186AbfAHQ5B (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:01 -0500
Received: by mail-io1-f67.google.com with SMTP id c2so3644787iom.12
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=;
        b=ozagS2GPca3Qg9e9fLKJ5ChOtCbxWWvqRMTgThNya+fwYAvFA3Hv5x+0BTl2jtylrs
         NCOnrL4xwzexqM8Faad7YVSqx/iRISmPljDn475q9L/yQvViq3t+Tvrz/AIHRVXhYZws
         WnibH64TYUWWUX2pOP2HFPLsZBVYu5J2h7fjm92ncRl3GIyVviZJ74MhQ0i5JBdAJ2cc
         bhkzv50itwr5K3F9PA2BGJW9OffhqjsrnJ/8d4lY896hjtpjbLuDUST5+tNd1oAHTkTt
         BytKE9qAIDf9XNMQJv3YE+e325eb7g8NGw5G4YQVfu7BCxMoIQKkRDwZLhMIaqNOn5Sh
         +UxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=;
        b=U5XVKENFIvY47J5Bow87YjVowDGkJ4+tpsA65zPL+tb2WSD1cILOi86gCfv7ahbC4E
         GlCf1fJv26KgsyyWjPcYbQUT2mynEM3HZrOdkcvyDsGXHNIUFzl2hbeRMojdjYb54On0
         vS+qpQDjZbHZx+22zFL8LzqGB+xn+SwtFP56qSrW1Iss8Fy7fvunr3ThuMeGY9y/NZDm
         DZEwkvKy5SkfSz3+f7cbHOayrc3lxhyZ7+4C/yh5yDogp/nSubQ1QXw7ySO1BK82kmdS
         uKniAjw8hcczXqQpzqWJowJtmN3/YXmbopnKMDCOZ8JjWyjyaEXTURth7Jv5Em9nQHDz
         58RA==
X-Gm-Message-State: AJcUukfqzY5RpvkJ5yI8dz9wj2fwuRg/XljgvjJWI4cgmZRTLoGpYY1k
        PEGTQ4532/BbqMOfl3OIl6XoMamE7BFd9A==
X-Google-Smtp-Source: 
 ALg8bN4LMiMHgY9C6Aw6Z1lBLQMCl5mdEdoKa9N2onSnseF29/Q+tJ6ZSLxrojHzbcQjyhcnngkM1Q==
X-Received: by 2002:a6b:5d01:: with SMTP id r1mr1501562iob.170.1546966620231;
        Tue, 08 Jan 2019 08:57:00 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.56.58
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:56:59 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 04/16] iomap: wire up the iopoll method
Date: Tue,  8 Jan 2019 09:56:33 -0700
Message-Id: <20190108165645.19311-5-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

From: Christoph Hellwig <hch@lst.de>

Store the request queue the last bio was submitted to in the iocb
private data in addition to the cookie so that we find the right block
device.  Also refactor the common direct I/O bio submission code into a
nice little helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Modified to use bio_set_polled().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/gfs2/file.c        |  2 ++
 fs/iomap.c            | 43 ++++++++++++++++++++++++++++---------------
 fs/xfs/xfs_file.c     |  1 +
 include/linux/iomap.h |  1 +
 4 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a2dea5bc0427..58a768e59712 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
@@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = {
 	.llseek		= gfs2_llseek,
 	.read_iter	= gfs2_file_read_iter,
 	.write_iter	= gfs2_file_write_iter,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= gfs2_ioctl,
 	.mmap		= gfs2_mmap,
 	.open		= gfs2_open,
diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..4ee50b76b4a1 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1454,6 +1454,28 @@ struct iomap_dio {
 	};
 };
 
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
+{
+	struct request_queue *q = READ_ONCE(kiocb->private);
+
+	if (!q)
+		return 0;
+	return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin);
+}
+EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
+
+static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
+		struct bio *bio)
+{
+	atomic_inc(&dio->ref);
+
+	if (dio->iocb->ki_flags & IOCB_HIPRI)
+		bio_set_polled(bio, dio->iocb);
+
+	dio->submit.last_queue = bdev_get_queue(iomap->bdev);
+	dio->submit.cookie = submit_bio(bio);
+}
+
 static ssize_t iomap_dio_complete(struct iomap_dio *dio)
 {
 	struct kiocb *iocb = dio->iocb;
@@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	}
 }
 
-static blk_qc_t
+static void
 iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 		unsigned len)
 {
@@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	if (dio->iocb->ki_flags & IOCB_HIPRI)
-		flags |= REQ_HIPRI;
-
 	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
-
-	atomic_inc(&dio->ref);
-	return submit_bio(bio);
+	iomap_dio_submit_bio(dio, iomap, bio);
 }
 
 static loff_t
@@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 				bio_set_pages_dirty(bio);
 		}
 
-		if (dio->iocb->ki_flags & IOCB_HIPRI)
-			bio->bi_opf |= REQ_HIPRI;
-
 		iov_iter_advance(dio->submit.iter, n);
 
 		dio->size += n;
@@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		copied += n;
 
 		nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
-
-		atomic_inc(&dio->ref);
-
-		dio->submit.last_queue = bdev_get_queue(iomap->bdev);
-		dio->submit.cookie = submit_bio(bio);
+		iomap_dio_submit_bio(dio, iomap, bio);
 	} while (nr_pages);
 
 	/*
@@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
+	WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie);
+	WRITE_ONCE(iocb->private, dio->submit.last_queue);
+
 	if (!atomic_dec_and_test(&dio->ref)) {
 		if (!dio->wait_for_completion)
 			return -EIOCBQUEUED;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e47425071e65..60c2da41f0fc 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = {
 	.write_iter	= xfs_file_write_iter,
 	.splice_read	= generic_file_splice_read,
 	.splice_write	= iter_file_splice_write,
+	.iopoll		= iomap_dio_iopoll,
 	.unlocked_ioctl	= xfs_file_ioctl,
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= xfs_file_compat_ioctl,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 9a4258154b25..0fefb5455bda 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
 		unsigned flags);
 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
+int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
 
 #ifdef CONFIG_SWAP
 struct file;

From patchwork Tue Jan  8 16:56:34 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752515
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7283617E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:10 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6283D28F85
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:10 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5740528F9A; Tue,  8 Jan 2019 16:57:10 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B7CC928F9E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729102AbfAHQ5I (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:08 -0500
Received: from mail-io1-f67.google.com ([209.85.166.67]:35049 "EHLO
        mail-io1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729115AbfAHQ5E (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:04 -0500
Received: by mail-io1-f67.google.com with SMTP id f4so3688411ion.2
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=7R2IfTG4avxh76saoXVhrgN1LaherVS7cakqxI17bRw=;
        b=FCrGdQyBib03o0MiK+9n44bzt/lfTixP+eT+mNDjTmTG2N1pfJeulo5N9oObc41jO4
         85WhG1mWH2TlDLwOwP0Uo/qkMEEceL5Bv9ZbSy7ARBzpD01Kvl566xZVpkew+KFbP2eA
         gCuWPPN2TudN2K4F4A+KSAJnigvvAKt1NisaBOXi82K+Wi3JNPefyhSYP1WWc9JOiBNN
         AMCZrqb6UapncEUfarHmhlEFIWcK+RfTDmkQN2+JIzc97uGcHBiuEnM0OPSFvRIyWjjK
         u/LDkQcqcPudMEcLgzUHSQjcLUXCe51dH50gT446zIii5EvnxKe4Gsx6lvegR9BcH8lJ
         lkUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=7R2IfTG4avxh76saoXVhrgN1LaherVS7cakqxI17bRw=;
        b=T/abGdwWj4QoOZt9liE+BaSlnfEXkVIlj3aND46tMjrhzJTErnA0NftJtFCiZbM3kZ
         UuUXeomQ1oP8RBQSJv1TJo2v1nd8tXqaHNdRZ6Iw/F/+8x3h1FpeHZhwU4LOZfD6sM/W
         q+Qn4e4ltlEQFp7fzc+SIDngmK99fmkpzx0wIqb3OA1ChkAWDnFxJubuE3H3OW0JpbpK
         sdP2lFd2Xgwn9anyKuVL46aQOxIJRiVGMkqTNeLcj/j3EOyNLtsp/oRrqj3/zg1rVoZf
         VXvYTFySCSO6jiUSpkQgzpwwtThE/4ooza2u1tJX+Mb+35kxMYfhz/XN5YIX+g4/zcTA
         sf1w==
X-Gm-Message-State: AJcUukc76Vql6+bYDCQGK8yQ90zPkq2MBecc50cEewaLQT4fi/R3M+wa
        3mlqvvtrs9zpL415Uqgkhrmok2PbNd1Bgg==
X-Google-Smtp-Source: 
 ALg8bN773XkVw9au/mz1NbapXKjnoPkMJbTmt2YgpF9KFxqGZ64mneTCkxlQ95XcP8LQ09cqkCrudQ==
X-Received: by 2002:a5d:8491:: with SMTP id t17mr1561013iom.11.1546966621991;
        Tue, 08 Jan 2019 08:57:01 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.00
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:01 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 05/16] Add io_uring IO interface
Date: Tue,  8 Jan 2019 09:56:34 -0700
Message-Id: <20190108165645.19311-6-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

The submission queue (SQ) and completion queue (CQ) rings are shared
between the application and the kernel. This eliminates the need to
copy data back and forth to submit and complete IO.

IO submissions use the io_uring_iocb data structure, and completions
are generated in the form of io_uring_event data structures. The SQ
ring is an index into the iocb_io_uring array, which makes it possible
to submit a batch of IOs without them being contiguous in the ring.
The CQ ring is always contiguous, as completion events are inherently
unordered and can point to any io_uring_iocb.

Two new system calls are added for this:

io_uring_setup(entries, iovecs, params)
	Sets up a context for doing async IO. On success, returns a file
	descriptor that the application can mmap to gain access to the
	SQ ring, CQ ring, and io_uring_iocbs.

io_uring_enter(fd, to_submit, min_complete, flags)
	Initiates IO against the rings mapped to this fd, or waits for
	them to complete, or both The behavior is controlled by the
	parameters passed in. If 'min_complete' is non-zero, then we'll
	try and submit new IO. If IORING_ENTER_GETEVENTS is set, the
	kernel will wait for 'min_complete' events, if they aren't
	already available.

With this setup, it's possible to do async IO with a single system
call. Future developments will enable polled IO with this interface,
and polled submission as well. The latter will enable an application
to do IO without doing ANY system calls at all.

For IRQ driven IO, an application only needs to enter the kernel for
completions if it wants to wait for them to occur.

Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 fs/Makefile                            |   2 +-
 fs/io_uring.c                          | 849 +++++++++++++++++++++++++
 include/linux/syscalls.h               |   5 +
 include/uapi/linux/io_uring.h          | 101 +++
 kernel/sys_ni.c                        |   2 +
 6 files changed, 960 insertions(+), 1 deletion(-)
 create mode 100644 fs/io_uring.c
 create mode 100644 include/uapi/linux/io_uring.h

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index f0b1709a5ffb..453ff7a79002 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -343,6 +343,8 @@
 332	common	statx			__x64_sys_statx
 333	common	io_pgetevents		__x64_sys_io_pgetevents
 334	common	rseq			__x64_sys_rseq
+335	common	io_uring_setup		__x64_sys_io_uring_setup
+336	common	io_uring_enter		__x64_sys_io_uring_enter
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/Makefile b/fs/Makefile
index 293733f61594..9ef9987b4192 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD)		+= signalfd.o
 obj-$(CONFIG_TIMERFD)		+= timerfd.o
 obj-$(CONFIG_EVENTFD)		+= eventfd.o
 obj-$(CONFIG_USERFAULTFD)	+= userfaultfd.o
-obj-$(CONFIG_AIO)               += aio.o
+obj-$(CONFIG_AIO)               += aio.o io_uring.o
 obj-$(CONFIG_FS_DAX)		+= dax.o
 obj-$(CONFIG_FS_ENCRYPTION)	+= crypto/
 obj-$(CONFIG_FILE_LOCKING)      += locks.o
diff --git a/fs/io_uring.c b/fs/io_uring.c
new file mode 100644
index 000000000000..ae2b886282bb
--- /dev/null
+++ b/fs/io_uring.c
@@ -0,0 +1,849 @@
+/*
+ * Shared application/kernel submission and completion ring pairs, for
+ * supporting fast/efficient IO.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ */
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/syscalls.h>
+#include <linux/refcount.h>
+#include <linux/uio.h>
+
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmu_context.h>
+#include <linux/percpu.h>
+#include <linux/slab.h>
+#include <linux/workqueue.h>
+#include <linux/blkdev.h>
+#include <linux/anon_inodes.h>
+
+#include <linux/uaccess.h>
+#include <linux/nospec.h>
+
+#include <uapi/linux/io_uring.h>
+
+#include "internal.h"
+
+struct io_uring {
+	u32 head ____cacheline_aligned_in_smp;
+	u32 tail ____cacheline_aligned_in_smp;
+};
+
+struct io_sq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			dropped;
+	u32			flags;
+	u32			array[0];
+};
+
+struct io_cq_ring {
+	struct io_uring		r;
+	u32			ring_mask;
+	u32			ring_entries;
+	u32			overflow;
+	struct io_uring_event	events[0];
+};
+
+struct io_iocb_ring {
+	struct			io_sq_ring *ring;
+	unsigned		entries;
+	unsigned		ring_mask;
+	struct io_uring_iocb	*iocbs;
+};
+
+struct io_event_ring {
+	struct io_cq_ring	*ring;
+	unsigned		entries;
+	unsigned		ring_mask;
+};
+
+struct io_ring_ctx {
+	struct percpu_ref	refs;
+
+	unsigned int		flags;
+	unsigned int		max_reqs;
+
+	struct io_iocb_ring	sq_ring;
+	struct io_event_ring	cq_ring;
+
+	struct work_struct	work;
+
+	struct {
+		struct mutex uring_lock;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		struct mutex    ring_lock;
+		wait_queue_head_t wait;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		spinlock_t      completion_lock;
+	} ____cacheline_aligned_in_smp;
+};
+
+struct fsync_iocb {
+	struct work_struct	work;
+	struct file		*file;
+	bool			datasync;
+};
+
+struct io_kiocb {
+	union {
+		struct kiocb		rw;
+		struct fsync_iocb	fsync;
+	};
+
+	struct io_ring_ctx	*ki_ctx;
+	unsigned long		ki_index;
+	struct list_head	ki_list;
+	unsigned long		ki_flags;
+};
+
+#define IO_PLUG_THRESHOLD		2
+
+static struct kmem_cache *kiocb_cachep, *ioctx_cachep;
+
+static const struct file_operations io_scqring_fops;
+
+static void io_ring_ctx_free(struct work_struct *work);
+static void io_ring_ctx_ref_free(struct percpu_ref *ref);
+
+static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx;
+
+	ctx = kmem_cache_zalloc(ioctx_cachep, GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) {
+		kmem_cache_free(ioctx_cachep, ctx);
+		return NULL;
+	}
+
+	ctx->flags = p->flags;
+	ctx->max_reqs = p->sq_entries;
+
+	INIT_WORK(&ctx->work, io_ring_ctx_free);
+
+	spin_lock_init(&ctx->completion_lock);
+	mutex_init(&ctx->ring_lock);
+	init_waitqueue_head(&ctx->wait);
+	mutex_init(&ctx->uring_lock);
+
+	return ctx;
+}
+
+static void io_inc_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring.ring;
+
+	ring->r.tail++;
+	smp_wmb();
+}
+
+static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx)
+{
+	struct io_cq_ring *ring = ctx->cq_ring.ring;
+	unsigned tail;
+
+	smp_rmb();
+	tail = READ_ONCE(ring->r.tail);
+	if (tail + 1 == READ_ONCE(ring->r.head))
+		return NULL;
+
+	return &ring->events[tail & ctx->cq_ring.ring_mask];
+}
+
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+{
+	struct io_kiocb *req;
+
+	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+	if (!req)
+		return NULL;
+
+	percpu_ref_get(&ctx->refs);
+	req->ki_ctx = ctx;
+	INIT_LIST_HEAD(&req->ki_list);
+	req->ki_flags = 0;
+	return req;
+}
+
+static inline void iocb_put(struct io_kiocb *iocb)
+{
+	percpu_ref_put(&iocb->ki_ctx->refs);
+	kmem_cache_free(kiocb_cachep, iocb);
+}
+
+static void io_complete_iocb(struct io_ring_ctx *ctx, struct io_kiocb *iocb)
+{
+	if (waitqueue_active(&ctx->wait))
+		wake_up(&ctx->wait);
+	iocb_put(iocb);
+}
+
+static void kiocb_end_write(struct kiocb *kiocb)
+{
+	if (kiocb->ki_flags & IOCB_WRITE) {
+		struct inode *inode = file_inode(kiocb->ki_filp);
+
+		/*
+		 * Tell lockdep we inherited freeze protection from submission
+		 * thread.
+		 */
+		if (S_ISREG(inode->i_mode))
+			__sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE);
+		file_end_write(kiocb->ki_filp);
+	}
+}
+
+static void io_fill_event(struct io_uring_event *ev, struct io_kiocb *kiocb,
+			  long res, unsigned flags)
+{
+	ev->index = kiocb->ki_index;
+	ev->res = res;
+	ev->flags = flags;
+}
+
+static void io_cqring_fill_event(struct io_kiocb *iocb, long res,
+				 unsigned ev_flags)
+{
+	struct io_ring_ctx *ctx = iocb->ki_ctx;
+	struct io_uring_event *ev;
+	unsigned long flags;
+
+	/*
+	 * If we can't get a cq entry, userspace overflowed the
+	 * submission (by quite a lot). Increment the overflow count in
+	 * the ring.
+	 */
+	spin_lock_irqsave(&ctx->completion_lock, flags);
+	ev = io_peek_cqring(ctx);
+	if (ev) {
+		io_fill_event(ev, iocb, res, ev_flags);
+		io_inc_cqring(ctx);
+	} else
+		ctx->cq_ring.ring->overflow++;
+	spin_unlock_irqrestore(&ctx->completion_lock, flags);
+}
+
+static void io_complete_scqring(struct io_kiocb *iocb, long res, unsigned flags)
+{
+	io_cqring_fill_event(iocb, res, flags);
+	io_complete_iocb(iocb->ki_ctx, iocb);
+}
+
+static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	fput(kiocb->ki_filp);
+	io_complete_scqring(iocb, res, 0);
+}
+
+static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
+{
+	struct kiocb *req = &kiocb->rw;
+	int ret;
+
+	req->ki_filp = fget(iocb->fd);
+	if (unlikely(!req->ki_filp))
+		return -EBADF;
+	req->ki_pos = iocb->off;
+	req->ki_flags = iocb_flags(req->ki_filp);
+	req->ki_hint = ki_hint_validate(file_write_hint(req->ki_filp));
+	if (iocb->ioprio) {
+		ret = ioprio_check_cap(iocb->ioprio);
+		if (ret)
+			goto out_fput;
+
+		req->ki_ioprio = iocb->ioprio;
+	} else
+		req->ki_ioprio = get_current_ioprio();
+
+	ret = kiocb_set_rw_flags(req, iocb->rw_flags);
+	if (unlikely(ret))
+		goto out_fput;
+
+	/* no one is going to poll for this I/O */
+	req->ki_flags &= ~IOCB_HIPRI;
+	req->ki_complete = io_complete_scqring_rw;
+	return 0;
+out_fput:
+	fput(req->ki_filp);
+	return ret;
+}
+
+static int io_setup_rw(int rw, const struct io_uring_iocb *iocb,
+		       struct iovec **iovec, struct iov_iter *iter)
+{
+	void __user *buf = (void __user *)(uintptr_t)iocb->addr;
+	size_t ret;
+
+	ret = import_single_range(rw, buf, iocb->len, *iovec, iter);
+	*iovec = NULL;
+	return ret;
+}
+
+static inline void io_rw_done(struct kiocb *req, ssize_t ret)
+{
+	switch (ret) {
+	case -EIOCBQUEUED:
+		break;
+	case -ERESTARTSYS:
+	case -ERESTARTNOINTR:
+	case -ERESTARTNOHAND:
+	case -ERESTART_RESTARTBLOCK:
+		/*
+		 * There's no easy way to restart the syscall since other AIO's
+		 * may be already running. Just fail this IO with EINTR.
+		 */
+		ret = -EINTR;
+		/*FALLTHRU*/
+	default:
+		req->ki_complete(req, ret, 0);
+	}
+}
+
+static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *req = &kiocb->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(kiocb, iocb);
+	if (ret)
+		return ret;
+	file = req->ki_filp;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_READ)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->read_iter))
+		goto out_fput;
+
+	ret = io_setup_rw(READ, iocb, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+
+	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
+	if (!ret)
+		io_rw_done(req, call_read_iter(file, req, &iter));
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+static ssize_t io_write(struct io_kiocb *kiocb,
+			const struct io_uring_iocb *iocb)
+{
+	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
+	struct kiocb *req = &kiocb->rw;
+	struct iov_iter iter;
+	struct file *file;
+	ssize_t ret;
+
+	ret = io_prep_rw(kiocb, iocb);
+	if (ret)
+		return ret;
+	file = req->ki_filp;
+
+	ret = -EBADF;
+	if (unlikely(!(file->f_mode & FMODE_WRITE)))
+		goto out_fput;
+	ret = -EINVAL;
+	if (unlikely(!file->f_op->write_iter))
+		goto out_fput;
+
+	ret = io_setup_rw(WRITE, iocb, &iovec, &iter);
+	if (ret)
+		goto out_fput;
+	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
+	if (!ret) {
+		/*
+		 * Open-code file_start_write here to grab freeze protection,
+		 * which will be released by another thread in
+		 * io_complete_rw().  Fool lockdep by telling it the lock got
+		 * released so that it doesn't complain about the held lock when
+		 * we return to userspace.
+		 */
+		if (S_ISREG(file_inode(file)->i_mode)) {
+			__sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
+			__sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
+		}
+		req->ki_flags |= IOCB_WRITE;
+		io_rw_done(req, call_write_iter(file, req, &iter));
+	}
+	kfree(iovec);
+out_fput:
+	if (unlikely(ret))
+		fput(file);
+	return ret;
+}
+
+static void io_fsync_work(struct work_struct *work)
+{
+	struct fsync_iocb *req = container_of(work, struct fsync_iocb, work);
+	struct io_kiocb *iocb = container_of(req, struct io_kiocb, fsync);
+	int ret;
+
+	ret = vfs_fsync(req->file, req->datasync);
+	fput(req->file);
+
+	io_complete_scqring(iocb, ret, 0);
+}
+
+static int io_fsync(struct fsync_iocb *req, const struct io_uring_iocb *iocb,
+		    bool datasync)
+{
+	if (unlikely(iocb->addr || iocb->off || iocb->len || iocb->__resv))
+		return -EINVAL;
+
+	req->file = fget(iocb->fd);
+	if (unlikely(!req->file))
+		return -EBADF;
+	if (unlikely(!req->file->f_op->fsync)) {
+		fput(req->file);
+		return -EINVAL;
+	}
+
+	req->datasync = datasync;
+	INIT_WORK(&req->work, io_fsync_work);
+	schedule_work(&req->work);
+	return 0;
+}
+
+static int __io_submit_one(struct io_ring_ctx *ctx,
+			   const struct io_uring_iocb *iocb,
+			   unsigned long ki_index)
+{
+	struct io_kiocb *req;
+	ssize_t ret;
+
+	/* enforce forwards compatibility on users */
+	if (unlikely(iocb->flags))
+		return -EINVAL;
+
+	req = io_get_req(ctx);
+	if (unlikely(!req))
+		return -EAGAIN;
+
+	ret = -EINVAL;
+	if (ki_index >= ctx->max_reqs)
+		goto out_put_req;
+	req->ki_index = ki_index;
+
+	ret = -EINVAL;
+	switch (iocb->opcode) {
+	case IORING_OP_READ:
+		ret = io_read(req, iocb);
+		break;
+	case IORING_OP_WRITE:
+		ret = io_write(req, iocb);
+		break;
+	case IORING_OP_FSYNC:
+		ret = io_fsync(&req->fsync, iocb, false);
+		break;
+	case IORING_OP_FDSYNC:
+		ret = io_fsync(&req->fsync, iocb, true);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	/*
+	 * If ret is 0, ->ki_complete() has either been called, or will get
+	 * called later on. Anything else, we need to free the req.
+	 */
+	if (ret)
+		goto out_put_req;
+	return 0;
+out_put_req:
+	iocb_put(req);
+	return ret;
+}
+
+static void io_inc_sqring(struct io_ring_ctx *ctx)
+{
+	struct io_sq_ring *ring = ctx->sq_ring.ring;
+
+	ring->r.head++;
+	smp_wmb();
+}
+
+static const struct io_uring_iocb *io_peek_sqring(struct io_ring_ctx *ctx,
+						  unsigned *iocb_index)
+{
+	struct io_sq_ring *ring = ctx->sq_ring.ring;
+	unsigned head;
+
+	smp_rmb();
+	head = READ_ONCE(ring->r.head);
+	if (head == READ_ONCE(ring->r.tail))
+		return NULL;
+
+	head = ring->array[head & ctx->sq_ring.ring_mask];
+	if (head < ctx->sq_ring.entries) {
+		*iocb_index = head;
+		return &ctx->sq_ring.iocbs[head];
+	}
+
+	/* drop invalid entries */
+	ring->r.head++;
+	ring->dropped++;
+	smp_wmb();
+	return NULL;
+}
+
+static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+{
+	int i, ret = 0, submit = 0;
+	struct blk_plug plug;
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_start_plug(&plug);
+
+	for (i = 0; i < to_submit; i++) {
+		const struct io_uring_iocb *iocb;
+		unsigned iocb_index;
+
+		iocb = io_peek_sqring(ctx, &iocb_index);
+		if (!iocb)
+			break;
+
+		ret = __io_submit_one(ctx, iocb, iocb_index);
+		if (ret)
+			break;
+
+		submit++;
+		io_inc_sqring(ctx);
+	}
+
+	if (to_submit > IO_PLUG_THRESHOLD)
+		blk_finish_plug(&plug);
+
+	return submit ? submit : ret;
+}
+
+/*
+ * Wait until events become available, if we don't already have some. The
+ * application must reap them itself, as they reside on the shared cq ring.
+ */
+static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events)
+{
+	struct io_cq_ring *ring = ctx->cq_ring.ring;
+	DEFINE_WAIT(wait);
+	int ret;
+
+	smp_rmb();
+	if (ring->r.head != ring->r.tail)
+		return 0;
+	if (!min_events)
+		return 0;
+
+	do {
+		prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE);
+
+		ret = 0;
+		smp_rmb();
+		if (ring->r.head != ring->r.tail)
+			break;
+
+		schedule();
+
+		ret = -EINTR;
+		if (signal_pending(current))
+			break;
+	} while (1);
+
+	finish_wait(&ctx->wait, &wait);
+	return ring->r.head == ring->r.tail ? ret : 0;
+}
+
+static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
+			    unsigned min_complete, unsigned flags)
+{
+	int ret = 0;
+
+	if (to_submit) {
+		ret = io_ring_submit(ctx, to_submit);
+		if (ret < 0)
+			return ret;
+	}
+	if (flags & IORING_ENTER_GETEVENTS) {
+		int get_ret;
+
+		if (!ret && to_submit)
+			min_complete = 0;
+
+		get_ret = io_cqring_wait(ctx, min_complete);
+		if (get_ret < 0 && !ret)
+			ret = get_ret;
+	}
+
+	return ret;
+}
+
+static void io_free_scq_urings(struct io_ring_ctx *ctx)
+{
+	if (ctx->sq_ring.ring) {
+		page_frag_free(ctx->sq_ring.ring);
+		ctx->sq_ring.ring = NULL;
+	}
+	if (ctx->sq_ring.iocbs) {
+		page_frag_free(ctx->sq_ring.iocbs);
+		ctx->sq_ring.iocbs = NULL;
+	}
+	if (ctx->cq_ring.ring) {
+		page_frag_free(ctx->cq_ring.ring);
+		ctx->cq_ring.ring = NULL;
+	}
+}
+
+static void io_ring_ctx_free(struct work_struct *work)
+{
+	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, work);
+
+	io_free_scq_urings(ctx);
+	percpu_ref_exit(&ctx->refs);
+	kmem_cache_free(ioctx_cachep, ctx);
+}
+
+static void io_ring_ctx_ref_free(struct percpu_ref *ref)
+{
+	struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs);
+
+	schedule_work(&ctx->work);
+}
+
+static int io_scqring_release(struct inode *inode, struct file *file)
+{
+	struct io_ring_ctx *ctx = file->private_data;
+
+	file->private_data = NULL;
+	percpu_ref_kill(&ctx->refs);
+	return 0;
+}
+
+static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long sz = vma->vm_end - vma->vm_start;
+	struct io_ring_ctx *ctx = file->private_data;
+	unsigned long pfn;
+	struct page *page;
+	void *ptr;
+
+	switch (offset) {
+	case IORING_OFF_SQ_RING:
+		ptr = ctx->sq_ring.ring;
+		break;
+	case IORING_OFF_IOCB:
+		ptr = ctx->sq_ring.iocbs;
+		break;
+	case IORING_OFF_CQ_RING:
+		ptr = ctx->cq_ring.ring;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	page = virt_to_head_page(ptr);
+	if (sz > (PAGE_SIZE << compound_order(page)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(ptr) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot);
+}
+
+SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit,
+		u32, min_complete, u32, flags)
+{
+	long ret = -EBADF;
+	struct fd f;
+
+	f = fdget(fd);
+	if (f.file) {
+		struct io_ring_ctx *ctx;
+
+		ret = -EOPNOTSUPP;
+		if (f.file->f_op != &io_scqring_fops)
+			goto err;
+
+		ctx = f.file->private_data;
+		ret = -EBUSY;
+		if (!mutex_trylock(&ctx->uring_lock))
+			goto err;
+
+		ret = __io_uring_enter(ctx, to_submit, min_complete, flags);
+		mutex_unlock(&ctx->uring_lock);
+err:
+		fdput(f);
+	}
+
+	return ret;
+}
+
+static const struct file_operations io_scqring_fops = {
+	.release	= io_scqring_release,
+	.mmap		= io_scqring_mmap,
+};
+
+static void *io_mem_alloc(size_t size)
+{
+	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
+				__GFP_NORETRY;
+
+	return (void *) __get_free_pages(gfp_flags, get_order(size));
+}
+
+static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
+				  struct io_uring_params *p)
+{
+	struct io_sq_ring *sq_ring;
+	struct io_cq_ring *cq_ring;
+
+	sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries));
+	if (!sq_ring)
+		return -ENOMEM;
+
+	ctx->sq_ring.ring = sq_ring;
+	sq_ring->ring_mask = p->sq_entries - 1;
+	sq_ring->ring_entries = p->sq_entries;
+	ctx->sq_ring.ring_mask = sq_ring->ring_mask;
+	ctx->sq_ring.entries = sq_ring->ring_entries;
+
+	ctx->sq_ring.iocbs = io_mem_alloc(sizeof(struct io_uring_iocb) *
+						p->sq_entries);
+	if (!ctx->sq_ring.iocbs)
+		goto err;
+
+	cq_ring = io_mem_alloc(struct_size(cq_ring, events, p->cq_entries));
+	if (!cq_ring)
+		goto err;
+
+	ctx->cq_ring.ring = cq_ring;
+	cq_ring->ring_mask = p->cq_entries - 1;
+	cq_ring->ring_entries = p->cq_entries;
+	ctx->cq_ring.ring_mask = cq_ring->ring_mask;
+	ctx->cq_ring.entries = cq_ring->ring_entries;
+	return 0;
+err:
+	io_free_scq_urings(ctx);
+	return -ENOMEM;
+}
+
+static void io_fill_offsets(struct io_uring_params *p)
+{
+	memset(&p->sq_off, 0, sizeof(p->sq_off));
+	p->sq_off.head = offsetof(struct io_sq_ring, r.head);
+	p->sq_off.tail = offsetof(struct io_sq_ring, r.tail);
+	p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask);
+	p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries);
+	p->sq_off.flags = offsetof(struct io_sq_ring, flags);
+	p->sq_off.dropped = offsetof(struct io_sq_ring, dropped);
+	p->sq_off.array = offsetof(struct io_sq_ring, array);
+
+	memset(&p->cq_off, 0, sizeof(p->cq_off));
+	p->cq_off.head = offsetof(struct io_cq_ring, r.head);
+	p->cq_off.tail = offsetof(struct io_cq_ring, r.tail);
+	p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask);
+	p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries);
+	p->cq_off.overflow = offsetof(struct io_cq_ring, overflow);
+	p->cq_off.events = offsetof(struct io_cq_ring, events);
+}
+
+static int io_uring_create(unsigned entries, struct io_uring_params *p)
+{
+	struct io_ring_ctx *ctx;
+	int ret;
+
+	/*
+	 * Use twice as many entries for the CQ ring. It's possible for the
+	 * application to drive a higher depth than the size of the SQ ring,
+	 * since the iocbs are only used at submission time. This allows for
+	 * some flexibility in overcommitting a bit.
+	 */
+	p->sq_entries = roundup_pow_of_two(entries);
+	p->cq_entries = 2 * p->sq_entries;
+
+	ctx = io_ring_ctx_alloc(p);
+	if (!ctx)
+		return -ENOMEM;
+
+	ret = io_allocate_scq_urings(ctx, p);
+	if (ret)
+		goto err;
+
+	ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx,
+				O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		goto err;
+
+	io_fill_offsets(p);
+	return ret;
+err:
+	percpu_ref_kill(&ctx->refs);
+	return ret;
+}
+
+/*
+ * sys_io_uring_setup:
+ *	Sets up an aio uring context, and returns the fd. Applications asks
+ *	for a ring size, we return the actual sq/cq ring sizes (among other
+ *	things) in the params structure passed in.
+ */
+SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs,
+		struct io_uring_params __user *, params)
+{
+	struct io_uring_params p;
+	long ret;
+	int i;
+
+	if (copy_from_user(&p, params, sizeof(p)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(p.resv); i++) {
+		if (p.resv[i])
+			return -EINVAL;
+	}
+
+	if (p.flags)
+		return -EINVAL;
+	if (iovecs)
+		return -EINVAL;
+
+	ret = io_uring_create(entries, &p);
+	if (ret < 0)
+		return ret;
+
+	if (copy_to_user(params, &p, sizeof(p)))
+		return -EFAULT;
+
+	return ret;
+}
+
+static int __init io_uring_setup(void)
+{
+	kiocb_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	ioctx_cachep = KMEM_CACHE(io_ring_ctx, SLAB_HWCACHE_ALIGN | SLAB_PANIC);
+	return 0;
+};
+__initcall(io_uring_setup);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 257cccba3062..6d40939f65cd 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -69,6 +69,7 @@ struct file_handle;
 struct sigaltstack;
 struct rseq;
 union bpf_attr;
+struct io_uring_params;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id,
 				struct io_event __user *events,
 				struct old_timespec32 __user *timeout,
 				const struct __aio_sigset *sig);
+asmlinkage long sys_io_uring_setup(u32 entries, struct iovec __user *iov,
+				struct io_uring_params __user *p);
+asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit,
+				u32 min_complete, u32 flags);
 
 /* fs/xattr.c */
 asmlinkage long sys_setxattr(const char __user *path, const char __user *name,
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
new file mode 100644
index 000000000000..c31ac84d9f53
--- /dev/null
+++ b/include/uapi/linux/io_uring.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Header file for the io_uring interface.
+ *
+ * Copyright (C) 2019 Jens Axboe
+ * Copyright (C) 2019 Christoph Hellwig
+ */
+#ifndef LINUX_IO_URING_H
+#define LINUX_IO_URING_H
+
+#include <linux/fs.h>
+#include <linux/types.h>
+
+/*
+ * IO submission data structure
+ */
+struct io_uring_iocb {
+	__u8	opcode;
+	__u8	flags;
+	__u16	ioprio;
+	__s32	fd;
+	__u64	off;
+	union {
+		void	*addr;
+		__u64	__pad;
+	};
+	__u32	len;
+	union {
+		__kernel_rwf_t	rw_flags;
+		__u32		__resv;
+	};
+};
+
+#define IORING_OP_READ		1
+#define IORING_OP_WRITE		2
+#define IORING_OP_FSYNC		3
+#define IORING_OP_FDSYNC	4
+
+/*
+ * IO completion data structure
+ */
+struct io_uring_event {
+	__u64	index;		/* what iocb this event came from */
+	__s32	res;		/* result code for this event */
+	__u32	flags;
+};
+
+/*
+ * io_uring_event->flags
+ */
+#define IOEV_FLAG_CACHEHIT	(1 << 0)	/* IO did not hit media */
+
+/*
+ * Magic offsets for the application to mmap the data it needs
+ */
+#define IORING_OFF_SQ_RING		0ULL
+#define IORING_OFF_CQ_RING		0x8000000ULL
+#define IORING_OFF_IOCB			0x10000000ULL
+
+/*
+ * Filled with the offset for mmap(2)
+ */
+struct io_sqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 flags;
+	__u32 dropped;
+	__u32 array;
+	__u32 resv[3];
+};
+
+struct io_cqring_offsets {
+	__u32 head;
+	__u32 tail;
+	__u32 ring_mask;
+	__u32 ring_entries;
+	__u32 overflow;
+	__u32 events;
+	__u32 resv[4];
+};
+
+/*
+ * io_uring_enter(2) flags
+ */
+#define IORING_ENTER_GETEVENTS	(1 << 0)
+
+/*
+ * Passed in for io_uring_setup(2). Copied back with updated info on success
+ */
+struct io_uring_params {
+	__u32 sq_entries;
+	__u32 cq_entries;
+	__u32 flags;
+	__u16 resv[10];
+	struct io_sqring_offsets sq_off;
+	struct io_cqring_offsets cq_off;
+};
+
+#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index ab9d0e3c6d50..ee5e523564bb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents);
 COND_SYSCALL(io_pgetevents);
 COND_SYSCALL_COMPAT(io_getevents);
 COND_SYSCALL_COMPAT(io_pgetevents);
+COND_SYSCALL(io_uring_setup);
+COND_SYSCALL(io_uring_enter);
 
 /* fs/xattr.c */
 

From patchwork Tue Jan  8 16:56:35 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752511
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 68DB9746
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:09 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5696128F66
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:09 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 54E7328F9A; Tue,  8 Jan 2019 16:57:09 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8657A28F7F
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:08 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729148AbfAHQ5H (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:07 -0500
Received: from mail-io1-f68.google.com ([209.85.166.68]:43413 "EHLO
        mail-io1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729129AbfAHQ5F (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:05 -0500
Received: by mail-io1-f68.google.com with SMTP id b23so3654195ios.10
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=pM4hU/anAyQ7B7J4DRUaTl/1t9/uB9j6FxoUM4cWWcI=;
        b=QCIMX7b+u+URPtfkzmEUAv0VpTPAKXSDGqIpNbAhbN9MlvCMc24RQg9BJaTqqleYvN
         aAz6aTWksU20jkOHSKmlfTx2GmCQdHn/Ud8eBcooFR4Otj6+d12qFYyqBnE/TsUQOhAP
         oUbKpYuvI3p375vcCvfwdAvhNoZEU7qpDliXMpq7z4Ll8aMCkfUxa1kV5w0zq4Wg7RDt
         mYPTy8JbL7boXh4KKRXPmVhv9+XUEC5fg2kEIez8LGbFoWFgj9ilwvEnyH4SowMqp233
         pGsqftNPxTMbdcnauc26PXaXKD0VZ/MgfezRH+zxV2ugyqSQe0xPeQtgOep71ADuXXXq
         AZvQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=pM4hU/anAyQ7B7J4DRUaTl/1t9/uB9j6FxoUM4cWWcI=;
        b=i3y+4jgDlPWoEuNCobCIqFeC0h3kHbjecIFaJKkXK3DbroAViOJwp42MvWntUgTbMd
         bMtwraVEAO3kUBjrsWJbmjK6NCbSkqHSvTUpTKMQf0Fjw8/p28TluNbXVzaIvKzeXIj3
         2rt70VfV46U9yslf7yf2CyRLs/K/nFnePEfjaOJ6+hAxJAw1T6DzKtATfuYdbtqCnr6t
         rWeHEX+ieGBVxkBqaZaG5MarNUz/vgsr2ax5UAHNQqjr5Igqd3wuUWkEc0vsckJBV7NM
         U3bopYLg+MvubhD56WL9o8eLHPnhoM87/USW3WZ/i/A2IC92jBvM4i82puAux0IQ+cam
         G+/g==
X-Gm-Message-State: AJcUukcb44L0gCOdrsFvWejQbBMYbpOHY09nX/s55NCUMdzGC2423cIv
        P79JG1KxkzPeuoGSWSUj5S5p5YaJJWMf1Q==
X-Google-Smtp-Source: 
 ALg8bN5A5yHkhyr5Fj+FnV0ODFCWLThJoZy7DQdvwJt92+nCld9TREcRobtkNiTx6qiuyF8pdJ85jw==
X-Received: by 2002:a6b:3b47:: with SMTP id i68mr1584019ioa.133.1546966623916;
        Tue, 08 Jan 2019 08:57:03 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.02
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:02 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 06/16] io_uring: support for IO polling
Date: Tue,  8 Jan 2019 09:56:35 -0700
Message-Id: <20190108165645.19311-7-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Add polled variants of the read and write commands. These act like their
non-polled counterparts, except we expect to poll for completion of
them.

To use polling, io_uring_setup() must be used with the
IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match
polled and non-polled IO on an io_uring.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 227 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |  10 +-
 2 files changed, 227 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ae2b886282bb..02eab2f42c63 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -76,7 +76,14 @@ struct io_ring_ctx {
 
 	struct work_struct	work;
 
+	/* iopoll submission state */
 	struct {
+		spinlock_t poll_lock;
+		struct list_head poll_submitted;
+	} ____cacheline_aligned_in_smp;
+
+	struct {
+		struct list_head poll_completing;
 		struct mutex uring_lock;
 	} ____cacheline_aligned_in_smp;
 
@@ -106,10 +113,14 @@ struct io_kiocb {
 	unsigned long		ki_index;
 	struct list_head	ki_list;
 	unsigned long		ki_flags;
+#define KIOCB_F_IOPOLL_COMPLETED	0	/* polled IO has completed */
+#define KIOCB_F_IOPOLL_EAGAIN		1	/* submission got EAGAIN */
 };
 
 #define IO_PLUG_THRESHOLD		2
 
+#define IO_IOPOLL_BATCH	8
+
 static struct kmem_cache *kiocb_cachep, *ioctx_cachep;
 
 static const struct file_operations io_scqring_fops;
@@ -138,6 +149,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	spin_lock_init(&ctx->completion_lock);
 	mutex_init(&ctx->ring_lock);
 	init_waitqueue_head(&ctx->wait);
+	spin_lock_init(&ctx->poll_lock);
+	INIT_LIST_HEAD(&ctx->poll_submitted);
+	INIT_LIST_HEAD(&ctx->poll_completing);
 	mutex_init(&ctx->uring_lock);
 
 	return ctx;
@@ -185,6 +199,15 @@ static inline void iocb_put(struct io_kiocb *iocb)
 	kmem_cache_free(kiocb_cachep, iocb);
 }
 
+static void iocb_put_many(struct io_ring_ctx *ctx, void **iocbs, int *nr)
+{
+	if (*nr) {
+		percpu_ref_put_many(&ctx->refs, *nr);
+		kmem_cache_free_bulk(kiocb_cachep, *nr, iocbs);
+		*nr = 0;
+	}
+}
+
 static void io_complete_iocb(struct io_ring_ctx *ctx, struct io_kiocb *iocb)
 {
 	if (waitqueue_active(&ctx->wait))
@@ -192,6 +215,134 @@ static void io_complete_iocb(struct io_ring_ctx *ctx, struct io_kiocb *iocb)
 	iocb_put(iocb);
 }
 
+/*
+ * Find and free completed poll iocbs
+ */
+static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events)
+{
+	void *iocbs[IO_IOPOLL_BATCH];
+	struct io_kiocb *iocb, *n;
+	int to_free = 0;
+
+	list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) {
+		if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags))
+			continue;
+		if (to_free == ARRAY_SIZE(iocbs))
+			iocb_put_many(ctx, iocbs, &to_free);
+
+		list_del(&iocb->ki_list);
+		iocbs[to_free++] = iocb;
+
+		fput(iocb->rw.ki_filp);
+		(*nr_events)++;
+	}
+
+	if (to_free)
+		iocb_put_many(ctx, iocbs, &to_free);
+}
+
+/*
+ * Poll for a mininum of 'min' events, and a maximum of 'max'. Note that if
+ * min == 0 we consider that a non-spinning poll check - we'll still enter
+ * the driver poll loop, but only as a non-spinning completion check.
+ */
+static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events,
+				long min)
+{
+	struct io_kiocb *iocb;
+	int found, polled, ret;
+
+	/*
+	 * Check if we already have done events that satisfy what we need
+	 */
+	if (!list_empty(&ctx->poll_completing)) {
+		io_iopoll_reap(ctx, nr_events);
+		if (min && *nr_events >= min)
+			return 0;
+	}
+
+	/*
+	 * Take in a new working set from the submitted list, if possible.
+	 */
+	if (!list_empty_careful(&ctx->poll_submitted)) {
+		spin_lock(&ctx->poll_lock);
+		list_splice_init(&ctx->poll_submitted, &ctx->poll_completing);
+		spin_unlock(&ctx->poll_lock);
+	}
+
+	if (list_empty(&ctx->poll_completing))
+		return 0;
+
+	/*
+	 * Check again now that we have a new batch.
+	 */
+	io_iopoll_reap(ctx, nr_events);
+	if (min && *nr_events >= min)
+		return 0;
+
+	polled = found = 0;
+	list_for_each_entry(iocb, &ctx->poll_completing, ki_list) {
+		/*
+		 * Poll for needed events with spin == true, anything after
+		 * that we just check if we have more, up to max.
+		 */
+		bool spin = !polled || *nr_events < min;
+		struct kiocb *kiocb = &iocb->rw;
+
+		if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags))
+			break;
+
+		found++;
+		ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin);
+		if (ret < 0)
+			return ret;
+
+		polled += ret;
+	}
+
+	io_iopoll_reap(ctx, nr_events);
+	if (*nr_events >= min)
+		return 0;
+	return found;
+}
+
+/*
+ * We can't just wait for polled events to come to us, we have to actively
+ * find and complete them.
+ */
+static void io_iopoll_reap_events(struct io_ring_ctx *ctx)
+{
+	if (!(ctx->flags & IORING_SETUP_IOPOLL))
+		return;
+
+	while (!list_empty_careful(&ctx->poll_submitted) ||
+	       !list_empty(&ctx->poll_completing)) {
+		unsigned int nr_events = 0;
+
+		io_iopoll_getevents(ctx, &nr_events, 1);
+	}
+}
+
+static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+			   long min)
+{
+	int ret = 0;
+
+	while (!*nr_events || !need_resched()) {
+		int tmin = 0;
+
+		if (*nr_events < min)
+			tmin = min - *nr_events;
+
+		ret = io_iopoll_getevents(ctx, nr_events, tmin);
+		if (ret <= 0)
+			break;
+		ret = 0;
+	}
+
+	return ret;
+}
+
 static void kiocb_end_write(struct kiocb *kiocb)
 {
 	if (kiocb->ki_flags & IOCB_WRITE) {
@@ -253,8 +404,23 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2)
 	io_complete_scqring(iocb, res, 0);
 }
 
+static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2)
+{
+	struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw);
+
+	kiocb_end_write(kiocb);
+
+	if (unlikely(res == -EAGAIN)) {
+		set_bit(KIOCB_F_IOPOLL_EAGAIN, &iocb->ki_flags);
+	} else {
+		io_cqring_fill_event(iocb, res, 0);
+		set_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags);
+	}
+}
+
 static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 {
+	struct io_ring_ctx *ctx = kiocb->ki_ctx;
 	struct kiocb *req = &kiocb->rw;
 	int ret;
 
@@ -277,9 +443,19 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 	if (unlikely(ret))
 		goto out_fput;
 
-	/* no one is going to poll for this I/O */
-	req->ki_flags &= ~IOCB_HIPRI;
-	req->ki_complete = io_complete_scqring_rw;
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		ret = -EOPNOTSUPP;
+		if (!(req->ki_flags & IOCB_DIRECT) ||
+		    !req->ki_filp->f_op->iopoll)
+			goto out_fput;
+
+		req->ki_flags |= IOCB_HIPRI;
+		req->ki_complete = io_complete_scqring_iopoll;
+	} else {
+		/* no one is going to poll for this I/O */
+		req->ki_flags &= ~IOCB_HIPRI;
+		req->ki_complete = io_complete_scqring_rw;
+	}
 	return 0;
 out_fput:
 	fput(req->ki_filp);
@@ -317,6 +493,30 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret)
 	}
 }
 
+/*
+ * After the iocb has been issued, it's safe to be found on the poll list.
+ * Adding the kiocb to the list AFTER submission ensures that we don't
+ * find it from a io_getevents() thread before the issuer is done accessing
+ * the kiocb cookie.
+ */
+static void io_iopoll_iocb_issued(struct io_kiocb *kiocb)
+{
+	/*
+	 * For fast devices, IO may have already completed. If it has, add
+	 * it to the front so we find it first. We can't add to the poll_done
+	 * list as that's unlocked from the completion side.
+	 */
+	const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags);
+	struct io_ring_ctx *ctx = kiocb->ki_ctx;
+
+	spin_lock(&ctx->poll_lock);
+	if (front)
+		list_add(&kiocb->ki_list, &ctx->poll_submitted);
+	else
+		list_add_tail(&kiocb->ki_list, &ctx->poll_submitted);
+	spin_unlock(&ctx->poll_lock);
+}
+
 static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -459,9 +659,13 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 		ret = io_write(req, iocb);
 		break;
 	case IORING_OP_FSYNC:
+		if (ctx->flags & IORING_SETUP_IOPOLL)
+			break;
 		ret = io_fsync(&req->fsync, iocb, false);
 		break;
 	case IORING_OP_FDSYNC:
+		if (ctx->flags & IORING_SETUP_IOPOLL)
+			break;
 		ret = io_fsync(&req->fsync, iocb, true);
 		break;
 	default:
@@ -475,6 +679,13 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 	 */
 	if (ret)
 		goto out_put_req;
+	if (ctx->flags & IORING_SETUP_IOPOLL) {
+		if (test_bit(KIOCB_F_IOPOLL_EAGAIN, &req->ki_flags)) {
+			ret = -EAGAIN;
+			goto out_put_req;
+		}
+		io_iopoll_iocb_issued(req);
+	}
 	return 0;
 out_put_req:
 	iocb_put(req);
@@ -589,12 +800,17 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 			return ret;
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
+		unsigned nr_events = 0;
 		int get_ret;
 
 		if (!ret && to_submit)
 			min_complete = 0;
 
-		get_ret = io_cqring_wait(ctx, min_complete);
+		if (ctx->flags & IORING_SETUP_IOPOLL)
+			get_ret = io_iopoll_check(ctx, &nr_events,
+							min_complete);
+		else
+			get_ret = io_cqring_wait(ctx, min_complete);
 		if (get_ret < 0 && !ret)
 			ret = get_ret;
 	}
@@ -622,6 +838,7 @@ static void io_ring_ctx_free(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, work);
 
+	io_iopoll_reap_events(ctx);
 	io_free_scq_urings(ctx);
 	percpu_ref_exit(&ctx->refs);
 	kmem_cache_free(ioctx_cachep, ctx);
@@ -825,7 +1042,7 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs,
 			return -EINVAL;
 	}
 
-	if (p.flags)
+	if (p.flags & ~IORING_SETUP_IOPOLL)
 		return -EINVAL;
 	if (iovecs)
 		return -EINVAL;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index c31ac84d9f53..f7ba30747816 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -31,6 +31,11 @@ struct io_uring_iocb {
 	};
 };
 
+/*
+ * io_uring_setup() flags
+ */
+#define IORING_SETUP_IOPOLL	(1 << 0)	/* io_context is polled */
+
 #define IORING_OP_READ		1
 #define IORING_OP_WRITE		2
 #define IORING_OP_FSYNC		3
@@ -45,11 +50,6 @@ struct io_uring_event {
 	__u32	flags;
 };
 
-/*
- * io_uring_event->flags
- */
-#define IOEV_FLAG_CACHEHIT	(1 << 0)	/* IO did not hit media */
-
 /*
  * Magic offsets for the application to mmap the data it needs
  */

From patchwork Tue Jan  8 16:56:36 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752519
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C9B2117E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:11 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BA32328F9A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:11 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id B832728FA5; Tue,  8 Jan 2019 16:57:11 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 328AA28FB1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:11 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729167AbfAHQ5K (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:10 -0500
Received: from mail-it1-f194.google.com ([209.85.166.194]:54035 "EHLO
        mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729136AbfAHQ5H (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:07 -0500
Received: by mail-it1-f194.google.com with SMTP id g85so7228810ita.3
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=u26EGdJD+JR7lvJxPBNP6VWIt303xqioYXa90ZvBUdA=;
        b=PCmJo9YpzYTov01ZDphUrTNdlRewGcZiZF+8Bqrk+HsqiuKD/rXBQCNbu81KBYQTLH
         TzZLP+ZXd6BO8yBUvKQqk5IeqnB/9An2wv3WfykSSg5ZwaQSZqSeBCF3x58d/A0bmguP
         ucDBXe4DhYmxpaSMI556aj/n5Y3uZpF1zW52iUDRk3ZZhY95maICZxfwqd9asEmirmBo
         ClgpJzlPZ3cjuVbpImxPoMEJbGP7Kt1I0ZPCn0du6QMmhNOlx4sX6RY+OHMkmohsmZep
         BK5p2a5ePXRYB39JPs/esRioJMpuGhSLb7RZmrKx1WXR3J3ne4T4WiryTdk+sMtzuc2R
         lbew==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=u26EGdJD+JR7lvJxPBNP6VWIt303xqioYXa90ZvBUdA=;
        b=FrbZ/oA/S4o6BA5d96f9GrW0DmMnl9ZDg5elrEYlgRuOKsPDpy02FjoUX0C0sgh3w9
         jGNx2EQqT41AH06/Q8at8YryLOxoihySNq4GS7ekI7ey6k7ewEcv2CfoOC9oZFkZyVIO
         ZzWUdYlPu6Ti+R6PXiJ2YUz/W+vWjnlxHNQkXuztttKNf7bpFcCDG8ROzzm/9iyjcuLk
         eiUa/GBVsPcan4dXCczcLlfrv8Jf+WnwRB1aXy7+NzI0qf5sGY35OE5iJGi87IRQaaaa
         sjx8XUx6V1df5qCudiIEpwcektSCTGpuMypFBvIFUQjJpAyrMA7WRuF2NDT6K56xqNGj
         m0xg==
X-Gm-Message-State: AJcUukfjzFecFQoip+jnD2xrnJ6tsjZ9+98H0N9dj7/PGy+vbqPK93cC
        F22TZLAMu8k1zBIWxqGxFKNsArxgld9Cgw==
X-Google-Smtp-Source: 
 ALg8bN6FEx9kjlndnlitAQLs/BnfTl2aafA6DQQ1pmrcl5ZeJwdSjhLcwRm92gk4YOU0GoVRc4+z3Q==
X-Received: by 2002:a02:4958:: with SMTP id z85mr1685910jaa.6.1546966625610;
        Tue, 08 Jan 2019 08:57:05 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.03
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:04 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 07/16] io_uring: add submission side request cache
Date: Tue,  8 Jan 2019 09:56:36 -0700
Message-Id: <20190108165645.19311-8-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

We have to add each submitted polled request to the io_context
poll_submitted list, which means we have to grab the poll_lock. We
already use the block plug to batch submissions if we're doing a batch
of IO submissions, extend that to cover the poll requests internally as
well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 122 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 106 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 02eab2f42c63..9f36eb728208 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -121,6 +121,21 @@ struct io_kiocb {
 
 #define IO_IOPOLL_BATCH	8
 
+struct io_submit_state {
+	struct io_ring_ctx *ctx;
+
+	struct blk_plug plug;
+#ifdef CONFIG_BLOCK
+	struct blk_plug_cb plug_cb;
+#endif
+
+	/*
+	 * Polled iocbs that have been submitted, but not added to the ctx yet
+	 */
+	struct list_head req_list;
+	unsigned int req_count;
+};
+
 static struct kmem_cache *kiocb_cachep, *ioctx_cachep;
 
 static const struct file_operations io_scqring_fops;
@@ -494,21 +509,29 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret)
 }
 
 /*
- * After the iocb has been issued, it's safe to be found on the poll list.
- * Adding the kiocb to the list AFTER submission ensures that we don't
- * find it from a io_getevents() thread before the issuer is done accessing
- * the kiocb cookie.
+ * Called either at the end of IO submission, or through a plug callback
+ * because we're going to schedule. Moves out local batch of requests to
+ * the ctx poll list, so they can be found for polling + reaping.
  */
-static void io_iopoll_iocb_issued(struct io_kiocb *kiocb)
+static void io_flush_state_reqs(struct io_ring_ctx *ctx,
+				 struct io_submit_state *state)
 {
+	spin_lock(&ctx->poll_lock);
+	list_splice_tail_init(&state->req_list, &ctx->poll_submitted);
+	spin_unlock(&ctx->poll_lock);
+	state->req_count = 0;
+}
+
+static void io_iopoll_iocb_add_list(struct io_kiocb *kiocb)
+{
+	const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags);
+	struct io_ring_ctx *ctx = kiocb->ki_ctx;
+
 	/*
 	 * For fast devices, IO may have already completed. If it has, add
 	 * it to the front so we find it first. We can't add to the poll_done
 	 * list as that's unlocked from the completion side.
 	 */
-	const int front = test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags);
-	struct io_ring_ctx *ctx = kiocb->ki_ctx;
-
 	spin_lock(&ctx->poll_lock);
 	if (front)
 		list_add(&kiocb->ki_list, &ctx->poll_submitted);
@@ -517,6 +540,33 @@ static void io_iopoll_iocb_issued(struct io_kiocb *kiocb)
 	spin_unlock(&ctx->poll_lock);
 }
 
+static void io_iopoll_iocb_add_state(struct io_submit_state *state,
+				     struct io_kiocb *kiocb)
+{
+	if (test_bit(KIOCB_F_IOPOLL_COMPLETED, &kiocb->ki_flags))
+		list_add(&kiocb->ki_list, &state->req_list);
+	else
+		list_add_tail(&kiocb->ki_list, &state->req_list);
+
+	if (++state->req_count >= IO_IOPOLL_BATCH)
+		io_flush_state_reqs(state->ctx, state);
+}
+
+/*
+ * After the iocb has been issued, it's safe to be found on the poll list.
+ * Adding the kiocb to the list AFTER submission ensures that we don't
+ * find it from a io_getevents() thread before the issuer is done accessing
+ * the kiocb cookie.
+ */
+static void io_iopoll_iocb_issued(struct io_submit_state *state,
+				  struct io_kiocb *kiocb)
+{
+	if (!state || !IS_ENABLED(CONFIG_BLOCK))
+		io_iopoll_iocb_add_list(kiocb);
+	else
+		io_iopoll_iocb_add_state(state, kiocb);
+}
+
 static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
@@ -632,7 +682,8 @@ static int io_fsync(struct fsync_iocb *req, const struct io_uring_iocb *iocb,
 
 static int __io_submit_one(struct io_ring_ctx *ctx,
 			   const struct io_uring_iocb *iocb,
-			   unsigned long ki_index)
+			   unsigned long ki_index,
+			   struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 	ssize_t ret;
@@ -684,7 +735,7 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 			ret = -EAGAIN;
 			goto out_put_req;
 		}
-		io_iopoll_iocb_issued(req);
+		io_iopoll_iocb_issued(state, req);
 	}
 	return 0;
 out_put_req:
@@ -692,6 +743,43 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 	return ret;
 }
 
+#ifdef CONFIG_BLOCK
+static void io_state_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+	struct io_submit_state *state;
+
+	state = container_of(cb, struct io_submit_state, plug_cb);
+	if (!list_empty(&state->req_list))
+		io_flush_state_reqs(state->ctx, state);
+}
+#endif
+
+/*
+ * Batched submission is done, ensure local IO is flushed out.
+ */
+static void io_submit_state_end(struct io_submit_state *state)
+{
+	blk_finish_plug(&state->plug);
+	if (!list_empty(&state->req_list))
+		io_flush_state_reqs(state->ctx, state);
+}
+
+/*
+ * Start submission side cache.
+ */
+static void io_submit_state_start(struct io_submit_state *state,
+				  struct io_ring_ctx *ctx)
+{
+	state->ctx = ctx;
+	INIT_LIST_HEAD(&state->req_list);
+	state->req_count = 0;
+#ifdef CONFIG_BLOCK
+	state->plug_cb.callback = io_state_unplug;
+	blk_start_plug(&state->plug);
+	list_add(&state->plug_cb.list, &state->plug.cb_list);
+#endif
+}
+
 static void io_inc_sqring(struct io_ring_ctx *ctx)
 {
 	struct io_sq_ring *ring = ctx->sq_ring.ring;
@@ -726,11 +814,13 @@ static const struct io_uring_iocb *io_peek_sqring(struct io_ring_ctx *ctx,
 
 static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
+	struct io_submit_state state, *statep = NULL;
 	int i, ret = 0, submit = 0;
-	struct blk_plug plug;
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_start_plug(&plug);
+	if (to_submit > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx);
+		statep = &state;
+	}
 
 	for (i = 0; i < to_submit; i++) {
 		const struct io_uring_iocb *iocb;
@@ -740,7 +830,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		if (!iocb)
 			break;
 
-		ret = __io_submit_one(ctx, iocb, iocb_index);
+		ret = __io_submit_one(ctx, iocb, iocb_index, statep);
 		if (ret)
 			break;
 
@@ -748,8 +838,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		io_inc_sqring(ctx);
 	}
 
-	if (to_submit > IO_PLUG_THRESHOLD)
-		blk_finish_plug(&plug);
+	if (statep)
+		io_submit_state_end(statep);
 
 	return submit ? submit : ret;
 }

From patchwork Tue Jan  8 16:56:37 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752529
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6C297746
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5CF4028F9D
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5104328FB7; Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CA35928FA9
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:14 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729210AbfAHQ5M (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:12 -0500
Received: from mail-io1-f66.google.com ([209.85.166.66]:45007 "EHLO
        mail-io1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729115AbfAHQ5J (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:09 -0500
Received: by mail-io1-f66.google.com with SMTP id r200so3649726iod.11
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:08 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=;
        b=fxrvtfy3/ra3mu7B9E7ba0E9viU/L1iUEG0/Bv1Yiw/43yE3M5gXlzcBNGDR0yC8cu
         esviBVjgM6lSfqlG9xh8L9J6H6s13oeXM6P9QVo6nySvr1+eOZjIbF3H4ZhGlbrbFu02
         Bx2Wnr0fWTI5Hix4AnIr6u6fqtZNRsHLKHs1FR0Nvhz7O1u+hlWRcWjDS2ks9tam8wKB
         +1rssb90CoMqHqHAwFILpyni0Pl1TJCYiX6ZEpoL7KhnCBrzpfOE4cEBmO3Zro4t6HeD
         X/RQQgF66qC665AkRTJXfO8/8FuhaK8Cc1MZMPGCnXAdegVdl5K3aHuKfUSdrxOpU2gc
         BJKw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=;
        b=oPnCYuexxWDMj4Xk/3ykePTZOiciCZ4MDiyMhMzv84TQIyk3jakGk4Qjo19tgh4srE
         6YEfel6pIcTVV1byOQfmmKVmWxQgnZnhbuiSjdop0cDUOEWbPsRS+eMba8KFOYqv6Jx3
         u6X3hBGPpjgnnA0hJZYPXbE8MefNvhEmIssAhm1Frmc7nZ4/bZiIOdELdMhtTLn2gvO5
         lOyUMVFlcSTggjGiEozlBi/WstP4OjAv7vhwaERkuWBei9hBojuwH1hVmRSwiZ/StHnG
         4xANPPcQgI2VEItv00VqZcIt5SzLQ/vB4NeAB4JW1IWpq6vCiZFeGMLAiqOwe2MEv7VS
         TTaA==
X-Gm-Message-State: AJcUukcuOtHzh0bUjab4zY0fAyZ9H1mYerl7G1h7BR8mkMsUDq6Ci70Q
        rW2w0Jf8UPAzuI2npJPSmZKRTh7s/jZ+OA==
X-Google-Smtp-Source: 
 ALg8bN6wc++5t440FHDI18o4YwXVkEqhiKYp23/1pEmsxwiH8alx9j7BstlwTqA/e7o7x/zO0s+zBg==
X-Received: by 2002:a6b:7b49:: with SMTP id m9mr1464987iop.237.1546966627439;
        Tue, 08 Jan 2019 08:57:07 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.05
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:06 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 08/16] fs: add fget_many() and fput_many()
Date: Tue,  8 Jan 2019 09:56:37 -0700
Message-Id: <20190108165645.19311-9-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Some uses cases repeatedly get and put references to the same file, but
the only exposed interface is doing these one at the time. As each of
these entail an atomic inc or dec on a shared structure, that cost can
add up.

Add fget_many(), which works just like fget(), except it takes an
argument for how many references to get on the file. Ditto fput_many(),
which can drop an arbitrary number of references to a file.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/file.c            | 15 ++++++++++-----
 fs/file_table.c      |  9 +++++++--
 include/linux/file.h |  2 ++
 include/linux/fs.h   |  4 +++-
 4 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3209ee271c41..e0d7ce70e860 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static struct file *__fget(unsigned int fd, fmode_t mask)
+static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
 {
 	struct files_struct *files = current->files;
 	struct file *file;
@@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 		 */
 		if (file->f_mode & mask)
 			file = NULL;
-		else if (!get_file_rcu(file))
+		else if (!get_file_rcu_many(file, refs))
 			goto loop;
 	}
 	rcu_read_unlock();
@@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask)
 	return file;
 }
 
+struct file *fget_many(unsigned int fd, unsigned int refs)
+{
+	return __fget(fd, FMODE_PATH, refs);
+}
+
 struct file *fget(unsigned int fd)
 {
-	return __fget(fd, FMODE_PATH);
+	return fget_many(fd, 1);
 }
 EXPORT_SYMBOL(fget);
 
 struct file *fget_raw(unsigned int fd)
 {
-	return __fget(fd, 0);
+	return __fget(fd, 0, 1);
 }
 EXPORT_SYMBOL(fget_raw);
 
@@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask)
 			return 0;
 		return (unsigned long)file;
 	} else {
-		file = __fget(fd, mask);
+		file = __fget(fd, mask, 1);
 		if (!file)
 			return 0;
 		return FDPUT_FPUT | (unsigned long)file;
diff --git a/fs/file_table.c b/fs/file_table.c
index 5679e7fcb6b0..155d7514a094 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -326,9 +326,9 @@ void flush_delayed_fput(void)
 
 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);
 
-void fput(struct file *file)
+void fput_many(struct file *file, unsigned int refs)
 {
-	if (atomic_long_dec_and_test(&file->f_count)) {
+	if (atomic_long_sub_and_test(refs, &file->f_count)) {
 		struct task_struct *task = current;
 
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
@@ -347,6 +347,11 @@ void fput(struct file *file)
 	}
 }
 
+void fput(struct file *file)
+{
+	fput_many(file, 1);
+}
+
 /*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
diff --git a/include/linux/file.h b/include/linux/file.h
index 6b2fb032416c..3fcddff56bc4 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -13,6 +13,7 @@
 struct file;
 
 extern void fput(struct file *);
+extern void fput_many(struct file *, unsigned int);
 
 struct file_operations;
 struct vfsmount;
@@ -44,6 +45,7 @@ static inline void fdput(struct fd fd)
 }
 
 extern struct file *fget(unsigned int fd);
+extern struct file *fget_many(unsigned int fd, unsigned int refs);
 extern struct file *fget_raw(unsigned int fd);
 extern unsigned long __fdget(unsigned int fd);
 extern unsigned long __fdget_raw(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ccb0b7a63aa5..acaad78b6781 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+#define get_file_rcu_many(x, cnt)	\
+	atomic_long_add_unless(&(x)->f_count, (cnt), 0)
+#define get_file_rcu(x) get_file_rcu_many((x), 1)
 #define fput_atomic(x)	atomic_long_add_unless(&(x)->f_count, -1, 1)
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 

From patchwork Tue Jan  8 16:56:38 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752521
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BFB1613B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:13 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AF81828F8A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:13 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id A3B9628FB2; Tue,  8 Jan 2019 16:57:13 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 157AC28F8A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:12 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729184AbfAHQ5M (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:12 -0500
Received: from mail-io1-f65.google.com ([209.85.166.65]:43421 "EHLO
        mail-io1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729132AbfAHQ5L (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:11 -0500
Received: by mail-io1-f65.google.com with SMTP id b23so3654478ios.10
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=IVD1aACbOf9E3T2cMeMXe0vYKp9FPjypBFliz/GxAEQ=;
        b=FVWoicZFSmyI3NpZH8o7XG3pwewvO05qPKnQyE4SSkEgPKTVgS6REhszGXnS1zU1vY
         1RgcgkjiIODHuZ1ARgjxXCmeEtNzM6vNFRkJHNgAPPSpG4mXMVZZyLmCuATa/DR0QkVU
         zySx9H3csuKKFqKCTTSrs8EBKiV3iLB/omnglR09IaKc7OBLr16I4fjG5V0LPFtLCc58
         +9jgD1f2UpYt3SFagG5LS/wcKwlr89eA0IkrbaFQ8qAgHiHqhI23660KBjI4HrWfl0l4
         M/M+fZFkfmsIzaHn1SAACSw2wk+cCYldxU84aN5ESgUtQdwtgcnQ8ZhF94wKxODSPYYi
         ehTw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=IVD1aACbOf9E3T2cMeMXe0vYKp9FPjypBFliz/GxAEQ=;
        b=BbftUf47QgFofvk1441yrDuRANYbMGkzzBJGlUZKBOSzA/QqkLrjhti5sCXQinc1oE
         TUxQGpiSOeApo1I8UHv4cVowQiKAub0stzOFyOh1SZJdjs73kQeObOkLcJpBIuUboZ9T
         ZO+GZWzGyFn8HVqO28K5UnNPitqxyzcJtcLlwI1NHLPNfzHfx2yUuNsbhYwJQOkNRQlo
         LUxF0a8cdTD6V4/frxgotTHnfHglyo2PYj4tW2ZfPtG6kCCmRWP6PKjB+Wp8kExCGHp4
         7LXhMwupc6jSo4DP+IiQFxOHmvu6DymrfKFsF58Vf+fNvQG/iIY8yDQ14vB9C46t3YCa
         n6pA==
X-Gm-Message-State: AJcUukdGHEd7La1E6RcCndWuPHdiJekM0//XppuVDwKREL0APtxh1tNm
        kTALtp5G0jQBCvDQ3K53E20WllyqUr+HSw==
X-Google-Smtp-Source: 
 ALg8bN7pqQLuBunNQFEh9uG7z7iOvCRuESJ1wBwEchD7J87Wa/ZaQt/zkkZFpbMm1J15n1qFFXnx9Q==
X-Received: by 2002:a6b:c402:: with SMTP id y2mr1690122ioa.77.1546966629440;
        Tue, 08 Jan 2019 08:57:09 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.07
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:08 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 09/16] io_uring: use fget/fput_many() for file references
Date: Tue,  8 Jan 2019 09:56:38 -0700
Message-Id: <20190108165645.19311-10-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On the submission side, add file reference batching to the
io_submit_state. We get as many references as the number of iocbs we
are submitting, and drop unused ones if we end up switching files. The
assumption here is that we're usually only dealing with one fd, and if
there are multiple, hopefuly they are at least somewhat ordered. Could
trivially be extended to cover multiple fds, if needed.

On the completion side we do the same thing, except this is trivially
done just locally in io_iopoll_reap().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 105 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 92 insertions(+), 13 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9f36eb728208..afbaebb63012 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -134,6 +134,15 @@ struct io_submit_state {
 	 */
 	struct list_head req_list;
 	unsigned int req_count;
+
+	/*
+	 * File reference cache
+	 */
+	struct file *file;
+	unsigned int fd;
+	unsigned int has_refs;
+	unsigned int used_refs;
+	unsigned int ios_left;
 };
 
 static struct kmem_cache *kiocb_cachep, *ioctx_cachep;
@@ -237,7 +246,8 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events)
 {
 	void *iocbs[IO_IOPOLL_BATCH];
 	struct io_kiocb *iocb, *n;
-	int to_free = 0;
+	int file_count, to_free = 0;
+	struct file *file = NULL;
 
 	list_for_each_entry_safe(iocb, n, &ctx->poll_completing, ki_list) {
 		if (!test_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags))
@@ -248,10 +258,27 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events)
 		list_del(&iocb->ki_list);
 		iocbs[to_free++] = iocb;
 
-		fput(iocb->rw.ki_filp);
+		/*
+		 * Batched puts of the same file, to avoid dirtying the
+		 * file usage count multiple times, if avoidable.
+		 */
+		if (!file) {
+			file = iocb->rw.ki_filp;
+			file_count = 1;
+		} else if (file == iocb->rw.ki_filp) {
+			file_count++;
+		} else {
+			fput_many(file, file_count);
+			file = iocb->rw.ki_filp;
+			file_count = 1;
+		}
+
 		(*nr_events)++;
 	}
 
+	if (file)
+		fput_many(file, file_count);
+
 	if (to_free)
 		iocb_put_many(ctx, iocbs, &to_free);
 }
@@ -433,13 +460,60 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2)
 	}
 }
 
-static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
+static void io_file_put(struct io_submit_state *state, struct file *file)
+{
+	if (!state) {
+		fput(file);
+	} else if (state->file) {
+		int diff = state->has_refs - state->used_refs;
+
+		if (diff)
+			fput_many(state->file, diff);
+		state->file = NULL;
+	}
+}
+
+/*
+ * Get as many references to a file as we have IOs left in this submission,
+ * assuming most submissions are for one file, or at least that each file
+ * has more than one submission.
+ */
+static struct file *io_file_get(struct io_submit_state *state, int fd)
+{
+	if (!state)
+		return fget(fd);
+
+	if (!state->file) {
+get_file:
+		state->file = fget_many(fd, state->ios_left);
+		if (!state->file)
+			return NULL;
+
+		state->fd = fd;
+		state->has_refs = state->ios_left;
+		state->used_refs = 1;
+		state->ios_left--;
+		return state->file;
+	}
+
+	if (state->fd == fd) {
+		state->used_refs++;
+		state->ios_left--;
+		return state->file;
+	}
+
+	io_file_put(state, NULL);
+	goto get_file;
+}
+
+static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
+		      struct io_submit_state *state)
 {
 	struct io_ring_ctx *ctx = kiocb->ki_ctx;
 	struct kiocb *req = &kiocb->rw;
 	int ret;
 
-	req->ki_filp = fget(iocb->fd);
+	req->ki_filp = io_file_get(state, iocb->fd);
 	if (unlikely(!req->ki_filp))
 		return -EBADF;
 	req->ki_pos = iocb->off;
@@ -473,7 +547,7 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 	}
 	return 0;
 out_fput:
-	fput(req->ki_filp);
+	io_file_put(state, req->ki_filp);
 	return ret;
 }
 
@@ -567,7 +641,8 @@ static void io_iopoll_iocb_issued(struct io_submit_state *state,
 		io_iopoll_iocb_add_state(state, kiocb);
 }
 
-static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
+static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
+		       struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -575,7 +650,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(kiocb, iocb);
+	ret = io_prep_rw(kiocb, iocb, state);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -602,7 +677,8 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb)
 }
 
 static ssize_t io_write(struct io_kiocb *kiocb,
-			const struct io_uring_iocb *iocb)
+			const struct io_uring_iocb *iocb,
+			struct io_submit_state *state)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -610,7 +686,7 @@ static ssize_t io_write(struct io_kiocb *kiocb,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(kiocb, iocb);
+	ret = io_prep_rw(kiocb, iocb, state);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -704,10 +780,10 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 	ret = -EINVAL;
 	switch (iocb->opcode) {
 	case IORING_OP_READ:
-		ret = io_read(req, iocb);
+		ret = io_read(req, iocb, state);
 		break;
 	case IORING_OP_WRITE:
-		ret = io_write(req, iocb);
+		ret = io_write(req, iocb, state);
 		break;
 	case IORING_OP_FSYNC:
 		if (ctx->flags & IORING_SETUP_IOPOLL)
@@ -762,17 +838,20 @@ static void io_submit_state_end(struct io_submit_state *state)
 	blk_finish_plug(&state->plug);
 	if (!list_empty(&state->req_list))
 		io_flush_state_reqs(state->ctx, state);
+	io_file_put(state, NULL);
 }
 
 /*
  * Start submission side cache.
  */
 static void io_submit_state_start(struct io_submit_state *state,
-				  struct io_ring_ctx *ctx)
+				  struct io_ring_ctx *ctx, unsigned max_ios)
 {
 	state->ctx = ctx;
 	INIT_LIST_HEAD(&state->req_list);
 	state->req_count = 0;
+	state->file = NULL;
+	state->ios_left = max_ios;
 #ifdef CONFIG_BLOCK
 	state->plug_cb.callback = io_state_unplug;
 	blk_start_plug(&state->plug);
@@ -818,7 +897,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 	int i, ret = 0, submit = 0;
 
 	if (to_submit > IO_PLUG_THRESHOLD) {
-		io_submit_state_start(&state, ctx);
+		io_submit_state_start(&state, ctx, to_submit);
 		statep = &state;
 	}
 

From patchwork Tue Jan  8 16:56:39 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752531
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E331117E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D553528F7F
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id D3D7C28FAC; Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8DB5428F7F
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:15 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1728798AbfAHQ5O (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:14 -0500
Received: from mail-it1-f194.google.com ([209.85.166.194]:54048 "EHLO
        mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729181AbfAHQ5M (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:12 -0500
Received: by mail-it1-f194.google.com with SMTP id g85so7229315ita.3
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:12 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=19AVyOAHFZ2ncTN9zzAcs34D6P7c0ZFcpuRI5Xj4k84=;
        b=Zl+G/MTzD699j6ZJuiRixezLast5YMfXBmzS4ax6gR/iAm6dEUQIDFy3VQw48XSJSP
         28k2PTy11Vsl2L1WsuV7CckBHVcqBCbadWdaHOQu56SWCNbDMTtFwA3JbFh5t3XhBT6D
         zWRAlf0gIY1qDy8Ddf0EAyhBcRPbDcGjpTVfPlLE9GoslgOjF3LG+PQD9Js4T7vXF6/X
         7iqdnUeaYD7cOA54Xz6X8p6oCQGMe9b9RpfY+wNNh5flvn7ilg1RmDC8AUJ4sTMSEamd
         fZlDhAK8JgEZMJw3fbvSL3LGDcL1t03blo8pDmnS5rwswvKDlRGUj1NlRqsYe9ctlxj7
         lKgA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=19AVyOAHFZ2ncTN9zzAcs34D6P7c0ZFcpuRI5Xj4k84=;
        b=IW5Im8bOhQtoeamzPD3v/f4Q5sBw93EUmXOu84HszeQKyKP+Qglg1ekaeKr8uDz3BX
         CkIknX0z2I44adY9uI34fRJVMMd9flmMErLWOD0UIzBYA9P6edyRGKPRLXYub+Z6AOQs
         B4LhFgGbOHoD3nPD7rTGshrzhyeTK7nOxmDf3MiW8vzjwLz3SvG+2JNLgF1iRV/mtrVQ
         w488aq8KjVmlavLi7aaaOD1jSKcKYk4d4GHGGKAvAJ9OSIV9aO9+lg9GrGCfR0a85pro
         1WLsj9+0QFufN5CQR2v4/K/8G0SNYGuRHQrpiTaMH0uH556DajbUO2rQQheUQF6OTJxo
         a8Jg==
X-Gm-Message-State: AJcUukcIEeBFux1UonutB88PUZlZyRHP6LbcBSlQGEw12zvLJyp1hqej
        +DwbaVc0qKFkiqO+jw6X3t+aTtasRQvlDA==
X-Google-Smtp-Source: 
 ALg8bN5pKFwQZeZUKxcRgpnwI886d90jk2HOKrvCtq1xJrWHiXjOpgTtmzYuxIbKmpRqN8aI2dJVoQ==
X-Received: by 2002:a24:6fc4:: with SMTP id x187mr2037882itb.93.1546966631268;
        Tue, 08 Jan 2019 08:57:11 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.09
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:10 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 10/16] io_uring: split kiocb init from allocation
Date: Tue,  8 Jan 2019 09:56:39 -0700
Message-Id: <20190108165645.19311-11-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

In preparation from having pre-allocated requests, that we then just
need to initialize before use.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index afbaebb63012..11d045f0f799 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -202,6 +202,14 @@ static struct io_uring_event *io_peek_cqring(struct io_ring_ctx *ctx)
 	return &ring->events[tail & ctx->cq_ring.ring_mask];
 }
 
+static void io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req)
+{
+	percpu_ref_get(&ctx->refs);
+	req->ki_ctx = ctx;
+	INIT_LIST_HEAD(&req->ki_list);
+	req->ki_flags = 0;
+}
+
 static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
 {
 	struct io_kiocb *req;
@@ -210,10 +218,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
 	if (!req)
 		return NULL;
 
-	percpu_ref_get(&ctx->refs);
-	req->ki_ctx = ctx;
-	INIT_LIST_HEAD(&req->ki_list);
-	req->ki_flags = 0;
+	io_req_init(ctx, req);
 	return req;
 }
 

From patchwork Tue Jan  8 16:56:40 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752535
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2821A13B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:17 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1909C28FA4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:17 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 1719228F9A; Tue,  8 Jan 2019 16:57:17 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B470128F9A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:16 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729223AbfAHQ5P (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:15 -0500
Received: from mail-io1-f66.google.com ([209.85.166.66]:38818 "EHLO
        mail-io1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729179AbfAHQ5O (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:14 -0500
Received: by mail-io1-f66.google.com with SMTP id l14so3664675ioj.5
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=Q1+rzmFAhsJFMNkbvjMPPUCScsMQNRxeiVOaxepJ4Zg=;
        b=FDGktQLBlh0jObXjQtHOQG7kOJVUuE+Bo4KU91owZt6HiDcOHKgXMPcZ0uFVeO4ZYs
         V1MAFrngFddIcB6aRxjvVYwpzheWWWKMF9XLBSbWab5UN9XxX1nF6tCgroCXMX002B83
         V7bYd5vLXWJIjgqliLcjOZYI35xRYZ4EYEy0bJ2ivVk1pgKuvCb/GTrtplaChXTuUKfx
         Vs64/JGc81svxSM1nJPjkXefwqIHulL+ntATAE2QaZjedXOaLKmN1DjBbN6QHKy4BTkC
         AThF8zZr3nTjzemBdxacxWywOQzZ8Td4qQg+ESNNzAkW7BQQlrUXhQjmzC8lZO9qeRlG
         ATYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=Q1+rzmFAhsJFMNkbvjMPPUCScsMQNRxeiVOaxepJ4Zg=;
        b=RD7gCp1Wse0Epay2+VkOJKOHMiYDz4zVkRjWSVaU2y7CZWY7XYNN5WB9ce1/RSb9eu
         WFfapfgvi0DiSkxIcGfXZw1WsolG4ab8pbN0bl3SHXNWsmm/HlkqpJ65xi6KGmyO2Wex
         iC8oFf80Fkl84hkX8P2Ce/AUIuB2hZJVoSUZ5GRCRXZuoeT7nxeEeq08POPUGNZa3NPJ
         BSd4ScptWtY5pGos9UZ4dIqbMJQAL52qp2cfXcWYXTso6fXMr45jyNLsWwafXjG1zqNR
         sxcrBQ9USgoMuZ5iBRSSG8joQAqr2ElqDD18BmcyQQ0iFCI+Cui+paeeZuhheO4V5IYJ
         fSQw==
X-Gm-Message-State: AJcUukctrPolkvdNhy5d/16bU+FsKpKfDzCdyZlvomMo4bhzE5cegqZ7
        pkhJ5i+gpeokFFTN6Kyr0c5F+FYFivLJ/Q==
X-Google-Smtp-Source: 
 ALg8bN7ivUBLk9qzoowG1ma2oq1rWMqjlNmy/ynh5BD+ifEWlbP7VIytgwNJ39sBZT/bvaZnUY9k1A==
X-Received: by 2002:a6b:1604:: with SMTP id 4mr1483427iow.29.1546966632893;
        Tue, 08 Jan 2019 08:57:12 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.11
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:12 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 11/16] io_uring: batch io_kiocb allocation
Date: Tue,  8 Jan 2019 09:56:40 -0700
Message-Id: <20190108165645.19311-12-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Similarly to how we use the state->ios_left to know how many references
to get to a file, we can use it to allocate the io_kiocb's we need in
bulk.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c | 41 +++++++++++++++++++++++++++++++++++------
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 11d045f0f799..62778d7ffb8d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -135,6 +135,13 @@ struct io_submit_state {
 	struct list_head req_list;
 	unsigned int req_count;
 
+	/*
+	 * io_kiocb alloc cache
+	 */
+	void *iocbs[IO_IOPOLL_BATCH];
+	unsigned int free_iocbs;
+	unsigned int cur_iocb;
+
 	/*
 	 * File reference cache
 	 */
@@ -210,15 +217,33 @@ static void io_req_init(struct io_ring_ctx *ctx, struct io_kiocb *req)
 	req->ki_flags = 0;
 }
 
-static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx)
+static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
+				   struct io_submit_state *state)
 {
 	struct io_kiocb *req;
 
-	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
-	if (!req)
-		return NULL;
+	if (!state)
+		req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
+	else if (!state->free_iocbs) {
+		size_t size;
+		int ret;
+
+		size = min_t(size_t, state->ios_left, ARRAY_SIZE(state->iocbs));
+		ret = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, size,
+						state->iocbs);
+		if (ret <= 0)
+			return ERR_PTR(-ENOMEM);
+		state->free_iocbs = ret - 1;
+		state->cur_iocb = 1;
+		req = state->iocbs[0];
+	} else {
+		req = state->iocbs[state->cur_iocb];
+		state->free_iocbs--;
+		state->cur_iocb++;
+	}
 
-	io_req_init(ctx, req);
+	if (req)
+		io_req_init(ctx, req);
 	return req;
 }
 
@@ -773,7 +798,7 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 	if (unlikely(iocb->flags))
 		return -EINVAL;
 
-	req = io_get_req(ctx);
+	req = io_get_req(ctx, state);
 	if (unlikely(!req))
 		return -EAGAIN;
 
@@ -844,6 +869,9 @@ static void io_submit_state_end(struct io_submit_state *state)
 	if (!list_empty(&state->req_list))
 		io_flush_state_reqs(state->ctx, state);
 	io_file_put(state, NULL);
+	if (state->free_iocbs)
+		kmem_cache_free_bulk(kiocb_cachep, state->free_iocbs,
+					&state->iocbs[state->cur_iocb]);
 }
 
 /*
@@ -855,6 +883,7 @@ static void io_submit_state_start(struct io_submit_state *state,
 	state->ctx = ctx;
 	INIT_LIST_HEAD(&state->req_list);
 	state->req_count = 0;
+	state->free_iocbs = 0;
 	state->file = NULL;
 	state->ios_left = max_ios;
 #ifdef CONFIG_BLOCK

From patchwork Tue Jan  8 16:56:41 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752539
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 992BC17E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:19 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 836E128F9F
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:19 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 818E728FBA; Tue,  8 Jan 2019 16:57:19 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EDD5028F9F
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729235AbfAHQ5S (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:18 -0500
Received: from mail-io1-f66.google.com ([209.85.166.66]:35066 "EHLO
        mail-io1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729219AbfAHQ5Q (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:16 -0500
Received: by mail-io1-f66.google.com with SMTP id f4so3688931ion.2
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:15 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=;
        b=GTw9t09Sg1AxtJ5+L7hLtOSEycum9LC9wgqFCHQ9uF75Dqqv442Kk5IWnZrDu00Yi0
         7lyXBSzesZXeo5NpHfoRuNr1ndCPc+pedh3RljiZS/2VPEfC7xGYvJJWHKbCRu51NCsZ
         irRZq/pGVj79I22DkV6AU4wWKr9QgeRWF7cJkDOKUUNhwx5OiqXVNn8TX8pONZkmCF76
         ds/P7H+Q6wgWz6bwxN7eSIIWnRvKFbL2vhewVKweNH9wkZxPeumRbsn8KyQxbdKTd7Fa
         +7iSPYIFhij2fOb8hNj8O8wNSHsRdXw/GtaFQY4UUDK3SFr5ob4IAeC+JLOy6GiNnk/V
         7WLw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=;
        b=SbQceyNuaqlgcYVXc/Ue5bXQ6fuZxsygKNCxmt3n+atxHSkCqurGGn4/HKCx5VBHYo
         Wmt5nUvSUktJGay+zgAz1Oe++uqj8PNG/BZ0ismlrqxcQAYRE7GeF8dAc5Yxhlmt+Fl1
         XYTxeGG9hxt29cJ8laCTPTEfgcZuSi8iWl+ewZidRMx/HNkGj5EqHGxkV55gxYoyz2ZE
         vdNm9ZCCfYMiiFVSOnjLGIlVtZCYUA6WFfL+63MF2griQg4Vd9tajw+h+A9P4sl5Wa2/
         lwtD+0Tfpy0QT9muhTA2bR0toXnKgUS3fC71xDUO/vr/7tgz5kG6Zm1LlDNsXth0Ghmy
         OyTw==
X-Gm-Message-State: AJcUukeY98wKySjHnxjY8WeAEJT0jOJIHc228ut9+DBN5d9DxWfjBq4m
        Ri5J3H3gQzXANpyFN5cALlahuGu70Q9zog==
X-Google-Smtp-Source: 
 ALg8bN4GaweI0iaMptEgJJ71xKVQKMRUtVLv45mHwX1U8C/cBBILXlgWDeHxCpbxHZlUhxCxIg00pA==
X-Received: by 2002:a6b:dd0b:: with SMTP id f11mr1588813ioc.45.1546966634597;
        Tue, 08 Jan 2019 08:57:14 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.12
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:13 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 12/16] block: implement bio helper to add iter bvec pages to
 bio
Date: Tue,  8 Jan 2019 09:56:41 -0700
Message-Id: <20190108165645.19311-13-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

For an ITER_BVEC, we can just iterate the iov and add the pages
to the bio directly. This requires that the caller doesn't releases
the pages on IO completion, we add a BIO_HOLD_PAGES flag for that.

The current two callers of bio_iov_iter_get_pages() are updated to
check if they need to release pages on completion. This makes them
work with bvecs that contain kernel mapped pages already.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/bio.c               | 59 ++++++++++++++++++++++++++++++++-------
 fs/block_dev.c            |  5 ++--
 fs/iomap.c                |  5 ++--
 include/linux/blk_types.h |  1 +
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..7af4f45d2ed6 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page,
 }
 EXPORT_SYMBOL(bio_add_page);
 
+static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned int len;
+	size_t size;
+
+	len = min_t(size_t, bv->bv_len, iter->count);
+	size = bio_add_page(bio, bv->bv_page, len,
+				bv->bv_offset + iter->iov_offset);
+	if (size == len) {
+		iov_iter_advance(iter, size);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
 
 /**
@@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 }
 
 /**
- * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio
+ * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
- * @iter: iov iterator describing the region to be mapped
+ * @iter: iov iterator describing the region to be added
+ *
+ * This takes either an iterator pointing to user memory, or one pointing to
+ * kernel pages (BVEC iterator). If we're adding user pages, we pin them and
+ * map them into the kernel. On IO completion, the caller should put those
+ * pages. If we're adding kernel pages, we just have to add the pages to the
+ * bio directly. We don't grab an extra reference to those pages (the user
+ * should already have that), and we don't put the page on IO completion.
+ * The caller needs to check if the bio is flagged BIO_HOLD_PAGES on IO
+ * completion. If it isn't, then pages should be released.
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
  * The function tries, but does not guarantee, to pin as many pages as
- * fit into the bio, or are requested in *iter, whatever is smaller.
- * If MM encounters an error pinning the requested pages, it stops.
- * Error is returned only if 0 pages could be pinned.
+ * fit into the bio, or are requested in *iter, whatever is smaller. If
+ * MM encounters an error pinning the requested pages, it stops. Error
+ * is returned only if 0 pages could be pinned.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	const bool is_bvec = iov_iter_is_bvec(iter);
 	unsigned short orig_vcnt = bio->bi_vcnt;
 
+	/*
+	 * If this is a BVEC iter, then the pages are kernel pages. Don't
+	 * release them on IO completion.
+	 */
+	if (is_bvec)
+		bio_set_flag(bio, BIO_HOLD_PAGES);
+
 	do {
-		int ret = __bio_iov_iter_get_pages(bio, iter);
+		int ret;
+
+		if (is_bvec)
+			ret = __bio_iov_bvec_add_pages(bio, iter);
+		else
+			ret = __bio_iov_iter_get_pages(bio, iter);
 
 		if (unlikely(ret))
 			return bio->bi_vcnt > orig_vcnt ? 0 : ret;
@@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work)
 		next = bio->bi_private;
 
 		bio_set_pages_dirty(bio);
-		bio_release_pages(bio);
+		if (!bio_flagged(bio, BIO_HOLD_PAGES))
+			bio_release_pages(bio);
 		bio_put(bio);
 	}
 }
@@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio)
 			goto defer;
 	}
 
-	bio_release_pages(bio);
+	if (!bio_flagged(bio, BIO_HOLD_PAGES))
+		bio_release_pages(bio);
 	bio_put(bio);
 	return;
 defer:
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2ebd2a0d7789..b7742014c9de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -324,8 +324,9 @@ static void blkdev_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_HOLD_PAGES))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 4ee50b76b4a1..0a64c9c51203 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 		struct bio_vec *bvec;
 		int i;
 
-		bio_for_each_segment_all(bvec, bio, i)
-			put_page(bvec->bv_page);
+		if (!bio_flagged(bio, BIO_HOLD_PAGES))
+			bio_for_each_segment_all(bvec, bio, i)
+				put_page(bvec->bv_page);
 		bio_put(bio);
 	}
 }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 5c7e7f859a24..97e206855cd3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -215,6 +215,7 @@ struct bio {
 /*
  * bio flags
  */
+#define BIO_HOLD_PAGES	0	/* don't put O_DIRECT pages */
 #define BIO_SEG_VALID	1	/* bi_phys_segments valid */
 #define BIO_CLONED	2	/* doesn't own data */
 #define BIO_BOUNCED	3	/* bio is a bounce bio */

From patchwork Tue Jan  8 16:56:42 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752543
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7CE2F13B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:21 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6E11F28FA5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:21 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 624E028F85; Tue,  8 Jan 2019 16:57:21 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8730E28FA5
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729253AbfAHQ5T (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:19 -0500
Received: from mail-io1-f65.google.com ([209.85.166.65]:36897 "EHLO
        mail-io1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729224AbfAHQ5S (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:18 -0500
Received: by mail-io1-f65.google.com with SMTP id g8so3680954iok.4
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=FT9RER6+VMvxBXmLoLoheD2emU6/BVeQLCNJ27f3fko=;
        b=jiT717Dvw2CY3OUi23jaXcNGO3hJ+1UnFJM0ovDpDK1o/9BQTaV6tO7tmaQ/GuM9iu
         Z6/tnOkCbkRghXFgl2TNde5ccN6l6qdnwn8RvIHsoApgo383uR72LaKA2MjOElaOUNrn
         yxi2UQJb5sfAhpxPl7pLc8ufGeMsqY575Mtd4bfQopFg3U56zcNK1L/qYF4Iu0hvyHT+
         JirGTmgbdEfqtYpxcSAjY2GSn/2JcIAMh1a9qMd7zfxHArVtK8KuGHUyqCeEUC5VHPt6
         wIbNlh6qGPflnXkibifFXXeM7okVs6u5P7LHVUb4PTTHTbj/2thVjxtpOueyWk0HA/5J
         M2BA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=FT9RER6+VMvxBXmLoLoheD2emU6/BVeQLCNJ27f3fko=;
        b=B/5g2bmFIATKblnEv8RVvJOCx+jMRg7JcPP05DNAJGFbUXouu6IYihE2fCqxxDTE/m
         qGdp/rnG0bmD+VWps6jaGJ/8Rsi8kedGCea1RH6cCHQnVJCxMnxZJJwG+EIHOWmVqhoM
         mkst41AMCGXt+gQ3svLSTUc9XqOFbXvBFPv6L61aeHAwSg9ja7tHkE1gE1SLumiJX++f
         e9/QkvF6rnp/fFy0KOB7GP13vU/osdgX9TFPj2zXDId+nXwTdNllABdxrP7rjg9mgijZ
         Ve6psP/CgwEcbc2/AMz/cbXVPZAxROndgByAj6M2B1RJJD/0tU0HbY26nbiIiAzm0F8a
         UooA==
X-Gm-Message-State: AJcUukfhDl/xJ30NOnjELISkFyw28QPQmHFoU4EbtMCGin0ineN+Y8DD
        eG32PYRAfCcBeB/XwT1l21YtQ4l9AaBVqQ==
X-Google-Smtp-Source: 
 ALg8bN7rn15oK/7ME5tKgFm/1iSH8HVuGXL3jwXpWf2A6VCKww0Q03T44VLLc1o3J6KOgar9mYNQyQ==
X-Received: by 2002:a5d:8ac6:: with SMTP id e6mr1410712iot.235.1546966636258;
        Tue, 08 Jan 2019 08:57:16 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.14
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:15 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers
Date: Tue,  8 Jan 2019 09:56:42 -0700
Message-Id: <20190108165645.19311-14-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

If we have fixed user buffers, we can map them into the kernel when we
setup the io_context. That avoids the need to do get_user_pages() for
each and every IO.

To utilize this feature, the application must set
IORING_SETUP_FIXEDBUFS and pass in an array of iovecs that contain the
desired buffer addresses and lengths. These buffers can then be mapped
into the kernel for the life time of the io_uring, as opposed to just
the duration of the each single IO. The application can then use the
IORING_OP_{READ,WRITE}_FIXED to perform IO to these fixed locations.

It's perfectly valid to setup a larger buffer, and then sometimes only
use parts of it for an IO. As long as the range is within the originally
mapped region, it will work just fine.

A limit of 4M is imposed as the largest buffer we currently support.
There's nothing preventing us from going larger, but we need some cap,
and 4M seemed like it would definitely be big enough. RLIMIT_MEMLOCK
is used to cap the total amount of memory pinned.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 212 +++++++++++++++++++++++++++++++---
 include/uapi/linux/io_uring.h |   3 +
 2 files changed, 201 insertions(+), 14 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 62778d7ffb8d..92129f867824 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -22,6 +22,8 @@
 #include <linux/workqueue.h>
 #include <linux/blkdev.h>
 #include <linux/anon_inodes.h>
+#include <linux/sizes.h>
+#include <linux/nospec.h>
 
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
@@ -65,6 +67,13 @@ struct io_event_ring {
 	unsigned		ring_mask;
 };
 
+struct io_mapped_ubuf {
+	u64		ubuf;
+	size_t		len;
+	struct		bio_vec *bvec;
+	unsigned int	nr_bvecs;
+};
+
 struct io_ring_ctx {
 	struct percpu_ref	refs;
 
@@ -74,6 +83,9 @@ struct io_ring_ctx {
 	struct io_iocb_ring	sq_ring;
 	struct io_event_ring	cq_ring;
 
+	/* if used, fixed mapped user buffers */
+	struct io_mapped_ubuf	*user_bufs;
+
 	struct work_struct	work;
 
 	/* iopoll submission state */
@@ -581,13 +593,45 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
 	return ret;
 }
 
-static int io_setup_rw(int rw, const struct io_uring_iocb *iocb,
-		       struct iovec **iovec, struct iov_iter *iter)
+static int io_setup_rw(int rw, struct io_kiocb *kiocb,
+		       const struct io_uring_iocb *iocb, struct iovec **iovec,
+		       struct iov_iter *iter, bool kaddr)
 {
 	void __user *buf = (void __user *)(uintptr_t)iocb->addr;
 	size_t ret;
 
-	ret = import_single_range(rw, buf, iocb->len, *iovec, iter);
+	if (!kaddr) {
+		ret = import_single_range(rw, buf, iocb->len, *iovec, iter);
+	} else {
+		struct io_ring_ctx *ctx = kiocb->ki_ctx;
+		struct io_mapped_ubuf *imu;
+		size_t len = iocb->len;
+		size_t offset;
+		int index;
+
+		/* __io_submit_one() already validated the index */
+		index = array_index_nospec(kiocb->ki_index,
+						ctx->max_reqs);
+		imu = &ctx->user_bufs[index];
+		if ((unsigned long) iocb->addr < imu->ubuf ||
+		    (unsigned long) iocb->addr + len > imu->ubuf + imu->len) {
+			ret = -EFAULT;
+			goto err;
+		}
+
+		/*
+		 * May not be a start of buffer, set size appropriately
+		 * and advance us to the beginning.
+		 */
+		offset = (unsigned long) iocb->addr - imu->ubuf;
+		iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs,
+				offset + len);
+		if (offset)
+			iov_iter_advance(iter, offset);
+		ret = 0;
+
+	}
+err:
 	*iovec = NULL;
 	return ret;
 }
@@ -672,7 +716,7 @@ static void io_iopoll_iocb_issued(struct io_submit_state *state,
 }
 
 static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
-		       struct io_submit_state *state)
+		       struct io_submit_state *state, bool kaddr)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -692,7 +736,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
 	if (unlikely(!file->f_op->read_iter))
 		goto out_fput;
 
-	ret = io_setup_rw(READ, iocb, &iovec, &iter);
+	ret = io_setup_rw(READ, kiocb, iocb, &iovec, &iter, kaddr);
 	if (ret)
 		goto out_fput;
 
@@ -708,7 +752,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
 
 static ssize_t io_write(struct io_kiocb *kiocb,
 			const struct io_uring_iocb *iocb,
-			struct io_submit_state *state)
+			struct io_submit_state *state, bool kaddr)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -728,7 +772,7 @@ static ssize_t io_write(struct io_kiocb *kiocb,
 	if (unlikely(!file->f_op->write_iter))
 		goto out_fput;
 
-	ret = io_setup_rw(WRITE, iocb, &iovec, &iter);
+	ret = io_setup_rw(WRITE, kiocb, iocb, &iovec, &iter, kaddr);
 	if (ret)
 		goto out_fput;
 	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
@@ -810,10 +854,16 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 	ret = -EINVAL;
 	switch (iocb->opcode) {
 	case IORING_OP_READ:
-		ret = io_read(req, iocb, state);
+		ret = io_read(req, iocb, state, false);
+		break;
+	case IORING_OP_READ_FIXED:
+		ret = io_read(req, iocb, state, true);
 		break;
 	case IORING_OP_WRITE:
-		ret = io_write(req, iocb, state);
+		ret = io_write(req, iocb, state, false);
+		break;
+	case IORING_OP_WRITE_FIXED:
+		ret = io_write(req, iocb, state, true);
 		break;
 	case IORING_OP_FSYNC:
 		if (ctx->flags & IORING_SETUP_IOPOLL)
@@ -1021,6 +1071,127 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 	return ret;
 }
 
+static void io_iocb_buffer_unmap(struct io_ring_ctx *ctx)
+{
+	int i, j;
+
+	if (!ctx->user_bufs)
+		return;
+
+	for (i = 0; i < ctx->max_reqs; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+
+		for (j = 0; j < imu->nr_bvecs; j++)
+			put_page(imu->bvec[j].bv_page);
+
+		kfree(imu->bvec);
+		imu->nr_bvecs = 0;
+	}
+
+	kfree(ctx->user_bufs);
+	ctx->user_bufs = NULL;
+}
+
+static int io_iocb_buffer_map(struct io_ring_ctx *ctx,
+			      struct iovec __user *iovecs)
+{
+	unsigned long total_pages, page_limit;
+	struct page **pages = NULL;
+	int i, j, got_pages = 0;
+	int ret = -EINVAL;
+
+	ctx->user_bufs = kcalloc(ctx->max_reqs, sizeof(struct io_mapped_ubuf),
+					GFP_KERNEL);
+	if (!ctx->user_bufs)
+		return -ENOMEM;
+
+	/* Don't allow more pages than we can safely lock */
+	total_pages = 0;
+	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	for (i = 0; i < ctx->max_reqs; i++) {
+		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
+		unsigned long off, start, end, ubuf;
+		int pret, nr_pages;
+		struct iovec iov;
+		size_t size;
+
+		ret = -EFAULT;
+		if (copy_from_user(&iov, &iovecs[i], sizeof(iov)))
+			goto err;
+
+		/*
+		 * Don't impose further limits on the size and buffer
+		 * constraints here, we'll -EINVAL later when IO is
+		 * submitted if they are wrong.
+		 */
+		ret = -EFAULT;
+		if (!iov.iov_base)
+			goto err;
+
+		/* arbitrary limit, but we need something */
+		if (iov.iov_len > SZ_4M)
+			goto err;
+
+		ubuf = (unsigned long) iov.iov_base;
+		end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+		start = ubuf >> PAGE_SHIFT;
+		nr_pages = end - start;
+
+		ret = -ENOMEM;
+		if (total_pages + nr_pages > page_limit)
+			goto err;
+
+		if (!pages || nr_pages > got_pages) {
+			kfree(pages);
+			pages = kmalloc(nr_pages * sizeof(struct page *),
+					GFP_KERNEL);
+			if (!pages)
+				goto err;
+			got_pages = nr_pages;
+		}
+
+		imu->bvec = kmalloc(nr_pages * sizeof(struct bio_vec),
+					GFP_KERNEL);
+		if (!imu->bvec)
+			goto err;
+
+		down_write(&current->mm->mmap_sem);
+		pret = get_user_pages(ubuf, nr_pages, 1, pages, NULL);
+		up_write(&current->mm->mmap_sem);
+
+		if (pret < nr_pages) {
+			if (pret < 0)
+				ret = pret;
+			goto err;
+		}
+
+		off = ubuf & ~PAGE_MASK;
+		size = iov.iov_len;
+		for (j = 0; j < nr_pages; j++) {
+			size_t vec_len;
+
+			vec_len = min_t(size_t, size, PAGE_SIZE - off);
+			imu->bvec[j].bv_page = pages[j];
+			imu->bvec[j].bv_len = vec_len;
+			imu->bvec[j].bv_offset = off;
+			off = 0;
+			size -= vec_len;
+		}
+		/* store original address for later verification */
+		imu->ubuf = ubuf;
+		imu->len = iov.iov_len;
+		imu->nr_bvecs = nr_pages;
+		total_pages += nr_pages;
+	}
+	kfree(pages);
+	return 0;
+err:
+	kfree(pages);
+	io_iocb_buffer_unmap(ctx);
+	return ret;
+}
+
 static void io_free_scq_urings(struct io_ring_ctx *ctx)
 {
 	if (ctx->sq_ring.ring) {
@@ -1043,6 +1214,7 @@ static void io_ring_ctx_free(struct work_struct *work)
 
 	io_iopoll_reap_events(ctx);
 	io_free_scq_urings(ctx);
+	io_iocb_buffer_unmap(ctx);
 	percpu_ref_exit(&ctx->refs);
 	kmem_cache_free(ioctx_cachep, ctx);
 }
@@ -1191,11 +1363,19 @@ static void io_fill_offsets(struct io_uring_params *p)
 	p->cq_off.events = offsetof(struct io_cq_ring, events);
 }
 
-static int io_uring_create(unsigned entries, struct io_uring_params *p)
+static int io_uring_create(unsigned entries, struct io_uring_params *p,
+			   struct iovec __user *iovecs)
 {
 	struct io_ring_ctx *ctx;
 	int ret;
 
+	/*
+	 * We don't use the iovecs without fixed buffers being asked for.
+	 * Error out if they don't match.
+	 */
+	if (!(p->flags & IORING_SETUP_FIXEDBUFS) && iovecs)
+		return -EINVAL;
+
 	/*
 	 * Use twice as many entries for the CQ ring. It's possible for the
 	 * application to drive a higher depth than the size of the SQ ring,
@@ -1213,6 +1393,12 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p)
 	if (ret)
 		goto err;
 
+	if (p->flags & IORING_SETUP_FIXEDBUFS) {
+		ret = io_iocb_buffer_map(ctx, iovecs);
+		if (ret)
+			goto err;
+	}
+
 	ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx,
 				O_RDWR | O_CLOEXEC);
 	if (ret < 0)
@@ -1245,12 +1431,10 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs,
 			return -EINVAL;
 	}
 
-	if (p.flags & ~IORING_SETUP_IOPOLL)
-		return -EINVAL;
-	if (iovecs)
+	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS))
 		return -EINVAL;
 
-	ret = io_uring_create(entries, &p);
+	ret = io_uring_create(entries, &p, iovecs);
 	if (ret < 0)
 		return ret;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f7ba30747816..925fd6ca3f38 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -35,11 +35,14 @@ struct io_uring_iocb {
  * io_uring_setup() flags
  */
 #define IORING_SETUP_IOPOLL	(1 << 0)	/* io_context is polled */
+#define IORING_SETUP_FIXEDBUFS	(1 << 1)	/* IO buffers are fixed */
 
 #define IORING_OP_READ		1
 #define IORING_OP_WRITE		2
 #define IORING_OP_FSYNC		3
 #define IORING_OP_FDSYNC	4
+#define IORING_OP_READ_FIXED	5
+#define IORING_OP_WRITE_FIXED	6
 
 /*
  * IO completion data structure

From patchwork Tue Jan  8 16:56:43 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752555
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C80FF746
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:29 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B663A28FB6
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:29 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id AA3D628FA2; Tue,  8 Jan 2019 16:57:29 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5E05228F9D
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729263AbfAHQ5U (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:20 -0500
Received: from mail-it1-f194.google.com ([209.85.166.194]:36610 "EHLO
        mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729237AbfAHQ5T (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:19 -0500
Received: by mail-it1-f194.google.com with SMTP id c9so6949722itj.1
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=E2mGG9suMfo/Hq4h2cKt3Xi1WwCcA6HNLu1UqmDgo9E=;
        b=0NBi+zSZrr4AtC02HIMvETmd3B9eT50AkMf2/tgfrBY2d3Osm1Wxh/ovFbCW6ps1K/
         fQSOP8HxmXiA/NXQuHEOJbUiqtsS5pWxyFc2U79TPZcLyDjQvw1IvMZp/4EosY5iJxVc
         rDI8oUzQFtjFz3EhlgCtw2hF7oXqY9M+84gLof3Kk6FXjTW0j2KI3w6bHWjnVFlUx2mj
         lmSFO3ZPdGjsQ6XUiKtVFOvcCmEyPk+5oheSQBMWS2/p/osGXRFbgEYeTIvGhG+0Z7Wx
         zR2+G2RmePOXDcd23NAUHY4VTxDtm5Hz4JJw4Q0X0acPHMMS9OPvYQVw++t4g64gdh+h
         H0mQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=E2mGG9suMfo/Hq4h2cKt3Xi1WwCcA6HNLu1UqmDgo9E=;
        b=XiiL47aKz5tfYMlxl/gz09veJpoh/EKm38Na5ng4PNK2G8gnuBXefScg8EJv8yFU4+
         /R+aPrt5GoGMfx11UF7Jnd7q8HN9ol7hYNXARGqkq/j6N/moO7yD18+fGWdeF5aQ0KsR
         IV6agW3k93z8uQY/TkAha0XZKmTeAZfpp8w6yxX9BZyj1AIwk880USxZKpapbOksNq6F
         8t95YKlDrU7Q22jOlT1sYsHOImcUJE3G0EXecBh+Ij4ai/m3fWNiCXWffilgs3sGlcp/
         aDni+1MP/x07tN71b8nd7znrCaK9/1cgmYVMBfnzNtaMyR01CuEtlg6PxWPHKQwYq4jX
         bHow==
X-Gm-Message-State: AJcUukfoM3E1VL3HpApxyeHmIDlIDyX8oCPAYTDogIzvkMZ4AmKu+dTB
        OpmW7Jfa5eY9Nz6dOBnwht/9WQFrmOVGEA==
X-Google-Smtp-Source: 
 ALg8bN7/F+Pd7Z24tGmLzvYkmukkDHM4hQlCThm5kX8/sQSG6X06MhDWsMLAnjpc/d/V8w0bbxnlzg==
X-Received: by 2002:a24:b951:: with SMTP id k17mr1656856iti.147.1546966637974;
        Tue, 08 Jan 2019 08:57:17 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.16
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:16 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 14/16] io_uring: support kernel side submission
Date: Tue,  8 Jan 2019 09:56:43 -0700
Message-Id: <20190108165645.19311-15-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Add support for backing the io_uring fd with either a thread, or a
workqueue and letting those handle the submission for us. This can
be used to reduce overhead for submission, or to always make submission
async. The latter is particularly useful for buffered aio, which is
now fully async with this feature.

For polled IO, we could have the kernel side thread hammer on the SQ
ring and submit when it finds IO. This would mean that an application
would NEVER have to enter the kernel to do IO! Didn't add this yet,
but it would be trivial to add.

If an application sets IORING_SETUP_SCQTHREAD, the io_uring gets a
single thread backing. If used with buffered IO, this will limit the
device queue depth to 1, but it will be async, IOs will simply be
serialized.

Or an application can set IORING_SETUP_SQWQ, in which case the urings
get a work queue backing. The concurrency level is the mininum of twice
the available CPUs, or the queue depth specific for the context. For
this mode, we attempt to do buffered reads inline, in case they are
cached. So we should only punt to a workqueue, if we would have to block
to get our data.

Tested with polling, no polling, fixedbufs, no fixedbufs, buffered,
O_DIRECT.

See this sample application for how to use it:

http://git.kernel.dk/cgit/fio/plain/t/io_uring.c

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 405 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/io_uring.h |   5 +-
 2 files changed, 387 insertions(+), 23 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 92129f867824..e6a808a89b78 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -14,6 +14,7 @@
 #include <linux/sched/signal.h>
 #include <linux/fs.h>
 #include <linux/file.h>
+#include <linux/fdtable.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/mmu_context.h>
@@ -24,6 +25,8 @@
 #include <linux/anon_inodes.h>
 #include <linux/sizes.h>
 #include <linux/nospec.h>
+#include <linux/kthread.h>
+#include <linux/sched/mm.h>
 
 #include <linux/uaccess.h>
 #include <linux/nospec.h>
@@ -58,6 +61,7 @@ struct io_iocb_ring {
 	struct			io_sq_ring *ring;
 	unsigned		entries;
 	unsigned		ring_mask;
+	unsigned		sq_thread_cpu;
 	struct io_uring_iocb	*iocbs;
 };
 
@@ -74,6 +78,14 @@ struct io_mapped_ubuf {
 	unsigned int	nr_bvecs;
 };
 
+struct io_sq_offload {
+	struct task_struct	*thread;	/* if using a thread */
+	struct workqueue_struct	*wq;	/* wq offload */
+	struct mm_struct	*mm;
+	struct files_struct	*files;
+	wait_queue_head_t	wait;
+};
+
 struct io_ring_ctx {
 	struct percpu_ref	refs;
 
@@ -88,6 +100,9 @@ struct io_ring_ctx {
 
 	struct work_struct	work;
 
+	/* sq ring submitter thread, if used */
+	struct io_sq_offload	sq_offload;
+
 	/* iopoll submission state */
 	struct {
 		spinlock_t poll_lock;
@@ -127,6 +142,7 @@ struct io_kiocb {
 	unsigned long		ki_flags;
 #define KIOCB_F_IOPOLL_COMPLETED	0	/* polled IO has completed */
 #define KIOCB_F_IOPOLL_EAGAIN		1	/* submission got EAGAIN */
+#define KIOCB_F_FORCE_NONBLOCK		2	/* inline submission attempt */
 };
 
 #define IO_PLUG_THRESHOLD		2
@@ -164,6 +180,18 @@ struct io_submit_state {
 	unsigned int ios_left;
 };
 
+struct iocb_submit {
+	const struct io_uring_iocb *iocb;
+	unsigned int index;
+};
+
+struct io_work {
+	struct work_struct work;
+	struct io_ring_ctx *ctx;
+	struct io_uring_iocb iocb;
+	unsigned iocb_index;
+};
+
 static struct kmem_cache *kiocb_cachep, *ioctx_cachep;
 
 static const struct file_operations io_scqring_fops;
@@ -442,18 +470,17 @@ static void kiocb_end_write(struct kiocb *kiocb)
 	}
 }
 
-static void io_fill_event(struct io_uring_event *ev, struct io_kiocb *kiocb,
+static void io_fill_event(struct io_uring_event *ev, unsigned ki_index,
 			  long res, unsigned flags)
 {
-	ev->index = kiocb->ki_index;
+	ev->index = ki_index;
 	ev->res = res;
 	ev->flags = flags;
 }
 
-static void io_cqring_fill_event(struct io_kiocb *iocb, long res,
-				 unsigned ev_flags)
+static void io_cqring_fill_event(struct io_ring_ctx *ctx, unsigned ki_index,
+				 long res, unsigned ev_flags)
 {
-	struct io_ring_ctx *ctx = iocb->ki_ctx;
 	struct io_uring_event *ev;
 	unsigned long flags;
 
@@ -465,7 +492,7 @@ static void io_cqring_fill_event(struct io_kiocb *iocb, long res,
 	spin_lock_irqsave(&ctx->completion_lock, flags);
 	ev = io_peek_cqring(ctx);
 	if (ev) {
-		io_fill_event(ev, iocb, res, ev_flags);
+		io_fill_event(ev, ki_index, res, ev_flags);
 		io_inc_cqring(ctx);
 	} else
 		ctx->cq_ring.ring->overflow++;
@@ -474,10 +501,24 @@ static void io_cqring_fill_event(struct io_kiocb *iocb, long res,
 
 static void io_complete_scqring(struct io_kiocb *iocb, long res, unsigned flags)
 {
-	io_cqring_fill_event(iocb, res, flags);
+	io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, flags);
 	io_complete_iocb(iocb->ki_ctx, iocb);
 }
 
+static void io_fill_cq_error(struct io_ring_ctx *ctx, unsigned ki_index,
+			     long error)
+{
+	io_cqring_fill_event(ctx, ki_index, error, 0);
+
+	/*
+	 * for thread offload, app could already be sleeping in io_ring_enter()
+	 * before we get to flag the error. wake them up, if needed.
+	 */
+	if (ctx->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ))
+		if (waitqueue_active(&ctx->wait))
+			wake_up(&ctx->wait);
+}
+
 static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw);
@@ -485,6 +526,7 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2)
 	kiocb_end_write(kiocb);
 
 	fput(kiocb->ki_filp);
+
 	io_complete_scqring(iocb, res, 0);
 }
 
@@ -497,7 +539,7 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2)
 	if (unlikely(res == -EAGAIN)) {
 		set_bit(KIOCB_F_IOPOLL_EAGAIN, &iocb->ki_flags);
 	} else {
-		io_cqring_fill_event(iocb, res, 0);
+		io_cqring_fill_event(iocb->ki_ctx, iocb->ki_index, res, 0);
 		set_bit(KIOCB_F_IOPOLL_COMPLETED, &iocb->ki_flags);
 	}
 }
@@ -549,7 +591,7 @@ static struct file *io_file_get(struct io_submit_state *state, int fd)
 }
 
 static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
-		      struct io_submit_state *state)
+		      struct io_submit_state *state, bool force_nonblock)
 {
 	struct io_ring_ctx *ctx = kiocb->ki_ctx;
 	struct kiocb *req = &kiocb->rw;
@@ -573,6 +615,10 @@ static int io_prep_rw(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
 	ret = kiocb_set_rw_flags(req, iocb->rw_flags);
 	if (unlikely(ret))
 		goto out_fput;
+	if (force_nonblock) {
+		req->ki_flags |= IOCB_NOWAIT;
+		set_bit(KIOCB_F_FORCE_NONBLOCK, &kiocb->ki_flags);
+	}
 
 	if (ctx->flags & IORING_SETUP_IOPOLL) {
 		ret = -EOPNOTSUPP;
@@ -716,7 +762,7 @@ static void io_iopoll_iocb_issued(struct io_submit_state *state,
 }
 
 static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
-		       struct io_submit_state *state, bool kaddr)
+		       struct io_submit_state *state, bool kaddr, bool nonblock)
 {
 	struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
 	struct kiocb *req = &kiocb->rw;
@@ -724,7 +770,7 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(kiocb, iocb, state);
+	ret = io_prep_rw(kiocb, iocb, state, nonblock);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -741,8 +787,18 @@ static ssize_t io_read(struct io_kiocb *kiocb, const struct io_uring_iocb *iocb,
 		goto out_fput;
 
 	ret = rw_verify_area(READ, file, &req->ki_pos, iov_iter_count(&iter));
-	if (!ret)
-		io_rw_done(req, call_read_iter(file, req, &iter));
+	if (!ret) {
+		ssize_t ret2;
+
+		/*
+		 * Catch -EAGAIN return for forced non-blocking submission
+		 */
+		ret2 = call_read_iter(file, req, &iter);
+		if (!nonblock || ret2 != -EAGAIN)
+			io_rw_done(req, ret2);
+		else
+			ret = -EAGAIN;
+	}
 	kfree(iovec);
 out_fput:
 	if (unlikely(ret))
@@ -760,7 +816,7 @@ static ssize_t io_write(struct io_kiocb *kiocb,
 	struct file *file;
 	ssize_t ret;
 
-	ret = io_prep_rw(kiocb, iocb, state);
+	ret = io_prep_rw(kiocb, iocb, state, false);
 	if (ret)
 		return ret;
 	file = req->ki_filp;
@@ -833,7 +889,7 @@ static int io_fsync(struct fsync_iocb *req, const struct io_uring_iocb *iocb,
 static int __io_submit_one(struct io_ring_ctx *ctx,
 			   const struct io_uring_iocb *iocb,
 			   unsigned long ki_index,
-			   struct io_submit_state *state)
+			   struct io_submit_state *state, bool force_nonblock)
 {
 	struct io_kiocb *req;
 	ssize_t ret;
@@ -854,10 +910,10 @@ static int __io_submit_one(struct io_ring_ctx *ctx,
 	ret = -EINVAL;
 	switch (iocb->opcode) {
 	case IORING_OP_READ:
-		ret = io_read(req, iocb, state, false);
+		ret = io_read(req, iocb, state, false, force_nonblock);
 		break;
 	case IORING_OP_READ_FIXED:
-		ret = io_read(req, iocb, state, true);
+		ret = io_read(req, iocb, state, true, force_nonblock);
 		break;
 	case IORING_OP_WRITE:
 		ret = io_write(req, iocb, state, false);
@@ -993,7 +1049,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 		if (!iocb)
 			break;
 
-		ret = __io_submit_one(ctx, iocb, iocb_index, statep);
+		ret = __io_submit_one(ctx, iocb, iocb_index, statep, false);
 		if (ret)
 			break;
 
@@ -1042,15 +1098,239 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events)
 	return ring->r.head == ring->r.tail ? ret : 0;
 }
 
+static int io_submit_iocbs(struct io_ring_ctx *ctx, struct iocb_submit *iocbs,
+			   unsigned int nr, struct mm_struct *cur_mm,
+			   bool mm_fault)
+{
+	struct io_submit_state state, *statep = NULL;
+	int ret, i, submitted = 0;
+
+	if (nr > IO_PLUG_THRESHOLD) {
+		io_submit_state_start(&state, ctx, nr);
+		statep = &state;
+	}
+
+	for (i = 0; i < nr; i++) {
+		if (unlikely(mm_fault))
+			ret = -EFAULT;
+		else
+			ret = __io_submit_one(ctx, iocbs[i].iocb,
+						iocbs[i].index, statep, false);
+		if (!ret) {
+			submitted++;
+			continue;
+		}
+
+		io_fill_cq_error(ctx, iocbs[i].index, ret);
+	}
+
+	if (statep)
+		io_submit_state_end(&state);
+
+	return submitted;
+}
+
+/*
+ * sq thread only supports O_DIRECT or FIXEDBUFS IO
+ */
+static int io_sq_thread(void *data)
+{
+	struct iocb_submit iocbs[IO_IOPOLL_BATCH];
+	struct io_ring_ctx *ctx = data;
+	struct io_sq_offload *sqo = &ctx->sq_offload;
+	struct mm_struct *cur_mm = NULL;
+	struct files_struct *old_files;
+	mm_segment_t old_fs;
+	DEFINE_WAIT(wait);
+
+	old_files = current->files;
+	current->files = sqo->files;
+
+	old_fs = get_fs();
+	set_fs(USER_DS);
+
+	while (!kthread_should_stop()) {
+		const struct io_uring_iocb *iocb;
+		bool mm_fault = false;
+		unsigned iocb_index;
+		int i;
+
+		iocb = io_peek_sqring(ctx, &iocb_index);
+		if (!iocb) {
+			/*
+			 * Drop cur_mm before scheduling, we can't hold it for
+			 * long periods (or over schedule()). Do this before
+			 * adding ourselves to the waitqueue, as the unuse/drop
+			 * may sleep.
+			 */
+			if (cur_mm) {
+				unuse_mm(cur_mm);
+				mmput(cur_mm);
+				cur_mm = NULL;
+			}
+
+			prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE);
+			iocb = io_peek_sqring(ctx, &iocb_index);
+			if (!iocb) {
+				if (kthread_should_park())
+					kthread_parkme();
+				if (kthread_should_stop()) {
+					finish_wait(&sqo->wait, &wait);
+					break;
+				}
+				if (signal_pending(current))
+					flush_signals(current);
+				schedule();
+			}
+			finish_wait(&sqo->wait, &wait);
+			if (!iocb)
+				continue;
+		}
+
+		/* If ->mm is set, we're not doing FIXEDBUFS */
+		if (sqo->mm && !cur_mm) {
+			mm_fault = !mmget_not_zero(sqo->mm);
+			if (!mm_fault) {
+				use_mm(sqo->mm);
+				cur_mm = sqo->mm;
+			}
+		}
+
+		i = 0;
+		do {
+			if (i == ARRAY_SIZE(iocbs))
+				break;
+			iocbs[i].iocb = iocb;
+			iocbs[i].index = iocb_index;
+			++i;
+			io_inc_sqring(ctx);
+		} while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL);
+
+		io_submit_iocbs(ctx, iocbs, i, cur_mm, mm_fault);
+	}
+	current->files = old_files;
+	set_fs(old_fs);
+	if (cur_mm) {
+		unuse_mm(cur_mm);
+		mmput(cur_mm);
+	}
+	return 0;
+}
+
+static void io_sq_wq_submit_work(struct work_struct *work)
+{
+	struct io_work *iw = container_of(work, struct io_work, work);
+	struct io_ring_ctx *ctx = iw->ctx;
+	struct io_sq_offload *sqo = &ctx->sq_offload;
+	mm_segment_t old_fs = get_fs();
+	struct files_struct *old_files;
+	int ret;
+
+	old_files = current->files;
+	current->files = sqo->files;
+
+	if (sqo->mm) {
+		if (!mmget_not_zero(sqo->mm)) {
+			ret = -EFAULT;
+			goto err;
+		}
+		use_mm(sqo->mm);
+	}
+
+	set_fs(USER_DS);
+
+	ret = __io_submit_one(ctx, &iw->iocb, iw->iocb_index, NULL, false);
+
+	set_fs(old_fs);
+	if (sqo->mm) {
+		unuse_mm(sqo->mm);
+		mmput(sqo->mm);
+	}
+
+err:
+	if (ret)
+		io_fill_cq_error(ctx, iw->iocb_index, ret);
+	current->files = old_files;
+	kfree(iw);
+}
+
+/*
+ * If this is a read, try a cached inline read first. If the IO is in the
+ * page cache, we can satisfy it without blocking and without having to
+ * punt to a threaded execution. This is much faster, particularly for
+ * lower queue depth IO, and it's always a lot more efficient.
+ */
+static bool io_sq_try_inline(struct io_ring_ctx *ctx,
+			     const struct io_uring_iocb *iocb, unsigned index)
+{
+	int ret;
+
+	if (iocb->opcode != IORING_OP_READ &&
+	    iocb->opcode != IORING_OP_READ_FIXED)
+		return false;
+
+	ret = __io_submit_one(ctx, iocb, index, NULL, true);
+
+	/*
+	 * If we get -EAGAIN, return false to submit out-of-line. Any other
+	 * result and we're done, call will fill in CQ ring event.
+	 */
+	return ret != -EAGAIN;
+}
+
+static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
+{
+	const struct io_uring_iocb *iocb;
+	struct io_work *work;
+	unsigned iocb_index;
+	int ret, queued;
+
+	ret = queued = 0;
+	while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL) {
+		ret = io_sq_try_inline(ctx, iocb, iocb_index);
+		if (!ret) {
+			work = kmalloc(sizeof(*work), GFP_KERNEL);
+			if (!work) {
+				ret = -ENOMEM;
+				break;
+			}
+			memcpy(&work->iocb, iocb, sizeof(*iocb));
+			io_inc_sqring(ctx);
+			work->iocb_index = iocb_index;
+			INIT_WORK(&work->work, io_sq_wq_submit_work);
+			work->ctx = ctx;
+			queue_work(ctx->sq_offload.wq, &work->work);
+		}
+		queued++;
+		if (queued == to_submit)
+			break;
+	}
+
+	return queued ? queued : ret;
+}
+
 static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit,
 			    unsigned min_complete, unsigned flags)
 {
 	int ret = 0;
 
 	if (to_submit) {
-		ret = io_ring_submit(ctx, to_submit);
-		if (ret < 0)
-			return ret;
+		/*
+		 * Three options here:
+		 * 1) We have an sq thread, just wake it up to do submissions
+		 * 2) We have an sq wq, queue a work item for each iocb
+		 * 3) Submit directly
+		 */
+		if (ctx->flags & IORING_SETUP_SQTHREAD) {
+			wake_up(&ctx->sq_offload.wait);
+			ret = to_submit;
+		} else if (ctx->flags & IORING_SETUP_SQWQ) {
+			ret = io_sq_wq_submit(ctx, to_submit);
+		} else {
+			ret = io_ring_submit(ctx, to_submit);
+			if (ret < 0)
+				return ret;
+		}
 	}
 	if (flags & IORING_ENTER_GETEVENTS) {
 		unsigned nr_events = 0;
@@ -1192,6 +1472,78 @@ static int io_iocb_buffer_map(struct io_ring_ctx *ctx,
 	return ret;
 }
 
+static int io_sq_thread(void *);
+
+static int io_sq_thread_start(struct io_ring_ctx *ctx)
+{
+	struct io_sq_offload *sqo = &ctx->sq_offload;
+	struct io_iocb_ring *ring = &ctx->sq_ring;
+	int ret;
+
+	memset(sqo, 0, sizeof(*sqo));
+	init_waitqueue_head(&sqo->wait);
+
+	if (!(ctx->flags & IORING_SETUP_FIXEDBUFS))
+		sqo->mm = current->mm;
+
+	ret = -EBADF;
+	sqo->files = get_files_struct(current);
+	if (!sqo->files)
+		goto err;
+
+	if (ctx->flags & IORING_SETUP_SQTHREAD) {
+		sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx,
+							ring->sq_thread_cpu,
+							"io_uring-sq");
+		if (IS_ERR(sqo->thread)) {
+			ret = PTR_ERR(sqo->thread);
+			sqo->thread = NULL;
+			goto err;
+		}
+		wake_up_process(sqo->thread);
+	} else if (ctx->flags & IORING_SETUP_SQWQ) {
+		int concurrency;
+
+		/* Do QD, or 2 * CPUS, whatever is smallest */
+		concurrency = min(ring->entries - 1, 2 * num_online_cpus());
+		sqo->wq = alloc_workqueue("io_ring-wq",
+						WQ_UNBOUND | WQ_FREEZABLE,
+						concurrency);
+		if (!sqo->wq) {
+			ret = -ENOMEM;
+			goto err;
+		}
+	}
+
+	return 0;
+err:
+	if (sqo->files) {
+		put_files_struct(sqo->files);
+		sqo->files = NULL;
+	}
+	if (sqo->mm)
+		sqo->mm = NULL;
+	return ret;
+}
+
+static void io_sq_thread_stop(struct io_ring_ctx *ctx)
+{
+	struct io_sq_offload *sqo = &ctx->sq_offload;
+
+	if (sqo->thread) {
+		kthread_park(sqo->thread);
+		kthread_stop(sqo->thread);
+		sqo->thread = NULL;
+	} else if (sqo->wq) {
+		destroy_workqueue(sqo->wq);
+		sqo->wq = NULL;
+	}
+	if (sqo->files) {
+		put_files_struct(sqo->files);
+		sqo->files = NULL;
+	}
+}
+
 static void io_free_scq_urings(struct io_ring_ctx *ctx)
 {
 	if (ctx->sq_ring.ring) {
@@ -1212,6 +1564,7 @@ static void io_ring_ctx_free(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, work);
 
+	io_sq_thread_stop(ctx);
 	io_iopoll_reap_events(ctx);
 	io_free_scq_urings(ctx);
 	io_iocb_buffer_unmap(ctx);
@@ -1398,6 +1751,13 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 		if (ret)
 			goto err;
 	}
+	if (p->flags & (IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ)) {
+		ctx->sq_ring.sq_thread_cpu = p->sq_thread_cpu;
+
+		ret = io_sq_thread_start(ctx);
+		if (ret)
+			goto err;
+	}
 
 	ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx,
 				O_RDWR | O_CLOEXEC);
@@ -1431,7 +1791,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs,
 			return -EINVAL;
 	}
 
-	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS))
+	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS |
+			IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ))
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p, iovecs);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 925fd6ca3f38..4f0a8ce49f9a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -36,6 +36,8 @@ struct io_uring_iocb {
  */
 #define IORING_SETUP_IOPOLL	(1 << 0)	/* io_context is polled */
 #define IORING_SETUP_FIXEDBUFS	(1 << 1)	/* IO buffers are fixed */
+#define IORING_SETUP_SQTHREAD	(1 << 2)	/* Use SQ thread */
+#define IORING_SETUP_SQWQ	(1 << 3)	/* Use SQ workqueue */
 
 #define IORING_OP_READ		1
 #define IORING_OP_WRITE		2
@@ -96,7 +98,8 @@ struct io_uring_params {
 	__u32 sq_entries;
 	__u32 cq_entries;
 	__u32 flags;
-	__u16 resv[10];
+	__u16 sq_thread_cpu;
+	__u16 resv[9];
 	struct io_sqring_offsets sq_off;
 	struct io_cqring_offsets cq_off;
 };

From patchwork Tue Jan  8 16:56:44 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752553
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A2D6D17E1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:28 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9403B28F9A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:28 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 8899628FB2; Tue,  8 Jan 2019 16:57:28 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C3CA728F9A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729284AbfAHQ50 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:26 -0500
Received: from mail-io1-f66.google.com ([209.85.166.66]:38825 "EHLO
        mail-io1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729266AbfAHQ5V (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:21 -0500
Received: by mail-io1-f66.google.com with SMTP id l14so3664993ioj.5
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:20 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=3RCWV6AgIU8uOx1gA8S68fITvQ9DAWXUgA5iZexF/As=;
        b=YDm3zG15dDqoiyqTpSzI3QGTguyQNo/jM2PyvqzQbTXBD2XhvGxF3pU+RJs+lEVvXi
         8XEb1ocIr8LCgiltwUv3cPaIDerfWD2t8D+4MWIv+w3MN5CF6QKq74WiHP/Y5UykZDfL
         U3SySAPuDCP4kagpdC3YYOx8RWrrhd5H0IxzxISEn8Pge2c8nd5X7ENCbgZwtOso9aYS
         u+D9IDxRrPVkYF+A7zGY2oxZeRkRb7p6aAqeuznLSUklwY7VCvP6SAeDzw0pD/BLEfgt
         vGY3PxttoCYKnFd++N/DxLZ8f26+tyEKxMT7Xv6QtadcR9sMEpcy3k0eTHdqfKbK6loa
         A3Aw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=3RCWV6AgIU8uOx1gA8S68fITvQ9DAWXUgA5iZexF/As=;
        b=q8+yxjnfWiJmSxe0Jcay9HVkpx9bH3TmDp0oQNl/iJ9qJMJP58+Yl4iP666bnvljyQ
         O0S+Z0VTXtey/jVTbNECVngEEX7zk4gebCBTGX4pfqmZJ6M3pTfy6np/ohG0ZGwVjK39
         Fa9BDzVKfFBWaJ9PH8qyxuJAPidPG08WRUit7FXqXecsj2b3Zb8U3vt429sr0bx9i9Au
         U6a34ia+LpBqp5q+oCXw21OvlFeeJkZ+62b3xbqjo2eHmBZhUaypxKF6qBo+veep2wVk
         X7ho5kE1NoF6yPS4envFCCh/AxXCH9RxWO+/Rw3+JCyOpGQIE/iiBysgfcl9sjgyi/pn
         o2ew==
X-Gm-Message-State: AJcUukflYWkwUkYUZdehHfUvF3P1mP2BeN7UFDp+uuYkq0S6JuTSgh2C
        5Ue/nYLlhjWBJzXsUXgy+NBobSOgnP50/w==
X-Google-Smtp-Source: 
 ALg8bN4MVNEPw0SoJ6+AbUCji6I4UCu7uAmFJcQjcYgPD4q9hVYJLPJ54oa+AWEMfYf7yQC0d+2Wmg==
X-Received: by 2002:a6b:d803:: with SMTP id y3mr1633035iob.247.1546966639650;
        Tue, 08 Jan 2019 08:57:19 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.18
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:18 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 15/16] io_uring: add submission polling
Date: Tue,  8 Jan 2019 09:56:44 -0700
Message-Id: <20190108165645.19311-16-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

This enables an application to do IO, without ever entering the kernel.
By using the SQ ring to fill in new events and watching for completions
on the CQ ring, we can submit and reap IOs without doing a single system
call. The kernel side thread will poll for new submissions, and in case
of HIPRI/polled IO, it'll also poll for completions.

For O_DIRECT, we can do this with just SQTHREAD being enabled. For
buffered aio, we need the workqueue as well. If we can satisfy the
buffered inline from the SQTHREAD, we do that. If not, we punt to the
workqueue. This is just like buffered aio off the io_uring_enter(2)
system call.

Proof of concept. If the thread has been idle for 1 second, it will set
sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to
call io_uring_enter() to start things back up again. If IO is kept busy,
that will never be needed. Basically an application that has this
feature enabled will guard it's io_uring_enter(2) call with:

barrier();
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
	io_uring_enter(fd, to_submit, 0, 0);

instead of calling it unconditionally.

Improvements:

1) Maybe have smarter backoff. Busy loop for X time, then go to
   monitor/mwait, finally the schedule we have now after an idle
   second. Might not be worth the complexity.

2) Probably want the application to pass in the appropriate grace
   period, not hard code it at 1 second.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 135 ++++++++++++++++++++++++++++------
 include/uapi/linux/io_uring.h |   3 +
 2 files changed, 115 insertions(+), 23 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e6a808a89b78..6c10841e4342 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -80,7 +80,8 @@ struct io_mapped_ubuf {
 
 struct io_sq_offload {
 	struct task_struct	*thread;	/* if using a thread */
-	struct workqueue_struct	*wq;	/* wq offload */
+	bool			thread_poll;
+	struct workqueue_struct	*wq;		/* wq offload */
 	struct mm_struct	*mm;
 	struct files_struct	*files;
 	wait_queue_head_t	wait;
@@ -198,6 +199,7 @@ static const struct file_operations io_scqring_fops;
 
 static void io_ring_ctx_free(struct work_struct *work);
 static void io_ring_ctx_ref_free(struct percpu_ref *ref);
+static void io_sq_wq_submit_work(struct work_struct *work);
 
 static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 {
@@ -1098,27 +1100,59 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events)
 	return ring->r.head == ring->r.tail ? ret : 0;
 }
 
+static int io_queue_async_work(struct io_ring_ctx *ctx, struct iocb_submit *is)
+{
+	struct io_work *work;
+
+	work = kmalloc(sizeof(*work), GFP_KERNEL);
+	if (work) {
+		memcpy(&work->iocb, is->iocb, sizeof(*is->iocb));
+		work->iocb_index = is->index;
+		INIT_WORK(&work->work, io_sq_wq_submit_work);
+		work->ctx = ctx;
+		queue_work(ctx->sq_offload.wq, &work->work);
+		return 0;
+	}
+
+	return -ENOMEM;
+}
+
 static int io_submit_iocbs(struct io_ring_ctx *ctx, struct iocb_submit *iocbs,
 			   unsigned int nr, struct mm_struct *cur_mm,
 			   bool mm_fault)
 {
 	struct io_submit_state state, *statep = NULL;
 	int ret, i, submitted = 0;
+	bool force_nonblock;
 
 	if (nr > IO_PLUG_THRESHOLD) {
 		io_submit_state_start(&state, ctx, nr);
 		statep = &state;
 	}
 
+	/*
+	 * Having both a thread and a workqueue only makes sense for buffered
+	 * IO, where we can't submit in an async fashion. Use the NOWAIT
+	 * trick from the SQ thread, and punt to the workqueue if we can't
+	 * satisfy this iocb without blocking. This is only necessary
+	 * for buffered IO with sqthread polled submission.
+	 */
+	force_nonblock = (ctx->flags & IORING_SETUP_SQWQ) != 0;
+
 	for (i = 0; i < nr; i++) {
-		if (unlikely(mm_fault))
+		if (unlikely(mm_fault)) {
 			ret = -EFAULT;
-		else
+		} else {
 			ret = __io_submit_one(ctx, iocbs[i].iocb,
-						iocbs[i].index, statep, false);
-		if (!ret) {
-			submitted++;
-			continue;
+						iocbs[i].index, statep,
+						force_nonblock);
+			/* nogo, submit to workqueue */
+			if (force_nonblock && ret == -EAGAIN)
+				ret = io_queue_async_work(ctx, &iocbs[i]);
+			if (!ret) {
+				submitted++;
+				continue;
+			}
 		}
 
 		io_fill_cq_error(ctx, iocbs[i].index, ret);
@@ -1131,7 +1165,10 @@ static int io_submit_iocbs(struct io_ring_ctx *ctx, struct iocb_submit *iocbs,
 }
 
 /*
- * sq thread only supports O_DIRECT or FIXEDBUFS IO
+ * SQ thread is woken if the app asked for offloaded submission. This can
+ * be either O_DIRECT, in which case we do submissions directly, or it can
+ * be buffered IO, in which case we do them inline if we can do so without
+ * blocking. If we can't, then we punt to a workqueue.
  */
 static int io_sq_thread(void *data)
 {
@@ -1142,6 +1179,8 @@ static int io_sq_thread(void *data)
 	struct files_struct *old_files;
 	mm_segment_t old_fs;
 	DEFINE_WAIT(wait);
+	unsigned inflight;
+	unsigned long timeout;
 
 	old_files = current->files;
 	current->files = sqo->files;
@@ -1149,14 +1188,43 @@ static int io_sq_thread(void *data)
 	old_fs = get_fs();
 	set_fs(USER_DS);
 
+	timeout = inflight = 0;
 	while (!kthread_should_stop()) {
 		const struct io_uring_iocb *iocb;
 		bool mm_fault = false;
 		unsigned iocb_index;
 		int i;
 
+		if (sqo->thread_poll && inflight) {
+			unsigned int nr_events = 0;
+
+			/*
+			 * Normal IO, just pretend everything completed.
+			 * We don't have to poll completions for that.
+			 */
+			if (ctx->flags & IORING_SETUP_IOPOLL)
+				io_iopoll_check(ctx, &nr_events, 0);
+			else
+				nr_events = inflight;
+
+			inflight -= nr_events;
+			if (!inflight)
+				timeout = jiffies + HZ;
+		}
+
 		iocb = io_peek_sqring(ctx, &iocb_index);
 		if (!iocb) {
+			/*
+			 * If we're polling, let us spin for a second without
+			 * work before going to sleep.
+			 */
+			if (sqo->thread_poll) {
+				if (inflight || !time_after(jiffies, timeout)) {
+					cpu_relax();
+					continue;
+				}
+			}
+
 			/*
 			 * Drop cur_mm before scheduling, we can't hold it for
 			 * long periods (or over schedule()). Do this before
@@ -1170,6 +1238,16 @@ static int io_sq_thread(void *data)
 			}
 
 			prepare_to_wait(&sqo->wait, &wait, TASK_INTERRUPTIBLE);
+
+			/* Tell userspace we may need a wakeup call */
+			if (sqo->thread_poll) {
+				struct io_sq_ring *ring;
+
+				ring = ctx->sq_ring.ring;
+				ring->flags |= IORING_SQ_NEED_WAKEUP;
+				smp_wmb();
+			}
+
 			iocb = io_peek_sqring(ctx, &iocb_index);
 			if (!iocb) {
 				if (kthread_should_park())
@@ -1181,6 +1259,13 @@ static int io_sq_thread(void *data)
 				if (signal_pending(current))
 					flush_signals(current);
 				schedule();
+
+				if (sqo->thread_poll) {
+					struct io_sq_ring *ring;
+
+					ring = ctx->sq_ring.ring;
+					ring->flags &= ~IORING_SQ_NEED_WAKEUP;
+				}
 			}
 			finish_wait(&sqo->wait, &wait);
 			if (!iocb)
@@ -1206,7 +1291,7 @@ static int io_sq_thread(void *data)
 			io_inc_sqring(ctx);
 		} while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL);
 
-		io_submit_iocbs(ctx, iocbs, i, cur_mm, mm_fault);
+		inflight += io_submit_iocbs(ctx, iocbs, i, cur_mm, mm_fault);
 	}
 	current->files = old_files;
 	set_fs(old_fs);
@@ -1281,7 +1366,6 @@ static bool io_sq_try_inline(struct io_ring_ctx *ctx,
 static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 {
 	const struct io_uring_iocb *iocb;
-	struct io_work *work;
 	unsigned iocb_index;
 	int ret, queued;
 
@@ -1289,18 +1373,17 @@ static int io_sq_wq_submit(struct io_ring_ctx *ctx, unsigned int to_submit)
 	while ((iocb = io_peek_sqring(ctx, &iocb_index)) != NULL) {
 		ret = io_sq_try_inline(ctx, iocb, iocb_index);
 		if (!ret) {
-			work = kmalloc(sizeof(*work), GFP_KERNEL);
-			if (!work) {
-				ret = -ENOMEM;
+			struct iocb_submit is = {
+				.iocb = iocb,
+				.index = iocb_index
+			};
+
+			ret = io_queue_async_work(ctx, &is);
+			if (ret)
 				break;
-			}
-			memcpy(&work->iocb, iocb, sizeof(*iocb));
-			io_inc_sqring(ctx);
-			work->iocb_index = iocb_index;
-			INIT_WORK(&work->work, io_sq_wq_submit_work);
-			work->ctx = ctx;
-			queue_work(ctx->sq_offload.wq, &work->work);
 		}
+
+		io_inc_sqring(ctx);
 		queued++;
 		if (queued == to_submit)
 			break;
@@ -1491,6 +1574,9 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx)
 	if (!sqo->files)
 		goto err;
 
+	if (ctx->flags & IORING_SETUP_SQPOLL)
+		sqo->thread_poll = true;
+
 	if (ctx->flags & IORING_SETUP_SQTHREAD) {
 		sqo->thread = kthread_create_on_cpu(io_sq_thread, ctx,
 							ring->sq_thread_cpu,
@@ -1501,7 +1587,8 @@ static int io_sq_thread_start(struct io_ring_ctx *ctx)
 			goto err;
 		}
 		wake_up_process(sqo->thread);
-	} else if (ctx->flags & IORING_SETUP_SQWQ) {
+	}
+	if (ctx->flags & IORING_SETUP_SQWQ) {
 		int concurrency;
 
 		/* Do QD, or 2 * CPUS, whatever is smallest */
@@ -1534,7 +1621,8 @@ static void io_sq_thread_stop(struct io_ring_ctx *ctx)
 		kthread_park(sqo->thread);
 		kthread_stop(sqo->thread);
 		sqo->thread = NULL;
-	} else if (sqo->wq) {
+	}
+	if (sqo->wq) {
 		destroy_workqueue(sqo->wq);
 		sqo->wq = NULL;
 	}
@@ -1792,7 +1880,8 @@ SYSCALL_DEFINE3(io_uring_setup, u32, entries, struct iovec __user *, iovecs,
 	}
 
 	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_FIXEDBUFS |
-			IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ))
+			IORING_SETUP_SQTHREAD | IORING_SETUP_SQWQ |
+			IORING_SETUP_SQPOLL))
 		return -EINVAL;
 
 	ret = io_uring_create(entries, &p, iovecs);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 4f0a8ce49f9a..bd665d38dd97 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -38,6 +38,7 @@ struct io_uring_iocb {
 #define IORING_SETUP_FIXEDBUFS	(1 << 1)	/* IO buffers are fixed */
 #define IORING_SETUP_SQTHREAD	(1 << 2)	/* Use SQ thread */
 #define IORING_SETUP_SQWQ	(1 << 3)	/* Use SQ workqueue */
+#define IORING_SETUP_SQPOLL	(1 << 4)	/* SQ thread polls */
 
 #define IORING_OP_READ		1
 #define IORING_OP_WRITE		2
@@ -76,6 +77,8 @@ struct io_sqring_offsets {
 	__u32 resv[3];
 };
 
+#define IORING_SQ_NEED_WAKEUP	(1 << 0) /* needs io_uring_enter wakeup */
+
 struct io_cqring_offsets {
 	__u32 head;
 	__u32 tail;

From patchwork Tue Jan  8 16:56:45 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jens Axboe <axboe@kernel.dk>
X-Patchwork-Id: 10752547
Return-Path: <linux-fsdevel-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4268D13B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:26 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2FC0828FAE
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:26 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 2441628F79; Tue,  8 Jan 2019 16:57:26 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable
	version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C857628F8A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Tue,  8 Jan 2019 16:57:25 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729229AbfAHQ5Y (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Tue, 8 Jan 2019 11:57:24 -0500
Received: from mail-it1-f193.google.com ([209.85.166.193]:38034 "EHLO
        mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1729280AbfAHQ5X (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Tue, 8 Jan 2019 11:57:23 -0500
Received: by mail-it1-f193.google.com with SMTP id h65so6939112ith.3
        for <linux-fsdevel@vger.kernel.org>;
 Tue, 08 Jan 2019 08:57:22 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=kernel-dk.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=JNGzHMnuawXZ2jXlVzeC+GxiAaklWr1vP40pdXT+hKc=;
        b=bpehgh9DDK5WDCcCbulj1IgtC8kIw5R84av9H6wjlTzTFX8FN6a8UR7lIEIc3y1r4+
         1KgSl5dTK7cJArcwvlOPO34wzP0/1myHfRURDPqgDreCRdSTHdn1XgHIEpw2Wz/bDnQ1
         SxsIu9yf9QiiaQYahYWNnEcC/L5vdTN8tDokuElcoFhfc0yrXEU7HfE4sz0QbiYTBNf6
         WG2k5sFj1ejGrpameQizvszKIS/Y3jFeLN9K+b3982XXVqXouSL3Yql9bRBdFGoejjBm
         3PNbSXl62S2nS8tez9iyVXCx5OB75RJs4Lxkth12VCoatA3LEnw22YYDMVlLdL8694gS
         OwtA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references;
        bh=JNGzHMnuawXZ2jXlVzeC+GxiAaklWr1vP40pdXT+hKc=;
        b=UF8lkyd+i/YxNT08/gtpiOn0WVqTdHaE73twNwYxcokwSOMe7JTucUeKoZTEXOscMR
         btH+haHfKRbKoDb1wW91+2EF0Pis7xlp64YKykpDbNiSSHbNccwu6SXWCTmzoecVhdwa
         N8zU7CWShpsbVgEOwjoR9i6Nk1+3VylUtHkil4bNnLSoi1IOa3hpV8wpN4RgHvb8Aiyx
         UAnUzCRAaIP5LGQcDyPgDx7nvNEMomugNzrOb6qn0ibN9e+JABLMhHeSQzGtTJ6I33xR
         CT9Hi51bJm6q/MwPCJfRvy2fOvsMx1oEKyQr07iTkzXg2eDtLpTl4HMf4r/NE7SmcExj
         cUHg==
X-Gm-Message-State: AJcUuke+dv3ILsZrVFL6PjmZw5cPI+YMTMPwTX+1bU/XWi7NtespRnjJ
        VG1Xu6CQOGXx8QqkcbmpCY5MSSrvK365HQ==
X-Google-Smtp-Source: 
 ALg8bN41HEg0pjekki/EOMUuCmJZVFsL8lHd3AR+Kh06EZzPQAQhJkKttRGANOf9fkwD17ny83bhFw==
X-Received: by 2002:a24:52cc:: with SMTP id d195mr1565683itb.29.1546966641543;
        Tue, 08 Jan 2019 08:57:21 -0800 (PST)
Received: from localhost.localdomain ([216.160.245.98])
        by smtp.gmail.com with ESMTPSA id
 m10sm17563442ioq.25.2019.01.08.08.57.19
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 08 Jan 2019 08:57:20 -0800 (PST)
From: Jens Axboe <axboe@kernel.dk>
To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org,
        linux-block@vger.kernel.org, linux-arch@vger.kernel.org
Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 16/16] io_uring: add io_uring_event cache hit information
Date: Tue,  8 Jan 2019 09:56:45 -0700
Message-Id: <20190108165645.19311-17-axboe@kernel.dk>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190108165645.19311-1-axboe@kernel.dk>
References: <20190108165645.19311-1-axboe@kernel.dk>
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Add hint on whether a read was served out of the page cache, or if it
hit media. This is useful for buffered async IO, O_DIRECT reads would
never have this set (for obvious reasons).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/io_uring.c                 | 6 +++++-
 include/uapi/linux/io_uring.h | 5 +++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6c10841e4342..50b9cfa8c861 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -524,12 +524,16 @@ static void io_fill_cq_error(struct io_ring_ctx *ctx, unsigned ki_index,
 static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2)
 {
 	struct io_kiocb *iocb = container_of(kiocb, struct io_kiocb, rw);
+	unsigned ev_flags = 0;
 
 	kiocb_end_write(kiocb);
 
 	fput(kiocb->ki_filp);
 
-	io_complete_scqring(iocb, res, 0);
+	if (res > 0 && test_bit(KIOCB_F_FORCE_NONBLOCK, &iocb->ki_flags))
+		ev_flags = IOEV_FLAG_CACHEHIT;
+
+	io_complete_scqring(iocb, res, ev_flags);
 }
 
 static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index bd665d38dd97..7dd21126f142 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -56,6 +56,11 @@ struct io_uring_event {
 	__u32	flags;
 };
 
+/*
+ * io_uring_event->flags
+ */
+#define IOEV_FLAG_CACHEHIT	(1 << 0)	/* IO did not hit media */
+
 /*
  * Magic offsets for the application to mmap the data it needs
  */