From patchwork Thu Feb 7 19:55:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802073 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C645D6C2 for ; Thu, 7 Feb 2019 19:56:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B629D2E259 for ; Thu, 7 Feb 2019 19:56:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AA4172E2A2; Thu, 7 Feb 2019 19:56:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 359682E25F for ; Thu, 7 Feb 2019 19:56:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727264AbfBGT4B (ORCPT ); Thu, 7 Feb 2019 14:56:01 -0500 Received: from mail-it1-f195.google.com ([209.85.166.195]:35962 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727122AbfBGT4B (ORCPT ); Thu, 7 Feb 2019 14:56:01 -0500 Received: by mail-it1-f195.google.com with SMTP id c9so2967336itj.1 for ; Thu, 07 Feb 2019 11:56:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=pB0024BHqxQbFqcC6bm8p7+6qjmI2rEZP9GcpS5QfRU=; b=SdLVDThfQNGGBvSdx2ieObxXUNxV541kXy7C0CKM0Rn+mwquJwHo+7gPZhNqgdgfhz mMjsRgcrqrMGitpc7N0Yt/Pcc4l2jtxQwdsW85Q+xcilRZJjip9LEpVO8b3hg+rhaMWp oLZmNfLIhl9FYvzAFjmFBT4cq4qmNacpd3uLzdeNzOAou6pTnrLT47Y9xC3UK/yMvudJ u8VZERigj5XAUy1wutp9LoBeoMuyZo3jzSdG+QTITcqbRi3xhQ0B5CIl7omlucUuAys7 JiwiUWfR71TYjLFm1uKYjOxQSQfkYWOyqpsTtZC7IUxQJuAlRxEXkRinetKcu8d5bkwr huAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=pB0024BHqxQbFqcC6bm8p7+6qjmI2rEZP9GcpS5QfRU=; b=B/L+yPe2CPw8hYqL94dMLRFo4tCpO9I8i4eoQmhxN2RPIs8DHRwgV6ybs5H8gu7lLA KGdv8B6NrPWD6PNgdhmVHVhbxbzgG2MCGdDAeoDLqIzMCYyLyPTu0B5Fgjd5iuIVQr/r KL7fcSj4scCmE1AIo2tdoeGGWbUekhNyccHOd94fZa3cMfBBDzNxUIaObmUY2A4bNI4d G1qH5liz+FO5sjDJ9dZDNBgMOKHmJqtD2jP4LUKmKrIK3+tSrsxkVDnr7507yeA56Zbm fDEzBPRBVJdfEFEY/SuZ98liidmd+mHNBMohM6lAbgUh2QeeyRWj18gYiIpelzQtNtlg B8iQ== X-Gm-Message-State: AHQUAubFmqtWtztLq1E9Ap+PM0zLqq7vNy/SAmpOWrW2Rq5+r5pJa9aI B1V3x+bgihe7SpesPARUBlqCWA== X-Google-Smtp-Source: AHgI3IZZSQU6u2qINr0s2sauHZQdvYQMMdS3/lnt0YCN4YBBAZgcgcRWg9XPNRgw8QA8qOeovGxfFw== X-Received: by 2002:a24:68d0:: with SMTP id v199mr1628018itb.73.1549569360177; Thu, 07 Feb 2019 11:56:00 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.55.58 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:55:59 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 01/18] fs: add an iopoll method to struct file_operations Date: Thu, 7 Feb 2019 12:55:35 -0700 Message-Id: <20190207195552.22770-2-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 29d8e2cfed0e..dedcc2e9265c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned int ki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1787,6 +1788,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); From patchwork Thu Feb 7 19:55:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802071 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 21D2E17FB for ; Thu, 7 Feb 2019 19:56:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 11E572E259 for ; Thu, 7 Feb 2019 19:56:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 05F322E2A2; Thu, 7 Feb 2019 19:56:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 673DD2E267 for ; Thu, 7 Feb 2019 19:56:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727265AbfBGT4E (ORCPT ); Thu, 7 Feb 2019 14:56:04 -0500 Received: from mail-it1-f194.google.com ([209.85.166.194]:36919 "EHLO mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727263AbfBGT4D (ORCPT ); Thu, 7 Feb 2019 14:56:03 -0500 Received: by mail-it1-f194.google.com with SMTP id b5so2956831iti.2 for ; Thu, 07 Feb 2019 11:56:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=/hv74R61uSx6qH8M6q4G3Wff9psoZdekyLR18O4KJGE=; b=FEwoQ4OTprnXluiO2P8jM3R70Gva5DXyEcAjq6wYJQrYnltgLdFhkN8s7fIWyg1ftL ma6wsFgdniqSyc6+IetEZZikWsPJEtIeW0AaJrFWNIpErDqDcz1X4Iv+DUSKc75Th76M AyN5lMYYa4vz4hg34mbvYKi8N3qik92Qp/GnD8Gg51tbhb77IFpp8wlDhLu+c9U4+itV UKrihepfvs7aS6czLn5C7ohdaRbMp3LRcq8d4fdb2fjuhZnCfz6FUbJVkhA0gcIQ/ZUG 8raHEv62WCQVnyHMIi0hK+DNQ73wVTg+nJJkUl6SwznCMrFqpn6JrhPPBvMWbeOYS15/ hT/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=/hv74R61uSx6qH8M6q4G3Wff9psoZdekyLR18O4KJGE=; b=DMslUFl6FUTWzbPTYpUdDMyK07qXLkM9XXLxsOpqR4vRVhfA88ZK85cPDUhVYLdMGe 1Kq0pmAXGz8fXQwoBAuhttxrw0gN+plzzui3w3paYnGKZBmta2W/SapiyMs3aeL6H9J5 I3bt8enIae9KbKsotUUGfBAHsvjszvgJi0FMva69uCIy6VOmkZlSpjzJSyci7KMbPEja ZVM9a2WxxPeVwWn5JkVue8ISD65Xp5qVZR8X4D++dWqHZVezLe9xGfJDBehNH4VYMpMA O0TQEBtfgZnu/Hjz6EO+NKj1KOWIUdiA7JUeyxkDBiUF/Xlr9jiEZ8Fcfh13GMu1570i +IVw== X-Gm-Message-State: AHQUAub3ckKiQEEx91JUyVVH5QtAPAbcFwKVRtPPdybyRJXm2hoQO5FJ yuFbGc363l2qoscwVDxa0YuaHw== X-Google-Smtp-Source: AHgI3IbuQIceMm/Etn1gt6B/vvUoK2hX4yhPMxljJrmwdXRQXFlCVOx70EYMnI4LosLUT/D0XCG5KQ== X-Received: by 2002:a24:7c58:: with SMTP id a85mr6612759itd.9.1549569362412; Thu, 07 Feb 2019 11:56:02 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.00 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:01 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 02/18] block: wire up block device iopoll method Date: Thu, 7 Feb 2019 12:55:36 -0700 Message-Id: <20190207195552.22770-3-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 58a4c1217fa8..f18d076a2596 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -293,6 +293,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -410,6 +418,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2076,6 +2085,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, From patchwork Thu Feb 7 19:55:37 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802075 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 226171575 for ; Thu, 7 Feb 2019 19:56:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 14B062E075 for ; Thu, 7 Feb 2019 19:56:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0919E2E2A2; Thu, 7 Feb 2019 19:56:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9B2262E259 for ; Thu, 7 Feb 2019 19:56:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727299AbfBGT4H (ORCPT ); Thu, 7 Feb 2019 14:56:07 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:34318 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727268AbfBGT4F (ORCPT ); Thu, 7 Feb 2019 14:56:05 -0500 Received: by mail-it1-f193.google.com with SMTP id x124so6553228itd.1 for ; Thu, 07 Feb 2019 11:56:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=FBZUThjVgqlafR/VslkU6xBWVbQBqldD6jr0wVQhMXM=; b=CbJ+5V72kjc34QsLtPgRv2OcvWygok/Tnl/CMlQvwuca9fcdKTJGCR9CHOvLdaeJNk /nNO/3NcSZwaiKlqMiSxObO+w42oqAuBnjCxLWWqVhR2lBqan9dulz4Zkumo0WSD2DzE Po49AzTjmLvyTbCveO2Of7xTgKouU5viys11cL/XSFeGRE4dZrK3CjAUz+M+OKoq6UIn EGE1kI9VTuG5bH/vJ9FiJOi0neunCjWr+pkTwmkZRa7VDkHey992O4fuITnrxbH6cs4r 2cw9rIPE5XYuqXNyh2GMUlLfohbVLx/lin2nPxKqrxcWO8q/BuOy02cB+7eB/WwOq2JX ap8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=FBZUThjVgqlafR/VslkU6xBWVbQBqldD6jr0wVQhMXM=; b=Irdnx2/QUc4xdNyUjSQRagO09t1h0ytGdIyZ4bGNulvKxjEpWTa2gs9XN0J78o/LS5 5mHm5Ft92gv5qTXuCdwV0pwMnZBumncZjBUaUZ6MUiOQxoqgUx0XdMZMdjZ+Zsxh+3kL N3I998pUxiXlC0MG+Gf+T+bh9O5A9y8shgH0w3SYvfZb4Xcd2VWza1BHFf4DIQy2HA9N 3dzcn6NkqwH/hDw9nUn2tVLaULQkrBFdBUnvXsXH+Ij2M/i/KBnshOVVtHhwQy06qsWo irboHHYENYv7ws/fN4NGxW/By8tK5wvq5m+LI46cuuVQhz+iPx76TYfqBw5MUYCnJBrW 7uNw== X-Gm-Message-State: AHQUAuZ+mAmiRf6OQOxOM6mr/QX+K34ZAZ/Xt8bHgn5jqjrs9sY7tvc6 2u7e8tv+R3OkB50j9ZeoU4IpSscS5REeqw== X-Google-Smtp-Source: AHgI3IY0B0TzivNn0Pq+PspWdM2zL8NCyOVxaanLi9x/HWgZYN+soegHSNU+lbLhXkuz/9g6QgUCVg== X-Received: by 2002:a02:284:: with SMTP id 126mr10064898jau.47.1549569364477; Thu, 07 Feb 2019 11:56:04 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.02 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:03 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 03/18] block: add bio_set_polled() helper Date: Thu, 7 Feb 2019 12:55:37 -0700 Message-Id: <20190207195552.22770-4-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 4 ++-- include/linux/bio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index f18d076a2596..392e2bfb636f 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -247,7 +247,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, task_io_account_write(ret); } if (iocb->ki_flags & IOCB_HIPRI) - bio.bi_opf |= REQ_HIPRI; + bio_set_polled(&bio, iocb); qc = submit_bio(&bio); for (;;) { @@ -415,7 +415,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + bio_set_polled(bio, iocb); qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7380b094dcca..f6f0a2b3cbc8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/* + * Mark a bio as polled. Note that for async polled IO, the caller must + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). + * We cannot block waiting for requests on polled IO, as those completions + * must be found by the caller. This is different than IRQ driven IO, where + * it's safe to wait for IO to complete. + */ +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) +{ + bio->bi_opf |= REQ_HIPRI; + if (!is_sync_kiocb(kiocb)) + bio->bi_opf |= REQ_NOWAIT; +} + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ From patchwork Thu Feb 7 19:55:38 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802077 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5E2CD17FB for ; Thu, 7 Feb 2019 19:56:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4E5E02E075 for ; Thu, 7 Feb 2019 19:56:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 428342E259; Thu, 7 Feb 2019 19:56:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BBE1B2E25F for ; Thu, 7 Feb 2019 19:56:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727268AbfBGT4H (ORCPT ); Thu, 7 Feb 2019 14:56:07 -0500 Received: from mail-it1-f194.google.com ([209.85.166.194]:34322 "EHLO mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727263AbfBGT4G (ORCPT ); Thu, 7 Feb 2019 14:56:06 -0500 Received: by mail-it1-f194.google.com with SMTP id x124so6553313itd.1 for ; Thu, 07 Feb 2019 11:56:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=kBV8dzOOhCbt68ELwObNvYoJE2gQ/QTxNte5p6AS0oA=; b=sL6+ENs8/1pkDya3pmUb1Uw4k3X6q+RnOmME3PMuhw/q/Suv3uC2ZPrkT2B2EaQJWc Pz5NHCJvX6RaI+vPgbKDm2VPZWeXwuvdAlSNihH0R8LXU6HiI+Ox518F76I2BrBHWS8x u7ihbNstOhTvhVck1Yx4Bjv1lYU7rOyEb/3HVePopzXFs1JSl9eAwCvf9Lao8hXDP3Br 3MM625lR91xBI375V+KiU1aG4GKcyIDBl9PERh1+NEWBkiec+Nppla9AGiOy9twBVYPh JjbOv7KI5EQmAyLQDmN5kq4L7QqhE56x0i5qeM7lBMk5EfWw8c0zcGgsjM9kugcDluAI 2l0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=kBV8dzOOhCbt68ELwObNvYoJE2gQ/QTxNte5p6AS0oA=; b=Z4l4/xNCiKSVqUoWT6GJFuhY9ZSouT0qRnisH5yeRovjybotu/zofMuBqXII8f8djK /skTJoTC00i5sF9Jni2/us1JwxAmFWxw0WLyKNXyFo1PvhH+i7W40foDNTg7fIbMf65L R2PjBd6a8atKYIvzbY16HyBGUcVT/ugzXWbzmYjcyGxvX9uhFPhE10O99r8qx5Oxtg22 MmvRerGnCi5wH1z3jWMIkbho5kzZU6JIkOogJUbh+3t3PKtdfbjEU+OiLWNTk0DU99sg YksGrhdMjxYhLcWBlYG8de3Q9jJvhEofH5s4pe6X9EZJVtR2Um2czcUiXxSubWEobYmx IATw== X-Gm-Message-State: AHQUAuZoyYNuCAmCKvrNGfEAB8VjNikLmyZ4YHctv4T1VaiHsR3s+pC7 GWW+AWblht31BY4Z5N5j2hQsug== X-Google-Smtp-Source: AHgI3IZlK4FrOJg/DL2AriON/DnHWwlxR7emBF0B0/Zc3TvnhEF26BBMEEnJwN79z65TV6CP/zg7zw== X-Received: by 2002:a24:3dca:: with SMTP id n193mr7005847itn.48.1549569366161; Thu, 07 Feb 2019 11:56:06 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:05 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 04/18] iomap: wire up the iopoll method Date: Thu, 7 Feb 2019 12:55:38 -0700 Message-Id: <20190207195552.22770-5-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig Modified to use bio_set_polled(). Signed-off-by: Jens Axboe --- fs/gfs2/file.c | 2 ++ fs/iomap.c | 43 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..58a768e59712 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index 897c60215dd1..2ac9eb746d44 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1463,6 +1463,28 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) + bio_set_polled(bio, dio->iocb); + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1575,7 +1597,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1589,15 +1611,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1700,9 +1717,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1710,11 +1724,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1925,6 +1935,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + /* * We are about to drop our additional submission reference, which * might be the last reference to the dio. There are three three diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 9a4258154b25..0fefb5455bda 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret, unsigned flags); ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, iomap_dio_end_io_t end_io); +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP struct file; From patchwork Thu Feb 7 19:55:39 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802079 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9CBFA6C2 for ; Thu, 7 Feb 2019 19:56:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 88BE62E24C for ; Thu, 7 Feb 2019 19:56:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7D0962E2A2; Thu, 7 Feb 2019 19:56:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4796B2E24C for ; Thu, 7 Feb 2019 19:56:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727301AbfBGT4K (ORCPT ); Thu, 7 Feb 2019 14:56:10 -0500 Received: from mail-it1-f195.google.com ([209.85.166.195]:50711 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727297AbfBGT4J (ORCPT ); Thu, 7 Feb 2019 14:56:09 -0500 Received: by mail-it1-f195.google.com with SMTP id z7so3009493iti.0 for ; Thu, 07 Feb 2019 11:56:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=wxKN5dvJalmQaA96EbmmDLVH7a2tMM7yvVFAnLXO504=; b=STPQmLkiLhnBpErsYMlgggTjkrecWKNzxavbdqmHDkWZke6CWLqvYcbv9sFLZ5/9Pm SYhzdxQKDfLllZrRm006ucCmbuu3FYqCQSALIdzvDTeHWsvxEzMzhEnODCmwebP7p8I7 PNhUS+iDscppV84UBh5CCWubLkJVWTZFirb0pJ+FKOr8DZ9EnCyv/mebBfNkkUowOCLp BcP3WTcM+Rp/RxdEq+tLCylkMIaK333dwBG46L36rr7I2sPR88Bl+oKNFbRm6WydxrGB 8Wb3fElv4rNc5RGJCgjzbbaAgjZ/ACw2sfbuhAO98icWBWeaJObehv0djERodYxBsgJe aj0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=wxKN5dvJalmQaA96EbmmDLVH7a2tMM7yvVFAnLXO504=; b=KtyZZWWxVKMAtHL2AJkpWNpTexvQ/t9r8x2U4syJXbhj/jybQkFUV0UuUzM8mxSx9f WDLLN3fONiArcD0YXIsXfRzbgrYTTmTVeRKUJBxWv4yVjyOilnFTkv1X4SyPiic8jQFG C0lKetr8yoNpr9VEbKkWboO1GONIyWi7aSHp3ZUTrpRQzQmNAYbS+q9V/QMMj3oiP4er qrCHraBIS7ZjE6nqGkvmUhnbfLDRrMt2ewKIyfaYvgiqGIPeyhcDmFxP4o9WgtvAn1tq ZXycP2xiGXTuNWfouTaTf1Y9qeBYPDhAkSvcWu6Kq+FbwgKi4fAIFs2vpA9hVN7CN1+V alGw== X-Gm-Message-State: AHQUAuZ7GeO7K0aJIvL+z16gKaB3dJSIKI0PdY8wqtJH68ZAFIDUqV+9 fg7/NZXv2ISa7Zj50GIpeibbtg== X-Google-Smtp-Source: AHgI3IbuSDniZAGcViSwxRezeVp0Ty/EXLQ18SJOBFAJ6s60ib7q318uQJQgUrNwOAAqE6A55VrHeA== X-Received: by 2002:a02:802:: with SMTP id 2mr10225789jac.80.1549569367873; Thu, 07 Feb 2019 11:56:07 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:07 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 05/18] Add io_uring IO interface Date: Thu, 7 Feb 2019 12:55:39 -0700 Message-Id: <20190207195552.22770-6-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 1174 ++++++++++++++++++++++++ include/linux/fs.h | 9 + include/linux/syscalls.h | 6 + include/uapi/asm-generic/unistd.h | 6 +- include/uapi/linux/io_uring.h | 95 ++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + net/unix/garbage.c | 3 + 11 files changed, 1308 insertions(+), 1 deletion(-) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..481c126259e9 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,5 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup +426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..6a32a430c8e0 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +425 common io_uring_setup __x64_sys_io_uring_setup +426 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..39a6cd14f0ec --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,1174 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2018-2019 Jens Axboe + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "internal.h" + +#define IORING_MAX_ENTRIES 4096 + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct { + struct percpu_ref refs; + } ____cacheline_aligned_in_smp; + + struct { + unsigned int flags; + bool compat; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned cached_sq_head; + unsigned sq_entries; + unsigned sq_mask; + struct io_uring_sqe *sq_sqes; + } ____cacheline_aligned_in_smp; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + + struct { + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cached_cq_tail; + unsigned cq_entries; + unsigned cq_mask; + struct wait_queue_head cq_wait; + struct fasync_struct *cq_fasync; + } ____cacheline_aligned_in_smp; + + struct user_struct *user; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; + +#if defined(CONFIG_NET) + struct socket *ring_sock; +#endif +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned short index; + bool has_user; +}; + +struct io_kiocb { + struct kiocb rw; + + struct sqe_submit submit; + + struct io_ring_ctx *ctx; + struct list_head list; + unsigned int flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 user_data; + + struct work_struct work; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_uring_fops; + +struct sock *io_uring_get_socket(struct file *file) +{ +#if defined(CONFIG_NET) + if (file->f_op == &io_uring_fops) { + struct io_ring_ctx *ctx = file->private_data; + + return ctx->ring_sock->sk; + } +#endif + return NULL; +} + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + init_waitqueue_head(&ctx->cq_wait); + init_completion(&ctx->ctx_done); + mutex_init(&ctx->uring_lock); + init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->completion_lock); + return ctx; +} + +static void io_commit_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + if (ctx->cached_cq_tail != ring->r.tail) { + /* order cqe stores with ring update */ + smp_wmb(); + WRITE_ONCE(ring->r.tail, ctx->cached_cq_tail); + /* write side barrier of tail update, app has read side */ + smp_wmb(); + + if (wq_has_sleeper(&ctx->cq_wait)) { + wake_up_interruptible(&ctx->cq_wait); + kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); + } + } +} + +static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + tail = ctx->cached_cq_tail; + smp_rmb(); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + ctx->cached_cq_tail++; + return &ring->cqes[tail & ctx->cq_mask]; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_get_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + } else + ctx->cq_ring->overflow++; +} + +static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + io_cqring_fill_event(ctx, ki_user_data, res, ev_flags); + io_commit_cqring(ctx); + spin_unlock_irqrestore(&ctx->completion_lock, flags); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, __GFP_NOWARN); + if (req) { + req->ctx = ctx; + req->flags = 0; + return req; + } + + io_ring_drop_ctx_refs(ctx, 1); + return NULL; +} + +static void io_free_req(struct io_kiocb *req) +{ + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_complete_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_add_event(req->ctx, req->user_data, res, 0); + io_free_req(req); +} + +/* + * If we tracked the file through the SCM inflight mechanism, we could support + * any file. For now, just ensure that anything potentially problematic is done + * inline. + */ +static bool io_file_supports_async(struct file *file) +{ + umode_t mode = file_inode(file)->i_mode; + + if (S_ISBLK(mode) || S_ISCHR(mode)) + return true; + if (S_ISREG(mode) && file->f_op != &io_uring_fops) + return true; + + return false; +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + unsigned ioprio; + int fd, ret; + + /* For -EAGAIN retry, everything is already prepped */ + if (kiocb->ki_filp) + return 0; + + fd = READ_ONCE(sqe->fd); + kiocb->ki_filp = fget(fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) + force_nonblock = false; + kiocb->ki_pos = READ_ONCE(sqe->off); + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + + ioprio = READ_ONCE(sqe->ioprio); + if (ioprio) { + ret = ioprio_check_cap(ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * We can't just restart the syscall, since previously + * submitted sqes may already be in progress. Just fail this + * IO with EINTR. + */ + ret = -EINTR; + /* fall through */ + default: + kiocb->ki_complete(kiocb, ret, 0); + } +} + +static int io_import_iovec(struct io_ring_ctx *ctx, int rw, + const struct sqe_submit *s, struct iovec **iovec, + struct iov_iter *iter) +{ + const struct io_uring_sqe *sqe = s->sqe; + void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); + size_t sqe_len = READ_ONCE(sqe->len); + + if (!s->has_user) + return EFAULT; + +#ifdef CONFIG_COMPAT + if (ctx->compat) + return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV, + iovec, iter); +#endif + + return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter); +} + +static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, s->sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + /* Hold on to the file for -EAGAIN */ + if (unlikely(ret && ret != -EAGAIN)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, s->sqe, force_nonblock); + if (ret) + return ret; + /* Hold on to the file for -EAGAIN */ + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + return -EAGAIN; + + ret = -EBADF; + file = kiocb->ki_filp; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +/* + * IORING_OP_NOP just posts a completion event, nothing else. + */ +static int io_nop(struct io_kiocb *req, u64 user_data) +{ + struct io_ring_ctx *ctx = req->ctx; + + io_cqring_add_event(ctx, user_data, 0, 0); + io_free_req(req); + return 0; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + const struct sqe_submit *s, bool force_nonblock) +{ + ssize_t ret; + int opcode; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->user_data = READ_ONCE(s->sqe->user_data); + + opcode = READ_ONCE(s->sqe->opcode); + switch (opcode) { + case IORING_OP_NOP: + ret = io_nop(req, req->user_data); + break; + case IORING_OP_READV: + ret = io_read(req, s, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, s, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct sqe_submit *s = &req->submit; + u64 user_data = READ_ONCE(s->sqe->user_data); + struct io_ring_ctx *ctx = req->ctx; + mm_segment_t old_fs = get_fs(); + int ret; + + /* Ensure we clear previously set forced non-block flag */ + req->flags &= ~REQ_F_FORCE_NONBLOCK; + req->rw.ki_flags &= ~IOCB_NOWAIT; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + s->has_user = true; + + ret = __io_submit_sqe(ctx, req, s, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_cqring_add_event(ctx, user_data, ret, 0); + io_free_req(req); + } +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(s->sqe->flags)) + return -EINVAL; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + req->rw.ki_filp = NULL; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->submit, s, sizeof(*s)); + INIT_WORK(&req->work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_commit_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + if (ctx->cached_sq_head != ring->r.head) { + WRITE_ONCE(ring->r.head, ctx->cached_sq_head); + /* write side barrier of head update, app has read side */ + smp_wmb(); + } +} + +/* + * Undo last io_get_sqring() + */ +static void io_drop_sqring(struct io_ring_ctx *ctx) +{ + ctx->cached_sq_head--; +} + +/* + * Fetch an sqe, if one is available. Note that s->sqe will point to memory + * that is mapped by userspace. This means that care needs to be taken to + * ensure that reads are stable, as we cannot rely on userspace always + * being a good citizen. If members of the sqe are validated and then later + * used, it's important that those reads are done through READ_ONCE() to + * prevent a re-load down the line. + */ +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + /* + * The cached sq head (or cq tail) serves two purposes: + * + * 1) allows us to batch the cost of updating the user visible + * head updates. + * 2) allows the kernel side to track the head on its own, even + * though the application is the one updating it. + */ + head = ctx->cached_sq_head; + smp_rmb(); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = READ_ONCE(ring->array[head & ctx->sq_mask]); + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + ctx->cached_sq_head++; + return true; + } + + /* drop invalid entries */ + ctx->cached_sq_head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_get_sqring(ctx, &s)) + break; + + s.has_user = true; + ret = io_submit_sqe(ctx, &s); + if (ret) { + io_drop_sqring(ctx); + break; + } + + submit++; + } + io_commit_sqring(ctx); + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, + const sigset_t __user *sig, size_t sigsz) +{ + struct io_cq_ring *ring = ctx->cq_ring; + sigset_t ksigmask, sigsaved; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + if (sig) { + ret = set_user_sigmask(sig, &ksigmask, &sigsaved, sigsz); + if (ret) + return ret; + } + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + + if (sig) + restore_user_sigmask(sig, &sigsaved); + + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + mmgrab(current->mm); + ctx->sqo_mm = current->mm; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + mmdrop(ctx->sqo_mm); + ctx->sqo_mm = NULL; + return ret; +} + +static void io_unaccount_mem(struct user_struct *user, unsigned long nr_pages) +{ + if (capable(CAP_IPC_LOCK)) + return; + atomic_long_sub(nr_pages, &user->locked_vm); +} + +static int io_account_mem(struct user_struct *user, unsigned long nr_pages) +{ + unsigned long page_limit, cur_pages, new_pages; + + if (capable(CAP_IPC_LOCK)) + return 0; + + /* Don't allow more pages than we can safely lock */ + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + do { + cur_pages = atomic_long_read(&user->locked_vm); + new_pages = cur_pages + nr_pages; + if (new_pages > page_limit) + return -ENOMEM; + } while (atomic_long_cmpxchg(&user->locked_vm, cur_pages, + new_pages) != cur_pages); + + return 0; +} + +static void io_mem_free(void *ptr) +{ + struct page *page = virt_to_head_page(ptr); + + if (put_page_testzero(page)) + free_compound_page(page); +} + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t bytes; + + bytes = struct_size(sq_ring, array, sq_entries); + bytes += array_size(sizeof(struct io_uring_sqe), sq_entries); + bytes += struct_size(cq_ring, cqes, cq_entries); + + return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) + destroy_workqueue(ctx->sqo_wq); + if (ctx->sqo_mm) + mmdrop(ctx->sqo_mm); +#if defined(CONFIG_NET) + if (ctx->ring_sock) + sock_release(ctx->ring_sock); +#endif + + io_mem_free(ctx->sq_ring); + io_mem_free(ctx->sq_sqes); + io_mem_free(ctx->cq_ring); + + percpu_ref_exit(&ctx->refs); + io_unaccount_mem(ctx->user, + ring_pages(ctx->sq_entries, ctx->cq_entries)); + free_uid(ctx->user); + kfree(ctx); +} + +static __poll_t io_uring_poll(struct file *file, poll_table *wait) +{ + struct io_ring_ctx *ctx = file->private_data; + __poll_t mask = 0; + + poll_wait(file, &ctx->cq_wait, wait); + smp_rmb(); + if (ctx->sq_ring->r.tail + 1 != ctx->cached_sq_head) + mask |= EPOLLOUT | EPOLLWRNORM; + if (ctx->cq_ring->r.head != ctx->cached_cq_tail) + mask |= EPOLLIN | EPOLLRDNORM; + + return mask; +} + +static int io_uring_fasync(int fd, struct file *file, int on) +{ + struct io_ring_ctx *ctx = file->private_data; + + return fasync_helper(fd, file, on, &ctx->cq_fasync); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + mutex_lock(&ctx->uring_lock); + percpu_ref_kill(&ctx->refs); + mutex_unlock(&ctx->uring_lock); + + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_uring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags, const sigset_t __user *, sig, + size_t, sigsz) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + int submitted = 0; + struct fd f; + + if (flags & ~IORING_ENTER_GETEVENTS) + return -EINVAL; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ret = -ENXIO; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + if (to_submit) { + to_submit = min(to_submit, ctx->sq_entries); + + mutex_lock(&ctx->uring_lock); + submitted = io_ring_submit(ctx, to_submit); + mutex_unlock(&ctx->uring_lock); + + if (submitted < 0) + goto out_ctx; + } + if (flags & IORING_ENTER_GETEVENTS) { + /* + * The application could have included the 'to_submit' count + * in how many events it wanted to wait for. If we failed to + * submit the desired count, we may need to adjust the number + * of events to poll/wait for. + */ + if (submitted < to_submit) + min_complete = min_t(unsigned, submitted, min_complete); + + ret = io_cqring_wait(ctx, min_complete, sig, sigsz); + } + +out_ctx: + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return submitted ? submitted : ret; +} + +static const struct file_operations io_uring_fops = { + .release = io_uring_release, + .mmap = io_uring_mmap, + .poll = io_uring_poll, + .fasync = io_uring_fasync, +}; + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + return -EOVERFLOW; + + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) { + io_mem_free(ctx->sq_ring); + return -ENOMEM; + } + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) { + io_mem_free(ctx->sq_ring); + io_mem_free(ctx->sq_sqes); + return -ENOMEM; + } + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +} + +static int io_uring_get_fd(struct io_ring_ctx *ctx) +{ + struct file *file; + int ret; + +#if defined(CONFIG_NET) + ret = sock_create_kern(&init_net, PF_UNIX, SOCK_RAW, IPPROTO_IP, + &ctx->ring_sock); + if (ret) + return ret; +#endif + + ret = get_unused_fd_flags(O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (IS_ERR(file)) { + put_unused_fd(ret); + ret = PTR_ERR(file); + goto err; + } + +#if defined(CONFIG_NET) + ctx->ring_sock->file = file; +#endif + fd_install(ret, file); + return ret; +err: +#if defined(CONFIG_NET) + sock_release(ctx->ring_sock); + ctx->ring_sock = NULL; +#endif + return ret; +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct user_struct *user = NULL; + struct io_ring_ctx *ctx; + int ret; + + if (!entries || entries > IORING_MAX_ENTRIES) + return -EINVAL; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + user = get_uid(current_user()); + ret = io_account_mem(user, ring_pages(p->sq_entries, p->cq_entries)); + if (ret) { + free_uid(user); + return ret; + } + + ctx = io_ring_ctx_alloc(p); + if (!ctx) { + io_unaccount_mem(user, ring_pages(p->sq_entries, + p->cq_entries)); + free_uid(user); + return -ENOMEM; + } + ctx->compat = in_compat_syscall(); + ctx->user = user; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = io_uring_get_fd(ctx); + if (ret < 0) + goto err; + + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +static long io_uring_setup(u32 entries, struct io_uring_params __user *params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params); +} + +static int __init io_uring_init(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_init); diff --git a/include/linux/fs.h b/include/linux/fs.h index dedcc2e9265c..61aa210f0c2b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3517,4 +3517,13 @@ extern void inode_nohighmem(struct inode *inode); extern int vfs_fadvise(struct file *file, loff_t offset, loff_t len, int advice); +#if defined(CONFIG_IO_URING) +extern struct sock *io_uring_get_socket(struct file *file); +#else +static inline struct sock *io_uring_get_socket(struct file *file) +{ + return NULL; +} +#endif + #endif /* _LINUX_FS_H */ diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..3072dbaa7869 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include #include @@ -309,6 +310,11 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags, + const sigset_t __user *sig, size_t sigsz); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d90127298f12..87871e7b7ea7 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -740,9 +740,13 @@ __SC_COMP(__NR_io_pgetevents, sys_io_pgetevents, compat_sys_io_pgetevents) __SYSCALL(__NR_rseq, sys_rseq) #define __NR_kexec_file_load 294 __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) +#define __NR_io_uring_setup 425 +__SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) +#define __NR_io_uring_enter 426 +__SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) #undef __NR_syscalls -#define __NR_syscalls 295 +#define __NR_syscalls 427 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..ac692823d6f4 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,95 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include +#include + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + __u64 addr; /* pointer to buffer or iovecs */ + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 user_data; /* data to be passed back at completion time */ + __u64 __pad2[3]; +}; + +#define IORING_OP_NOP 0 +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv1; + __u64 resv2; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u64 resv[2]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1U << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u32 resv[7]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index c9386a365eea..8b5e0da04384 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1414,6 +1414,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ diff --git a/net/unix/garbage.c b/net/unix/garbage.c index c36757e72844..f81854d74c7d 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -108,6 +108,9 @@ struct sock *unix_get_socket(struct file *filp) /* PF_UNIX ? */ if (s && sock->ops && sock->ops->family == PF_UNIX) u_sock = s; + } else { + /* Could be an io_uring instance */ + u_sock = io_uring_get_socket(filp); } return u_sock; } From patchwork Thu Feb 7 19:55:40 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802081 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 913EA1575 for ; Thu, 7 Feb 2019 19:56:14 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7F7442E259 for ; Thu, 7 Feb 2019 19:56:14 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 739222E25F; Thu, 7 Feb 2019 19:56:14 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EC35A2E259 for ; Thu, 7 Feb 2019 19:56:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727305AbfBGT4L (ORCPT ); Thu, 7 Feb 2019 14:56:11 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:36954 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727300AbfBGT4L (ORCPT ); Thu, 7 Feb 2019 14:56:11 -0500 Received: by mail-it1-f193.google.com with SMTP id b5so2958058iti.2 for ; Thu, 07 Feb 2019 11:56:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=rf7RqaMZdixzZlTvGWlyJgJvm8vyz+uMGRavhxLh86s=; b=hluCxhbfuIbcDlTYI6NttBBq1dndLtKRNvOHuannGv9UVgA9+wDRtMhF8vIRYfY+z+ YWM6LhyJoZxGB8CnPBHdnDmQ1GORdT2zWnUaps0mc88y0Ll5kB4wuEhfr2eOezmo7Fq0 2m0eQOfmkYm9qbDRVyO23orNf5bHboLFB3dKS47KuimCj3IEk/rgfwoVPVGSAIQeEp2v 1w2FIkGkmmtfDJCkYYgJ3WAvWC+Asv/hynO9RcENRoe92j25kgz1voXopr5ZoFDOrwKs cFC14NtTInFYNybxgPTAl76szv5lyRQJfEv7dGCZbMoNa8gkvKNCYaX4KdtGt1/9tMz+ Sb/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=rf7RqaMZdixzZlTvGWlyJgJvm8vyz+uMGRavhxLh86s=; b=dtkXRNQ2MMED/IQn0nMp2hXiMwCKydHZCucuMWBVl8A9PqAKDFN2UyvL+wOIEFNQzQ WnaYrrlFQHN+aApAGNhNaM6WGjBXD4w1JwAi3V5OqnvBWw0ujWqAUmfIw46Gn51GBNwj Q6cOixxJKvGjuMKibp0IzN8IeBGU/HVouBzlgUuLbt9x01zuXoX0Ghrwr7H5mldABi7z yp+vd97VwmAH1gmX0zctWpsR6HlpaNG6SViUetYGSsQ8pvZPLqJeOrK0xZRQjiFq+hZL 3UhYtrd797Y0kSPSK8zn18h07bR7qI1kitnE4HtaIDiNIe/mnKdUC9uj9a+/C0DiPYAx Tqpw== X-Gm-Message-State: AHQUAuaIjJSpgwU34k7tAgtjEbIkPvNZZVS/+oj6vv48m0nB3j/cAoKN jPtZsYDPdZeFjT4hoWbUEFkw2A== X-Google-Smtp-Source: AHgI3IZojjfQZUntG4iYZBx1RXLeFrUk9cIy5C0QJDTZgIx+g12xhY+v7X07d7C1gFQfKGNS+wcJog== X-Received: by 2002:a02:601d:: with SMTP id i29mr8988816jac.11.1549569370533; Thu, 07 Feb 2019 11:56:10 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:08 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 06/18] io_uring: add fsync support Date: Thu, 7 Feb 2019 12:55:40 -0700 Message-Id: <20190207195552.22770-7-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Add a new fsync opcode, which either syncs a range if one is passed, or the whole file if the offset and length fields are both cleared to zero. A flag is provided to use fdatasync semantics, that is only force out metadata which is required to retrieve the file data, but not others like metadata. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/io_uring.c | 40 +++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 8 ++++++- 2 files changed, 47 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 39a6cd14f0ec..e92ae7abffce 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4,6 +4,7 @@ * supporting fast/efficient IO. * * Copyright (C) 2018-2019 Jens Axboe + * Copyright (c) 2018-2019 Christoph Hellwig */ #include #include @@ -517,6 +518,42 @@ static int io_nop(struct io_kiocb *req, u64 user_data) return 0; } +static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct io_ring_ctx *ctx = req->ctx; + loff_t sqe_off = READ_ONCE(sqe->off); + loff_t sqe_len = READ_ONCE(sqe->len); + loff_t end = sqe_off + sqe_len; + unsigned fsync_flags; + struct file *file; + int ret, fd; + + /* fsync always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + + if (unlikely(sqe->addr || sqe->ioprio)) + return -EINVAL; + + fsync_flags = READ_ONCE(sqe->fsync_flags); + if (unlikely(fsync_flags & ~IORING_FSYNC_DATASYNC)) + return -EINVAL; + + fd = READ_ONCE(sqe->fd); + file = fget(fd); + if (unlikely(!file)) + return -EBADF; + + ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX, + fsync_flags & IORING_FSYNC_DATASYNC); + + fput(file); + io_cqring_add_event(ctx, sqe->user_data, ret, 0); + io_free_req(req); + return 0; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock) { @@ -538,6 +575,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_WRITEV: ret = io_write(req, s, force_nonblock); break; + case IORING_OP_FSYNC: + ret = io_fsync(req, s->sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ac692823d6f4..4589d56d0b68 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -24,7 +24,7 @@ struct io_uring_sqe { __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; - __u32 __resv; + __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ __u64 __pad2[3]; @@ -33,6 +33,12 @@ struct io_uring_sqe { #define IORING_OP_NOP 0 #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 +#define IORING_OP_FSYNC 3 + +/* + * sqe->fsync_flags + */ +#define IORING_FSYNC_DATASYNC (1U << 0) /* * IO completion data structure (Completion Queue Entry) From patchwork Thu Feb 7 19:55:41 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802083 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 819751575 for ; Thu, 7 Feb 2019 19:56:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 71CE62E24C for ; Thu, 7 Feb 2019 19:56:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6628D2E25F; Thu, 7 Feb 2019 19:56:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C0902E24C for ; Thu, 7 Feb 2019 19:56:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727308AbfBGT4N (ORCPT ); Thu, 7 Feb 2019 14:56:13 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:51609 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727306AbfBGT4N (ORCPT ); Thu, 7 Feb 2019 14:56:13 -0500 Received: by mail-it1-f193.google.com with SMTP id w18so2986388ite.1 for ; Thu, 07 Feb 2019 11:56:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=8JrQk65lhYU7TrSIPIuwZloPq7Q0wEDQpEA/8gVhMrM=; b=0uNM9OPerN0JfjgzuZ4dF+ZgVhiOI2jiwA+9T/Esi6nw8xigRMWgDSivcM/ere9BaG cubtm6NqP1AaCPgkguWmcEYOqD3+X7Vqw3jpDAVppMkZpqTGe2V3dXW/PUjeU3bhIqdg B/NNRZ57esvTgdcp7ZiK5oV+0lZOVxYflPnA9vLm/1IZK8YGTe4jgMaCQWNuVEuVCjRd +v7M6ukDFCFb39MU6e5mlrF9OZWeTwxOT020orCUD9p2z+UeNNC8nh2Oz3YoPl+p16tR 5LnW7i+JjIcnTL7IpDRDM8YUBRIU0P4EUBOciKv+MTYFLyVXJrflMtS0C4TOCar/8YGs xzwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=8JrQk65lhYU7TrSIPIuwZloPq7Q0wEDQpEA/8gVhMrM=; b=TPPhXcN1X9sStkQF3qEPJXUyPTU4SZjjXXAtsMa7GqT/+yT2gO0iq7/SGjhvLBJHUW /U+kOO90siDJGRgFAIW60jY9lc6uHBHExgaPBqJXweinSHXEdbxhrqTcLRG393u1uf2b /JK6oDL7t0UmAzhfQ/dnhcBM0ktyzlDO4vp+ibvliSnqqSzY4gMdXmJlGVykTmm2Ocj+ PyOwJAxgixm9v7vyStO6emRjgtDfDRS9UQAsMB30BZDyl7paKbSD3Neps/wj1dQ/CJ/D Kk7qlgvRVcQVDx5nrNHVxpq+t/DkPy+l9TW0o6cfgk8Ou8jHeVJfgLTVA6+LvXMwlK5j +b5A== X-Gm-Message-State: AHQUAuYtuEFgIWFyfzu3DCsYHdrAe2b1VRY81ogAC027epzK+kkjRzP/ ir3EarMz/cZo+5OxeGxYC/K4qg== X-Google-Smtp-Source: AHgI3IayyPvZD5NTY0PvsdyWfpPqhRX25ACHhzQAVJoRdP8vCA89lT3X8Tz/SCosCihVPbvGM4a6Iw== X-Received: by 2002:a02:c605:: with SMTP id i5mr5316109jan.9.1549569372080; Thu, 07 Feb 2019 11:56:12 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.10 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:11 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 07/18] io_uring: support for IO polling Date: Thu, 7 Feb 2019 12:55:41 -0700 Message-Id: <20190207195552.22770-8-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add support for a polled io_uring context. When a read or write is submitted to a polled context, the application must poll for completions on the CQ ring through io_uring_enter(2). Polled IO may not generate IRQ completions, hence they need to be actively found by the application itself. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Signed-off-by: Jens Axboe --- fs/io_uring.c | 274 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 + 2 files changed, 270 insertions(+), 9 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index e92ae7abffce..5e83391dc538 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -103,6 +103,14 @@ struct io_ring_ctx { struct { spinlock_t completion_lock; + bool poll_multi_file; + /* + * ->poll_list is protected by the ctx->uring_lock for + * io_uring instances that don't use IORING_SETUP_SQPOLL. + * For SQPOLL, only the single threaded io_sq_thread() will + * manipulate the list, hence no extra locking is needed there. + */ + struct list_head poll_list; } ____cacheline_aligned_in_smp; #if defined(CONFIG_NET) @@ -114,6 +122,7 @@ struct sqe_submit { const struct io_uring_sqe *sqe; unsigned short index; bool has_user; + bool needs_lock; }; struct io_kiocb { @@ -125,12 +134,15 @@ struct io_kiocb { struct list_head list; unsigned int flags; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ +#define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ u64 user_data; + u64 error; struct work_struct work; }; #define IO_PLUG_THRESHOLD 2 +#define IO_IOPOLL_BATCH 8 static struct kmem_cache *req_cachep; @@ -174,6 +186,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); + INIT_LIST_HEAD(&ctx->poll_list); return ctx; } @@ -268,12 +281,153 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) return NULL; } +static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) +{ + if (*nr) { + kmem_cache_free_bulk(req_cachep, *nr, reqs); + io_ring_drop_ctx_refs(ctx, *nr); + *nr = 0; + } +} + static void io_free_req(struct io_kiocb *req) { io_ring_drop_ctx_refs(req->ctx, 1); kmem_cache_free(req_cachep, req); } +/* + * Find and free completed poll iocbs + */ +static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, + struct list_head *done) +{ + void *reqs[IO_IOPOLL_BATCH]; + struct io_kiocb *req; + int to_free = 0; + + while (!list_empty(done)) { + req = list_first_entry(done, struct io_kiocb, list); + list_del(&req->list); + + io_cqring_fill_event(ctx, req->user_data, req->error, 0); + + reqs[to_free++] = req; + (*nr_events)++; + + fput(req->rw.ki_filp); + if (to_free == ARRAY_SIZE(reqs)) + io_free_req_many(ctx, reqs, &to_free); + } + io_commit_cqring(ctx); + + io_free_req_many(ctx, reqs, &to_free); +} + +static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + struct io_kiocb *req, *tmp; + LIST_HEAD(done); + bool spin; + int ret; + + /* + * Only spin for completions if we don't have multiple devices hanging + * off our complete list, and we're under the requested amount. + */ + spin = !ctx->poll_multi_file && *nr_events < min; + + ret = 0; + list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) { + struct kiocb *kiocb = &req->rw; + + /* + * Move completed entries to our local list. If we find a + * request that requires polling, break out and complete + * the done list first, if we have entries there. + */ + if (req->flags & REQ_F_IOPOLL_COMPLETED) { + list_move_tail(&req->list, &done); + continue; + } + if (!list_empty(&done)) + break; + + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (ret < 0) + break; + + if (ret && spin) + spin = false; + ret = 0; + } + + if (!list_empty(&done)) + io_iopoll_complete(ctx, nr_events, &done); + + return ret; +} + +/* + * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a + * non-spinning poll check - we'll still enter the driver poll loop, but only + * as a non-spinning completion check. + */ +static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + while (!list_empty(&ctx->poll_list)) { + int ret; + + ret = io_do_iopoll(ctx, nr_events, min); + if (ret < 0) + return ret; + if (!min || *nr_events >= min) + return 0; + } + + return 1; +} + +/* + * We can't just wait for polled events to come to us, we have to actively + * find and complete them. + */ +static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +{ + if (!(ctx->flags & IORING_SETUP_IOPOLL)) + return; + + mutex_lock(&ctx->uring_lock); + while (!list_empty(&ctx->poll_list)) { + unsigned int nr_events = 0; + + io_iopoll_getevents(ctx, &nr_events, 1); + } + mutex_unlock(&ctx->uring_lock); +} + +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) +{ + int ret = 0; + + do { + int tmin = 0; + + if (*nr_events < min) + tmin = min - *nr_events; + + ret = io_iopoll_getevents(ctx, nr_events, tmin); + if (ret <= 0) + break; + ret = 0; + } while (min && !*nr_events && !need_resched()); + + return ret; +} + static void kiocb_end_write(struct kiocb *kiocb) { if (kiocb->ki_flags & IOCB_WRITE) { @@ -300,6 +454,53 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) io_free_req(req); } +static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + req->error = res; + if (res != -EAGAIN) + req->flags |= REQ_F_IOPOLL_COMPLETED; +} + +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_iopoll_getevents() thread before the issuer is done + * accessing the kiocb cookie. + */ +static void io_iopoll_req_issued(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + + /* + * Track whether we have multiple files in our lists. This will impact + * how we do polling eventually, not spinning if we're on potentially + * different devices. + */ + if (list_empty(&ctx->poll_list)) { + ctx->poll_multi_file = false; + } else if (!ctx->poll_multi_file) { + struct io_kiocb *list_req; + + list_req = list_first_entry(&ctx->poll_list, struct io_kiocb, + list); + if (list_req->rw.ki_filp != req->rw.ki_filp) + ctx->poll_multi_file = true; + } + + /* + * For fast devices, IO may have already completed. If it has, add + * it to the front so we find it first. + */ + if (req->flags & REQ_F_IOPOLL_COMPLETED) + list_add(&req->list, &ctx->poll_list); + else + list_add_tail(&req->list, &ctx->poll_list); +} + /* * If we tracked the file through the SCM inflight mechanism, we could support * any file. For now, just ensure that anything potentially problematic is done @@ -320,6 +521,7 @@ static bool io_file_supports_async(struct file *file) static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { + struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; unsigned ioprio; int fd, ret; @@ -355,12 +557,22 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_flags |= IOCB_NOWAIT; req->flags |= REQ_F_FORCE_NONBLOCK; } - if (kiocb->ki_flags & IOCB_HIPRI) { - ret = -EINVAL; - goto out_fput; - } + if (ctx->flags & IORING_SETUP_IOPOLL) { + ret = -EOPNOTSUPP; + if (!(kiocb->ki_flags & IOCB_DIRECT) || + !kiocb->ki_filp->f_op->iopoll) + goto out_fput; - kiocb->ki_complete = io_complete_rw; + req->error = 0; + kiocb->ki_flags |= IOCB_HIPRI; + kiocb->ki_complete = io_complete_rw_iopoll; + } else { + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + kiocb->ki_complete = io_complete_rw; + } return 0; out_fput: fput(kiocb->ki_filp); @@ -513,6 +725,9 @@ static int io_nop(struct io_kiocb *req, u64 user_data) { struct io_ring_ctx *ctx = req->ctx; + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + io_cqring_add_event(ctx, user_data, 0, 0); io_free_req(req); return 0; @@ -533,6 +748,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (force_nonblock) return -EAGAIN; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; if (unlikely(sqe->addr || sqe->ioprio)) return -EINVAL; @@ -583,7 +800,22 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, break; } - return ret; + if (ret) + return ret; + + if (ctx->flags & IORING_SETUP_IOPOLL) { + if (req->error == -EAGAIN) + return -EAGAIN; + + /* workqueue context doesn't hold uring_lock, grab it now */ + if (s->needs_lock) + mutex_lock(&ctx->uring_lock); + io_iopoll_req_issued(req); + if (s->needs_lock) + mutex_unlock(&ctx->uring_lock); + } + + return 0; } static void io_sq_wq_submit_work(struct work_struct *work) @@ -607,8 +839,19 @@ static void io_sq_wq_submit_work(struct work_struct *work) use_mm(ctx->sqo_mm); set_fs(USER_DS); s->has_user = true; + s->needs_lock = true; - ret = __io_submit_sqe(ctx, req, s, false); + do { + ret = __io_submit_sqe(ctx, req, s, false); + /* + * We can get EAGAIN for polled IO even though we're forcing + * a sync submission from here, since we can't wait for + * request slots on the block side. + */ + if (ret != -EAGAIN) + break; + cond_resched(); + } while (1); set_fs(old_fs); unuse_mm(ctx->sqo_mm); @@ -723,6 +966,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) break; s.has_user = true; + s.needs_lock = false; + ret = io_submit_sqe(ctx, &s); if (ret) { io_drop_sqring(ctx); @@ -876,6 +1121,8 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) sock_release(ctx->ring_sock); #endif + io_iopoll_reap_events(ctx); + io_mem_free(ctx->sq_ring); io_mem_free(ctx->sq_sqes); io_mem_free(ctx->cq_ring); @@ -915,6 +1162,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock); + io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); } @@ -995,6 +1243,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, goto out_ctx; } if (flags & IORING_ENTER_GETEVENTS) { + unsigned nr_events = 0; + /* * The application could have included the 'to_submit' count * in how many events it wanted to wait for. If we failed to @@ -1004,7 +1254,13 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, if (submitted < to_submit) min_complete = min_t(unsigned, submitted, min_complete); - ret = io_cqring_wait(ctx, min_complete, sig, sigsz); + if (ctx->flags & IORING_SETUP_IOPOLL) { + mutex_lock(&ctx->uring_lock); + ret = io_iopoll_check(ctx, &nr_events, min_complete); + mutex_unlock(&ctx->uring_lock); + } else { + ret = io_cqring_wait(ctx, min_complete, sig, sigsz); + } } out_ctx: @@ -1187,7 +1443,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) return -EINVAL; } - if (p.flags) + if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4589d56d0b68..5c457ea396e6 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -30,6 +30,11 @@ struct io_uring_sqe { __u64 __pad2[3]; }; +/* + * io_uring_setup() flags + */ +#define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */ + #define IORING_OP_NOP 0 #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 From patchwork Thu Feb 7 19:55:42 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802085 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 804B06C2 for ; Thu, 7 Feb 2019 19:56:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 70CE92E0B8 for ; Thu, 7 Feb 2019 19:56:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 64B8B2E259; Thu, 7 Feb 2019 19:56:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E0FF02E0B8 for ; Thu, 7 Feb 2019 19:56:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727312AbfBGT4P (ORCPT ); Thu, 7 Feb 2019 14:56:15 -0500 Received: from mail-it1-f195.google.com ([209.85.166.195]:52545 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727306AbfBGT4O (ORCPT ); Thu, 7 Feb 2019 14:56:14 -0500 Received: by mail-it1-f195.google.com with SMTP id d11so2962583itf.2 for ; Thu, 07 Feb 2019 11:56:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=3E9d5n313aBEkujtV4yMdSTvF/NumxHs/DHfZEFr8lQ=; b=RTuzxiLIIeqzbGi5t5WxNS8QSbGYNLQdBy4jGbGtB4nzpWac4jlPjOjW/qtcFIswu+ xQD1raZkuuJAr5D6Fka6tnHKM1sNDBZPxjD5rbdUYFNAT2u6sF7ELxuiOTK2lGJ4kf4y 0oRs7665/ETYHK1ufIT9e1rZ2cwqls+x5UvJYvKDJVrNwsduDftKkwLI2vAC9tUodJF6 dF1k/AOWVSyNe6q62PFgDzRHNCx1djIC00Hlx8AWrR0S6sPoZD9/DKEvBRVUx018M7SW cs7ruycXeFHAPlAsObXvegtsbe3kjylQiiL8hUXXQiuZAZMG2i9pyoIGTCwGiAyaWYLR Smww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=3E9d5n313aBEkujtV4yMdSTvF/NumxHs/DHfZEFr8lQ=; b=IWe+RYnx5QhV3W68llJZMOjMMRbrmw8u54ceKv6Es4QAIHO0I/FxKp3dTYjcn8+910 MwauLPYDbc8aJaJ+gYhVCkLD36CQw/C2bY82ppYzoaLjXdrx7f/Ts1nHLw7MJNK6G5Gm NTM4tOakR81RgsNt7SLHP8PKJjYFhvelXx/9DOsYRhOXghGCQReTVDYr9t7Zun4MRF6u WTC8ovWSj96B9n3jACVxhgNCMHoF9cpxQDiEGZN3Az6CNG2Dii6C5+VuepctgEbYTScu HJcFJ6jdxQt7QyS3epkJdQZspnMYKuiktdYuDnn9qUZMruQph0EVSkzjYlCE/73UqWIM d3Cg== X-Gm-Message-State: AHQUAuYAaWJTG5ni9QjXiXIlXNXaoogZ8fVIlK2j1kRbI0Ig2z1xwTgP gGnJOgg+xsNVPlnorKiouR3czg== X-Google-Smtp-Source: AHgI3IbMm4pv4dpGcUJZ/NRvLbbbPe+jQqSVyPsGo934UetTg5Z6I2aJRPa5wEfSbRDlk6/p4KUD0Q== X-Received: by 2002:a24:5c47:: with SMTP id q68mr5319903itb.35.1549569373744; Thu, 07 Feb 2019 11:56:13 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.12 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:12 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 08/18] fs: add fget_many() and fput_many() Date: Thu, 7 Feb 2019 12:55:42 -0700 Message-Id: <20190207195552.22770-9-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/file.c | 15 ++++++++++----- fs/file_table.c | 9 +++++++-- include/linux/file.h | 2 ++ include/linux/fs.h | 4 +++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 3209ee271c41..97df385d6ab0 100644 --- a/fs/file.c +++ b/fs/file.c @@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return __fget(fd, FMODE_PATH, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index 5679e7fcb6b0..155d7514a094 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index 61aa210f0c2b..80e1b199a4b1 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) \ + atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) From patchwork Thu Feb 7 19:55:43 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802087 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5A5481575 for ; Thu, 7 Feb 2019 19:56:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4AF982E0B8 for ; Thu, 7 Feb 2019 19:56:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3EDE42E259; Thu, 7 Feb 2019 19:56:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 795E52E0B8 for ; Thu, 7 Feb 2019 19:56:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727314AbfBGT4Q (ORCPT ); Thu, 7 Feb 2019 14:56:16 -0500 Received: from mail-it1-f196.google.com ([209.85.166.196]:36026 "EHLO mail-it1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727302AbfBGT4Q (ORCPT ); Thu, 7 Feb 2019 14:56:16 -0500 Received: by mail-it1-f196.google.com with SMTP id c9so2969660itj.1 for ; Thu, 07 Feb 2019 11:56:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=ZnyEzVIX6pQDgSUenqYBdx8lBpdlDoO+QEnLqKLjuOs=; b=d3US/kJlHVaPTn0ZGmbvOkutjHDTgzSVoR2hOtcU+TV5Ifad6zOMvgPE+5abZ5nzFb hE+O0GTW/9uUghhUu9dFUlRCzY0COsb2Ui3qyfTqv4rPPO86Og56uR9jaTyqRFqE6Wvr Vw3QrG7rpfaRKx22fvZPum/ZR32Qp/0fO9Gog3YMbJK+qCvuUsQAIULCQzxm7i25a35g 4Z0KDYrYX3y2ljnIsDvHJWf3iSXjxpBGXvJxFNxcUcrmtHefjYHOeuNPF+GeIeSesakq wajiRMFjyTph7iF3kKPHpVf8lSEzUBdyGu2NZ1DR1vl9TPj1z/yQ/06n2MPOyAJCm1Yi vO0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=ZnyEzVIX6pQDgSUenqYBdx8lBpdlDoO+QEnLqKLjuOs=; b=qL0I5v2NBQvtHNwgOEu9Y+hsM4ePNqWVEsPC1UxxdSluV/cJXEntT0iVZYFKZDeV9M DANqR4jkXd+63oihQB6HexQD1Nj3pRhWgxDXJqd9P/AcY33VXJLOilWXc24Y6kEej69L F+F9IpRzbVNJva9BDEfgYA+uHEI6z/KXRBjXeFAaMTpyP6a81+b7TnelEyMdWJbrYW0i /xSq/Qj2GP1caxNOFn+S3LHquosHJ3wVhHGTzK2miUEppyneps85jmp9jUmTMY4Hwz+U zF0VY5wqFheKPRCU4tHTF5V8iFL2hus/lifclYQVczoKvOgqbMG3fSsQirdxYOugtqDX Kq4A== X-Gm-Message-State: AHQUAuZUcAa/PtUJmCgkG8mul/r2/X/LZRKszr/uQ222BdHNAYsFhYZE mdzj4CwCstR3v+B4s7Jg+WrPDQ== X-Google-Smtp-Source: AHgI3IbY3OGHFOhEaJNOBnJ3iDhJPreTccBdsL6rOyF9TI14Ks5512aKQXMcVLF8bxK2ytRVsETEnw== X-Received: by 2002:a6b:5b01:: with SMTP id v1mr7630586ioh.12.1549569375406; Thu, 07 Feb 2019 11:56:15 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:14 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 09/18] io_uring: use fget/fput_many() for file references Date: Thu, 7 Feb 2019 12:55:43 -0700 Message-Id: <20190207195552.22770-10-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add a separate io_submit_state structure, to cache some of the things we need for IO submission. One such example is file reference batching. io_submit_state. We get as many references as the number of sqes we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Signed-off-by: Jens Axboe --- fs/io_uring.c | 142 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 121 insertions(+), 21 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 5e83391dc538..e23e639a4ff0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -144,6 +144,19 @@ struct io_kiocb { #define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8 +struct io_submit_state { + struct blk_plug plug; + + /* + * File reference cache + */ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; +}; + static struct kmem_cache *req_cachep; static const struct file_operations io_uring_fops; @@ -303,9 +316,11 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, struct list_head *done) { void *reqs[IO_IOPOLL_BATCH]; + int file_count, to_free; + struct file *file = NULL; struct io_kiocb *req; - int to_free = 0; + file_count = to_free = 0; while (!list_empty(done)) { req = list_first_entry(done, struct io_kiocb, list); list_del(&req->list); @@ -315,12 +330,28 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, reqs[to_free++] = req; (*nr_events)++; - fput(req->rw.ki_filp); + /* + * Batched puts of the same file, to avoid dirtying the + * file usage count multiple times, if avoidable. + */ + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } + if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); } io_commit_cqring(ctx); + if (file) + fput_many(file, file_count); io_free_req_many(ctx, reqs, &to_free); } @@ -501,6 +532,48 @@ static void io_iopoll_req_issued(struct io_kiocb *req) list_add_tail(&req->list, &ctx->poll_list); } +static void io_file_put(struct io_submit_state *state, struct file *file) +{ + if (!state) { + fput(file); + } else if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *io_file_get(struct io_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (state->file) { + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + io_file_put(state, NULL); + } + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; +} + /* * If we tracked the file through the SCM inflight mechanism, we could support * any file. For now, just ensure that anything potentially problematic is done @@ -519,7 +592,7 @@ static bool io_file_supports_async(struct file *file) } static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; @@ -531,7 +604,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; fd = READ_ONCE(sqe->fd); - kiocb->ki_filp = fget(fd); + kiocb->ki_filp = io_file_get(state, fd); if (unlikely(!kiocb->ki_filp)) return -EBADF; if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) @@ -575,7 +648,10 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - fput(kiocb->ki_filp); + /* in case of error, we didn't use this file reference. drop it. */ + if (state) + state->used_refs--; + io_file_put(state, kiocb->ki_filp); return ret; } @@ -621,7 +697,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, } static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -629,7 +705,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, struct file *file; ssize_t ret; - ret = io_prep_rw(req, s->sqe, force_nonblock); + ret = io_prep_rw(req, s->sqe, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -665,7 +741,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, } static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -673,7 +749,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, struct file *file; ssize_t ret; - ret = io_prep_rw(req, s->sqe, force_nonblock); + ret = io_prep_rw(req, s->sqe, force_nonblock, state); if (ret) return ret; /* Hold on to the file for -EAGAIN */ @@ -772,7 +848,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, } static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - const struct sqe_submit *s, bool force_nonblock) + const struct sqe_submit *s, bool force_nonblock, + struct io_submit_state *state) { ssize_t ret; int opcode; @@ -787,10 +864,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_nop(req, req->user_data); break; case IORING_OP_READV: - ret = io_read(req, s, force_nonblock); + ret = io_read(req, s, force_nonblock, state); break; case IORING_OP_WRITEV: - ret = io_write(req, s, force_nonblock); + ret = io_write(req, s, force_nonblock, state); break; case IORING_OP_FSYNC: ret = io_fsync(req, s->sqe, force_nonblock); @@ -842,7 +919,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) s->needs_lock = true; do { - ret = __io_submit_sqe(ctx, req, s, false); + ret = __io_submit_sqe(ctx, req, s, false, NULL); /* * We can get EAGAIN for polled IO even though we're forcing * a sync submission from here, since we can't wait for @@ -863,7 +940,8 @@ static void io_sq_wq_submit_work(struct work_struct *work) } } -static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s) +static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s, + struct io_submit_state *state) { struct io_kiocb *req; ssize_t ret; @@ -878,7 +956,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s) req->rw.ki_filp = NULL; - ret = __io_submit_sqe(ctx, req, s, true); + ret = __io_submit_sqe(ctx, req, s, true, state); if (ret == -EAGAIN) { memcpy(&req->submit, s, sizeof(*s)); INIT_WORK(&req->work, io_sq_wq_submit_work); @@ -891,6 +969,26 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s) return ret; } +/* + * Batched submission is done, ensure local IO is flushed out. + */ +static void io_submit_state_end(struct io_submit_state *state) +{ + blk_finish_plug(&state->plug); + io_file_put(state, NULL); +} + +/* + * Start submission side cache. + */ +static void io_submit_state_start(struct io_submit_state *state, + struct io_ring_ctx *ctx, unsigned max_ios) +{ + blk_start_plug(&state->plug); + state->file = NULL; + state->ios_left = max_ios; +} + static void io_commit_sqring(struct io_ring_ctx *ctx) { struct io_sq_ring *ring = ctx->sq_ring; @@ -953,11 +1051,13 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { + struct io_submit_state state, *statep = NULL; int i, ret = 0, submit = 0; - struct blk_plug plug; - if (to_submit > IO_PLUG_THRESHOLD) - blk_start_plug(&plug); + if (to_submit > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, to_submit); + statep = &state; + } for (i = 0; i < to_submit; i++) { struct sqe_submit s; @@ -968,7 +1068,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) s.has_user = true; s.needs_lock = false; - ret = io_submit_sqe(ctx, &s); + ret = io_submit_sqe(ctx, &s, statep); if (ret) { io_drop_sqring(ctx); break; @@ -978,8 +1078,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) } io_commit_sqring(ctx); - if (to_submit > IO_PLUG_THRESHOLD) - blk_finish_plug(&plug); + if (statep) + io_submit_state_end(statep); return submit ? submit : ret; } From patchwork Thu Feb 7 19:55:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802089 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A60EA1575 for ; Thu, 7 Feb 2019 19:56:19 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9625C2E24C for ; Thu, 7 Feb 2019 19:56:19 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8ABFF2E259; Thu, 7 Feb 2019 19:56:19 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 299022E0B8 for ; Thu, 7 Feb 2019 19:56:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727269AbfBGT4S (ORCPT ); Thu, 7 Feb 2019 14:56:18 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:52560 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727302AbfBGT4S (ORCPT ); Thu, 7 Feb 2019 14:56:18 -0500 Received: by mail-it1-f193.google.com with SMTP id d11so2963052itf.2 for ; Thu, 07 Feb 2019 11:56:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Mr9uhblvrQ5aKZU3A/I21BrR3XZRNiX/ZkYG9diopjM=; b=RwpS11dMjZKEETRzG3FSzXkk/Kahd9OAAgnqzCwTmmgbgltJc+1UoJ0Z2aG4wfw3lP y52vY+En/1cGVL9Yifuz0AyEbaa6MWEWRdgx7YW9nQv3hm5sagzMxyoTzz6bcXTT+Uw8 ouK2uflN0H1kpn3HddjeSLGSoiE3RfNdKcfB+fER7Cze1qdR+1/LwituuezDnE4c9XNt KxxN6PcZ5a2z5C+AyWgiSaqVO4dr+u7ZpYD6II5PRFmElxhJi+RaxhbiQ1lh1Gr8XVu+ yKorFpmDnNBgiCo9CetCCu3wy4BVO5Rp8Un3PAaw+Vr+vSVuqfaEWlTG2tBICzUyRxgN UbYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Mr9uhblvrQ5aKZU3A/I21BrR3XZRNiX/ZkYG9diopjM=; b=X/ZmwjFE/ILiQyVeOGUftGXXrn+Hpi+wZ/QVhmSPs+pmdEAoVO5jq/bEbQsT9RJO7q 2Pa6AJ23tBin7qnYa3NeCtODVI7yWA8nFPx6QY9eZUbn79gcl2n4JlBeBtQSatFZNWwF iqb2mIL2yWlnPKHEXYbOugNkcp9t9W+zjVt6StGM+XYDcKxO6iX+iHHRlKkqRmZTteHx Q0twKPYSGPX0GupowDgUR3NJi+TmUFU2HHSRIX6tGhlAQaXSqMJxzmFpYaSQt9ADOFt1 fmIjraW8DylK0xb0BFrtFs4899F5YPf5fAbI4fpYcndeziiRBdCFur1YtRjHeOoWpWvn LXOQ== X-Gm-Message-State: AHQUAuahDkAWr6CHwI1gbVIf4XH34TRu/ZCcfx0Uy8qtBb4kEJbebvno 41xN7Ctf1gQjMSB9jzXenCq8aw== X-Google-Smtp-Source: AHgI3IYH+hrkzdMIp1stNUaUbesWzNQ0Tx91UgjdDmMwXisp8GRbJ33T7vVxqtqcy1l1SnR8KlZPgg== X-Received: by 2002:a6b:da10:: with SMTP id x16mr9311544iob.101.1549569377085; Thu, 07 Feb 2019 11:56:17 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.15 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:16 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 10/18] io_uring: batch io_kiocb allocation Date: Thu, 7 Feb 2019 12:55:44 -0700 Message-Id: <20190207195552.22770-11-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Signed-off-by: Jens Axboe --- fs/io_uring.c | 45 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 38 insertions(+), 7 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index e23e639a4ff0..1369cb95e1b5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -147,6 +147,13 @@ struct io_kiocb { struct io_submit_state { struct blk_plug plug; + /* + * io_kiocb alloc cache + */ + void *reqs[IO_IOPOLL_BATCH]; + unsigned int free_reqs; + unsigned int cur_req; + /* * File reference cache */ @@ -276,20 +283,40 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) wake_up(&ctx->wait); } -static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, + struct io_submit_state *state) { struct io_kiocb *req; if (!percpu_ref_tryget(&ctx->refs)) return NULL; - req = kmem_cache_alloc(req_cachep, __GFP_NOWARN); - if (req) { - req->ctx = ctx; - req->flags = 0; - return req; + if (!state) { + req = kmem_cache_alloc(req_cachep, __GFP_NOWARN); + if (unlikely(!req)) + goto out; + } else if (!state->free_reqs) { + size_t sz; + int ret; + + sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs)); + ret = kmem_cache_alloc_bulk(req_cachep, __GFP_NOWARN, sz, + state->reqs); + if (unlikely(ret <= 0)) + goto out; + state->free_reqs = ret - 1; + state->cur_req = 1; + req = state->reqs[0]; + } else { + req = state->reqs[state->cur_req]; + state->free_reqs--; + state->cur_req++; } + req->ctx = ctx; + req->flags = 0; + return req; +out: io_ring_drop_ctx_refs(ctx, 1); return NULL; } @@ -950,7 +977,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s, if (unlikely(s->sqe->flags)) return -EINVAL; - req = io_get_req(ctx); + req = io_get_req(ctx, state); if (unlikely(!req)) return -EAGAIN; @@ -976,6 +1003,9 @@ static void io_submit_state_end(struct io_submit_state *state) { blk_finish_plug(&state->plug); io_file_put(state, NULL); + if (state->free_reqs) + kmem_cache_free_bulk(req_cachep, state->free_reqs, + &state->reqs[state->cur_req]); } /* @@ -985,6 +1015,7 @@ static void io_submit_state_start(struct io_submit_state *state, struct io_ring_ctx *ctx, unsigned max_ios) { blk_start_plug(&state->plug); + state->free_reqs = 0; state->file = NULL; state->ios_left = max_ios; } From patchwork Thu Feb 7 19:55:45 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802091 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AD7566C2 for ; Thu, 7 Feb 2019 19:56:21 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9DC032E0B8 for ; Thu, 7 Feb 2019 19:56:21 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 920F02E25F; Thu, 7 Feb 2019 19:56:21 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0AE1D2E0B8 for ; Thu, 7 Feb 2019 19:56:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727324AbfBGT4U (ORCPT ); Thu, 7 Feb 2019 14:56:20 -0500 Received: from mail-it1-f196.google.com ([209.85.166.196]:51655 "EHLO mail-it1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727306AbfBGT4U (ORCPT ); Thu, 7 Feb 2019 14:56:20 -0500 Received: by mail-it1-f196.google.com with SMTP id w18so2987402ite.1 for ; Thu, 07 Feb 2019 11:56:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=7VfEDtG7dKefZTHegCj8nNjKVprjqmLxQfKlQm/1H1o=; b=lfl0Rw13bwtJXKFOZ51+cgNqHD0HQmvrobCH019Yrt+WiNrfyWYvQ5lIuZQLmn/Q3d rgF9RfdKb86/1E4HD5+xTUGw4+UKs1tQtMtBhOffeiZ4ZZpyAWE2wUK9q6xBrIKbS00D 2Urgkixsp2yk/GWh5G3Avy1pVTj4ry8dgG/B1DTjsXvDv6qEsh8aRdAyUG8hLttrPnbj fMoT6DnD4OQFOcWEkFw9z7QfEgTGDrXK8xd/pmCJ04z1REq1xfkN6b+fIJ0RFUKPSc/Y CcrZThYakDRkgMJg0k1bGX2+XR132+0AQJax9HHrxw+osQkoQ7XNxoc7EKwhCaqkhEM+ QzBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=7VfEDtG7dKefZTHegCj8nNjKVprjqmLxQfKlQm/1H1o=; b=OwDB+QqjlYQCm+W1gokgjZEvCJk0exYeK9nyJvORSTAY1q6E3n/eNhFzI8IBt6O7cD rF39rQ72uPdNMBX5rro5Q+mLfkhSsvv+3Q4O7J/5G4yIcRskIi3yr4rZZ/RMIyY7EBkx JJl8qeBZwoiLzUJsvF9qUh37JydOhM/ls0NQ8aKopmwIzZp6Dj+hupSL14k+/HI/HIYs X7bsj5ZHTim163AnPObEH2AnbA1PM6TBXJTuHD5y/DJnRobj97yX1IUD3XDDCnAJbugD lY7o0C0GnV4Y5vkVtkxyvQPORoQPPbTj5vdJLokcrBqYo1JllSZiw5KGOpdO1sMn+rbo DBmQ== X-Gm-Message-State: AHQUAuaeCPNjppQZDU8ZJtBlvITaillB/UqyGIS4ynwcsQLy9Td4Mq+s nw317Y1/Ub6Ltv6HLzLQ7pD5JQ== X-Google-Smtp-Source: AHgI3IbtBTMnzojtLwC7gUxKfo1WU1fGr5aJkhy+XKO2Y2V/lrPDopAPN07HpjftRjALB7F7W1aASQ== X-Received: by 2002:a24:570a:: with SMTP id u10mr6779587ita.11.1549569378667; Thu, 07 Feb 2019 11:56:18 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.17 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:17 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 11/18] block: implement bio helper to add iter bvec pages to bio Date: Thu, 7 Feb 2019 12:55:45 -0700 Message-Id: <20190207195552.22770-12-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. This requires that the caller doesn't releases the pages on IO completion, we add a BIO_NO_PAGE_REF flag for that. The current two callers of bio_iov_iter_get_pages() are updated to check if they need to release pages on completion. This makes them work with bvecs that contain kernel mapped pages already. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- block/bio.c | 59 ++++++++++++++++++++++++++++++++------- fs/block_dev.c | 5 ++-- fs/iomap.c | 5 ++-- include/linux/blk_types.h | 1 + 4 files changed, 56 insertions(+), 14 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4db1008309ed..330df572cfb8 100644 --- a/block/bio.c +++ b/block/bio.c @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + unsigned int len; + size_t size; + + len = min_t(size_t, bv->bv_len, iter->count); + size = bio_add_page(bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset); + if (size == len) { + iov_iter_advance(iter, size); + return 0; + } + + return -EINVAL; +} + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) /** @@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } /** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped + * @iter: iov iterator describing the region to be added + * + * This takes either an iterator pointing to user memory, or one pointing to + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and + * map them into the kernel. On IO completion, the caller should put those + * pages. If we're adding kernel pages, we just have to add the pages to the + * bio directly. We don't grab an extra reference to those pages (the user + * should already have that), and we don't put the page on IO completion. + * The caller needs to check if the bio is flagged BIO_NO_PAGE_REF on IO + * completion. If it isn't, then pages should be released. * - * Pins pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. * The function tries, but does not guarantee, to pin as many pages as - * fit into the bio, or are requested in *iter, whatever is smaller. - * If MM encounters an error pinning the requested pages, it stops. - * Error is returned only if 0 pages could be pinned. + * fit into the bio, or are requested in *iter, whatever is smaller. If + * MM encounters an error pinning the requested pages, it stops. Error + * is returned only if 0 pages could be pinned. */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { + const bool is_bvec = iov_iter_is_bvec(iter); unsigned short orig_vcnt = bio->bi_vcnt; + /* + * If this is a BVEC iter, then the pages are kernel pages. Don't + * release them on IO completion. + */ + if (is_bvec) + bio_set_flag(bio, BIO_NO_PAGE_REF); + do { - int ret = __bio_iov_iter_get_pages(bio, iter); + int ret; + + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) return bio->bi_vcnt > orig_vcnt ? 0 : ret; @@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_NO_PAGE_REF)) + bio_release_pages(bio); bio_put(bio); } } @@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_NO_PAGE_REF)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/fs/block_dev.c b/fs/block_dev.c index 392e2bfb636f..051ab41d1c61 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -338,8 +338,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_NO_PAGE_REF)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/fs/iomap.c b/fs/iomap.c index 2ac9eb746d44..9389cf0a1c6f 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1591,8 +1591,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_NO_PAGE_REF)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index d66bf5f32610..791fee35df88 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,6 +215,7 @@ struct bio { /* * bio flags */ +#define BIO_NO_PAGE_REF 0 /* don't put release vec pages */ #define BIO_SEG_VALID 1 /* bi_phys_segments valid */ #define BIO_CLONED 2 /* doesn't own data */ #define BIO_BOUNCED 3 /* bio is a bounce bio */ From patchwork Thu Feb 7 19:55:46 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802093 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 235136C2 for ; Thu, 7 Feb 2019 19:56:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C7C92E0B8 for ; Thu, 7 Feb 2019 19:56:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 005D62E25F; Thu, 7 Feb 2019 19:56:23 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C6F412E0B8 for ; Thu, 7 Feb 2019 19:56:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727327AbfBGT4W (ORCPT ); Thu, 7 Feb 2019 14:56:22 -0500 Received: from mail-it1-f195.google.com ([209.85.166.195]:40805 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727317AbfBGT4W (ORCPT ); Thu, 7 Feb 2019 14:56:22 -0500 Received: by mail-it1-f195.google.com with SMTP id h193so2901871ita.5 for ; Thu, 07 Feb 2019 11:56:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=qivkwTZ/Yj5GC/YuN3QX1KYeE0kTmkPbqlf4lTyClH0=; b=rl4gHYiRbXSCJHrhKGxTbf9e8dyDJzpnxCdeitzRwT2sed/Lo57KqTs8Wu7R/nEOKf eLBAU174SASLFUh1ZGkj9dJrznQe1nhkEsN34sCoos32poprpnmZulrN3NILe7wPdnIf EvQ3tfGLMDXIWeGnbbmZ4q2XMxUglEbR/MjEGR/g+tKNjTgZQvfysic7mvDgzZHokqgl TQn8F+ad1yPs/vHPrpgpSc/sUhM6shwuPKRPPoilIUP8YZdzDgyRgbV+Vhaz0875VR1S YXiXdfGCPPGgzisGqvonm0aZTJq/TlXJeHRlVa4QSfRkao1Bvjd1dlxNNsbA2hn2YCpD KqZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=qivkwTZ/Yj5GC/YuN3QX1KYeE0kTmkPbqlf4lTyClH0=; b=fuo0ysJaV0Xn2uIe35NJao+YrHwSp5NcCzqQlildEhzSTATNFOCP+4GpT7SJdSFAqM j5j6UioXW4lyLbdY820Wt1lIT0Mt7Vplh3IeBOHUAqd9wmE7JV7h5eL/L0Lugtz48f4Y SWIxswW5Y7uc+rI6CYiEWoYhoMJT8gOfgh3tn5dKlgoSDBSOvhzM+qf/zwh1G7jkvtuc QhmI5sgbItjMKlWhXEY7mOmNaq+b+B9SB2BDPjHGaHziGZvj3SxxB+0IOiYbM+LfDgFS 2MjSC8Jpbd7wMaBRo9SrggyZaDMZMr1kSvulES774Ntr1jFlaMzIa2s7uiLa8R969FFo K3ig== X-Gm-Message-State: AHQUAub5IKa34gorCD8SY72MpGxhVzcipPzlw7IHTE+6ay/olQ9pQ14l 0gAHiF+Hs3aaVlGv2Nm/n3xdCw== X-Google-Smtp-Source: AHgI3IZxVKSSlB1kl8fq8suowaTN5yMyrMNRkEV2MWBZTXeqFS39j/4V6Owi0j888c3jxvZ342uiGw== X-Received: by 2002:a6b:6311:: with SMTP id p17mr10626121iog.103.1549569380298; Thu, 07 Feb 2019 11:56:20 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.18 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:19 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 12/18] io_uring: add support for pre-mapped user IO buffers Date: Thu, 7 Feb 2019 12:55:46 -0700 Message-Id: <20190207195552.22770-13-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring context, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring context. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring context. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/io_uring.c | 356 ++++++++++++++++++++++++- include/linux/sched/user.h | 2 +- include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/io_uring.h | 13 +- kernel/sys_ni.c | 1 + 8 files changed, 363 insertions(+), 17 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 481c126259e9..2eefd2a7c1ce 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -400,3 +400,4 @@ 386 i386 rseq sys_rseq __ia32_sys_rseq 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter +427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 6a32a430c8e0..65c026185e61 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -345,6 +345,7 @@ 334 common rseq __x64_sys_rseq 425 common io_uring_setup __x64_sys_io_uring_setup 426 common io_uring_enter __x64_sys_io_uring_enter +427 common io_uring_register __x64_sys_io_uring_register # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/io_uring.c b/fs/io_uring.c index 1369cb95e1b5..9d6233dc35ca 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -32,6 +33,7 @@ #include #include #include +#include #include @@ -61,6 +63,13 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -92,6 +101,10 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp; + /* if used, fixed mapped user buffers */ + unsigned nr_user_bufs; + struct io_mapped_ubuf *user_bufs; + struct user_struct *user; struct completion ctx_done; @@ -703,6 +716,44 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } } +static int io_import_fixed(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + size_t len = READ_ONCE(sqe->len); + struct io_mapped_ubuf *imu; + unsigned index, buf_index; + size_t offset; + u64 buf_addr; + + /* attempt to use fixed buffers without having provided iovecs */ + if (unlikely(!ctx->user_bufs)) + return -EFAULT; + + buf_index = READ_ONCE(sqe->buf_index); + if (unlikely(buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + + index = array_index_nospec(buf_index, ctx->nr_user_bufs); + imu = &ctx->user_bufs[index]; + buf_addr = READ_ONCE(sqe->addr); + + if (buf_addr + len < buf_addr) + return -EFAULT; + if (buf_addr < imu->ubuf || buf_addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = buf_addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct sqe_submit *s, struct iovec **iovec, struct iov_iter *iter) @@ -710,6 +761,15 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct io_uring_sqe *sqe = s->sqe; void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr)); size_t sqe_len = READ_ONCE(sqe->len); + u8 opcode; + + opcode = READ_ONCE(sqe->opcode); + if (opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED) { + ssize_t ret = io_import_fixed(ctx, rw, sqe, iter); + *iovec = NULL; + return ret; + } if (!s->has_user) return EFAULT; @@ -853,7 +913,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (unlikely(sqe->addr || sqe->ioprio)) + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL; fsync_flags = READ_ONCE(sqe->fsync_flags); @@ -891,9 +951,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_nop(req, req->user_data); break; case IORING_OP_READV: + if (unlikely(s->sqe->buf_index)) + return -EINVAL; ret = io_read(req, s, force_nonblock, state); break; case IORING_OP_WRITEV: + if (unlikely(s->sqe->buf_index)) + return -EINVAL; + ret = io_write(req, s, force_nonblock, state); + break; + case IORING_OP_READ_FIXED: + ret = io_read(req, s, force_nonblock, state); + break; + case IORING_OP_WRITE_FIXED: ret = io_write(req, s, force_nonblock, state); break; case IORING_OP_FSYNC: @@ -922,28 +992,47 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; } +static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) +{ + u8 opcode = READ_ONCE(sqe->opcode); + + return !(opcode == IORING_OP_READ_FIXED || + opcode == IORING_OP_WRITE_FIXED); +} + static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct sqe_submit *s = &req->submit; - u64 user_data = READ_ONCE(s->sqe->user_data); struct io_ring_ctx *ctx = req->ctx; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; + bool needs_user; + u64 user_data; int ret; /* Ensure we clear previously set forced non-block flag */ req->flags &= ~REQ_F_FORCE_NONBLOCK; req->rw.ki_flags &= ~IOCB_NOWAIT; - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; - } - - use_mm(ctx->sqo_mm); - set_fs(USER_DS); - s->has_user = true; + user_data = READ_ONCE(s->sqe->user_data); s->needs_lock = true; + s->has_user = false; + + /* + * If we're doing IO to fixed buffers, we don't need to get/set + * user context + */ + needs_user = io_sqe_needs_user(s->sqe); + if (needs_user) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + use_mm(ctx->sqo_mm); + old_fs = get_fs(); + set_fs(USER_DS); + s->has_user = true; + } do { ret = __io_submit_sqe(ctx, req, s, false, NULL); @@ -957,9 +1046,11 @@ static void io_sq_wq_submit_work(struct work_struct *work) cond_resched(); } while (1); - set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); + if (needs_user) { + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); + } err: if (ret) { io_cqring_add_event(ctx, user_data, ret, 0); @@ -1241,6 +1332,188 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; } +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return -ENXIO; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + io_unaccount_mem(ctx->user, imu->nr_bvecs); + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; + free_uid(ctx->user); + ctx->user = NULL; + return 0; +} + +static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, + void __user *arg, unsigned index) +{ + struct iovec __user *src; + +#ifdef CONFIG_COMPAT + if (ctx->compat) { + struct compat_iovec __user *ciovs; + struct compat_iovec ciov; + + ciovs = (struct compat_iovec __user *) arg; + if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) + return -EFAULT; + + dst->iov_base = (void __user *) (unsigned long) ciov.iov_base; + dst->iov_len = ciov.iov_len; + return 0; + } +#endif + src = (struct iovec __user *) arg; + if (copy_from_user(dst, &src[index], sizeof(*dst))) + return -EFAULT; + return 0; +} + +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + struct vm_area_struct **vmas = NULL; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + if (ctx->user_bufs) + return -EBUSY; + if (!nr_args || nr_args > UIO_MAXIOV) + return -EINVAL; + + ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + for (i = 0; i < nr_args; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = io_copy_iov(ctx, &iov, arg, i); + if (ret) + break; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base || !iov.iov_len) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_1G) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = io_account_mem(ctx->user, nr_pages); + if (ret) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(vmas); + kfree(pages); + pages = kmalloc_array(nr_pages, sizeof(struct page *), + GFP_KERNEL); + vmas = kmalloc_array(nr_pages, + sizeof(struct vma_area_struct *), + GFP_KERNEL); + if (!pages || !vmas) { + ret = -ENOMEM; + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + got_pages = nr_pages; + } + + imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + GFP_KERNEL); + if (!imu->bvec) { + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + + down_write(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE, + pages, vmas); + if (pret == nr_pages) { + /* don't support file backed memory */ + for (j = 0; j < nr_pages; j++) { + struct vm_area_struct *vma = vmas[j]; + + if (vma->vm_file) { + ret = -EOPNOTSUPP; + break; + } + } + } else { + ret = pret < 0 ? pret : -EFAULT; + } + up_write(¤t->mm->mmap_sem); + if (ret) { + /* + * if we did partial map, or found file backed vmas, + * release any pages we did get + */ + if (pret > 0) { + for (j = 0; j < pret; j++) + put_page(pages[j]); + } + io_unaccount_mem(ctx->user, nr_pages); + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + } + kfree(pages); + kfree(vmas); + ctx->nr_user_bufs = nr_args; + return 0; +err: + kfree(pages); + kfree(vmas); + io_sqe_buffer_unregister(ctx); + return ret; +} + static void io_ring_ctx_free(struct io_ring_ctx *ctx) { if (ctx->sqo_wq) @@ -1253,6 +1526,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) #endif io_iopoll_reap_events(ctx); + io_sqe_buffer_unregister(ctx); io_mem_free(ctx->sq_ring); io_mem_free(ctx->sq_sqes); @@ -1593,6 +1867,60 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return io_uring_setup(entries, params); } +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, + void __user *arg, unsigned nr_args) +{ + int ret; + + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + + switch (opcode) { + case IORING_REGISTER_BUFFERS: + ret = io_sqe_buffer_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_BUFFERS: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_buffer_unregister(ctx); + break; + default: + ret = -EINVAL; + break; + } + + /* bring the ctx back to life */ + reinit_completion(&ctx->ctx_done); + percpu_ref_reinit(&ctx->refs); + return ret; +} + +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, + void __user *, arg, unsigned int, nr_args) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ctx = f.file->private_data; + + mutex_lock(&ctx->uring_lock); + ret = __io_uring_register(ctx, opcode, arg, nr_args); + mutex_unlock(&ctx->uring_lock); +out_fput: + fdput(f); + return ret; +} + static int __init io_uring_init(void) { req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index 39ad98c09c58..c7b5f86b91a1 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -40,7 +40,7 @@ struct user_struct { kuid_t uid; #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \ - defined(CONFIG_NET) + defined(CONFIG_NET) || defined(CONFIG_IO_URING) atomic_long_t locked_vm; #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 3072dbaa7869..3681c05ac538 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -315,6 +315,8 @@ asmlinkage long sys_io_uring_setup(u32 entries, asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags, const sigset_t __user *sig, size_t sigsz); +asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op, + void __user *arg, unsigned int nr_args); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 87871e7b7ea7..d346229a1eb0 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -744,9 +744,11 @@ __SYSCALL(__NR_kexec_file_load, sys_kexec_file_load) __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) #define __NR_io_uring_enter 426 __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) +#define __NR_io_uring_register 427 +__SYSCALL(__NR_io_uring_register, sys_io_uring_register) #undef __NR_syscalls -#define __NR_syscalls 427 +#define __NR_syscalls 428 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 5c457ea396e6..cf28f7a11f12 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -27,7 +27,10 @@ struct io_uring_sqe { __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ - __u64 __pad2[3]; + union { + __u16 buf_index; /* index into fixed buffers, if used */ + __u64 __pad2[3]; + }; }; /* @@ -39,6 +42,8 @@ struct io_uring_sqe { #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 +#define IORING_OP_READ_FIXED 4 +#define IORING_OP_WRITE_FIXED 5 /* * sqe->fsync_flags @@ -103,4 +108,10 @@ struct io_uring_params { struct io_cqring_offsets cq_off; }; +/* + * io_uring_register(2) opcodes and arguments + */ +#define IORING_REGISTER_BUFFERS 0 +#define IORING_UNREGISTER_BUFFERS 1 + #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ee5e523564bb..1bb6604dc19f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); COND_SYSCALL(io_uring_setup); COND_SYSCALL(io_uring_enter); +COND_SYSCALL(io_uring_register); /* fs/xattr.c */ From patchwork Thu Feb 7 19:55:47 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802095 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A89661575 for ; Thu, 7 Feb 2019 19:56:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 94AB02E259 for ; Thu, 7 Feb 2019 19:56:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 838602E0B8; Thu, 7 Feb 2019 19:56:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A07FA2E0B8 for ; Thu, 7 Feb 2019 19:56:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727325AbfBGT4Y (ORCPT ); Thu, 7 Feb 2019 14:56:24 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:51677 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727328AbfBGT4X (ORCPT ); Thu, 7 Feb 2019 14:56:23 -0500 Received: by mail-it1-f193.google.com with SMTP id w18so2987968ite.1 for ; Thu, 07 Feb 2019 11:56:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=vSCBinz6SeeuDDio7UEauuXC28ORiVzuzj84d/jJvbg=; b=FpeKlKvk/hBFMBZ2k+y5DhfdkgkmVsiS05isyxGYSYUqY1pN+vYEdGBz6tlRqSjCfO SOe2B77krBN7ctAFqV0VPk8FxesftOmrc7aarRJ4Iyval+Fn7Tk1uU14IWZ3KN3cNg8t LjJjoAhndQxgPEyLMtH2Qhf9eKMsKx39NQvOyEyVYEX6bBG95+SfgQJduJWMhqGAVrWP UJO6x7oLpU2uQQnCbi4y+LyxWHu9l9917MetsNK1TCtPQLfnynX6drwWdH/IO7aymFrR 0oWl6Udnnsl9+cNRESZtqe5uzfWfkc0zLBGsf/otj1B6Bv4f00Pk4LccjWlDjRjpGEx2 74zg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=vSCBinz6SeeuDDio7UEauuXC28ORiVzuzj84d/jJvbg=; b=rbPouZaXWefwgHeMwey0R0oVACXWOWjAjN2BayH+qRFSAGONwZ+rVDMDbY2qfYnrzk uRTM6XbM6VWmSSQ5UKOdnyMMVf+RXqZc6vwNQwkYcdqNH/z3UgNl7Rw+HSWqIvWzHRqr 11ZfSPGRQEzIefV0TXtFvKMqoSlU/b/GTxdAt24QapCvb2By4MmmgRnipOsGR8kO9SG+ CjnYfjXRqRhI1EHNxt+Xl8kK1Zv2gsaNCqzroe6Yn0zc6HNnMZzynlxhBWmXhJCNW1lN VpZzfsgXQ5BfHdIdrhnwNtYBBYMVp0grBF055dt8iFZqJ90rg4fYW6eI0EKPO2gfS9oz AHag== X-Gm-Message-State: AHQUAubNlWbp/hZ05a86jGdng0zZiqSu/KNUusvjVowNA3ysMKh0wcDU QNhXZg7q2owcQJ/EatvM3POm5w== X-Google-Smtp-Source: AHgI3IbB9g2mGtWskHDlb9n3RtimmOtJskb7s8Qw2KHZgt9Aq8x85YGgwEB/9bsp0+30sFG8tIIRSw== X-Received: by 2002:a6b:b790:: with SMTP id h138mr5975723iof.114.1549569381935; Thu, 07 Feb 2019 11:56:21 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.20 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:20 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 13/18] io_uring: add file set registration Date: Thu, 7 Feb 2019 12:55:47 -0700 Message-Id: <20190207195552.22770-14-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We normally have to fget/fput for each IO we do on a file. Even with the batching we do, the cost of the atomic inc/dec of the file usage count adds up. This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes for the io_uring_register(2) system call. The arguments passed in must be an array of __s32 holding file descriptors, and nr_args should hold the number of file descriptors the application wishes to pin for the duration of the io_uring context (or until IORING_UNREGISTER_FILES is called). When used, the application must set IOSQE_FIXED_FILE in the sqe->flags member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd to the index in the array passed in to IORING_REGISTER_FILES. Files are automatically unregistered when the io_uring context is torn down. An application need only unregister if it wishes to register a new set of fds. Signed-off-by: Jens Axboe --- fs/io_uring.c | 207 +++++++++++++++++++++++++++++----- include/net/af_unix.h | 1 + include/uapi/linux/io_uring.h | 9 +- net/unix/af_unix.c | 2 +- 4 files changed, 188 insertions(+), 31 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 9d6233dc35ca..f2550efec60d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -101,6 +102,13 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp; + /* + * If used, fixed file set. Writers must ensure that ->refs is dead, + * readers must ensure that ->refs is alive as long as the file* is + * used. Only updated through io_uring_register(2). + */ + struct scm_fp_list *user_files; + /* if used, fixed mapped user buffers */ unsigned nr_user_bufs; struct io_mapped_ubuf *user_bufs; @@ -148,6 +156,7 @@ struct io_kiocb { unsigned int flags; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ +#define REQ_F_FIXED_FILE 4 /* ctx owns file */ u64 user_data; u64 error; @@ -374,15 +383,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * Batched puts of the same file, to avoid dirtying the * file usage count multiple times, if avoidable. */ - if (!file) { - file = req->rw.ki_filp; - file_count = 1; - } else if (file == req->rw.ki_filp) { - file_count++; - } else { - fput_many(file, file_count); - file = req->rw.ki_filp; - file_count = 1; + if (!(req->flags & REQ_F_FIXED_FILE)) { + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } } if (to_free == ARRAY_SIZE(reqs)) @@ -514,13 +525,19 @@ static void kiocb_end_write(struct kiocb *kiocb) } } +static void io_fput(struct io_kiocb *req) +{ + if (!(req->flags & REQ_F_FIXED_FILE)) + fput(req->rw.ki_filp); +} + static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); kiocb_end_write(kiocb); - fput(kiocb->ki_filp); + io_fput(req); io_cqring_add_event(req->ctx, req->user_data, res, 0); io_free_req(req); } @@ -636,19 +653,29 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, { struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; - unsigned ioprio; + unsigned ioprio, flags; int fd, ret; /* For -EAGAIN retry, everything is already prepped */ if (kiocb->ki_filp) return 0; + flags = READ_ONCE(sqe->flags); fd = READ_ONCE(sqe->fd); - kiocb->ki_filp = io_file_get(state, fd); - if (unlikely(!kiocb->ki_filp)) - return -EBADF; - if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) - force_nonblock = false; + + if (flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || + (unsigned) fd >= ctx->user_files->count)) + return -EBADF; + kiocb->ki_filp = ctx->user_files->fp[fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + kiocb->ki_filp = io_file_get(state, fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + if (force_nonblock && !io_file_supports_async(kiocb->ki_filp)) + force_nonblock = false; + } kiocb->ki_pos = READ_ONCE(sqe->off); kiocb->ki_flags = iocb_flags(kiocb->ki_filp); kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); @@ -688,10 +715,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - /* in case of error, we didn't use this file reference. drop it. */ - if (state) - state->used_refs--; - io_file_put(state, kiocb->ki_filp); + if (!(flags & IOSQE_FIXED_FILE)) { + /* + * in case of error, we didn't use this file reference. drop it. + */ + if (state) + state->used_refs--; + io_file_put(state, kiocb->ki_filp); + } return ret; } @@ -823,7 +854,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, out_fput: /* Hold on to the file for -EAGAIN */ if (unlikely(ret && ret != -EAGAIN)) - fput(file); + io_fput(req); return ret; } @@ -877,7 +908,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, kfree(iovec); out_fput: if (unlikely(ret)) - fput(file); + io_fput(req); return ret; } @@ -903,7 +934,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, loff_t sqe_off = READ_ONCE(sqe->off); loff_t sqe_len = READ_ONCE(sqe->len); loff_t end = sqe_off + sqe_len; - unsigned fsync_flags; + unsigned fsync_flags, flags; struct file *file; int ret, fd; @@ -921,14 +952,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return -EINVAL; fd = READ_ONCE(sqe->fd); - file = fget(fd); + flags = READ_ONCE(sqe->flags); + + if (flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || fd >= ctx->user_files->count)) + return -EBADF; + file = ctx->user_files->fp[fd]; + } else { + file = fget(fd); + } if (unlikely(!file)) return -EBADF; ret = vfs_fsync_range(file, sqe_off, end > 0 ? end : LLONG_MAX, fsync_flags & IORING_FSYNC_DATASYNC); - fput(file); + if (!(flags & IOSQE_FIXED_FILE)) + fput(file); io_cqring_add_event(ctx, sqe->user_data, ret, 0); io_free_req(req); return 0; @@ -1065,7 +1105,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(s->sqe->flags)) + if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE)) return -EINVAL; req = io_get_req(ctx, state); @@ -1253,6 +1293,104 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, return ring->r.head == ring->r.tail ? ret : 0; } +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) +{ +#if defined(CONFIG_NET) + if (ctx->ring_sock) { + struct sock *sock = ctx->ring_sock->sk; + struct sk_buff *skb; + + while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL) + kfree_skb(skb); + } +#else + int i; + + for (i = 0; i < ctx->user_files->count; i++) + fput(ctx->user_files->fp[i]); + + kfree(ctx->user_files); +#endif +} + +static int io_sqe_files_unregister(struct io_ring_ctx *ctx) +{ + if (!ctx->user_files) + return -ENXIO; + + __io_sqe_files_unregister(ctx); + ctx->user_files = NULL; + return 0; +} + +static int io_sqe_files_scm(struct io_ring_ctx *ctx) +{ +#if defined(CONFIG_NET) + struct scm_fp_list *fpl = ctx->user_files; + struct sk_buff *skb; + int i; + + skb = __alloc_skb(0, GFP_KERNEL, 0, NUMA_NO_NODE); + if (!skb) + return -ENOMEM; + + skb->sk = ctx->ring_sock->sk; + skb->destructor = unix_destruct_scm; + + fpl->user = get_uid(ctx->user); + for (i = 0; i < fpl->count; i++) { + get_file(fpl->fp[i]); + unix_inflight(fpl->user, fpl->fp[i]); + fput(fpl->fp[i]); + } + + UNIXCB(skb).fp = fpl; + skb_queue_head(&ctx->ring_sock->sk->sk_receive_queue, skb); +#endif + return 0; +} + +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + __s32 __user *fds = (__s32 __user *) arg; + struct scm_fp_list *fpl; + int fd, ret = 0; + unsigned i; + + if (ctx->user_files) + return -EBUSY; + if (!nr_args || nr_args > SCM_MAX_FD) + return -EINVAL; + + fpl = kzalloc(sizeof(*ctx->user_files), GFP_KERNEL); + if (!fpl) + return -ENOMEM; + fpl->max = nr_args; + + for (i = 0; i < nr_args; i++) { + ret = -EFAULT; + if (copy_from_user(&fd, &fds[i], sizeof(fd))) + break; + + fpl->fp[i] = fget(fd); + + ret = -EBADF; + if (!fpl->fp[i]) + break; + fpl->count++; + ret = 0; + } + + ctx->user_files = fpl; + if (!ret) + ret = io_sqe_files_scm(ctx); + if (ret) + io_sqe_files_unregister(ctx); + + return ret; +} + static int io_sq_offload_start(struct io_ring_ctx *ctx) { int ret; @@ -1520,14 +1658,16 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) destroy_workqueue(ctx->sqo_wq); if (ctx->sqo_mm) mmdrop(ctx->sqo_mm); + + io_iopoll_reap_events(ctx); + io_sqe_buffer_unregister(ctx); + io_sqe_files_unregister(ctx); + #if defined(CONFIG_NET) if (ctx->ring_sock) sock_release(ctx->ring_sock); #endif - io_iopoll_reap_events(ctx); - io_sqe_buffer_unregister(ctx); - io_mem_free(ctx->sq_ring); io_mem_free(ctx->sq_sqes); io_mem_free(ctx->cq_ring); @@ -1885,6 +2025,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_buffer_unregister(ctx); break; + case IORING_REGISTER_FILES: + ret = io_sqe_files_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_FILES: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_files_unregister(ctx); + break; default: ret = -EINVAL; break; diff --git a/include/net/af_unix.h b/include/net/af_unix.h index ddbba838d048..3426d6dacc45 100644 --- a/include/net/af_unix.h +++ b/include/net/af_unix.h @@ -10,6 +10,7 @@ void unix_inflight(struct user_struct *user, struct file *fp); void unix_notinflight(struct user_struct *user, struct file *fp); +void unix_destruct_scm(struct sk_buff *skb); void unix_gc(void); void wait_for_unix_gc(void); struct sock *unix_get_socket(struct file *filp); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index cf28f7a11f12..6257478d55e9 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -16,7 +16,7 @@ */ struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ - __u8 flags; /* as of now unused */ + __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ @@ -33,6 +33,11 @@ struct io_uring_sqe { }; }; +/* + * sqe->flags + */ +#define IOSQE_FIXED_FILE (1U << 0) /* use fixed fileset */ + /* * io_uring_setup() flags */ @@ -113,5 +118,7 @@ struct io_uring_params { */ #define IORING_REGISTER_BUFFERS 0 #define IORING_UNREGISTER_BUFFERS 1 +#define IORING_REGISTER_FILES 2 +#define IORING_UNREGISTER_FILES 3 #endif diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 74d1eed7cbd4..9b1bbf74c4ea 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1497,7 +1497,7 @@ static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb) unix_notinflight(scm->fp->user, scm->fp->fp[i]); } -static void unix_destruct_scm(struct sk_buff *skb) +void unix_destruct_scm(struct sk_buff *skb) { struct scm_cookie scm; memset(&scm, 0, sizeof(scm)); From patchwork Thu Feb 7 19:55:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802097 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 340F11575 for ; Thu, 7 Feb 2019 19:56:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 242AD2E075 for ; Thu, 7 Feb 2019 19:56:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 185682E259; Thu, 7 Feb 2019 19:56:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 28A392E0B8 for ; Thu, 7 Feb 2019 19:56:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727333AbfBGT4Z (ORCPT ); Thu, 7 Feb 2019 14:56:25 -0500 Received: from mail-it1-f195.google.com ([209.85.166.195]:53343 "EHLO mail-it1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727317AbfBGT4Z (ORCPT ); Thu, 7 Feb 2019 14:56:25 -0500 Received: by mail-it1-f195.google.com with SMTP id g85so2949048ita.3 for ; Thu, 07 Feb 2019 11:56:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=ndDyYIcVQbpFadHebfRlQ1moHoOyQrE949czO0q0J4I=; b=yglprH+5o/YWq/yNIVzjRfd+CK3vwMEGOl1lQEiDXHg34Zx9QFGmxNvHk/HS/g1q3Z DSH3cNGc2RxaX61ZGYCUp/lb1syYsWVQ+tgIRfwehnz/OdKa7thdrwGSns7/ErdUmBWW Ph5frJzEoeRJUaZOVON50lNb8P0mVyAHPKAHfQeAuPoupx/t5xsg9r3GkYI8aCH1SGd8 zrECa1zJBgQu53g5Z24ayZFDxFoNv4098kzvXks80K0+joqJbyUhKKDcXbkh0tnGGb8Q mB3AKxFBqkRKRQvYDcNnWy+RjkIP+m4fKUCDP3QXUHrHmkc+4TI4e98RcdFIWXw8MyBr 3G4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=ndDyYIcVQbpFadHebfRlQ1moHoOyQrE949czO0q0J4I=; b=QElXFAZUXrJiLy0RU+4OEAZku0m9q03o9NZyMzG28gnRSXNQnqbvBRQRRUOFtwxJVx T7tjxbLAQga5yZHo83auxaS5bLJwqRbUxQv3J9GW+VaxvG/Ywqafgr9x26osfQONbFpH Gl+Liw1LzdoqcliYc2ELPhProXrTTmMx1eBfdZjzA0yZdHq2kbWHTDnmNT2/XlEI8PAK Bc2hp6L10bn9nafBWJ5bci4cO6mOQr7e49eL/48HjyxUa6d9/wnuWl/wBxfmyOGGR49J bgKYw3dvyGl/cKXXluaelS1PvgPfe6WT9E8R8hqO25zqMD7V536vfDJijFuxI7IFdInO M3mQ== X-Gm-Message-State: AHQUAubKfZPNGkc0QJdQMoihEskaoZ8hX7vUkypTLx37PzP3QaIA8PXi t3+hw8b7UaL29iU4TyvtaA9K7w== X-Google-Smtp-Source: AHgI3IYN4mufmtDote+Rx4Hd1eJethMD56tgmeG/M4B4E+78s+rd1+X2RNvieqWN8x+8pd0msYXv2w== X-Received: by 2002:a24:7b90:: with SMTP id q138mr5785338itc.37.1549569383924; Thu, 07 Feb 2019 11:56:23 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:22 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 14/18] io_uring: add submission polling Date: Thu, 7 Feb 2019 12:55:48 -0700 Message-Id: <20190207195552.22770-15-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. By default, we allow 1 second of active spinning. This can by changed by passing in a different grace period at io_uring_register(2) time. If the thread exceeds this idle time without having any work to do, it will set: sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP); instead of calling it unconditionally. It's mandatory to use fixed files with this feature. Failure to do so will result in the application getting an -EBADF CQ entry when submitting IO. Signed-off-by: Jens Axboe --- fs/io_uring.c | 249 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 12 +- 2 files changed, 253 insertions(+), 8 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index f2550efec60d..200631cf3955 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -85,12 +86,16 @@ struct io_ring_ctx { unsigned cached_sq_head; unsigned sq_entries; unsigned sq_mask; + unsigned sq_thread_idle; struct io_uring_sqe *sq_sqes; } ____cacheline_aligned_in_smp; /* IO offload */ struct workqueue_struct *sqo_wq; + struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; + wait_queue_head_t sqo_wait; + unsigned sqo_stop; struct { /* CQ ring */ @@ -144,6 +149,7 @@ struct sqe_submit { unsigned short index; bool has_user; bool needs_lock; + bool needs_fixed_file; }; struct io_kiocb { @@ -295,6 +301,8 @@ static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); + if (waitqueue_active(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); } static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) @@ -648,9 +656,10 @@ static bool io_file_supports_async(struct file *file) return false; } -static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, +static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) { + const struct io_uring_sqe *sqe = s->sqe; struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; unsigned ioprio, flags; @@ -670,6 +679,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_filp = ctx->user_files->fp[fd]; req->flags |= REQ_F_FIXED_FILE; } else { + if (s->needs_fixed_file) + return -EBADF; kiocb->ki_filp = io_file_get(state, fd); if (unlikely(!kiocb->ki_filp)) return -EBADF; @@ -823,7 +834,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, struct file *file; ssize_t ret; - ret = io_prep_rw(req, s->sqe, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -867,7 +878,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, struct file *file; ssize_t ret; - ret = io_prep_rw(req, s->sqe, force_nonblock, state); + ret = io_prep_rw(req, s, force_nonblock, state); if (ret) return ret; /* Hold on to the file for -EAGAIN */ @@ -1057,6 +1068,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) user_data = READ_ONCE(s->sqe->user_data); s->needs_lock = true; s->has_user = false; + s->needs_fixed_file = false; /* * If we're doing IO to fixed buffers, we don't need to get/set @@ -1211,6 +1223,170 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) return false; } +static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, bool has_user, bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) { + ret = -EFAULT; + } else { + sqes[i].has_user = has_user; + sqes[i].needs_lock = true; + sqes[i].needs_fixed_file = true; + ret = io_submit_sqe(ctx, &sqes[i], statep); + } + if (!ret) { + submitted++; + continue; + } + + io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct mm_struct *cur_mm = NULL; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; + + old_fs = get_fs(); + set_fs(USER_DS); + + timeout = inflight = 0; + while (!kthread_should_stop() && !ctx->sqo_stop) { + bool all_fixed, mm_fault = false; + int i; + + if (inflight) { + unsigned nr_events = 0; + + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * We disallow the app entering submit/complete + * with polling, but we still need to lock the + * ring to prevent racing with polled issue + * that got punted to a workqueue. + */ + mutex_lock(&ctx->uring_lock); + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } else { + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + ctx->sq_thread_idle; + } + + if (!io_get_sqring(ctx, &sqes[0])) { + /* + * We're polling. If we're within the defined idle + * period, then let us spin without work before going + * to sleep. + */ + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&ctx->sqo_wait, &wait, + TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + + if (!io_get_sqring(ctx, &sqes[0])) { + if (kthread_should_stop()) { + finish_wait(&ctx->sqo_wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + continue; + } + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + + i = 0; + all_fixed = true; + do { + if (all_fixed && io_sqe_needs_user(sqes[i].sqe)) + all_fixed = false; + + i++; + if (i == ARRAY_SIZE(sqes)) + break; + } while (io_get_sqring(ctx, &sqes[i])); + + io_commit_sqring(ctx); + + /* Unless all new commands are FIXED regions, grab mm */ + if (!all_fixed && !cur_mm) { + mm_fault = !mmget_not_zero(ctx->sqo_mm); + if (!mm_fault) { + use_mm(ctx->sqo_mm); + cur_mm = ctx->sqo_mm; + } + } + + inflight += io_submit_sqes(ctx, sqes, i, cur_mm != NULL, + mm_fault); + } + + io_iopoll_reap_events(ctx); + + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; @@ -1229,6 +1405,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) s.has_user = true; s.needs_lock = false; + s.needs_fixed_file = false; ret = io_submit_sqe(ctx, &s, statep); if (ret) { @@ -1391,13 +1568,47 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return ret; } -static int io_sq_offload_start(struct io_ring_ctx *ctx) +static int io_sq_offload_start(struct io_ring_ctx *ctx, + struct io_uring_params *p) { int ret; + init_waitqueue_head(&ctx->sqo_wait); mmgrab(current->mm); ctx->sqo_mm = current->mm; + ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle); + if (!ctx->sq_thread_idle) + ctx->sq_thread_idle = HZ; + + ret = -EINVAL; + if (!cpu_possible(p->sq_thread_cpu)) + goto err; + + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (p->flags & IORING_SETUP_SQ_AFF) { + int cpu; + + cpu = array_index_nospec(p->sq_thread_cpu, NR_CPUS); + ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, + ctx, cpu, + "io_uring-sq"); + } else { + ctx->sqo_thread = kthread_create(io_sq_thread, ctx, + "io_uring-sq"); + } + if (IS_ERR(ctx->sqo_thread)) { + ret = PTR_ERR(ctx->sqo_thread); + ctx->sqo_thread = NULL; + goto err; + } + wake_up_process(ctx->sqo_thread); + } else if (p->flags & IORING_SETUP_SQ_AFF) { + /* Can't have SQ_AFF without SQPOLL */ + ret = -EINVAL; + goto err; + } + /* Do QD, or 2 * CPUS, whatever is smallest */ ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, min(ctx->sq_entries - 1, 2 * num_online_cpus())); @@ -1408,6 +1619,12 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) return 0; err: + if (ctx->sqo_thread) { + ctx->sqo_stop = 1; + mb(); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } mmdrop(ctx->sqo_mm); ctx->sqo_mm = NULL; return ret; @@ -1654,6 +1871,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, static void io_ring_ctx_free(struct io_ring_ctx *ctx) { + if (ctx->sqo_thread) { + ctx->sqo_stop = 1; + mb(); + kthread_stop(ctx->sqo_thread); + } if (ctx->sqo_wq) destroy_workqueue(ctx->sqo_wq); if (ctx->sqo_mm) @@ -1761,7 +1983,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, int submitted = 0; struct fd f; - if (flags & ~IORING_ENTER_GETEVENTS) + if (flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP)) return -EINVAL; f = fdget(fd); @@ -1777,6 +1999,18 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, if (!percpu_ref_tryget(&ctx->refs)) goto out_fput; + /* + * For SQ polling, the thread will do all submissions and completions. + * Just return the requested submit count, and wake the thread if + * we were asked to. + */ + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (flags & IORING_ENTER_SQ_WAKEUP) + wake_up(&ctx->sqo_wait); + submitted = to_submit; + goto out_ctx; + } + if (to_submit) { to_submit = min(to_submit, ctx->sq_entries); @@ -1940,7 +2174,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret) goto err; - ret = io_sq_offload_start(ctx); + ret = io_sq_offload_start(ctx, p); if (ret) goto err; @@ -1988,7 +2222,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | + IORING_SETUP_SQ_AFF)) return -EINVAL; ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6257478d55e9..0ec74bab8dbe 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -42,6 +42,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1U << 0) /* io_context is polled */ +#define IORING_SETUP_SQPOLL (1U << 1) /* SQ poll thread */ +#define IORING_SETUP_SQ_AFF (1U << 2) /* sq_thread_cpu is valid */ #define IORING_OP_NOP 0 #define IORING_OP_READV 1 @@ -86,6 +88,11 @@ struct io_sqring_offsets { __u64 resv2; }; +/* + * sq_ring->flags + */ +#define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; @@ -100,6 +107,7 @@ struct io_cqring_offsets { * io_uring_enter(2) flags */ #define IORING_ENTER_GETEVENTS (1U << 0) +#define IORING_ENTER_SQ_WAKEUP (1U << 1) /* * Passed in for io_uring_setup(2). Copied back with updated info on success @@ -108,7 +116,9 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u32 resv[7]; + __u32 sq_thread_cpu; + __u32 sq_thread_idle; + __u32 resv[5]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; }; From patchwork Thu Feb 7 19:55:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802099 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A6906C2 for ; Thu, 7 Feb 2019 19:56:28 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3BD772E075 for ; Thu, 7 Feb 2019 19:56:28 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2FF582E259; Thu, 7 Feb 2019 19:56:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CCF362E075 for ; Thu, 7 Feb 2019 19:56:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727338AbfBGT41 (ORCPT ); Thu, 7 Feb 2019 14:56:27 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:36079 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727328AbfBGT40 (ORCPT ); Thu, 7 Feb 2019 14:56:26 -0500 Received: by mail-it1-f193.google.com with SMTP id c9so2971331itj.1 for ; Thu, 07 Feb 2019 11:56:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=z7+GZ4iwB5WgeWIzeaDqvBYowst7B51aZXoDPY3tw1Q=; b=LM9YC2+MSS6B3vPkkCRZ+oodjoUITfC+Tm30atZkbmJZtI9A7YeCMjrwsgBr2Rvt2O wSC5s2Affwx/O3+vH+EofslfUSdLGlUb5Mq0gZTdglFEfRTVFbMiUPWBBsjSn/2Z8e4c jPiYNUkZCOcysNtctZ3iuzieMLGRROTGMKsw4rn85/ku6Q3fa0VV2/qLBFyOSUoetXOD TzyR89C/sfQ1DEzMHc0ugDtUySQ+mp5z5pBc0+AzZyAG238PBscXrmG8/JZ0eNu+WFxV VoyWSN8JJo52+W6ypponMkP9QUAJ+kd/FGlDQ6j6vTLTCzW+Ke2ZZbaH1g32MlavTEvo KDWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=z7+GZ4iwB5WgeWIzeaDqvBYowst7B51aZXoDPY3tw1Q=; b=N9G/MzLbvs20aZWaL1bbMw8Tu++Ab30oWFB7kV/mW+7AeCuEvdp3G98rol4sZZRgj3 ysHTEdyEAq61Yy8MRAHe5WHSf2YaGfd332AFZS4McCjVl1kc7Fzh9miSrXKomc80n/9i JrgbolKcA7PoyODsXwRQsWYhBDLnb1zymwwQyjdaDasNjMHTyENOOBKNCQMUlS4fkUzu 0R8Q5lkE04zPs8Esjs9noWlS3d+SRC4Lv3FW4g0Z5/PtwXGFqtLe/j2sKmWlg2dq73wE zUQBXPlfuAqkutxW5NfblfMpNyA+45ZUprZe4Syd1U/LK5DT6jW9DBw4Xc954UbRurXF SbRg== X-Gm-Message-State: AHQUAubd6i2ty4ZkLcxZZuXAX7Hjg07VG9Ej1uokMUU4c6cqzsZL5DIY tjn0YolIVZPcDTQJSRf/6D+8qoVn619vww== X-Google-Smtp-Source: AHgI3IZLrhKX8uUVFbm6/FwzvrVXFqToS3FvNhaI8qh0rOmYZAAJP0qeIZ9a6u6AN26TlVFyz/Jsrg== X-Received: by 2002:a6b:bc83:: with SMTP id m125mr9059505iof.83.1549569385806; Thu, 07 Feb 2019 11:56:25 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:24 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 15/18] io_uring: add io_kiocb ref count Date: Thu, 7 Feb 2019 12:55:49 -0700 Message-Id: <20190207195552.22770-16-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We'll use this for the POLL implementation. Regular requests will NOT be using references, so initialize it to 0. Any real use of the io_kiocb ref will initialize it to at least 2. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/io_uring.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 200631cf3955..b54c70f8f8af 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -160,6 +160,7 @@ struct io_kiocb { struct io_ring_ctx *ctx; struct list_head list; unsigned int flags; + refcount_t refs; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ @@ -345,6 +346,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, req->ctx = ctx; req->flags = 0; + refcount_set(&req->refs, 0); return req; out: io_ring_drop_ctx_refs(ctx, 1); @@ -362,8 +364,10 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) static void io_free_req(struct io_kiocb *req) { - io_ring_drop_ctx_refs(req->ctx, 1); - kmem_cache_free(req_cachep, req); + if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) { + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); + } } /* From patchwork Thu Feb 7 19:55:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802101 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 88B701575 for ; Thu, 7 Feb 2019 19:56:30 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78A572E075 for ; Thu, 7 Feb 2019 19:56:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6CEB72E259; Thu, 7 Feb 2019 19:56:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A1E762E075 for ; Thu, 7 Feb 2019 19:56:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727317AbfBGT43 (ORCPT ); Thu, 7 Feb 2019 14:56:29 -0500 Received: from mail-it1-f194.google.com ([209.85.166.194]:40839 "EHLO mail-it1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727328AbfBGT42 (ORCPT ); Thu, 7 Feb 2019 14:56:28 -0500 Received: by mail-it1-f194.google.com with SMTP id h193so2903001ita.5 for ; Thu, 07 Feb 2019 11:56:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=H2+XxYQytcar/hUFNJe/j7pYO7Srz4ImHwUEc0sZlkQ=; b=aL/z++JGJ3uOV+a2FA3oyLbbN4Z8ktUqCrRPxiD+zsSFCVMthAWcaXsQD4voBNJ/r1 ksOmOwYWzxEMQW+osZZc1XRtQH/+J5NfhAhVuSc0jcySGD8JVLcqM5AJeDPgfuPtMQQv dypfvhiggUckd11qlslguVUKsr6cdynktpCh9lt8BmGTODp+SO4liXhZSmjLRkga/RTm nEqofr/WfFfHqjMpZ1TaBOXXUjQ6nE+6M2mQE9rIt4iSgGXeEimEg3q8R5jBjG/I7/9q LrCmh4UMK3SY7GfXivnfdPyJbqNtTBtRLMbN9IB7vmQbQh7DykVu0ZSf8L9Aloucwkhl mJqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=H2+XxYQytcar/hUFNJe/j7pYO7Srz4ImHwUEc0sZlkQ=; b=R5nGpgGt1Wt9yvte93Djhl08ythxKIdxoqHdY5J+xkqVDJSGGpLp3049drxoIuD5HS Kpha0HqFLxobsytxavu5k6uMtZM0/cIGU7jyOU81TzPcgjBUf1MjxEAhmRNvLDhckt1y sAHkxvKGiNHKKBbWqqGFvMVB3pJZF1zkbBRx9+k8tX9HiD78tyGPhQGWXb44Ep525D4C KaPe1x/1TedBI9hQyw4M9GkmD2T99luvNVOB5P/Hy6hRzl+c99Sm+hE+UJQ101JjR8GL E9sxCXfRGduycwpHvaozNzLRvivP0gD9EvRi9hKKVa60WvcLaWOiS3x4CA/DUkYZOSvk 6W/A== X-Gm-Message-State: AHQUAuYTO/OQsHKutip26WKwpinALyHfBVNkh9X1RwduxcUpYhU7yk2O GNOTASBGz7GbYA/cMdcVgy1dhQ== X-Google-Smtp-Source: AHgI3IYBzB0enRdK1u/5Fgq3DQWo8/+Ve83kUvDLXgngvf9Zbne8kkglHjksokcDvF7mCR7rYCkxCQ== X-Received: by 2002:a5d:85cd:: with SMTP id e13mr10128414ios.46.1549569387596; Thu, 07 Feb 2019 11:56:27 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.25 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:26 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 16/18] io_uring: add support for IORING_OP_POLL Date: Thu, 7 Feb 2019 12:55:50 -0700 Message-Id: <20190207195552.22770-17-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This is basically a direct port of bfe4037e722e, which implements a one-shot poll command through aio. Description below is based on that commit as well. However, instead of adding a POLL command and relying on io_cancel(2) to remove it, we mimic the epoll(2) interface of having a command to add a poll notification, IORING_OP_POLL_ADD, and one to remove it again, IORING_OP_POLL_REMOVE. To poll for a file descriptor the application should submit an sqe of type IORING_OP_POLL. It will poll the fd for the events specified in the poll_events field. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the sqe is completed, it will have to be resubmitted. Based-on-code-from: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/io_uring.c | 256 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 3 + 2 files changed, 258 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index b54c70f8f8af..cd456fd65af7 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -137,6 +137,7 @@ struct io_ring_ctx { * manipulate the list, hence no extra locking is needed there. */ struct list_head poll_list; + struct list_head cancel_list; } ____cacheline_aligned_in_smp; #if defined(CONFIG_NET) @@ -152,8 +153,20 @@ struct sqe_submit { bool needs_fixed_file; }; +struct io_poll_iocb { + struct file *file; + struct wait_queue_head *head; + __poll_t events; + bool woken; + bool canceled; + struct wait_queue_entry wait; +}; + struct io_kiocb { - struct kiocb rw; + union { + struct kiocb rw; + struct io_poll_iocb poll; + }; struct sqe_submit submit; @@ -236,6 +249,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); + INIT_LIST_HEAD(&ctx->cancel_list); return ctx; } @@ -989,6 +1003,239 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; } +static void io_poll_remove_one(struct io_kiocb *req) +{ + struct io_poll_iocb *poll = &req->poll; + + spin_lock(&poll->head->lock); + WRITE_ONCE(poll->canceled, true); + if (!list_empty(&poll->wait.entry)) { + list_del_init(&poll->wait.entry); + queue_work(req->ctx->sqo_wq, &req->work); + } + spin_unlock(&poll->head->lock); + + list_del_init(&req->list); +} + +static void io_poll_remove_all(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + spin_lock_irq(&ctx->completion_lock); + while (!list_empty(&ctx->cancel_list)) { + req = list_first_entry(&ctx->cancel_list, struct io_kiocb,list); + io_poll_remove_one(req); + } + spin_unlock_irq(&ctx->completion_lock); +} + +/* + * Find a running poll command that matches one specified in sqe->addr, + * and remove it if found. + */ +static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *poll_req, *next; + int ret = -ENOENT; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || + sqe->poll_events) + return -EINVAL; + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) { + if (READ_ONCE(sqe->addr) == poll_req->user_data) { + io_poll_remove_one(poll_req); + ret = 0; + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_free_req(req); + return 0; +} + +static void io_poll_complete(struct io_kiocb *req, __poll_t mask) +{ + io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0); + io_fput(req); + io_free_req(req); +} + +static void io_poll_complete_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct io_poll_iocb *poll = &req->poll; + struct poll_table_struct pt = { ._key = poll->events }; + struct io_ring_ctx *ctx = req->ctx; + __poll_t mask = 0; + + if (!READ_ONCE(poll->canceled)) + mask = vfs_poll(poll->file, &pt) & poll->events; + + /* + * Note that ->ki_cancel callers also delete iocb from active_reqs after + * calling ->ki_cancel. We need the ctx_lock roundtrip here to + * synchronize with them. In the cancellation case the list_del_init + * itself is not actually needed, but harmless so we keep it in to + * avoid further branches in the fast path. + */ + spin_lock_irq(&ctx->completion_lock); + if (!mask && !READ_ONCE(poll->canceled)) { + add_wait_queue(poll->head, &poll->wait); + spin_unlock_irq(&ctx->completion_lock); + return; + } + list_del_init(&req->list); + spin_unlock_irq(&ctx->completion_lock); + + io_poll_complete(req, mask); +} + +static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) +{ + struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb, + wait); + struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); + struct io_ring_ctx *ctx = req->ctx; + __poll_t mask = key_to_poll(key); + + poll->woken = true; + + /* for instances that support it check for an event match first: */ + if (mask) { + if (!(mask & poll->events)) + return 0; + + /* try to complete the iocb inline if we can: */ + if (spin_trylock(&ctx->completion_lock)) { + list_del(&req->list); + spin_unlock(&ctx->completion_lock); + + list_del_init(&poll->wait.entry); + io_poll_complete(req, mask); + return 1; + } + } + + list_del_init(&poll->wait.entry); + queue_work(ctx->sqo_wq, &req->work); + return 1; +} + +struct io_poll_table { + struct poll_table_struct pt; + struct io_kiocb *req; + int error; +}; + +static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + + if (unlikely(pt->req->poll.head)) { + pt->error = -EINVAL; + return; + } + + pt->error = 0; + pt->req->poll.head = head; + add_wait_queue(head, &pt->req->poll.wait); +} + +static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_poll_iocb *poll = &req->poll; + struct io_ring_ctx *ctx = req->ctx; + struct io_poll_table ipt; + unsigned flags; + __poll_t mask; + u16 events; + int fd; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) + return -EINVAL; + + INIT_WORK(&req->work, io_poll_complete_work); + events = READ_ONCE(sqe->poll_events); + poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + + flags = READ_ONCE(sqe->flags); + fd = READ_ONCE(sqe->fd); + + if (flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || fd >= ctx->user_files->count)) + return -EBADF; + poll->file = ctx->user_files->fp[fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + poll->file = fget(fd); + } + if (unlikely(!poll->file)) + return -EBADF; + + poll->head = NULL; + poll->woken = false; + poll->canceled = false; + + ipt.pt._qproc = io_poll_queue_proc; + ipt.pt._key = poll->events; + ipt.req = req; + ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */ + + /* initialized the list so that we can do list_empty checks */ + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, io_poll_wake); + + /* one for removal from waitqueue, one for this function */ + refcount_set(&req->refs, 2); + + mask = vfs_poll(poll->file, &ipt.pt) & poll->events; + if (unlikely(!poll->head)) { + /* we did not manage to set up a waitqueue, done */ + goto out; + } + + spin_lock_irq(&ctx->completion_lock); + spin_lock(&poll->head->lock); + if (poll->woken) { + /* wake_up context handles the rest */ + mask = 0; + ipt.error = 0; + } else if (mask || ipt.error) { + /* if we get an error or a mask we are done */ + WARN_ON_ONCE(list_empty(&poll->wait.entry)); + list_del_init(&poll->wait.entry); + } else { + /* actually waiting for an event */ + list_add_tail(&req->list, &ctx->cancel_list); + } + spin_unlock(&poll->head->lock); + spin_unlock_irq(&ctx->completion_lock); + +out: + if (unlikely(ipt.error)) { + if (!(flags & IOSQE_FIXED_FILE)) + fput(poll->file); + return ipt.error; + } + + if (mask) + io_poll_complete(req, mask); + io_free_req(req); + return 0; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) @@ -1024,6 +1271,12 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_FSYNC: ret = io_fsync(req, s->sqe, force_nonblock); break; + case IORING_OP_POLL_ADD: + ret = io_poll_add(req, s->sqe); + break; + case IORING_OP_POLL_REMOVE: + ret = io_poll_remove(req, s->sqe); + break; default: ret = -EINVAL; break; @@ -1933,6 +2186,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock); + io_poll_remove_all(ctx); io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 0ec74bab8dbe..e23408692118 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -25,6 +25,7 @@ struct io_uring_sqe { union { __kernel_rwf_t rw_flags; __u32 fsync_flags; + __u16 poll_events; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -51,6 +52,8 @@ struct io_uring_sqe { #define IORING_OP_FSYNC 3 #define IORING_OP_READ_FIXED 4 #define IORING_OP_WRITE_FIXED 5 +#define IORING_OP_POLL_ADD 6 +#define IORING_OP_POLL_REMOVE 7 /* * sqe->fsync_flags From patchwork Thu Feb 7 19:55:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802103 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 138521575 for ; Thu, 7 Feb 2019 19:56:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 02F4A2E075 for ; Thu, 7 Feb 2019 19:56:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EB7C72E267; Thu, 7 Feb 2019 19:56:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0D35D2E075 for ; Thu, 7 Feb 2019 19:56:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727351AbfBGT4b (ORCPT ); Thu, 7 Feb 2019 14:56:31 -0500 Received: from mail-it1-f196.google.com ([209.85.166.196]:40848 "EHLO mail-it1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727341AbfBGT4b (ORCPT ); Thu, 7 Feb 2019 14:56:31 -0500 Received: by mail-it1-f196.google.com with SMTP id h193so2903218ita.5 for ; Thu, 07 Feb 2019 11:56:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=LhzsYEkILsqChW2MqqIFodomXCOIkRsvzylZhKUifcs=; b=xxGNzXSoJnH9ssO9B6O91fDN/UzMxyVvcGYxo2YUiurt5ZkXpSgN6Wu0R/62e8Xprn JujnAkDnkSuRvTDoLO+ovTr8ElXUH4xXNDL7BEvGVUTQ1oH3GXMt3m8eDDt6vF8fZsbG kbkME5c5zhbz8dSDd0P1zSrAjboA+K7BjIbmRDc074VzcWeVsb/YjFUnUVzGCkJvXond dKxHogHb7t2bAz3OPi34nGMPhvvGS2uRQj3G6DvBMzp3rciEQhNW/CMPE6zJV6jkLC1Q +lQ9RTXRv9HQinx7Bn73w9hhXoF848HCBqjyCXPt0iG/9DcVmubjPgYi5iTVQGkROE3E gn3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=LhzsYEkILsqChW2MqqIFodomXCOIkRsvzylZhKUifcs=; b=sNthniQHISnrv5tTd8r/Kf6/2hTkumFH8eekMkpwmCag2+PflRYDfCiez4riwBdD1F wEexUVSPXsMg6HjZQykW1RTLzAWafx78salrWl3O/LZ4U0Ti3z2wi6axpmqXo1lMRx3L wynFA3rC0fvqkDAG/npYqVzIgcYZEvIWw7DmtFyOFHCEdW4kpFawpBUEn4HdzvNADVUc CGS38i24eyDHh5SKBp5U2+f6kvK02GS5iDI7qECXaezQqnNZHURHMZYAkBJHzkQLSaPw buuLuZvoOPT6t+oIkAb1vk0tUIPaJYYDrL/yCesU779ULf2sHvG+zwN5kcVTq+6IrpOS 79ZA== X-Gm-Message-State: AHQUAuaqznwFzHlcOTJRgQrdp9Ug9OR8eDNPzBVxol9sjo6KzKOzzmRD VuiBsOFkcOBtF8F7SUX4W3Pp0tukMgwd1Q== X-Google-Smtp-Source: AHgI3Ib0uVu6f8FWK3cpFAIhQMtLZhLl3w3e0PS5p0M/nRUwg+JWZ3kVyQ7Fgv0P/Obzj9wdbxvWOA== X-Received: by 2002:a6b:f70a:: with SMTP id k10mr8412887iog.270.1549569389204; Thu, 07 Feb 2019 11:56:29 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.27 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:28 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 17/18] io_uring: allow workqueue item to handle multiple buffered requests Date: Thu, 7 Feb 2019 12:55:51 -0700 Message-Id: <20190207195552.22770-18-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Right now we punt any buffered request that ends up triggering an -EAGAIN to an async workqueue. This works fine in terms of providing async execution of them, but it also can create quite a lot of work queue items. For sequentially buffered IO, it's advantageous to serialize the issue of them. For reads, the first one will trigger a read-ahead, and subsequent request merely end up waiting on later pages to complete. For writes, devices usually respond better to streamed sequential writes. Add state to track the last buffered request we punted to a work queue, and if the next one is sequential to the previous, attempt to get the previous work item to handle it. We limit the number of sequential add-ons to the a multiple (8) of the max read-ahead size of the file. This should be a good number for both reads and wries, as it defines the max IO size the device can do directly. This drastically cuts down on the number of context switches we need to handle buffered sequential IO, and a basic test case of copying a big file with io_uring sees a 5x speedup. Signed-off-by: Jens Axboe --- fs/io_uring.c | 267 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 216 insertions(+), 51 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index cd456fd65af7..af063145f901 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -72,6 +72,16 @@ struct io_mapped_ubuf { unsigned int nr_bvecs; }; +struct async_list { + spinlock_t lock; + atomic_t cnt; + struct list_head list; + + struct file *file; + off_t io_end; + size_t io_pages; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -140,6 +150,8 @@ struct io_ring_ctx { struct list_head cancel_list; } ____cacheline_aligned_in_smp; + struct async_list pending_async[2]; + #if defined(CONFIG_NET) struct socket *ring_sock; #endif @@ -177,6 +189,7 @@ struct io_kiocb { #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_FIXED_FILE 4 /* ctx owns file */ +#define REQ_F_SEQ_PREV 8 /* sequential with previous */ u64 user_data; u64 error; @@ -232,6 +245,7 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref) static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) { struct io_ring_ctx *ctx; + int i; ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) @@ -247,6 +261,11 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_completion(&ctx->ctx_done); mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); + for (i = 0; i < ARRAY_SIZE(ctx->pending_async); i++) { + spin_lock_init(&ctx->pending_async[i].lock); + INIT_LIST_HEAD(&ctx->pending_async[i].list); + atomic_set(&ctx->pending_async[i].cnt, 0); + } spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); INIT_LIST_HEAD(&ctx->cancel_list); @@ -843,6 +862,39 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter); } +static void io_async_list_note(int rw, struct io_kiocb *req, size_t len) +{ + struct async_list *async_list = &req->ctx->pending_async[rw]; + struct kiocb *kiocb = &req->rw; + struct file *filp = kiocb->ki_filp; + off_t io_end = kiocb->ki_pos + len; + + if (filp == async_list->file && kiocb->ki_pos == async_list->io_end) { + unsigned long max_pages; + + /* Use 8x RA size as a decent limiter for both reads/writes */ + max_pages = filp->f_ra.ra_pages; + if (!max_pages) + max_pages = VM_MAX_READAHEAD >> (PAGE_SHIFT - 10); + max_pages *= 8; + + len >>= PAGE_SHIFT; + if (async_list->io_pages + len <= max_pages) { + req->flags |= REQ_F_SEQ_PREV; + async_list->io_pages += len; + } else { + io_end = 0; + async_list->io_pages = 0; + } + } + + if (async_list->file != filp) { + async_list->io_pages = 0; + async_list->file = filp; + } + async_list->io_end = io_end; +} + static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) { @@ -850,6 +902,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; + size_t iov_count; ssize_t ret; ret = io_prep_rw(req, s, force_nonblock, state); @@ -868,16 +921,24 @@ static ssize_t io_read(struct io_kiocb *req, const struct sqe_submit *s, if (ret) goto out_fput; - ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + iov_count = iov_iter_count(&iter); + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_count); if (!ret) { ssize_t ret2; /* Catch -EAGAIN return for forced non-blocking submission */ ret2 = call_read_iter(file, kiocb, &iter); - if (!force_nonblock || ret2 != -EAGAIN) + if (!force_nonblock || ret2 != -EAGAIN) { io_rw_done(kiocb, ret2); - else + } else { + /* + * If ->needs_lock is true, we're already in async + * context. + */ + if (!s->needs_lock) + io_async_list_note(READ, req, iov_count); ret = -EAGAIN; + } } kfree(iovec); out_fput: @@ -894,14 +955,12 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; + size_t iov_count; ssize_t ret; ret = io_prep_rw(req, s, force_nonblock, state); if (ret) return ret; - /* Hold on to the file for -EAGAIN */ - if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) - return -EAGAIN; ret = -EBADF; file = kiocb->ki_filp; @@ -915,8 +974,17 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, if (ret) goto out_fput; - ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, - iov_iter_count(&iter)); + iov_count = iov_iter_count(&iter); + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) { + /* If ->needs_lock is true, we're already in async context. */ + if (!s->needs_lock) + io_async_list_note(WRITE, req, iov_count); + goto out_free; + } + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, iov_count); if (!ret) { /* * Open-code file_start_write here to grab freeze protection, @@ -934,9 +1002,11 @@ static ssize_t io_write(struct io_kiocb *req, const struct sqe_submit *s, kiocb->ki_flags |= IOCB_WRITE; io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); } +out_free: kfree(iovec); out_fput: - if (unlikely(ret)) + /* Hold on to the file for -EAGAIN */ + if (unlikely(ret && ret != -EAGAIN)) io_fput(req); return ret; } @@ -1300,6 +1370,21 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, return 0; } +static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx, + const struct io_uring_sqe *sqe) +{ + switch (sqe->opcode) { + case IORING_OP_READV: + case IORING_OP_READ_FIXED: + return &ctx->pending_async[READ]; + case IORING_OP_WRITEV: + case IORING_OP_WRITE_FIXED: + return &ctx->pending_async[WRITE]; + default: + return NULL; + } +} + static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) { u8 opcode = READ_ONCE(sqe->opcode); @@ -1311,60 +1396,133 @@ static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe) static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work); - struct sqe_submit *s = &req->submit; struct io_ring_ctx *ctx = req->ctx; + struct mm_struct *cur_mm = NULL; + struct async_list *async_list; + LIST_HEAD(req_list); mm_segment_t old_fs; - bool needs_user; - u64 user_data; int ret; - /* Ensure we clear previously set forced non-block flag */ - req->flags &= ~REQ_F_FORCE_NONBLOCK; - req->rw.ki_flags &= ~IOCB_NOWAIT; + async_list = io_async_list_from_sqe(ctx, req->submit.sqe); +restart: + do { + struct sqe_submit *s = &req->submit; + u64 user_data = READ_ONCE(s->sqe->user_data); + + /* Ensure we clear previously set forced non-block flag */ + req->flags &= ~REQ_F_FORCE_NONBLOCK; + req->rw.ki_flags &= ~IOCB_NOWAIT; - user_data = READ_ONCE(s->sqe->user_data); - s->needs_lock = true; - s->has_user = false; - s->needs_fixed_file = false; + ret = 0; + if (io_sqe_needs_user(s->sqe) && !cur_mm) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + } else { + cur_mm = ctx->sqo_mm; + use_mm(ctx->sqo_mm); + old_fs = get_fs(); + set_fs(USER_DS); + } + } + + if (!ret) { + s->has_user = cur_mm != NULL; + s->needs_lock = true; + s->needs_fixed_file = false; + do { + ret = __io_submit_sqe(ctx, req, s, false, NULL); + /* + * We can get EAGAIN for polled IO even though + * we're forcing a sync submission from here, + * since we can't wait for request slots on the + * block side. + */ + if (ret != -EAGAIN) + break; + cond_resched(); + } while (1); + } + if (ret) { + io_cqring_add_event(ctx, user_data, ret, 0); + io_free_req(req); + } + if (!async_list) + break; + if (!list_empty(&req_list)) { + req = list_first_entry(&req_list, struct io_kiocb, + list); + list_del(&req->list); + continue; + } + if (list_empty(&async_list->list)) + break; + + req = NULL; + spin_lock(&async_list->lock); + if (list_empty(&async_list->list)) { + spin_unlock(&async_list->lock); + break; + } + list_splice_init(&async_list->list, &req_list); + spin_unlock(&async_list->lock); + + req = list_first_entry(&req_list, struct io_kiocb, list); + list_del(&req->list); + } while (req); /* - * If we're doing IO to fixed buffers, we don't need to get/set - * user context + * Rare case of racing with a submitter. If we find the count has + * dropped to zero AND we have pending work items, then restart + * the processing. This is a tiny race window. */ - needs_user = io_sqe_needs_user(s->sqe); - if (needs_user) { - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; + ret = atomic_dec_return(&async_list->cnt); + while (!ret && !list_empty(&async_list->list)) { + spin_lock(&async_list->lock); + atomic_inc(&async_list->cnt); + list_splice_init(&async_list->list, &req_list); + spin_unlock(&async_list->lock); + + if (!list_empty(&req_list)) { + req = list_first_entry(&req_list, struct io_kiocb, + list); + list_del(&req->list); + goto restart; } - use_mm(ctx->sqo_mm); - old_fs = get_fs(); - set_fs(USER_DS); - s->has_user = true; + ret = atomic_dec_return(&async_list->cnt); } - do { - ret = __io_submit_sqe(ctx, req, s, false, NULL); - /* - * We can get EAGAIN for polled IO even though we're forcing - * a sync submission from here, since we can't wait for - * request slots on the block side. - */ - if (ret != -EAGAIN) - break; - cond_resched(); - } while (1); - - if (needs_user) { + if (cur_mm) { set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); + unuse_mm(cur_mm); + mmput(cur_mm); } -err: - if (ret) { - io_cqring_add_event(ctx, user_data, ret, 0); - io_free_req(req); +} + +/* + * See if we can piggy back onto previously submitted work, that is still + * running. We currently only allow this if the new request is sequential + * to the previous one we punted. + */ +static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req) +{ + bool ret = false; + + if (!list) + return false; + if (!(req->flags & REQ_F_SEQ_PREV)) + return false; + if (!atomic_read(&list->cnt)) + return false; + + ret = true; + spin_lock(&list->lock); + list_add_tail(&req->list, &list->list); + if (!atomic_read(&list->cnt)) { + list_del_init(&req->list); + ret = false; } + spin_unlock(&list->lock); + return ret; } static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s, @@ -1385,9 +1543,16 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, const struct sqe_submit *s, ret = __io_submit_sqe(ctx, req, s, true, state); if (ret == -EAGAIN) { + struct async_list *list; + + list = io_async_list_from_sqe(ctx, s->sqe); memcpy(&req->submit, s, sizeof(*s)); - INIT_WORK(&req->work, io_sq_wq_submit_work); - queue_work(ctx->sqo_wq, &req->work); + if (!io_add_to_prev_work(list, req)) { + if (list) + atomic_inc(&list->cnt); + INIT_WORK(&req->work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work); + } ret = 0; } if (ret) From patchwork Thu Feb 7 19:55:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10802105 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 627446C2 for ; Thu, 7 Feb 2019 19:56:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5340B2E075 for ; Thu, 7 Feb 2019 19:56:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 47A052E259; Thu, 7 Feb 2019 19:56:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 92E032E0B8 for ; Thu, 7 Feb 2019 19:56:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727341AbfBGT4b (ORCPT ); Thu, 7 Feb 2019 14:56:31 -0500 Received: from mail-it1-f193.google.com ([209.85.166.193]:55705 "EHLO mail-it1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727328AbfBGT4b (ORCPT ); Thu, 7 Feb 2019 14:56:31 -0500 Received: by mail-it1-f193.google.com with SMTP id m62so2906453ith.5 for ; Thu, 07 Feb 2019 11:56:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=zS00KhsUXCPDI6cyOnnnPVzCLiZ1I0zI7Brf0oktcjQ=; b=OZoEPvuyeIXIgAZho8+iQZvgzNqTiDNbv77R7AuUcrfSndOQA8/EqrS7FdKk/4dIrX l9lm3Tnhv5HssZcdXu19WeurFoBJrck0WeHBLp2L7JC5irPXWBi2u38iQf6DdPyHSLXh qSGYim3O31Vg5eH5eeXP4kVg0KntQikTNLW6oXuFIZjM/D/mOBxKBYpshfWa5J7NWElw jV8xfZDvpz4hM1b0TDZtpRL0YxywH2grM/dGYdkFcyZGpGNdzlmg/Wcc0K+aEKepFTZW iamJM7Pqp1chZ1ipUOvDEugq2Gu5caZwc+YG7UWMbjWlNn1IN7MF/KyKYF02g5I0J68g /9qg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=zS00KhsUXCPDI6cyOnnnPVzCLiZ1I0zI7Brf0oktcjQ=; b=O0aAPNphXnbYzlQ42T3Nk3nYQENUxbbjYU+oyIZ8sn02gBrBXzz/uM7U5YGUEbttgz QMWtnoaWsHrRF3vgylrHU1+w6KrJkLkuY85WSxU2aPN9/3D+62MUDtL64s+GGWJr5oPv CROvoCfOml4+Kpuzxy5nr7/37LSmYp58tXRA/JF59IckgCKjeGnnHWb0YbTdw8CUhok2 MyZQiFWC81wmvMLf8pTPvU5xrGklEPP4Be3ACdyeP88fyXacJRR0pf7ZHci6bwa8yADJ CLXBQlRVoFxr+pbfJobRaGV/RjTgKxOuUN8SZVx3vjuYx4sKuVMGLfK2sbwc5FeSPdmd etsw== X-Gm-Message-State: AHQUAub1dMJD34GEAKyqI3aAd0mVTHjstK3xsmb4XDEblgh9gPToFJVF QxswOZPl7ZgkX0H5Ii5K8qj9MQ== X-Google-Smtp-Source: AHgI3IaRDm+gBKGKuscwZPZQmGuLK/DjJXJg7qKYcJpARVsfandY/V3WBPMksQvGDBex3cQqcICDHQ== X-Received: by 2002:a6b:d318:: with SMTP id s24mr9548081iob.6.1549569390853; Thu, 07 Feb 2019 11:56:30 -0800 (PST) Received: from localhost.localdomain ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id y26sm5092782iob.16.2019.02.07.11.56.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Feb 2019 11:56:29 -0800 (PST) From: Jens Axboe To: linux-aio@kvack.org, linux-block@vger.kernel.org, linux-api@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, jannh@google.com, viro@ZenIV.linux.org.uk, Jens Axboe Subject: [PATCH 18/18] io_uring: add io_uring_event cache hit information Date: Thu, 7 Feb 2019 12:55:52 -0700 Message-Id: <20190207195552.22770-19-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190207195552.22770-1-axboe@kernel.dk> References: <20190207195552.22770-1-axboe@kernel.dk> Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add hint on whether a read was served out of the page cache, or if it hit media. This is useful for buffered async IO, O_DIRECT reads would never have this set (for obvious reasons). If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT set. Signed-off-by: Jens Axboe --- fs/io_uring.c | 7 ++++++- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index af063145f901..69ab3f134b6d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -579,11 +579,16 @@ static void io_fput(struct io_kiocb *req) static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + unsigned ev_flags = 0; kiocb_end_write(kiocb); io_fput(req); - io_cqring_add_event(req->ctx, req->user_data, res, 0); + + if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK)) + ev_flags = IOCQE_FLAG_CACHEHIT; + + io_cqring_add_event(req->ctx, req->user_data, res, ev_flags); io_free_req(req); } diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e23408692118..24906e99fdc7 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -69,6 +69,11 @@ struct io_uring_cqe { __u32 flags; }; +/* + * io_uring_event->flags + */ +#define IOCQE_FLAG_CACHEHIT (1U << 0) /* IO did not hit media */ + /* * Magic offsets for the application to mmap the data it needs */