From patchwork Sat Jan 12 21:29:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761097 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4409E746 for ; Sat, 12 Jan 2019 21:30:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3823B2901C for ; Sat, 12 Jan 2019 21:30:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2C6652904F; Sat, 12 Jan 2019 21:30:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CAFA12901C for ; Sat, 12 Jan 2019 21:30:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726574AbfALVaX (ORCPT ); Sat, 12 Jan 2019 16:30:23 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:42063 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726547AbfALVaX (ORCPT ); Sat, 12 Jan 2019 16:30:23 -0500 Received: by mail-pg1-f194.google.com with SMTP id d72so7785551pga.9 for ; Sat, 12 Jan 2019 13:30:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=; b=gpSSOGxgIEn2qmUpaDMDasAgpRZt4HpFc3bEmUg4SbTxyEiA6gi4La+rYe7a7YU0k/ kwcccd8qW08Qk2crkbTmg2KZ5CkHBauOOrcq/sKnXfyYirD/lNc1wGVnJUH2slUlVTMg Siz0nZpfdNwOH2J3jaPBvIGPKXFPRKX2yyj9pS7qoa4qWjOZaU69oPqFs/y9KuJE/lkE XlcxJxHN6woxZU4k0nJ+YzfVc3ZuKuAVNqVNQzmkKJ2PYuRX06haNxGuB80Rzk/v5zOF KUk9479hFx3yvqk2iYQ8/dwy0/7IMYjNGDsIcfoHDjqyhISscigI3AST9Zz6Y7v1LbaY Cafg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=; b=A0p2/Eobaq5zOZTzYW/+g6T1lXqxu4LPXjsYeBOmeBjcOrMfUkaazvP2QWZLMQTZuE QWaPgBVeAhF0wlA6fGnNSiQntxmUIw3K75hhIwDSk6uVwGncVHVsWYi+65X8q6uyvz6T IdoTJ6t0GZ+5CQibQGo3gFR7ht878mw5L6qZs2pXvxEpGmfe5eV/Pw2vuO4irzHlKK/H ZHW+wYw0ywm4yWssg48B5jpXVBX5sSYaxikizrzf4lKlM7SOLXKhiBK3f5lMMGVEAWiX SRek0JEPSRexd333UbKlxn/N1gvffLE5d0f1Hg12b9omPUBn4ApZkQU8wskxygxWomSP A9oA== X-Gm-Message-State: AJcUukfmdEyA/Jd4QvntU6oHA8DO3W3Xz/vlSF+I1SMw+LedEjed9CKr DUtTDxU4aTKMeE5rWMSf2PFa1/gYTkS35g== X-Google-Smtp-Source: ALg8bN50vz8qqguarHVsPWJG2FG95KrCg0wsoXqCkboquEDqWmxk1qNETLttB5B5Fh8jwvjKq7v0Rg== X-Received: by 2002:a63:fe0a:: with SMTP id p10mr15655759pgh.265.1547328621707; Sat, 12 Jan 2019 13:30:21 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.19 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:20 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 01/16] fs: add an iopoll method to struct file_operations Date: Sat, 12 Jan 2019 14:29:56 -0700 Message-Id: <20190112213011.1439-2-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. TODO: we can probably union ki_cookie with the existing hint and I/O priority fields to avoid struct kiocb growth. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 811c77743dad..ccb0b7a63aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned int ki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1786,6 +1787,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); From patchwork Sat Jan 12 21:29:57 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761101 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 319E71390 for ; Sat, 12 Jan 2019 21:30:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 25D982901C for ; Sat, 12 Jan 2019 21:30:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1A5FC2904F; Sat, 12 Jan 2019 21:30:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C57AF2901C for ; Sat, 12 Jan 2019 21:30:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726558AbfALVaZ (ORCPT ); Sat, 12 Jan 2019 16:30:25 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:42066 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726547AbfALVaY (ORCPT ); Sat, 12 Jan 2019 16:30:24 -0500 Received: by mail-pg1-f195.google.com with SMTP id d72so7785577pga.9 for ; Sat, 12 Jan 2019 13:30:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=; b=hEwFXZQtjXrOzXhD4Yz0yWcj8gFtgx2EWPEeNFcgu9qrhvQ0sdYdtUfb+ahkkLHiSu xO1HhJX+Qm/Eq5aNGzM0DlqgV9TWFSKaKFqiUCZxrlohyK6zELYUQ1JYmiQ+N3PA72lx cQgWw20iM6ZC1vxVVyaLcOxzxaZJkmFxqDkkaKvv1Lm0q/ht3yaGvVJUwVJEFo/Ygtsa QbcnoNTr2DbdBS8x6iBuYzhasVsKs6EjFWTxD4RTY4QTJ3yM3AXv1S1hqsottdSTTr+D d8RyuKnM3uPN9ztuho2m8MnVx2pLogGGkEp85iUw6qIMX3v6pBsnBwnKGSHAnWNP5Lch 0cKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=; b=Ze1oanP5YKx5G6LrwdQWEuNkRyQO0cLG7S8FRZAntxMtmSW0CKVnAOgQja5alv0pKY vsqHXTaqQl0nt0UV/yTMhotRi+npheUSOEFqK/RwG4pCRm1KlKnoplYkJ62ypQ/OHfGr ZbamzVPAk87ImOuowQz7OwoencJCFeDsz1OhBEb58mguUjiuohfHc5P2qsu8NOex7WtI JR62H3NFjx5IXc6b2HH9UtUCqjXAYw3Z1xGm/4cKMm/dlDwguMT0iQ3c8twb/5pglXBe iW/gb5bTTj5TCkUtczt5igcD0BWDsdyXSipLjV6CyOi/N7/LESw4IiP5VTKFxcqVOUcA MVOA== X-Gm-Message-State: AJcUukeXCU3K7y6MWj+6sQN6kZH343LwaSINORFMoN4lq0ORF0L8hcyn OZH/BGuhEztyfDtBql7ey/HTZwiG+Spiiw== X-Google-Smtp-Source: ALg8bN6gkCrheGgHX73DQ7wTT8lxPShgYnnFvASqs6aBuQzj6C3BXNp0GksNuI9UsfbrPgwBBMbL4A== X-Received: by 2002:a63:e051:: with SMTP id n17mr7848083pgj.258.1547328623477; Sat, 12 Jan 2019 13:30:23 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.21 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:22 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 02/16] block: wire up block device iopoll method Date: Sat, 12 Jan 2019 14:29:57 -0700 Message-Id: <20190112213011.1439-3-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index c546cdce77e6..5415579f3e14 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -279,6 +279,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -396,6 +404,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2068,6 +2077,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, From patchwork Sat Jan 12 21:29:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761107 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6F5471580 for ; Sat, 12 Jan 2019 21:30:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 632182902C for ; Sat, 12 Jan 2019 21:30:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 575112904F; Sat, 12 Jan 2019 21:30:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 043FD29053 for ; Sat, 12 Jan 2019 21:30:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726547AbfALVa3 (ORCPT ); Sat, 12 Jan 2019 16:30:29 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:39293 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726587AbfALVa0 (ORCPT ); Sat, 12 Jan 2019 16:30:26 -0500 Received: by mail-pf1-f195.google.com with SMTP id r136so8553371pfc.6 for ; Sat, 12 Jan 2019 13:30:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=; b=SQrjJm2d4O/dCVxq7WPIWMZIVgwaZsZtOrJhsJ87Y23N+dky/TdoejMAKM/k84Pl+/ fV26n+/7ZItJN2MIJB957EzgMGDlvBGYB7jPPwy5DqWTkLYuHBs7z+qYUeAjT76aHyEA 1OknA4EVp93wxftH4Her70ydD8iJwvlcHxa+vW0CBqUjqfMT2yLRXqP97822pLo9fktE Ib15M29BbJ3Ehp4GeOXnAXzC0sXuepmgdHR6VNr7NVKUJsoy03816ET8YCCOfIld++RA NcccVEOjQoluw7l+4F6hMXKxKu/Mu3bzTTnBEf5z5CHW+SgA5GveJF3ZdkWgoCAGh7yK OdLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=; b=I7knSjogtuFpwQavRQ6T7TUx+KiqCEjynn5ya74EE3IRnTWlyFKSNQo93JjCPOMTZq +HNisL+UXY2PR127U1vo544upEvVXYodEdF3flgCgGaHwff+S+hnIkbQuW64KrqBy20U nnFBQlf0WnbfMiIb9zUvFYbkgpfSzObbpIMLBoa85m8JLWDxZs6+pv+nFyFvTCF++Mq2 opXJaSfs0QI6gJWdcgzURS3YMfBoadd0qX8H6FlX/Wl2gOoQ6EX3lQ5zaZrLARXATWTx FABwh6AHo0CdVL+x/x6dj198b4kxhdtii9I8F43YLj6OpjaGsrQZuEyvBmF/HaWaGg3Q 0AoA== X-Gm-Message-State: AJcUukfbuI+7FbwaMSNANqz0GPQVtyphNF+NZIZM+SKuKxp+c1H48CLT w42lN8JzKvxMu2m6lT5WOf78rIJbHTDWpQ== X-Google-Smtp-Source: ALg8bN6ZPEFDB29dv+5Cv6HS1EIYRsOhBUSlU2TbBb3IFmfD/CPD4cdEAantzhuniiZoRUjRPJu36w== X-Received: by 2002:a63:c848:: with SMTP id l8mr17773719pgi.78.1547328625225; Sat, 12 Jan 2019 13:30:25 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.23 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:24 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 03/16] block: add bio_set_polled() helper Date: Sat, 12 Jan 2019 14:29:58 -0700 Message-Id: <20190112213011.1439-4-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Signed-off-by: Jens Axboe --- fs/block_dev.c | 4 ++-- include/linux/bio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 5415579f3e14..2ebd2a0d7789 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, task_io_account_write(ret); } if (iocb->ki_flags & IOCB_HIPRI) - bio.bi_opf |= REQ_HIPRI; + bio_set_polled(&bio, iocb); qc = submit_bio(&bio); for (;;) { @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + bio_set_polled(bio, iocb); qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7380b094dcca..f6f0a2b3cbc8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/* + * Mark a bio as polled. Note that for async polled IO, the caller must + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). + * We cannot block waiting for requests on polled IO, as those completions + * must be found by the caller. This is different than IRQ driven IO, where + * it's safe to wait for IO to complete. + */ +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) +{ + bio->bi_opf |= REQ_HIPRI; + if (!is_sync_kiocb(kiocb)) + bio->bi_opf |= REQ_NOWAIT; +} + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ From patchwork Sat Jan 12 21:29:59 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761105 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0D3421390 for ; Sat, 12 Jan 2019 21:30:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F22F12901C for ; Sat, 12 Jan 2019 21:30:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E66382904F; Sat, 12 Jan 2019 21:30:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 637B52901C for ; Sat, 12 Jan 2019 21:30:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726606AbfALVa3 (ORCPT ); Sat, 12 Jan 2019 16:30:29 -0500 Received: from mail-pf1-f193.google.com ([209.85.210.193]:41195 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726613AbfALVa2 (ORCPT ); Sat, 12 Jan 2019 16:30:28 -0500 Received: by mail-pf1-f193.google.com with SMTP id b7so8560717pfi.8 for ; Sat, 12 Jan 2019 13:30:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=; b=ZZ+LKKsZptGnIXN839tmI16amo0fovOT+fNuRcKeDsuTAxZLEGf3oYSE1wKTByaumr bbCIvPXqj4CiMoQpArIzXaAa+ifGr/1bsbtuKwvuXCzSOZ55dQ5IwilQeSEUppUl5TOo 3k2QsIqp5u+tzl6yl17BCwbpcROuaT8jlALCiUpmQ0ZR4HYWodDe9o6NgzCMvgof0f4d EVfLaTRsMxRhKYCIVUqKa4LroF1iilzcohcQQfRYTRvy832BT+hILsBk4nCDlUS4famw hryH40LisJjZbF5hS0BVAB4vhZ8C+aIVe9Nbh7bBzy6ObpIKQJw9eu/QAWeJUXIPyyXZ ivcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=; b=nZ8gKVvqFcFGiCHLJ7w/vZANO6ojJPWqomRjcfkiWanzwJzK6Z8ySESgg9+YveWH3s mCVxkNBtPPhjJ/hDPVWsClyTc7CWdqpOHhL79AcnU5Fi+GJqvb0RUW51cNvXLxu8FwBo WgyxoOCTdHfrjiCIwQJbJOT6fRFhfRI7gvggIcOgxYHp7ZggZY2QBR2QlXRGh7klmlcf ThmgGiAxtgHRSSdN6bE1eJWqNzhdNlw7+yXxtFily/hjyfowxcujnejCZAXEmwYQjEKw yErGWfkI7205G/HSWnDmunMuHwmPf7WbEdWWyVQy6l/LLzWTP156NZnJKwNFqUKWT2mO G+pQ== X-Gm-Message-State: AJcUukfgDEWX5cMLfzNl9Vqt0qAz+x4BRXUpl87IsWa4OY6iSsmIGc5F zsejX5ZPn6uGZQUQ+FE9BKP7hxzZUaEDIw== X-Google-Smtp-Source: ALg8bN4aUeKFPKhYPOgzzodeQo1D7+AmmcyTODFNzzu+nyMdo76LN8RDjewkiXALZWFkX3DmvW7UJw== X-Received: by 2002:a63:6d48:: with SMTP id i69mr17258892pgc.215.1547328627211; Sat, 12 Jan 2019 13:30:27 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.25 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:26 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 04/16] iomap: wire up the iopoll method Date: Sat, 12 Jan 2019 14:29:59 -0700 Message-Id: <20190112213011.1439-5-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig Modified to use bio_set_polled(). Signed-off-by: Jens Axboe --- fs/gfs2/file.c | 2 ++ fs/iomap.c | 43 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..58a768e59712 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index a3088fae567b..4ee50b76b4a1 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1454,6 +1454,28 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) + bio_set_polled(bio, dio->iocb); + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + if (!atomic_dec_and_test(&dio->ref)) { if (!dio->wait_for_completion) return -EIOCBQUEUED; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 9a4258154b25..0fefb5455bda 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret, unsigned flags); ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, iomap_dio_end_io_t end_io); +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP struct file; From patchwork Sat Jan 12 21:30:00 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761117 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 33DB61390 for ; Sat, 12 Jan 2019 21:30:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 25C9D2901C for ; Sat, 12 Jan 2019 21:30:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1980629053; Sat, 12 Jan 2019 21:30:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 74F242901C for ; Sat, 12 Jan 2019 21:30:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726624AbfALVad (ORCPT ); Sat, 12 Jan 2019 16:30:33 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:37426 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726622AbfALVab (ORCPT ); Sat, 12 Jan 2019 16:30:31 -0500 Received: by mail-pg1-f194.google.com with SMTP id c25so7799977pgb.4 for ; Sat, 12 Jan 2019 13:30:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=DH1dfZfWNhPXbDtKvdlSoFoxOkWLREd4WWAmV0+GQzY=; b=yc1kt5ahTUZdQbJeUZxxYlbJG5hSxOLP5gF39mIuaPfbC+oMHtiS+U2jTPsQMJPq/O enLcAlSmOmJVU/AS+zov2mlxmwf0YNTbATEZ0SmMUI86FogAgewGb0xr5AfU9TInmjIP 5IqzvvkY9TSzFtpit1Pm3VAvA/4mfYpUIILpvSUF3WpoTET0/VXeBX+18lf2dCTKvK+v 10aNzpMp570Zim93tNBKC2Nm9nWz6FLXTtHcDk5d5ACqs0qs3cX0CColLIMNcEe2it3d rtgPUXBxABEJ8y614fXlLZCtYShEiv0hcIqS8d0O4NZmDJSCnIkyN3S7GIhk63JUElV2 H5WA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=DH1dfZfWNhPXbDtKvdlSoFoxOkWLREd4WWAmV0+GQzY=; b=s02EnPbEObcH7EQ36Tx3LDlSM6z3+pNmVy4fIokFxJIfGuSNMOnbLjyVC+SYRrVkHv mhuhvSlvQiXT1j7qNjjj14Mp2czihZ1jQeZdH2Yo2TjljXgiq4TKs/+B2C/tyzh3VoRv kpdgLtwFBzfxjVPMhjDHSJZxKpQ6X7pojSd4SQmSEvVIVK6OZqr2jLn/kwSwJ75Lur03 z2uYizbTmIu1AeK2zavA28zpJCS/W0drQSGcaSzNq5s5sXRjeuMimBHYuTUT65NiAyIU Cu6OjSyx/uz3rM4ORJ/yWZekOOyj2AkYut2/7CiOqCeVzPrNLFCcfvkQjAGSzHMUMSlQ 2rwQ== X-Gm-Message-State: AJcUukdfacFsPQdeYwdrc/JrpGm4tfZBPIvZhGrF9Pyr3EstqI8ePyjW GCLR/R+EEFkkQZ7/mGq55HmwGDFmzuuSyA== X-Google-Smtp-Source: ALg8bN6vcLNHswHkXmPw6ANMWWuSm1ycVgM/TBWL1Fjq/ohx59IbKzAn+HVK30tj52M8pyUdKPIfcg== X-Received: by 2002:a63:9e58:: with SMTP id r24mr8057182pgo.264.1547328629042; Sat, 12 Jan 2019 13:30:29 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.27 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:28 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 05/16] Add io_uring IO interface Date: Sat, 12 Jan 2019 14:30:00 -0700 Message-Id: <20190112213011.1439-6-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered and can point to any io_uring_iocb. Two new system calls are added for this: io_uring_setup(entries, iovecs, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_iocbs. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 925 +++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 96 +++ init/Kconfig | 9 + kernel/sys_ni.c | 2 + 7 files changed, 1040 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..df8fe19cdb74 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,925 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2019 Jens Axboe + */ +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct percpu_ref refs; + + unsigned int flags; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned sq_entries; + unsigned sq_mask; + unsigned sq_thread_cpu; + struct io_uring_sqe *sq_sqes; + + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cq_entries; + unsigned cq_mask; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + struct files_struct *sqo_files; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +struct io_work { + struct work_struct work; + struct sqe_submit submit; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct io_work work; + }; + + struct io_ring_ctx *ki_ctx; + struct list_head ki_list; + unsigned long ki_flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 ki_user_data; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_scqring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + + init_completion(&ctx->ctx_done); + + spin_lock_init(&ctx->completion_lock); + init_waitqueue_head(&ctx->wait); + mutex_init(&ctx->uring_lock); + + return ctx; +} + +static void io_inc_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + ring->r.tail++; + smp_wmb(); +} + +static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->r.tail); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + return &ring->cqes[tail & ctx->cq_mask]; +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + if (!percpu_ref_tryget(&ctx->refs)) + return NULL; + + req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (!req) + return NULL; + + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + return req; +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_free_req(struct io_kiocb *req) +{ + kmem_cache_free(req_cachep, req); + io_ring_drop_ctx_refs(req->ki_ctx, 1); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + unsigned long flags; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + spin_lock_irqsave(&ctx->completion_lock, flags); + cqe = io_peek_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + smp_wmb(); + io_inc_cqring(ctx); + } else + ctx->cq_ring->overflow++; + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, + long error) +{ + io_cqring_fill_event(ctx, s->index, error, 0); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_fill_event(req->ki_ctx, req->ki_user_data, res, 0); + io_free_req(req); +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + int ret; + + kiocb->ki_filp = fget(sqe->fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + kiocb->ki_pos = sqe->off; + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = sqe->ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->ki_flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_scqring_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *req, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * There's no easy way to restart the syscall since other AIO's + * may be already running. Just fail this IO with EINTR. + */ + ret = -EINTR; + /*FALLTHRU*/ + default: + req->ki_complete(req, ret, 0); + } +} + +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + goto out_fput; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, bool force_nonblock) +{ + const struct io_uring_sqe *sqe = s->sqe; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(sqe->flags || sqe->__pad2)) + return -EINVAL; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->ki_user_data = sqe->user_data; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_READV: + ret = io_read(req, sqe, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work.work); + struct io_ring_ctx *ctx = req->ki_ctx; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + /* + * Ensure we clear previously set flags. even it NOWAIT was originally + * set, it's pointless now that we're in an async context. + */ + req->rw.ki_flags &= ~IOCB_NOWAIT; + req->ki_flags &= ~REQ_F_FORCE_NONBLOCK; + + old_files = current->files; + current->files = ctx->sqo_files; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + + ret = __io_submit_sqe(ctx, req, &req->work.submit, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_fill_cq_error(ctx, &req->work.submit, ret); + io_free_req(req); + } + current->files = old_files; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->work.submit, s, sizeof(*s)); + INIT_WORK(&req->work.work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work.work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_inc_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + ring->r.head++; + smp_wmb(); +} + +static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + smp_rmb(); + head = READ_ONCE(ring->r.head); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + return true; + } + + /* drop invalid entries */ + ring->r.head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_peek_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) + break; + + submit++; + io_inc_sqring(ctx); + } + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + ctx->sqo_mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that + * gets closed on exit, then fops->release() is invoked which + * waits for the sq thread and sq workqueue to flush and exit + * before exiting. + */ + ret = -EBADF; + ctx->sqo_files = current->files; + if (!ctx->sqo_files) + goto err; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + if (ctx->sqo_files) + ctx->sqo_files = NULL; + ctx->sqo_mm = NULL; + return ret; +} + +static void io_sq_offload_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_sq_offload_stop(ctx); + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + kfree(ctx); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_scqring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_scqring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_scqring_fops = { + .release = io_scqring_release, + .mmap = io_scqring_mmap, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + int ret; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_scqring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +static int __init io_uring_setup(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_setup); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..542757a4c898 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include #include @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..dbbfc02bc0a8 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include +#include + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + union { + void *addr; /* buffer or iovecs */ + __u64 __pad; + }; + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 __pad2; + __u64 user_data; /* data to be passed back at completion time */ +}; + +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..ce7bd7af9312 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..ee5e523564bb 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,8 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ From patchwork Sat Jan 12 21:30:01 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761115 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BC6CD1390 for ; Sat, 12 Jan 2019 21:30:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AF4D12902C for ; Sat, 12 Jan 2019 21:30:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A29D329053; Sat, 12 Jan 2019 21:30:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 46D622902C for ; Sat, 12 Jan 2019 21:30:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726640AbfALVac (ORCPT ); Sat, 12 Jan 2019 16:30:32 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:41818 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726613AbfALVac (ORCPT ); Sat, 12 Jan 2019 16:30:32 -0500 Received: by mail-pg1-f193.google.com with SMTP id m1so7795678pgq.8 for ; Sat, 12 Jan 2019 13:30:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=w5uR07zetzxo27QPcX+8gbKKzHAS9hE9QvWCuwzf6nE=; b=YU+0DL/6mX1sJP+4YIy+KvA0AxUV/tj8DPdL9gKSyiBjbukDVykbLVyi3MQBPJ+qGA 7xATeZ30R21zmEnW6Fv0jyjxNIY9LEsfLJnEWRYvlOyTxtHz8ays/qEfcVvmgQvO8ObJ dy0COMHRsgcOTCX9L2K32E8XZjg47hizxdFsRl/wkpIiu/X9AVgWhirX0IDoQ41LvRYH kKeVxlL/ejdxrzv1bq4jUUaHz5vgrydJy7GI3avAfty40WYHxWJRt+zev3XtruN12x5q BnyKW4I5rt0RH9kBQh7EYPB4uTl0Y01F5F/ovHnqgpJNiK+7bOz2eJ4XAvkH1goD681q wk6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=w5uR07zetzxo27QPcX+8gbKKzHAS9hE9QvWCuwzf6nE=; b=K2h7iRSmr35Fqzl2Y0x9H3CizOqQ7ULeXT9Yzg1Xf2VKA+EWTN5CBo5+V9//+XQQW9 mFz0m5Coj833bDrxxJaEBlZyLQXFCooGv0uyiqeL8pIDq7Ln6NNeK/yo8lREFb+acby7 mAvDeNn5N7v+B2SGNRo3nr+ztfitKzTlSntHIiXMKJ8Y7DdPrsnxjVD6K5dCpKYXfHjY hjodDgxIiQnynZOBAfUBEEUuQvqEh+P2JBtSDcnPULOOrdkm+rLJ7jHwfjlqOAkBfN34 HzoSViQRVx3DYrVB20bde8w9Ily+smb5B06mFTcJm6wfrjoESuIGNww2iDMyqw7bVmRq SBNA== X-Gm-Message-State: AJcUukdxcEAn9cQxQcuvD1iLjE5Dqcdc8TuN0wNgDZ2v/AoadNXFL0hM XlcpaKalRZR2PCE7Qbcdb0Zy2OGTETL9fg== X-Google-Smtp-Source: ALg8bN5ltFW7RAcMKhNgpHu10PFFLBjDEALl13mOHMxkwn8xDlhCrsveqCXLg8y//DVQvseAKgZRQQ== X-Received: by 2002:a63:5207:: with SMTP id g7mr18161211pgb.253.1547328631145; Sat, 12 Jan 2019 13:30:31 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:30 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 06/16] io_uring: add fsync support Date: Sat, 12 Jan 2019 14:30:01 -0700 Message-Id: <20190112213011.1439-7-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Add a new fsync opcode, which either syncs a range if one is passed, or the whole file if the offset and length fields are both cleared to zero. A flag is provided to use fdatasync semantics, that is only force out metadata which is required to retrieve the file data, but not others like metadata. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/io_uring.c | 33 +++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 8 +++++++- 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index df8fe19cdb74..c51d429faef1 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -419,6 +419,36 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, return ret; } +static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct io_ring_ctx *ctx = req->ki_ctx; + loff_t end = sqe->off + sqe->len; + struct file *file; + int ret; + + /* fsync always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + + if (unlikely(sqe->addr)) + return -EINVAL; + if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) + return -EINVAL; + + file = fget(sqe->fd); + if (unlikely(!file)) + return -EBADF; + + ret = vfs_fsync_range(file, sqe->off, end > 0 ? end : LLONG_MAX, + sqe->fsync_flags & IORING_FSYNC_DATASYNC); + + fput(file); + io_cqring_fill_event(ctx, sqe->user_data, ret, 0); + io_free_req(req); + return 0; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, struct sqe_submit *s, bool force_nonblock) { @@ -441,6 +471,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_WRITEV: ret = io_write(req, sqe, force_nonblock); break; + case IORING_OP_FSYNC: + ret = io_fsync(req, sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index dbbfc02bc0a8..72e8f5f13f3b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -27,7 +27,7 @@ struct io_uring_sqe { __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; - __u32 __resv; + __u32 fsync_flags; }; __u64 __pad2; __u64 user_data; /* data to be passed back at completion time */ @@ -35,6 +35,12 @@ struct io_uring_sqe { #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 +#define IORING_OP_FSYNC 3 + +/* + * sqe->fsync_flags + */ +#define IORING_FSYNC_DATASYNC (1 << 0) /* * IO completion data structure (Completion Queue Entry) From patchwork Sat Jan 12 21:30:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761119 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 91B4E1580 for ; Sat, 12 Jan 2019 21:30:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 82DB92901C for ; Sat, 12 Jan 2019 21:30:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 771112904F; Sat, 12 Jan 2019 21:30:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 885BC2902C for ; Sat, 12 Jan 2019 21:30:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726641AbfALVaf (ORCPT ); Sat, 12 Jan 2019 16:30:35 -0500 Received: from mail-pl1-f194.google.com ([209.85.214.194]:38564 "EHLO mail-pl1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726622AbfALVae (ORCPT ); Sat, 12 Jan 2019 16:30:34 -0500 Received: by mail-pl1-f194.google.com with SMTP id e5so8353543plb.5 for ; Sat, 12 Jan 2019 13:30:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=suY6akPkbncRRmcEMg8/yONRomoiitvfJhvarjmZOzU=; b=o0IumRvRYcZupfNJA6cU4IAP1rFzhqp3rbEmHyvXute8R4XPmVjHXB95cPiBxxe+BL i9yGVW5qnGKUvRbP+7Co6DCLioLUllEfV51aA27YyqGKsw59h3WYpSftWFi3YBI2hv8d pC3ZTJ3og+AnASQBm6JWv5qA7SvC/YZd+f1paKe93MamKfrqfQikHzTt8unVUz9ceqZk 4xENOwQ2p53Vg5R89yf0uR5evh2KkMpSv/Yp4KEdZ8FBvv9iLLvyQ1JmUIwp9Xb4LMTu 26rF2XXBfngWsyU5cEfBicB5yishWDTCHwgGJcr9TbXYR5yewEE2aZw9BqJHCMKnmDKT f13A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=suY6akPkbncRRmcEMg8/yONRomoiitvfJhvarjmZOzU=; b=msfHR+V8ipXjZjDcUFJH8OfLsMNlkc5nUGAzhxxpQCf8HQocLU03LEdN+OYlGXMP/D Tc0UYRuuK+9sb2ZqSjQX0QFjgLkD4mFLZ93AieO6jbO4ebFQOkUMpkl5jBkdwr1PJwJh bl/XgQ9bvP9oTfcEHlDW9bTm+ZmEDUtslXf3FlxxHYsCix7sfEvTloxLL2FI0UZJAIYw Ob3d/GfpWDtvWKY6zaddMTGeT7OntIM9NFDpOWT90DlHRv0gwyLvd5eMdeAhhq/jZzM6 HvObCpAMNQUtIXiuLm6kEs1wGlFdbG/ag7urK3ZYUkbHfbpl3JVR1dWbefrsWCxdmBCO Glzg== X-Gm-Message-State: AJcUukdM4ABpuQV3V24VMNAAnxFyUqr88fsPMiYUIEmsvVipx5FtbWrg fVv+hSsAFUXPstyXEzTv/d4J7f7DZMKJaA== X-Google-Smtp-Source: ALg8bN73huroCOgB1SFZYcJ3fobrBnrNWa22LTnqIww41zRRpjzjUzg7SiY2/t+KoDC7aZVsvii/LQ== X-Received: by 2002:a17:902:5a5:: with SMTP id f34mr19996393plf.161.1547328632960; Sat, 12 Jan 2019 13:30:32 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.31 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:32 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 07/16] io_uring: support for IO polling Date: Sat, 12 Jan 2019 14:30:02 -0700 Message-Id: <20190112213011.1439-8-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add support for a polled io_uring context. When a read or write is submitted to a polled context, the application must poll for completions on the CQ ring through io_uring_enter(2). Polled IO may not generate IRQ completions, hence they need to be actively found by the application itself. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Signed-off-by: Jens Axboe --- fs/io_uring.c | 306 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 + 2 files changed, 297 insertions(+), 14 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index c51d429faef1..eb6550f135e3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -55,6 +55,11 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; }; +struct list_multi { + struct list_head list; + unsigned multi; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -79,13 +84,17 @@ struct io_ring_ctx { struct completion ctx_done; + /* iopoll submission state */ struct { - struct mutex uring_lock; - wait_queue_head_t wait; + spinlock_t poll_lock; + struct list_multi poll_submitted; } ____cacheline_aligned_in_smp; struct { + struct list_multi poll_completing; spinlock_t completion_lock; + struct mutex uring_lock; + wait_queue_head_t wait; } ____cacheline_aligned_in_smp; }; @@ -109,10 +118,13 @@ struct io_kiocb { struct list_head ki_list; unsigned long ki_flags; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ +#define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ +#define REQ_F_IOPOLL_EAGAIN 4 /* submission got EAGAIN */ u64 ki_user_data; }; #define IO_PLUG_THRESHOLD 2 +#define IO_IOPOLL_BATCH 8 static struct kmem_cache *req_cachep; @@ -144,6 +156,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) spin_lock_init(&ctx->completion_lock); init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->poll_lock); + INIT_LIST_HEAD(&ctx->poll_submitted.list); + INIT_LIST_HEAD(&ctx->poll_completing.list); mutex_init(&ctx->uring_lock); return ctx; @@ -195,12 +210,209 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) wake_up(&ctx->wait); } +static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) +{ + if (*nr) { + kmem_cache_free_bulk(req_cachep, *nr, reqs); + io_ring_drop_ctx_refs(ctx, *nr); + *nr = 0; + } +} + static void io_free_req(struct io_kiocb *req) { kmem_cache_free(req_cachep, req); io_ring_drop_ctx_refs(req->ki_ctx, 1); } +static void io_multi_list_splice(struct list_multi *src, struct list_multi *dst) +{ + list_splice_tail_init(&src->list, &dst->list); + dst->multi |= src->multi; +} + +/* + * Track whether we have multiple files in our lists. This will impact how + * we do polling eventually, not spinning if we're on potentially on different + * devices. + */ +static void io_multi_list_add(struct io_kiocb *req, struct list_multi *list) +{ + if (list_empty(&list->list)) { + list->multi = 0; + } else if (!list->multi) { + struct io_kiocb *list_req; + + list_req = list_first_entry(&list->list, struct io_kiocb, + ki_list); + if (list_req->rw.ki_filp != req->rw.ki_filp) + list->multi = 1; + } + + /* + * For fast devices, IO may have already completed. If it has, add + * it to the front so we find it first. We can't add to the poll_done + * list as that's unlocked from the completion side. + */ + if (req->ki_flags & REQ_F_IOPOLL_COMPLETED) + list_add(&req->ki_list, &list->list); + else + list_add_tail(&req->ki_list, &list->list); +} + +/* + * Find and free completed poll iocbs + */ +static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) +{ + void *reqs[IO_IOPOLL_BATCH]; + struct io_kiocb *req, *n; + int to_free = 0; + + list_for_each_entry_safe(req, n, &ctx->poll_completing.list, ki_list) { + if (!(req->ki_flags & REQ_F_IOPOLL_COMPLETED)) + continue; + if (to_free == ARRAY_SIZE(reqs)) + io_free_req_many(ctx, reqs, &to_free); + + list_del(&req->ki_list); + reqs[to_free++] = req; + + fput(req->rw.ki_filp); + (*nr_events)++; + } + + if (to_free) + io_free_req_many(ctx, reqs, &to_free); +} + +static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + int polled, found, ret; + struct io_kiocb *req; + bool spin; + + /* + * Only spin for completions if we don't have multiple devices hanging + * off our complete list, and we're under the requested amount. + */ + spin = !ctx->poll_completing.multi && (*nr_events < min); + + polled = found = 0; + list_for_each_entry(req, &ctx->poll_completing.list, ki_list) { + struct kiocb *kiocb = &req->rw; + + if (req->ki_flags & REQ_F_IOPOLL_COMPLETED) { + /* force caller reap */ + found = 0; + break; + } + + found++; + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (ret < 0) + return ret; + + polled += ret; + if (polled && spin) + spin = false; + } + + return found; +} + +/* + * Poll for a mininum of 'min' events, and a maximum of 'max'. Note that if + * min == 0 we consider that a non-spinning poll check - we'll still enter + * the driver poll loop, but only as a non-spinning completion check. + */ +static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + int found; + + /* + * Check if we already have done events that satisfy what we need + */ + if (!list_empty(&ctx->poll_completing.list)) { + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + } + + /* + * Take in a new working set from the submitted list, if possible. + */ + if (!list_empty_careful(&ctx->poll_submitted.list)) { + spin_lock(&ctx->poll_lock); + io_multi_list_splice(&ctx->poll_submitted, + &ctx->poll_completing); + spin_unlock(&ctx->poll_lock); + } + + if (list_empty(&ctx->poll_completing.list)) + return 0; + + /* + * Check again now that we have a new batch. + */ + io_iopoll_reap(ctx, nr_events); + if (min && *nr_events >= min) + return 0; + + do { + found = io_do_iopoll(ctx, nr_events, min); + if (found <= 0) + break; + } while (ctx->poll_completing.multi && (min && *nr_events >= min)); + + io_iopoll_reap(ctx, nr_events); + if (*nr_events >= min) + return 0; + + return found; +} + +/* + * We can't just wait for polled events to come to us, we have to actively + * find and complete them. + */ +static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +{ + if (!(ctx->flags & IORING_SETUP_IOPOLL)) + return; + + mutex_lock(&ctx->uring_lock); + while (!list_empty_careful(&ctx->poll_submitted.list) || + !list_empty(&ctx->poll_completing.list)) { + unsigned int nr_events = 0; + + io_iopoll_getevents(ctx, &nr_events, 1); + } + mutex_unlock(&ctx->uring_lock); +} + +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) +{ + int ret = 0; + + while (!*nr_events || !need_resched()) { + int tmin = 0; + + if (*nr_events < min) + tmin = min - *nr_events; + + ret = io_iopoll_getevents(ctx, nr_events, tmin); + if (ret <= 0) + break; + ret = 0; + } + + return ret; +} + static void kiocb_end_write(struct kiocb *kiocb) { if (kiocb->ki_flags & IOCB_WRITE) { @@ -216,18 +428,16 @@ static void kiocb_end_write(struct kiocb *kiocb) } } -static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, - long res, unsigned ev_flags) +static void __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) { struct io_uring_cqe *cqe; - unsigned long flags; /* * If we can't get a cq entry, userspace overflowed the * submission (by quite a lot). Increment the overflow count in * the ring. */ - spin_lock_irqsave(&ctx->completion_lock, flags); cqe = io_peek_cqring(ctx); if (cqe) { cqe->user_data = ki_user_data; @@ -237,6 +447,15 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, io_inc_cqring(ctx); } else ctx->cq_ring->overflow++; +} + +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_cqring_fill_event(ctx, ki_user_data, res, ev_flags); spin_unlock_irqrestore(&ctx->completion_lock, flags); } @@ -260,9 +479,41 @@ static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) io_free_req(req); } +static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + if (unlikely(res == -EAGAIN)) { + req->ki_flags |= REQ_F_IOPOLL_EAGAIN; + } else { + struct io_ring_ctx *ctx = req->ki_ctx; + + __io_cqring_fill_event(ctx, req->ki_user_data, res, 0); + req->ki_flags |= REQ_F_IOPOLL_COMPLETED; + } +} + +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_getevents() thread before the issuer is done accessing + * the kiocb cookie. + */ +static void io_iopoll_req_issued(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ki_ctx; + + spin_lock(&ctx->poll_lock); + io_multi_list_add(req, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { + struct io_ring_ctx *ctx = req->ki_ctx; struct kiocb *kiocb = &req->rw; int ret; @@ -288,12 +539,21 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_flags |= IOCB_NOWAIT; req->ki_flags |= REQ_F_FORCE_NONBLOCK; } - if (kiocb->ki_flags & IOCB_HIPRI) { - ret = -EINVAL; - goto out_fput; - } + if (ctx->flags & IORING_SETUP_IOPOLL) { + ret = -EOPNOTSUPP; + if (!(kiocb->ki_flags & IOCB_DIRECT) || + !kiocb->ki_filp->f_op->iopoll) + goto out_fput; - kiocb->ki_complete = io_complete_scqring_rw; + kiocb->ki_flags |= IOCB_HIPRI; + kiocb->ki_complete = io_complete_scqring_iopoll; + } else { + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + kiocb->ki_complete = io_complete_scqring_rw; + } return 0; out_fput: fput(kiocb->ki_filp); @@ -431,6 +691,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (force_nonblock) return -EAGAIN; + if (unlikely(req->ki_ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; if (unlikely(sqe->addr)) return -EINVAL; if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) @@ -479,7 +741,16 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, break; } - return ret; + if (ret) + return ret; + + if (ctx->flags & IORING_SETUP_IOPOLL) { + if (req->ki_flags & REQ_F_IOPOLL_EAGAIN) + return -EAGAIN; + io_iopoll_req_issued(req); + } + + return 0; } static void io_sq_wq_submit_work(struct work_struct *work) @@ -649,12 +920,17 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } if (flags & IORING_ENTER_GETEVENTS) { + unsigned nr_events = 0; int get_ret; if (!ret && to_submit) min_complete = 0; - get_ret = io_cqring_wait(ctx, min_complete); + if (ctx->flags & IORING_SETUP_IOPOLL) + get_ret = io_iopoll_check(ctx, &nr_events, + min_complete); + else + get_ret = io_cqring_wait(ctx, min_complete); if (get_ret < 0 && !ret) ret = get_ret; } @@ -722,6 +998,7 @@ static void io_free_scq_urings(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_sq_offload_stop(ctx); + io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); @@ -730,6 +1007,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) { percpu_ref_kill(&ctx->refs); + io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); } @@ -937,7 +1215,7 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return -EINVAL; } - if (p.flags) + if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 72e8f5f13f3b..506498cc913e 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -33,6 +33,11 @@ struct io_uring_sqe { __u64 user_data; /* data to be passed back at completion time */ }; +/* + * io_uring_setup() flags + */ +#define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ + #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 From patchwork Sat Jan 12 21:30:03 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761125 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7B64B1390 for ; Sat, 12 Jan 2019 21:30:39 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6F18D2901C for ; Sat, 12 Jan 2019 21:30:39 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6365C2904F; Sat, 12 Jan 2019 21:30:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CE7E62901C for ; Sat, 12 Jan 2019 21:30:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726644AbfALVah (ORCPT ); Sat, 12 Jan 2019 16:30:37 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:34345 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726645AbfALVag (ORCPT ); Sat, 12 Jan 2019 16:30:36 -0500 Received: by mail-pf1-f196.google.com with SMTP id h3so8574309pfg.1 for ; Sat, 12 Jan 2019 13:30:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=MVJAbM+vSTtres5bBRghBNm7ZqHvbUQv8b3wzymGHNw=; b=IsPH4iVmlUo/OS/q7KM54nOy9XO9/f/OEyvLU+0ZHrX64zJWpgGDZxYforyLWMv+e9 87P+X+BJvUSrN9C9dvbPB2kBlT77BTqgMfBJ5Ppq+Zr0l46CKWxKiOQzO2aIS2l1e7Vk sytIRBOodbuJ4qeXqbGOQkDYkVlL+NBAxf6YKRCYAAEvVGULBA66QV+IzCWX5na7EfY1 S71XoYiL9cgwf8FaoVo0MZY6Dj5QkSqaFkwonPXHV0T562GBc2xJqTlYmGXrPVoK544j 81oOxGkcf+VY67fHdLRbpL2ctYY4Ed73jbkA9P8vsxxtLhC+xruWapyP/Tz60z68heQj /4vQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=MVJAbM+vSTtres5bBRghBNm7ZqHvbUQv8b3wzymGHNw=; b=sUDuJcVcdjQC7pssKmRe/lhfh4mjhhZelbbuWla/JgPeLMRA02jJ+9UnOLLL70LiuO yvru5S1r0C+pdeKXO/hgbeeG6mz8Hg3CCaPXr5f7moTuE3X6MvFQDWF4/117EoUeHaQm G9Mlj1R7flNi7O/XDlYwOqUnJFhHz1Znvvk5Cc9wtoj7OVOUrxMGB3CLTK0utmzWYW5W aVRtzqD/1yNL3uGW7tpELNVBsRMBwZmXnVwI24HNsA+O6pR7QacrcTEdYPcypof5Of2a NlrgigniK1mzC/fT+DnUFS/Uz+m89FxcYt0n/nnwVKO/gzJzHdEQe98wwIL587M0OegN 2kyQ== X-Gm-Message-State: AJcUukdVa6iBsuuAW2RaH7CbT4lgxTZ2jwq5DROGFQoOwgjf1BqFK99h j7DK9EPGc1q3iPrQg49N6AI2feNjtdRonQ== X-Google-Smtp-Source: ALg8bN5UL3TwRnPOCK30P0y38os1x8V59hNxWQBJPqCkrqssbPKSeuVQbBtZCYXEFRnLL9n3VMIyKA== X-Received: by 2002:a63:8742:: with SMTP id i63mr17828540pge.298.1547328634748; Sat, 12 Jan 2019 13:30:34 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:33 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 08/16] io_uring: add submission side request cache Date: Sat, 12 Jan 2019 14:30:03 -0700 Message-Id: <20190112213011.1439-9-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We have to add each submitted polled request to the io_ring_ctx poll_submitted list, which means we have to grab the poll_lock. We already use the block plug to batch submissions if we're doing a batch of IO submissions, extend that to cover the poll requests internally as well. Signed-off-by: Jens Axboe --- fs/io_uring.c | 122 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 105 insertions(+), 17 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index eb6550f135e3..93905ce360bb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -126,6 +126,21 @@ struct io_kiocb { #define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8 +struct io_submit_state { + struct io_ring_ctx *ctx; + + struct blk_plug plug; +#ifdef CONFIG_BLOCK + struct blk_plug_cb plug_cb; +#endif + + /* + * Polled iocbs that have been submitted, but not added to the ctx yet + */ + struct list_multi req_list; + unsigned int req_count; +}; + static struct kmem_cache *req_cachep; static const struct file_operations io_scqring_fops; @@ -495,19 +510,51 @@ static void io_complete_scqring_iopoll(struct kiocb *kiocb, long res, long res2) } } + /* + * Called either at the end of IO submission, or through a plug callback + * because we're going to schedule. Moves out local batch of requests to + * the ctx poll list, so they can be found for polling + reaping. + */ +static void io_flush_state_reqs(struct io_ring_ctx *ctx, + struct io_submit_state *state) +{ + spin_lock(&ctx->poll_lock); + io_multi_list_splice(&state->req_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); + + state->req_count = 0; +} + +static void io_iopoll_req_add_list(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ki_ctx; + + spin_lock(&ctx->poll_lock); + io_multi_list_add(req, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); +} + +static void io_iopoll_req_add_state(struct io_submit_state *state, + struct io_kiocb *req) +{ + io_multi_list_add(req, &state->req_list); + if (++state->req_count >= IO_IOPOLL_BATCH) + io_flush_state_reqs(state->ctx, state); +} + /* * After the iocb has been issued, it's safe to be found on the poll list. * Adding the kiocb to the list AFTER submission ensures that we don't * find it from a io_getevents() thread before the issuer is done accessing * the kiocb cookie. */ -static void io_iopoll_req_issued(struct io_kiocb *req) +static void io_iopoll_req_issued(struct io_submit_state *state, + struct io_kiocb *req) { - struct io_ring_ctx *ctx = req->ki_ctx; - - spin_lock(&ctx->poll_lock); - io_multi_list_add(req, &ctx->poll_submitted); - spin_unlock(&ctx->poll_lock); + if (!state || !IS_ENABLED(CONFIG_BLOCK)) + io_iopoll_req_add_list(req); + else + io_iopoll_req_add_state(state, req); } static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, @@ -712,7 +759,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, } static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct sqe_submit *s, bool force_nonblock) + struct sqe_submit *s, bool force_nonblock, + struct io_submit_state *state) { const struct io_uring_sqe *sqe = s->sqe; ssize_t ret; @@ -747,7 +795,7 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, if (ctx->flags & IORING_SETUP_IOPOLL) { if (req->ki_flags & REQ_F_IOPOLL_EAGAIN) return -EAGAIN; - io_iopoll_req_issued(req); + io_iopoll_req_issued(state, req); } return 0; @@ -779,7 +827,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) use_mm(ctx->sqo_mm); set_fs(USER_DS); - ret = __io_submit_sqe(ctx, req, &req->work.submit, false); + ret = __io_submit_sqe(ctx, req, &req->work.submit, false, NULL); set_fs(old_fs); unuse_mm(ctx->sqo_mm); @@ -792,7 +840,8 @@ static void io_sq_wq_submit_work(struct work_struct *work) current->files = old_files; } -static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, + struct io_submit_state *state) { struct io_kiocb *req; ssize_t ret; @@ -801,7 +850,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) if (unlikely(!req)) return -EAGAIN; - ret = __io_submit_sqe(ctx, req, s, true); + ret = __io_submit_sqe(ctx, req, s, true, state); if (ret == -EAGAIN) { memcpy(&req->work.submit, s, sizeof(*s)); INIT_WORK(&req->work.work, io_sq_wq_submit_work); @@ -814,6 +863,43 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) return ret; } +#ifdef CONFIG_BLOCK +static void io_state_unplug(struct blk_plug_cb *cb, bool from_schedule) +{ + struct io_submit_state *state; + + state = container_of(cb, struct io_submit_state, plug_cb); + if (!list_empty(&state->req_list.list)) + io_flush_state_reqs(state->ctx, state); +} +#endif + +/* + * Batched submission is done, ensure local IO is flushed out. + */ +static void io_submit_state_end(struct io_submit_state *state) +{ + blk_finish_plug(&state->plug); + if (!list_empty(&state->req_list.list)) + io_flush_state_reqs(state->ctx, state); +} + +/* + * Start submission side cache. + */ +static void io_submit_state_start(struct io_submit_state *state, + struct io_ring_ctx *ctx) +{ + state->ctx = ctx; + INIT_LIST_HEAD(&state->req_list.list); + state->req_count = 0; +#ifdef CONFIG_BLOCK + state->plug_cb.callback = io_state_unplug; + blk_start_plug(&state->plug); + list_add(&state->plug_cb.list, &state->plug.cb_list); +#endif +} + static void io_inc_sqring(struct io_ring_ctx *ctx) { struct io_sq_ring *ring = ctx->sq_ring; @@ -848,11 +934,13 @@ static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { + struct io_submit_state state, *statep = NULL; int i, ret = 0, submit = 0; - struct blk_plug plug; - if (to_submit > IO_PLUG_THRESHOLD) - blk_start_plug(&plug); + if (to_submit > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx); + statep = &state; + } for (i = 0; i < to_submit; i++) { struct sqe_submit s; @@ -860,7 +948,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!io_peek_sqring(ctx, &s)) break; - ret = io_submit_sqe(ctx, &s); + ret = io_submit_sqe(ctx, &s, statep); if (ret) break; @@ -868,8 +956,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) io_inc_sqring(ctx); } - if (to_submit > IO_PLUG_THRESHOLD) - blk_finish_plug(&plug); + if (statep) + io_submit_state_end(statep); return submit ? submit : ret; } From patchwork Sat Jan 12 21:30:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761127 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1AAC41580 for ; Sat, 12 Jan 2019 21:30:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0D33D2901C for ; Sat, 12 Jan 2019 21:30:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 016582904F; Sat, 12 Jan 2019 21:30:39 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 823C52901C for ; Sat, 12 Jan 2019 21:30:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726647AbfALVai (ORCPT ); Sat, 12 Jan 2019 16:30:38 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:47102 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726660AbfALVah (ORCPT ); Sat, 12 Jan 2019 16:30:37 -0500 Received: by mail-pl1-f193.google.com with SMTP id t13so8346704ply.13 for ; Sat, 12 Jan 2019 13:30:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=; b=TMRr4+hDYOacDnkEaeA+/ucAU/kCy7rfVWmTZP9KlLHt29O+oYZJ0XtNnvLg/CUqEr IEOQSTgwy3TxyYwFKKfsXXdmGy+LFuzUY1pjyEp7n96RfN6s4rVJ3oyR6HU4gQCTA8zf YkcK7z5dsXFiiXxHDXj7QlRAsic8NAn8hPIr3J3iJDrYxB3SKwrvgt142MjLhWmdMIOT Hab1UuANTAnA+TkqxYNcbWNXs4/PQKlgRN5uIoH1Mcf40XQT5dHzU2GrGkPuU3v5St43 VWc+lcayablnb7/j3nVPCrr6plLpLBXEAyzIDUf0PcQSH3JvKa5yw+dczhzSefRCy0QI 8Z7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=; b=i3SkyUmRZ4ffTh4f3r7rOqiYGOAlwC0Dvboet0gPVyCYSSdhWA6H8C6SXMwtpZZrXL xMPAy1vu00bWqvHxE0o6Z2E2LUSvVtDDyvh/zUi7hVfr1JK8751E52y1+3M8Z/XKcyVK pKq4NkRipCvMAXTUS2aaodN31ZK8Bn77MtyqJm7USsv9jzZjztH7OoDqXFcQZ3vFiZoD f1+CV7/u29C/0MPCk13Rz0furQIRmQ0Vzmmq8zrrxOYBlRzRGbfDcL0GyIc9yAzvvVEa eakb7lLp9oQR86x2yjjKDMaL4dkdWJjZrqrJP0hgM0fKXCEKoCqqnwXb6amG0wZxc3XV 7YXw== X-Gm-Message-State: AJcUukdetIwwIjWZNjXId0mFIDY13ajT8TDs5c74v/xsX0ELfS/+ugAp TtJLmoENzFEiTSdBKi0jQrVfGf6G/8gAGg== X-Google-Smtp-Source: ALg8bN5UiO3mzJCLfvTWnHiKpY3iioSGQIM+v15yaqpDJ3WVSdowIN0BZ94jGPz0xgk4dr4VDXAR1g== X-Received: by 2002:a17:902:ac1:: with SMTP id 59mr19624309plp.36.1547328636633; Sat, 12 Jan 2019 13:30:36 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.34 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:35 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 09/16] fs: add fget_many() and fput_many() Date: Sat, 12 Jan 2019 14:30:04 -0700 Message-Id: <20190112213011.1439-10-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Signed-off-by: Jens Axboe --- fs/file.c | 15 ++++++++++----- fs/file_table.c | 9 +++++++-- include/linux/file.h | 2 ++ include/linux/fs.h | 4 +++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 3209ee271c41..e0d7ce70e860 100644 --- a/fs/file.c +++ b/fs/file.c @@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return fget_many(fd, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index 5679e7fcb6b0..155d7514a094 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index ccb0b7a63aa5..acaad78b6781 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) \ + atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) From patchwork Sat Jan 12 21:30:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761133 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7B4A61390 for ; Sat, 12 Jan 2019 21:30:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6E6132901C for ; Sat, 12 Jan 2019 21:30:42 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6297A2904F; Sat, 12 Jan 2019 21:30:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CE6D82901C for ; Sat, 12 Jan 2019 21:30:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726678AbfALVak (ORCPT ); Sat, 12 Jan 2019 16:30:40 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:37786 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726673AbfALVak (ORCPT ); Sat, 12 Jan 2019 16:30:40 -0500 Received: by mail-pl1-f193.google.com with SMTP id b5so8363055plr.4 for ; Sat, 12 Jan 2019 13:30:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=R4F9Nz1irwaEO5ASW/z73EMm/6iuhiIm/ZXZ/FF5J/o=; b=CWZRVSqr2B2+r3QKLqvqFCsUihPbYbXnCCe1w0FbKT5t7C7R89aGNmVpfNBL6hl2hc m13LrjEOVFDT3u5e/pqPIJu7IRR0dks7cB+gaqU2BNuUt4IfzFIZ4o4SsLqHzYSWOt0l vg9v93qc9OGylZE7ujk7LBehsuJimkaBkf6I+Ezm31EYLmocFOgTNhEU3pDE7mR6P7PD 5MG+IsPXOBcFDfhGUEnUE4K/0VAhcpB0a9tHfpdABDBkKfiPkMKmLhAdEoydNBIWtDwA G1lUMye6E6R0R5boqmQxwN77qwxHBB+cUlmRKADZWIe/YBCJoqpzRFoOvu3XR0IPgf52 hMZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=R4F9Nz1irwaEO5ASW/z73EMm/6iuhiIm/ZXZ/FF5J/o=; b=RvhtPvukB8NoHmOaERTV6c0gYOz0KeNYw+kcn82ZQ+vcSRpPCUA+Clqg8G7TsmL3b6 CvvhAtB/pxFm2Jhj+b1Wv+A6t5PrAQlzgJID3KwwItpcBrsZ/wP3KulFFML/W2MT0qR/ +0sE8mheg2XZz+OgZ6gaUSR9vQgNLYiTLHAV3BHxs5CZuphACn+vcpRuorn4A/2H+AJ2 butXg+2ciyEJyjEZeaxf91CNLH+4KYIn3LYGagOkDVqBmBDRvlay548Leh4stFLQ2saX pIhyw0XidoG465Sbv3SoHRkyKevk55Y3cjad73ecgTfOVoC5gFWWCgLZitFq0ezQaVnp 2glA== X-Gm-Message-State: AJcUukeisxUyyeN8GDeUE6vMixp4x6BJyb2U2iRxnCgk2iPpQGg+q2Bj 2VBrz3ThpxjejPY/fTUbDaabTazk58j1gg== X-Google-Smtp-Source: ALg8bN4MHWp7HTUn7hPFnCAR8uDQEXDZiJQC5w+czOEmU12yGeD1wATWftV8yCWeacPoIiFs5DUJPg== X-Received: by 2002:a17:902:a60f:: with SMTP id u15mr18927689plq.275.1547328638386; Sat, 12 Jan 2019 13:30:38 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.36 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:37 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 10/16] io_uring: use fget/fput_many() for file references Date: Sat, 12 Jan 2019 14:30:05 -0700 Message-Id: <20190112213011.1439-11-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On the submission side, add file reference batching to the io_submit_state. We get as many references as the number of iocbs we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Signed-off-by: Jens Axboe --- fs/io_uring.c | 98 ++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 85 insertions(+), 13 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 93905ce360bb..443988474b83 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -139,6 +139,15 @@ struct io_submit_state { */ struct list_multi req_list; unsigned int req_count; + + /* + * File reference cache + */ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; }; static struct kmem_cache *req_cachep; @@ -282,8 +291,10 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) { void *reqs[IO_IOPOLL_BATCH]; struct io_kiocb *req, *n; - int to_free = 0; + int file_count, to_free; + struct file *file = NULL; + file_count = to_free = 0; list_for_each_entry_safe(req, n, &ctx->poll_completing.list, ki_list) { if (!(req->ki_flags & REQ_F_IOPOLL_COMPLETED)) continue; @@ -293,10 +304,26 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) list_del(&req->ki_list); reqs[to_free++] = req; - fput(req->rw.ki_filp); + /* + * Batched puts of the same file, to avoid dirtying the + * file usage count multiple times, if avoidable. + */ + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } + (*nr_events)++; } + if (file) + fput_many(file, file_count); if (to_free) io_free_req_many(ctx, reqs, &to_free); } @@ -557,14 +584,56 @@ static void io_iopoll_req_issued(struct io_submit_state *state, io_iopoll_req_add_state(state, req); } +static void io_file_put(struct io_submit_state *state, struct file *file) +{ + if (!state) { + fput(file); + } else if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *io_file_get(struct io_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (state->file) { + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + io_file_put(state, NULL); + } + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct io_ring_ctx *ctx = req->ki_ctx; struct kiocb *kiocb = &req->rw; int ret; - kiocb->ki_filp = fget(sqe->fd); + kiocb->ki_filp = io_file_get(state, sqe->fd); if (unlikely(!kiocb->ki_filp)) return -EBADF; kiocb->ki_pos = sqe->off; @@ -603,7 +672,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - fput(kiocb->ki_filp); + io_file_put(state, kiocb->ki_filp); return ret; } @@ -628,7 +697,7 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; void __user *buf = (void __user *) (uintptr_t) sqe->addr; @@ -637,7 +706,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(req, sqe, force_nonblock); + ret = io_prep_rw(req, sqe, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -672,7 +741,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, } static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; void __user *buf = (void __user *) (uintptr_t) sqe->addr; @@ -681,7 +750,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(req, sqe, force_nonblock); + ret = io_prep_rw(req, sqe, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -776,10 +845,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = -EINVAL; switch (sqe->opcode) { case IORING_OP_READV: - ret = io_read(req, sqe, force_nonblock); + ret = io_read(req, sqe, force_nonblock, state); break; case IORING_OP_WRITEV: - ret = io_write(req, sqe, force_nonblock); + ret = io_write(req, sqe, force_nonblock, state); break; case IORING_OP_FSYNC: ret = io_fsync(req, sqe, force_nonblock); @@ -882,17 +951,20 @@ static void io_submit_state_end(struct io_submit_state *state) blk_finish_plug(&state->plug); if (!list_empty(&state->req_list.list)) io_flush_state_reqs(state->ctx, state); + io_file_put(state, NULL); } /* * Start submission side cache. */ static void io_submit_state_start(struct io_submit_state *state, - struct io_ring_ctx *ctx) + struct io_ring_ctx *ctx, unsigned max_ios) { state->ctx = ctx; INIT_LIST_HEAD(&state->req_list.list); state->req_count = 0; + state->file = NULL; + state->ios_left = max_ios; #ifdef CONFIG_BLOCK state->plug_cb.callback = io_state_unplug; blk_start_plug(&state->plug); @@ -938,7 +1010,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) int i, ret = 0, submit = 0; if (to_submit > IO_PLUG_THRESHOLD) { - io_submit_state_start(&state, ctx); + io_submit_state_start(&state, ctx, to_submit); statep = &state; } From patchwork Sat Jan 12 21:30:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761137 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EEBF7746 for ; Sat, 12 Jan 2019 21:30:43 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E345D2901C for ; Sat, 12 Jan 2019 21:30:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D7C502904F; Sat, 12 Jan 2019 21:30:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C0812901C for ; Sat, 12 Jan 2019 21:30:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726685AbfALVam (ORCPT ); Sat, 12 Jan 2019 16:30:42 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:37760 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726673AbfALVal (ORCPT ); Sat, 12 Jan 2019 16:30:41 -0500 Received: by mail-pf1-f194.google.com with SMTP id y126so8572485pfb.4 for ; Sat, 12 Jan 2019 13:30:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=7UBRv/xk8WPda5LpOZd/9AGkNsfnfzZylGeuJHJ/2Lw=; b=IpRkiSRAYfJaFeo+BgCuen33D9kCf6oLALTk5F5Jgho98iD6FxqH8chICxe9IBzixS bcXcxmhfh8DSqpYo6EE/WVK2CRPdf9Tde5tWtenTKD2rq+cVjY80yCJSM+nanvftOtYL pZ1MyrNV0SCHKwffCVsLOOFharBAL+1jSy6p9Rg08nbzP2AiYcvqins1qGu3Bzp+Kt+u 5IJMvEMd6PwmkKvKBmt+0K8t2gEklyGj+XrcpkF3/52J9vgumHvxlVS1I8594A20PFg+ TDR+WXZ7SeYGy5VqbNofcA17rSqRSBPLtZG3dB65uCnPWv40INPa8kzDIITU/x9yEluo pUrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=7UBRv/xk8WPda5LpOZd/9AGkNsfnfzZylGeuJHJ/2Lw=; b=HiICYoo6tHrunW0Ph25SBCATiF88kBIO0UWRfx1uiTGIrLrctKT/+uxw4MCev6czr5 +1iMn9tVDa6QMxuSrnPzRilMiArYCtzxJlXW2bRAv+lRPZ6SWxKsukINWTFPLhzh2lPO LN9OqkdAWO2OATX3TWSBRIQvDPaL1xzBfwJoFdd3WvOVV0+3qD09yo6XyLnATQL0cWKE P62A66inblaJcV6EWSoDFd7EnhoLnK/IWp6Ge9LNw7mN7NgpR3Pm5uyv69AyHHzyT5Kh rJpCBBFJSr45nW53wOPaaDpvk+wbXa+V6Z3yhY+wLtGhErcCHluGUGRw1nc+1N0FsZI6 5WJg== X-Gm-Message-State: AJcUukd9rL8CTAniQJVPayBplC5EnRhEULR1iK0hMHhKokjQurcokWVa S5aXci7djLpwp/KDGJ6FuoNm4ZeO3BHNGg== X-Google-Smtp-Source: ALg8bN6PvdT1Dm0KL20etIiyRs9ndmJdUvZaRnA+76AmB0sx3/kQMeIexVKa6ursW3YNJ1bDhIbGAg== X-Received: by 2002:a62:6ac8:: with SMTP id f191mr19520989pfc.13.1547328640267; Sat, 12 Jan 2019 13:30:40 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.38 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:39 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 11/16] io_uring: batch io_kiocb allocation Date: Sat, 12 Jan 2019 14:30:06 -0700 Message-Id: <20190112213011.1439-12-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Signed-off-by: Jens Axboe --- fs/io_uring.c | 66 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 50 insertions(+), 16 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 443988474b83..3867a93d485f 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -140,6 +140,13 @@ struct io_submit_state { struct list_multi req_list; unsigned int req_count; + /* + * io_kiocb alloc cache + */ + void *reqs[IO_IOPOLL_BATCH]; + unsigned int free_reqs; + unsigned int cur_req; + /* * File reference cache */ @@ -209,29 +216,52 @@ static struct io_uring_cqe *io_peek_cqring(struct io_ring_ctx *ctx) return &ring->cqes[tail & ctx->cq_mask]; } -static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) { + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, + struct io_submit_state *state) +{ + gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; struct io_kiocb *req; if (!percpu_ref_tryget(&ctx->refs)) return NULL; - req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); - if (!req) - return NULL; - - req->ki_ctx = ctx; - INIT_LIST_HEAD(&req->ki_list); - req->ki_flags = 0; - return req; -} + if (!state) + req = kmem_cache_alloc(req_cachep, gfp); + else if (!state->free_reqs) { + size_t sz; + int ret; + + sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs)); + ret = kmem_cache_alloc_bulk(req_cachep, gfp, sz, + state->reqs); + if (ret <= 0) + goto out; + state->free_reqs = ret - 1; + state->cur_req = 1; + req = state->reqs[0]; + } else { + req = state->reqs[state->cur_req]; + state->free_reqs--; + state->cur_req++; + } -static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) -{ - percpu_ref_put_many(&ctx->refs, refs); + if (req) { + req->ki_ctx = ctx; + req->ki_flags = 0; + return req; + } - if (waitqueue_active(&ctx->wait)) - wake_up(&ctx->wait); +out: + io_ring_drop_ctx_refs(ctx, 1); + return NULL; } static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) @@ -915,7 +945,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, struct io_kiocb *req; ssize_t ret; - req = io_get_req(ctx); + req = io_get_req(ctx, state); if (unlikely(!req)) return -EAGAIN; @@ -952,6 +982,9 @@ static void io_submit_state_end(struct io_submit_state *state) if (!list_empty(&state->req_list.list)) io_flush_state_reqs(state->ctx, state); io_file_put(state, NULL); + if (state->free_reqs) + kmem_cache_free_bulk(req_cachep, state->free_reqs, + &state->reqs[state->cur_req]); } /* @@ -963,6 +996,7 @@ static void io_submit_state_start(struct io_submit_state *state, state->ctx = ctx; INIT_LIST_HEAD(&state->req_list.list); state->req_count = 0; + state->free_reqs = 0; state->file = NULL; state->ios_left = max_ios; #ifdef CONFIG_BLOCK From patchwork Sat Jan 12 21:30:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761141 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 573D71390 for ; Sat, 12 Jan 2019 21:30:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4B36B2901C for ; Sat, 12 Jan 2019 21:30:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3F4D12904F; Sat, 12 Jan 2019 21:30:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C17552901C for ; Sat, 12 Jan 2019 21:30:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726691AbfALVao (ORCPT ); Sat, 12 Jan 2019 16:30:44 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:32780 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726673AbfALVan (ORCPT ); Sat, 12 Jan 2019 16:30:43 -0500 Received: by mail-pf1-f196.google.com with SMTP id c123so8575429pfb.0 for ; Sat, 12 Jan 2019 13:30:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=; b=tlidfC/Pl2It1OafQF2CPjRlO8krxSlVw6SsWO8bgOw7dO7ve9uqtKG3YdKhEFOvBn yMDGN7TfuuOTSHKmBPDYUwRYzi7fyH62xUhngP3U70wWxfgE2Gla7MWSkWp0PZMdPgcz kQR3qz3tglJ8duu+TQgyU+jj6cZpfxio0Du3Wnt+oc4jFqwKjPKlYL+ns7whmBT3q9lK Kf74QcAduoCTOKRxXEdYC0/91FrO6e1C6GiXoMlMz0lBhjO/cn8aM1a3CoSDWQbNH8PV b1GjWj1IGShhBmHOH0rpT+LLeWPGtkkmF5EuAsgg2ClZkoEGihyaxJC9+KHlcqlEloUp lMdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=; b=RGA0Q1qZKhOML/NYzP95ZHlFH/cqCLXBqSCLidq+7aJRGzsCztC//G8NQYSZMu+YWf PaFpDZ174Fdl5mF4NCsWJDXwrBeUxzC3z4yyfhruSi5Z+TkfA92LYLNjvdEHwDteaL0S VIn33ljOfkB+iPElWSCgtxBdthDCXYFDZU9hQLLZNcW2au+eP3n51hpKCsdnAg//MJts OxoeeLFbbi+LRIqw3lNc17vOYi1nqFYSaFovoJzg1emfQASYiOCPX9GRvb29izBLtbcL 3iQqyA+ET4flhItihnEYP1NgHWLW3+W/M2ArhWq64UiKUaq4s6CYHbvQgEP5IVw8wc9e qDtA== X-Gm-Message-State: AJcUukd1bfMLBvXxq5DLfnTYSA1H5EwBLQRYaRIQ0FNuovMvD5VLEwTR Z/uPi/4hUf4WM5nRXtYbK8varSszJhRHlA== X-Google-Smtp-Source: ALg8bN5c97DVkyWK6tTvnU3QKrfe+d/Yc/uDbv7mXTXxJuysDkWYInIqZ0fyXEoVJXS1AUuwq2CA3w== X-Received: by 2002:a63:91c1:: with SMTP id l184mr18055816pge.29.1547328642161; Sat, 12 Jan 2019 13:30:42 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.40 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:41 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 12/16] block: implement bio helper to add iter bvec pages to bio Date: Sat, 12 Jan 2019 14:30:07 -0700 Message-Id: <20190112213011.1439-13-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. This requires that the caller doesn't releases the pages on IO completion, we add a BIO_HOLD_PAGES flag for that. The current two callers of bio_iov_iter_get_pages() are updated to check if they need to release pages on completion. This makes them work with bvecs that contain kernel mapped pages already. Signed-off-by: Jens Axboe --- block/bio.c | 59 ++++++++++++++++++++++++++++++++------- fs/block_dev.c | 5 ++-- fs/iomap.c | 5 ++-- include/linux/blk_types.h | 1 + 4 files changed, 56 insertions(+), 14 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4db1008309ed..7af4f45d2ed6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + unsigned int len; + size_t size; + + len = min_t(size_t, bv->bv_len, iter->count); + size = bio_add_page(bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset); + if (size == len) { + iov_iter_advance(iter, size); + return 0; + } + + return -EINVAL; +} + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) /** @@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } /** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped + * @iter: iov iterator describing the region to be added + * + * This takes either an iterator pointing to user memory, or one pointing to + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and + * map them into the kernel. On IO completion, the caller should put those + * pages. If we're adding kernel pages, we just have to add the pages to the + * bio directly. We don't grab an extra reference to those pages (the user + * should already have that), and we don't put the page on IO completion. + * The caller needs to check if the bio is flagged BIO_HOLD_PAGES on IO + * completion. If it isn't, then pages should be released. * - * Pins pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. * The function tries, but does not guarantee, to pin as many pages as - * fit into the bio, or are requested in *iter, whatever is smaller. - * If MM encounters an error pinning the requested pages, it stops. - * Error is returned only if 0 pages could be pinned. + * fit into the bio, or are requested in *iter, whatever is smaller. If + * MM encounters an error pinning the requested pages, it stops. Error + * is returned only if 0 pages could be pinned. */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { + const bool is_bvec = iov_iter_is_bvec(iter); unsigned short orig_vcnt = bio->bi_vcnt; + /* + * If this is a BVEC iter, then the pages are kernel pages. Don't + * release them on IO completion. + */ + if (is_bvec) + bio_set_flag(bio, BIO_HOLD_PAGES); + do { - int ret = __bio_iov_iter_get_pages(bio, iter); + int ret; + + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) return bio->bi_vcnt > orig_vcnt ? 0 : ret; @@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); } } @@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/fs/block_dev.c b/fs/block_dev.c index 2ebd2a0d7789..b7742014c9de 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -324,8 +324,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/fs/iomap.c b/fs/iomap.c index 4ee50b76b4a1..0a64c9c51203 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 5c7e7f859a24..97e206855cd3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,6 +215,7 @@ struct bio { /* * bio flags */ +#define BIO_HOLD_PAGES 0 /* don't put O_DIRECT pages */ #define BIO_SEG_VALID 1 /* bi_phys_segments valid */ #define BIO_CLONED 2 /* doesn't own data */ #define BIO_BOUNCED 3 /* bio is a bounce bio */ From patchwork Sat Jan 12 21:30:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761145 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 913841580 for ; Sat, 12 Jan 2019 21:30:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 85EB92901C for ; Sat, 12 Jan 2019 21:30:49 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7A61329053; Sat, 12 Jan 2019 21:30:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 881832904F for ; Sat, 12 Jan 2019 21:30:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726649AbfALVar (ORCPT ); Sat, 12 Jan 2019 16:30:47 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:42602 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726696AbfALVap (ORCPT ); Sat, 12 Jan 2019 16:30:45 -0500 Received: by mail-pl1-f193.google.com with SMTP id y1so8347947plp.9 for ; Sat, 12 Jan 2019 13:30:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=XEP6KqqE0cXLJ/gjC0UMvyFFqquWaHcgFrOVWDcHLU8=; b=h2ylMZpiMrky5MBzdUxKQKBmOUmmrwAcyqv4DEgDRI2AJustFw5bz30Eq3w4FP4hs1 4hiOKrnhdahdqQ2w+oEBkGcorExj2KD1JBt1wV4Ew7hQ2izD0/rLD8/pwopn6zphKFk7 BnpjmVZ+2qzTzBLWBc7l11+J6T7tDAg358QfRNExdzI05ralIQxOc2fLPjj6AuIEJUpD br9HU6zwlaorObLtKqp+EzZcX3LOrzTfCctEDd9ZAqOkKQlUrssBOqvlK/JMRrDrewAc sNCt2ZCiO+dDPi5kuuqTqKM71j4U2UERMPt4FEdDsv78ILVKLMgyLtiF6FAnExfn78AY ZBlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=XEP6KqqE0cXLJ/gjC0UMvyFFqquWaHcgFrOVWDcHLU8=; b=TOqe1JgMgzRb7lmM1RAvV5DFflkOukhprwhFgvViQSj+0fKr8lF1fSEtSPyEx6BvNG vgJ38N2oSS7X+pDUmwZH6xywu8hMKWgFnG4KOMjYrM2db9TApq3zqCS70ffaETrkh+N6 J8GpC7s6ykcZmvBGkFkTmWlYJhuzFv656CrKIT5IgHk49vj5EzhDe0GYPdtMi02lX3H4 uu24LHWcytwSb2Ets5yRAMLhXoaY5XPJJoOtovxEQaiE5HPayJ/Cx/JDvK00XFOFALY3 JGM62n1FwEnpdiU4bl0OxpHAsSVGAJD0hF/Vyj5LS+Xw+e/Pvo6As1j1+r91h6j2NYV4 S+HA== X-Gm-Message-State: AJcUukeiU4vhXTYu3UhvGGTbH1xOLcvt9quRmyaIP54SCtYvPDJ7Qb2/ oObtmDM8QbjDpsWLSWO1VgDzKEB+EqTcLw== X-Google-Smtp-Source: ALg8bN4a75FSMMBTLGRkLQ4/WpZpz/yAPHDyRtYPIEq2RjLZpOqC3rHaMPlXj+4mSzmcOtMRkTXuSw== X-Received: by 2002:a17:902:7005:: with SMTP id y5mr20009535plk.7.1547328644028; Sat, 12 Jan 2019 13:30:44 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.42 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:43 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 13/16] io_uring: add support for pre-mapped user IO buffers Date: Sat, 12 Jan 2019 14:30:08 -0700 Message-Id: <20190112213011.1439-14-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring context, passing in IORING_REGISTER_BUFFERS as the opcode, and the following struct as the argument: struct io_uring_register_buffers { struct iovec *iovecs; __u32 nr_iovecs; }; If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring context. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring context. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/io_uring.c | 318 +++++++++++++++++++++++-- include/linux/syscalls.h | 2 + include/uapi/linux/io_uring.h | 17 +- kernel/sys_ni.c | 1 + 5 files changed, 323 insertions(+), 16 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 453ff7a79002..8e05d4f05d88 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -345,6 +345,7 @@ 334 common rseq __x64_sys_rseq 335 common io_uring_setup __x64_sys_io_uring_setup 336 common io_uring_enter __x64_sys_io_uring_enter +337 common io_uring_register __x64_sys_io_uring_register # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/io_uring.c b/fs/io_uring.c index 3867a93d485f..64e590175641 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -23,8 +23,11 @@ #include #include #include +#include #include #include +#include +#include #include #include @@ -60,6 +63,13 @@ struct list_multi { unsigned multi; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct percpu_ref refs; @@ -82,6 +92,10 @@ struct io_ring_ctx { struct mm_struct *sqo_mm; struct files_struct *sqo_files; + /* if used, fixed mapped user buffers */ + struct io_mapped_ubuf *user_bufs; + unsigned nr_user_bufs; + struct completion ctx_done; /* iopoll submission state */ @@ -726,11 +740,42 @@ static inline void io_rw_done(struct kiocb *req, ssize_t ret) } } +static int io_import_fixed(int rw, struct io_kiocb *req, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + struct io_ring_ctx *ctx = req->ki_ctx; + struct io_mapped_ubuf *imu; + size_t len = sqe->len; + size_t offset; + int index; + + /* attempt to use fixed buffers without having provided iovecs */ + if (!ctx->user_bufs) + return -EFAULT; + + /* io_submit_sqe() already validated the index */ + index = array_index_nospec(sqe->buf_index, ctx->sq_entries); + imu = &ctx->user_bufs[index]; + if ((unsigned long) sqe->addr < imu->ubuf || + (unsigned long) sqe->addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = (unsigned long) sqe->addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - void __user *buf = (void __user *) (uintptr_t) sqe->addr; struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; @@ -748,7 +793,15 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(!file->f_op->read_iter)) goto out_fput; - ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (sqe->opcode == IORING_OP_READ_FIXED) { + ret = io_import_fixed(READ, req, sqe, &iter); + iovec = NULL; + } else { + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + + ret = import_iovec(READ, buf, sqe->len, UIO_FASTIOV, &iovec, + &iter); + } if (ret) goto out_fput; @@ -774,7 +827,6 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; - void __user *buf = (void __user *) (uintptr_t) sqe->addr; struct kiocb *kiocb = &req->rw; struct iov_iter iter; struct file *file; @@ -796,7 +848,15 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(!file->f_op->write_iter)) goto out_fput; - ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, &iter); + if (sqe->opcode == IORING_OP_WRITE_FIXED) { + ret = io_import_fixed(WRITE, req, sqe, &iter); + iovec = NULL; + } else { + void __user *buf = (void __user *) (uintptr_t) sqe->addr; + + ret = import_iovec(WRITE, buf, sqe->len, UIO_FASTIOV, &iovec, + &iter); + } if (ret) goto out_fput; @@ -865,7 +925,7 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(sqe->flags || sqe->__pad2)) + if (unlikely(sqe->flags || sqe->__pad2 || sqe->__pad3)) return -EINVAL; if (unlikely(s->index >= ctx->sq_entries)) @@ -875,9 +935,27 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = -EINVAL; switch (sqe->opcode) { case IORING_OP_READV: + if (unlikely(sqe->buf_index)) + return -EINVAL; ret = io_read(req, sqe, force_nonblock, state); break; case IORING_OP_WRITEV: + if (unlikely(sqe->buf_index)) + return -EINVAL; + ret = io_write(req, sqe, force_nonblock, state); + break; + case IORING_OP_READ_FIXED: + if (unlikely(!ctx->user_bufs)) + return -EFAULT; + if (unlikely(sqe->buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + ret = io_read(req, sqe, force_nonblock, state); + break; + case IORING_OP_WRITE_FIXED: + if (unlikely(!ctx->user_bufs)) + return -EFAULT; + if (unlikely(sqe->buf_index >= ctx->nr_user_bufs)) + return -EFAULT; ret = io_write(req, sqe, force_nonblock, state); break; case IORING_OP_FSYNC: @@ -903,9 +981,11 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, static void io_sq_wq_submit_work(struct work_struct *work) { struct io_kiocb *req = container_of(work, struct io_kiocb, work.work); + struct sqe_submit *s = &req->work.submit; struct io_ring_ctx *ctx = req->ki_ctx; - mm_segment_t old_fs = get_fs(); struct files_struct *old_files; + mm_segment_t old_fs; + bool needs_user; int ret; /* @@ -918,19 +998,32 @@ static void io_sq_wq_submit_work(struct work_struct *work) old_files = current->files; current->files = ctx->sqo_files; - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; + /* + * If we're doing IO to fixed buffers, we don't need to get/set + * user context + */ + needs_user = true; + if (s->sqe->opcode == IORING_OP_READ_FIXED || + s->sqe->opcode == IORING_OP_WRITE_FIXED) + needs_user = false; + + if (needs_user) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + use_mm(ctx->sqo_mm); + old_fs = get_fs(); + set_fs(USER_DS); } - use_mm(ctx->sqo_mm); - set_fs(USER_DS); - ret = __io_submit_sqe(ctx, req, &req->work.submit, false, NULL); - set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); + if (needs_user) { + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); + } err: if (ret) { io_fill_cq_error(ctx, &req->work.submit, ret); @@ -1173,6 +1266,132 @@ static void io_sq_offload_stop(struct io_ring_ctx *ctx) } } +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return -EINVAL; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; + return 0; +} + +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, + struct io_uring_register_buffers *reg) +{ + unsigned long total_pages, page_limit; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + if (reg->nr_iovecs > USHRT_MAX) + return -EINVAL; + + ctx->user_bufs = kcalloc(reg->nr_iovecs, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + /* Don't allow more pages than we can safely lock */ + total_pages = 0; + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + for (i = 0; i < reg->nr_iovecs; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = -EFAULT; + if (copy_from_user(&iov, ®->iovecs[i], sizeof(iov))) + goto err; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_1G) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = -ENOMEM; + if (total_pages + nr_pages > page_limit) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(pages); + pages = kmalloc(nr_pages * sizeof(struct page *), + GFP_KERNEL); + if (!pages) + goto err; + got_pages = nr_pages; + } + + imu->bvec = kmalloc(nr_pages * sizeof(struct bio_vec), + GFP_KERNEL); + if (!imu->bvec) + goto err; + + down_write(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, 1, pages, NULL); + up_write(¤t->mm->mmap_sem); + + if (pret < nr_pages) { + if (pret < 0) + ret = pret; + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + total_pages += nr_pages; + } + kfree(pages); + ctx->nr_user_bufs = reg->nr_iovecs; + return 0; +err: + kfree(pages); + io_sqe_buffer_unregister(ctx); + return ret; +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring) { @@ -1194,6 +1413,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sq_offload_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_sqe_buffer_unregister(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); } @@ -1422,6 +1642,74 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return ret; } +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, + void __user *arg) +{ + int ret; + + /* Drop our initial ref and wait for the ctx to be fully idle */ + percpu_ref_put(&ctx->refs); + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + + switch (opcode) { + case IORING_REGISTER_BUFFERS: { + struct io_uring_register_buffers reg; + + ret = -EFAULT; + if (copy_from_user(®, arg, sizeof(reg))) + break; + ret = io_sqe_buffer_register(ctx, ®); + break; + } + case IORING_UNREGISTER_BUFFERS: + ret = -EINVAL; + if (arg) + break; + ret = io_sqe_buffer_unregister(ctx); + break; + default: + ret = -EINVAL; + break; + } + + /* bring the ctx back to life */ + percpu_ref_resurrect(&ctx->refs); + percpu_ref_get(&ctx->refs); + return ret; +} + +SYSCALL_DEFINE3(io_uring_register, unsigned int, fd, unsigned int, opcode, + void __user *, arg) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_scqring_fops) + goto out_fput; + + ret = -EINVAL; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_register(ctx, opcode, arg); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + static int __init io_uring_setup(void) { req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 542757a4c898..e36c264d74e8 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -314,6 +314,8 @@ asmlinkage long sys_io_uring_setup(u32 entries, struct io_uring_params __user *p); asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags); +asmlinkage long sys_io_uring_register(unsigned int fd, unsigned op, + void __user *arg); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 506498cc913e..e1eeb3fc7443 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -29,7 +29,9 @@ struct io_uring_sqe { __kernel_rwf_t rw_flags; __u32 fsync_flags; }; - __u64 __pad2; + __u16 buf_index; /* index into fixed buffers, if used */ + __u16 __pad2; + __u32 __pad3; __u64 user_data; /* data to be passed back at completion time */ }; @@ -41,6 +43,8 @@ struct io_uring_sqe { #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 +#define IORING_OP_READ_FIXED 4 +#define IORING_OP_WRITE_FIXED 5 /* * sqe->fsync_flags @@ -104,4 +108,15 @@ struct io_uring_params { struct io_cqring_offsets cq_off; }; +/* + * io_uring_register(2) opcodes and arguments + */ +#define IORING_REGISTER_BUFFERS 0 +#define IORING_UNREGISTER_BUFFERS 1 + +struct io_uring_register_buffers { + struct iovec *iovecs; + __u32 nr_iovecs; +}; + #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ee5e523564bb..1bb6604dc19f 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -48,6 +48,7 @@ COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); COND_SYSCALL(io_uring_setup); COND_SYSCALL(io_uring_enter); +COND_SYSCALL(io_uring_register); /* fs/xattr.c */ From patchwork Sat Jan 12 21:30:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761149 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D27AD1390 for ; Sat, 12 Jan 2019 21:30:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C5BE72904F for ; Sat, 12 Jan 2019 21:30:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B78CF2901C; Sat, 12 Jan 2019 21:30:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0CB5F2901C for ; Sat, 12 Jan 2019 21:30:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726712AbfALVau (ORCPT ); Sat, 12 Jan 2019 16:30:50 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:35086 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726700AbfALVar (ORCPT ); Sat, 12 Jan 2019 16:30:47 -0500 Received: by mail-pf1-f194.google.com with SMTP id z9so8569715pfi.2 for ; Sat, 12 Jan 2019 13:30:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=8vV4hdIAmsQBc8lORxnHOFoG7gVrwDswoWOOwJ1uzl8=; b=W//oERYASV2gKGngSjzaCFxeXtO6bzuNmTpz30DHSbOlEIEl5ZgO+fV9vbWq0ebKlX GHXFxixaz4FXQ50Qo6nWjylHaBPJ9LMoWonYq12Kd1qPmOqnKGr+sDczN1SM56f3QlMg 9yvTCZtCs4G4uW/Kvu3KvKv9r/2W5Xeunc4E6hhTaa3hEPZeXAUHkuf3lRDsIr1w9n2h YwUbmlbIha5jT1Vn1Ow3bgll6Fu/xDR21cV30aNKnahFhOfYF/KpgmXy0DBQK2dXnVyT 3KTLqiFlyaGuG9vk9Q/qCe4xqDbNz+9Vw3LtQVed7dHasg6poWeaIOpVrlcElwPD0G+y 10SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=8vV4hdIAmsQBc8lORxnHOFoG7gVrwDswoWOOwJ1uzl8=; b=hxje9b9OVuD7jNUkmzoiHb44N55Oo1Ne/spzBJgI7O7/2MUxhA/cA8Qh77HEMDuhRh H75UrvvAcrnx8W9HndIINVfTR3xAzRXP//kdrocMpKJ3JmPS72opeCN8/uAR9cIwmdN2 yP7Ow0O7EBRRaonDG3gV/9K6+CUbia1KQPv6blaiaXJziEM7f1Rd2n9ejcTbaNKj2g70 OnzNE5+N4UyUT3yogqHlUJP3dNScTei9LNMFPV/Rmg237b910Xc2i02WhGokwZRvD+sR vcN/d4V5b0upwI6XvJSBiqdeRpoEl1q9tmocEetNU3c+YqpjmwrhJJfmzfp8OTkuUYry b6rg== X-Gm-Message-State: AJcUukfaHlU2lUk1tBzTUrZfKYkFYpWi7XzJbxfnBXNkQdIHUiUEqy1U IffRqXEY9VFSje2yo7AcfsFgjvlNGPiHMw== X-Google-Smtp-Source: ALg8bN7sJayQSiMCKY8csJqQNzzC++YKFFchwj8Ro8cc99AGb+YY+OPyrRQQsevDXsRmUVH58kd8mg== X-Received: by 2002:a63:2643:: with SMTP id m64mr17626724pgm.35.1547328645930; Sat, 12 Jan 2019 13:30:45 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:45 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 14/16] io_uring: add submission polling Date: Sat, 12 Jan 2019 14:30:09 -0700 Message-Id: <20190112213011.1439-15-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. Proof of concept. If the thread has been idle for 1 second, it will set sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, to_submit, 0, 0); instead of calling it unconditionally. Improvements: 1) Maybe have smarter backoff. Busy loop for X time, then go to monitor/mwait, finally the schedule we have now after an idle second. Might not be worth the complexity. 2) Probably want the application to pass in the appropriate grace period, not hard code it at 1 second. Signed-off-by: Jens Axboe --- fs/io_uring.c | 215 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 10 +- 2 files changed, 218 insertions(+), 7 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 64e590175641..6f446d13b3d4 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -89,8 +90,10 @@ struct io_ring_ctx { /* IO offload */ struct workqueue_struct *sqo_wq; + struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; struct files_struct *sqo_files; + wait_queue_head_t sqo_wait; /* if used, fixed mapped user buffers */ struct io_mapped_ubuf *user_bufs; @@ -1131,6 +1134,167 @@ static bool io_peek_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) return false; } +static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) + ret = -EFAULT; + else + ret = io_submit_sqe(ctx, &sqes[i], statep); + if (!ret) { + submitted++; + continue; + } + + io_fill_cq_error(ctx, &sqes[i], ret); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct mm_struct *cur_mm = NULL; + struct files_struct *old_files; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; + + old_files = current->files; + current->files = ctx->sqo_files; + + old_fs = get_fs(); + set_fs(USER_DS); + + timeout = inflight = 0; + while (!kthread_should_stop()) { + bool all_fixed, mm_fault = false; + int i; + + if (inflight) { + unsigned int nr_events = 0; + + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * App should not use IORING_ENTER_GETEVENTS + * with thread polling, but if it does, then + * ensure we are mutually exclusive. + */ + if (mutex_trylock(&ctx->uring_lock)) { + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } + } else { + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + HZ; + } + + if (!io_peek_sqring(ctx, &sqes[0])) { + /* + * We're polling, let us spin for a second without + * work before going to sleep. + */ + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&ctx->sqo_wait, &wait, + TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + + if (!io_peek_sqring(ctx, &sqes[0])) { + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) { + finish_wait(&ctx->sqo_wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + continue; + } + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + + i = 0; + all_fixed = true; + do { + if (sqes[i].sqe->opcode != IORING_OP_READ_FIXED && + sqes[i].sqe->opcode != IORING_OP_WRITE_FIXED) + all_fixed = false; + if (i + 1 == ARRAY_SIZE(sqes)) + break; + i++; + io_inc_sqring(ctx); + } while (io_peek_sqring(ctx, &sqes[i])); + + /* Unless all new commands are FIXED regions, grab mm */ + if (!all_fixed && !cur_mm) { + mm_fault = !mmget_not_zero(ctx->sqo_mm); + if (!mm_fault) { + use_mm(ctx->sqo_mm); + cur_mm = ctx->sqo_mm; + } + } + + inflight += io_submit_sqes(ctx, sqes, i, mm_fault); + } + current->files = old_files; + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; @@ -1202,9 +1366,14 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, int ret = 0; if (to_submit) { - ret = io_ring_submit(ctx, to_submit); - if (ret < 0) - return ret; + if (ctx->flags & IORING_SETUP_SQPOLL) { + wake_up(&ctx->sqo_wait); + ret = to_submit; + } else { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -1225,10 +1394,12 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } -static int io_sq_offload_start(struct io_ring_ctx *ctx) +static int io_sq_offload_start(struct io_ring_ctx *ctx, + struct io_uring_params *p) { int ret; + init_waitqueue_head(&ctx->sqo_wait); ctx->sqo_mm = current->mm; /* @@ -1242,6 +1413,27 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) if (!ctx->sqo_files) goto err; + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (p->flags & IORING_SETUP_SQ_AFF) { + ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, + ctx, p->sq_thread_cpu, + "io_uring-sq"); + } else { + ctx->sqo_thread = kthread_create(io_sq_thread, ctx, + "io_uring-sq"); + } + if (IS_ERR(ctx->sqo_thread)) { + ret = PTR_ERR(ctx->sqo_thread); + ctx->sqo_thread = NULL; + goto err; + } + wake_up_process(ctx->sqo_thread); + } else if (p->flags & IORING_SETUP_SQ_AFF) { + /* Can't have SQ_AFF without SQPOLL */ + ret = -EINVAL; + goto err; + } + /* Do QD, or 2 * CPUS, whatever is smallest */ ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, min(ctx->sq_entries - 1, 2 * num_online_cpus())); @@ -1252,6 +1444,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) return 0; err: + if (ctx->sqo_thread) { + kthread_park(ctx->sqo_thread); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } if (ctx->sqo_files) ctx->sqo_files = NULL; ctx->sqo_mm = NULL; @@ -1260,6 +1457,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) static void io_sq_offload_stop(struct io_ring_ctx *ctx) { + if (ctx->sqo_thread) { + kthread_park(ctx->sqo_thread); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } if (ctx->sqo_wq) { destroy_workqueue(ctx->sqo_wq); ctx->sqo_wq = NULL; @@ -1594,7 +1796,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p) if (ret) goto err; - ret = io_sq_offload_start(ctx); + ret = io_sq_offload_start(ctx, p); if (ret) goto err; @@ -1629,7 +1831,8 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries, return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | + IORING_SETUP_SQ_AFF)) return -EINVAL; ret = io_uring_create(entries, &p); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e1eeb3fc7443..7f9be0bd84ed 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -39,6 +39,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ +#define IORING_SETUP_SQPOLL (1 << 1) /* SQ poll thread */ +#define IORING_SETUP_SQ_AFF (1 << 2) /* sq_thread_cpu is valid */ #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 @@ -81,6 +83,11 @@ struct io_sqring_offsets { __u32 resv[3]; }; +/* + * sq_ring->flags + */ +#define IORING_SQ_NEED_WAKEUP (1 << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; @@ -103,7 +110,8 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u16 resv[10]; + __u16 sq_thread_cpu; + __u16 resv[9]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; }; From patchwork Sat Jan 12 21:30:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761153 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8C24D1580 for ; Sat, 12 Jan 2019 21:30:52 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 800F92902C for ; Sat, 12 Jan 2019 21:30:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6D62429054; Sat, 12 Jan 2019 21:30:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D288E29054 for ; Sat, 12 Jan 2019 21:30:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726708AbfALVat (ORCPT ); Sat, 12 Jan 2019 16:30:49 -0500 Received: from mail-pf1-f193.google.com ([209.85.210.193]:41216 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726696AbfALVas (ORCPT ); Sat, 12 Jan 2019 16:30:48 -0500 Received: by mail-pf1-f193.google.com with SMTP id b7so8560950pfi.8 for ; Sat, 12 Jan 2019 13:30:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=AW7jsIB6hNhqzBATRVyZlWxRkDXvAdFtbb95X+jyjmo=; b=B941y95GgMIADycRcE44Rk58D82fKFMpV+hUFgg3aRbhPofanzduRmenlpsRr9Kqhs xA+kFmktWCf+18x6/usAkBK0iAlu1/KaCN17i7pV5psxxDybo9XVfrEtiCcscUzBDZC6 qJi5cvePD3/1a44saLuy2tlqPbAHq1wjzZ9DhvA+RQmkbgcMqqovevXcpnoOGeXgUil2 boGyb4BJmv/8teAVGXtIWeWjKJUtyuYqMZY6cLe3Xlj0sgc4fsdabg5VlcETzFMIUHGO iz+lyBzQyLmev/YH+srcvfm/aIxeKnmEwMqEGyaYwV7t2cStbS5zHuJHJaOcI01jC+0f qE2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=AW7jsIB6hNhqzBATRVyZlWxRkDXvAdFtbb95X+jyjmo=; b=iLuIjNbr3E6IQdk9FM+69NSNoDb5nyWBMNTmg4LH4B24iGxajdKfiCzktQK9Gu9Jsq A5bzTdeyzeADGmR2XTF+hJONnlp7fNPdvNPC1Z9FEbhCeEZWlWgrAJIVRHufY+bjkxhi 04fFQyOpFm9fDaxJNGwAoaOT5BR1/GmYZL60CeCID3Qa+IhdGjHcBGkMXj+lyBtdPqHH X5jU31ctYCL8XAmswjPQu/8bFlVUKBNQ0aFX32xrLgjy4zKSmMj1VUCmFVJRGlJHFF2l ViUrnCTi7UVMBxdAYpopdAOmu9RehwPtoRrO0vwyqjBB84+nZiqpZBY9CR7wRAe24ODx ZtIA== X-Gm-Message-State: AJcUukctULXhx7GvYdz6pcpNKbfH5eJ4xx/LXKWLE56zLBzPXuIT7NqX yqgryCRq1961Te+O3DXa9nL8fe7GuGbmPw== X-Google-Smtp-Source: ALg8bN7CJCZRLqA34VAoQHkLAyT9qeyJtUt5brK1P2LWiWnaAOgSU+j7wk+VcgfAAh3GHpX0yBSdmw== X-Received: by 2002:a63:1e17:: with SMTP id e23mr15596830pge.130.1547328647672; Sat, 12 Jan 2019 13:30:47 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.45 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:46 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 15/16] io_uring: add file registration Date: Sat, 12 Jan 2019 14:30:10 -0700 Message-Id: <20190112213011.1439-16-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We normally have to fget/fput for each IO we do on a file. Even with the batching we do, this atomic inc/dec cost adds up. This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes for the io_uring_register(2) system call. Pass in an array of fds that are in use by the application, and we'll fget these for the duration of the io_uring context. When used, the application must set IOSQE_FIXED_FILE in the sqe->flags member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd to the index in the array passed in to IORING_REGISTER_FILES. Files are automatically unregistered when the io_uring context is torn down. An application need only unregister if it wishes to register a few set of fds. Signed-off-by: Jens Axboe --- fs/io_uring.c | 131 +++++++++++++++++++++++++++++----- include/uapi/linux/io_uring.h | 14 +++- 2 files changed, 127 insertions(+), 18 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6f446d13b3d4..66a76f9fa7af 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -99,6 +99,10 @@ struct io_ring_ctx { struct io_mapped_ubuf *user_bufs; unsigned nr_user_bufs; + /* if used, fixed file set */ + struct file **user_files; + unsigned nr_user_files; + struct completion ctx_done; /* iopoll submission state */ @@ -137,6 +141,7 @@ struct io_kiocb { #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_IOPOLL_EAGAIN 4 /* submission got EAGAIN */ +#define REQ_F_FIXED_FILE 8 /* ctx owns file */ u64 ki_user_data; }; @@ -355,15 +360,17 @@ static void io_iopoll_reap(struct io_ring_ctx *ctx, unsigned int *nr_events) * Batched puts of the same file, to avoid dirtying the * file usage count multiple times, if avoidable. */ - if (!file) { - file = req->rw.ki_filp; - file_count = 1; - } else if (file == req->rw.ki_filp) { - file_count++; - } else { - fput_many(file, file_count); - file = req->rw.ki_filp; - file_count = 1; + if (!(req->ki_flags & REQ_F_FIXED_FILE)) { + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } } (*nr_events)++; @@ -557,13 +564,19 @@ static void io_fill_cq_error(struct io_ring_ctx *ctx, struct sqe_submit *s, wake_up(&ctx->wait); } +static void io_fput(struct io_kiocb *req) +{ + if (!(req->ki_flags & REQ_F_FIXED_FILE)) + fput(req->rw.ki_filp); +} + static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); kiocb_end_write(kiocb); - fput(kiocb->ki_filp); + io_fput(req); io_cqring_fill_event(req->ki_ctx, req->ki_user_data, res, 0); io_free_req(req); } @@ -680,7 +693,17 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct kiocb *kiocb = &req->rw; int ret; - kiocb->ki_filp = io_file_get(state, sqe->fd); + if (unlikely(sqe->flags & ~IOSQE_FIXED_FILE)) + return -EINVAL; + + if (sqe->flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files)) + return -EBADF; + kiocb->ki_filp = ctx->user_files[sqe->fd]; + req->ki_flags |= REQ_F_FIXED_FILE; + } else { + kiocb->ki_filp = io_file_get(state, sqe->fd); + } if (unlikely(!kiocb->ki_filp)) return -EBADF; kiocb->ki_pos = sqe->off; @@ -719,7 +742,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - io_file_put(state, kiocb->ki_filp); + if (!(sqe->flags & IOSQE_FIXED_FILE)) + io_file_put(state, kiocb->ki_filp); return ret; } @@ -822,7 +846,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, kfree(iovec); out_fput: if (unlikely(ret)) - fput(file); + io_fput(req); return ret; } @@ -884,7 +908,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, } out_fput: if (unlikely(ret)) - fput(file); + io_fput(req); return ret; } @@ -902,19 +926,30 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(req->ki_ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; + if (unlikely(sqe->flags & ~IOSQE_FIXED_FILE)) + return -EINVAL; if (unlikely(sqe->addr)) return -EINVAL; if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) return -EINVAL; - file = fget(sqe->fd); + if (sqe->flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files)) + return -EBADF; + file = ctx->user_files[sqe->fd]; + } else { + file = fget(sqe->fd); + } + if (unlikely(!file)) return -EBADF; ret = vfs_fsync_range(file, sqe->off, end > 0 ? end : LLONG_MAX, sqe->fsync_flags & IORING_FSYNC_DATASYNC); - fput(file); + if (!(sqe->flags & IOSQE_FIXED_FILE)) + fput(file); + io_cqring_fill_event(ctx, sqe->user_data, ret, 0); io_free_req(req); return 0; @@ -928,7 +963,7 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(sqe->flags || sqe->__pad2 || sqe->__pad3)) + if (unlikely(sqe->__pad2 || sqe->__pad3)) return -EINVAL; if (unlikely(s->index >= ctx->sq_entries)) @@ -1394,6 +1429,52 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } +static int io_sqe_files_unregister(struct io_ring_ctx *ctx) +{ + int i; + + if (!ctx->user_files) + return -EINVAL; + + for (i = 0; i < ctx->nr_user_files; i++) + fput(ctx->user_files[i]); + + kfree(ctx->user_files); + ctx->user_files = NULL; + ctx->nr_user_files = 0; + return 0; +} + +static int io_sqe_files_register(struct io_ring_ctx *ctx, + struct io_uring_register_files *reg) +{ + int fd, i, ret = 0; + + ctx->user_files = kcalloc(reg->nr_fds, sizeof(struct file *), + GFP_KERNEL); + if (!ctx->user_files) + return -ENOMEM; + + for (i = 0; i < reg->nr_fds; i++) { + ret = -EFAULT; + if (copy_from_user(&fd, ®->fds[i], sizeof(fd))) + break; + + ctx->user_files[i] = fget(fd); + + ret = -EBADF; + if (!ctx->user_files[i]) + break; + ctx->nr_user_files++; + ret = 0; + } + + if (ret) + io_sqe_files_unregister(ctx); + + return ret; +} + static int io_sq_offload_start(struct io_ring_ctx *ctx, struct io_uring_params *p) { @@ -1615,6 +1696,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sq_offload_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_sqe_files_unregister(ctx); io_sqe_buffer_unregister(ctx); percpu_ref_exit(&ctx->refs); kfree(ctx); @@ -1871,6 +1953,21 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_buffer_unregister(ctx); break; + case IORING_REGISTER_FILES: { + struct io_uring_register_files reg; + + ret = -EFAULT; + if (copy_from_user(®, arg, sizeof(reg))) + break; + ret = io_sqe_files_register(ctx, ®); + break; + } + case IORING_UNREGISTER_FILES: + ret = -EINVAL; + if (arg) + break; + ret = io_sqe_files_unregister(ctx); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 7f9be0bd84ed..ffdaf46f19dd 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -16,7 +16,7 @@ */ struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ - __u8 flags; /* as of now unused */ + __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ @@ -35,6 +35,11 @@ struct io_uring_sqe { __u64 user_data; /* data to be passed back at completion time */ }; +/* + * sqe->flags + */ +#define IOSQE_FIXED_FILE (1 << 0) /* use fixed fileset */ + /* * io_uring_setup() flags */ @@ -121,10 +126,17 @@ struct io_uring_params { */ #define IORING_REGISTER_BUFFERS 0 #define IORING_UNREGISTER_BUFFERS 1 +#define IORING_REGISTER_FILES 2 +#define IORING_UNREGISTER_FILES 3 struct io_uring_register_buffers { struct iovec *iovecs; __u32 nr_iovecs; }; +struct io_uring_register_files { + __s32 *fds; + __u32 nr_fds; +}; + #endif From patchwork Sat Jan 12 21:30:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10761157 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3213D1390 for ; Sat, 12 Jan 2019 21:30:54 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 25DC82901C for ; Sat, 12 Jan 2019 21:30:54 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1A6A32904F; Sat, 12 Jan 2019 21:30:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C61F02901C for ; Sat, 12 Jan 2019 21:30:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726715AbfALVaw (ORCPT ); Sat, 12 Jan 2019 16:30:52 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:46387 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726616AbfALVav (ORCPT ); Sat, 12 Jan 2019 16:30:51 -0500 Received: by mail-pf1-f196.google.com with SMTP id c73so8543410pfe.13 for ; Sat, 12 Jan 2019 13:30:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=+QRhJk0MXmcBlpj1fw6MhS+bnf5QaKKqC3bc/BXBzqQ=; b=UwO2tP/ElljBAGpZmLbt8szeLpll+tp26zd0c7Wwf3R1W36lXSL+Cqs7aDu7lUoM/f /nq9oNIx/eNjo1M3bCoLgbqoARbTJh1lP6+Fcf5Ziwe3gN44JJ7zLGWdKpx6t0sXNoet wtEWWbL/1hGrssQ2vWTAkhhoPdluIusL/B0B8MaizbJf/KYADiwOm1VXVDye4oUnsb9M 3u1AqFOyW86MFao0ckTUJhNEc7FuwPVfjcQg7+y8Whq9A4Mykvt+6I9BFCep3hDe3xou 7oXBER7XKpjKzSJ11hwtHfqHgxZ6uyf6lg15+bcOP8NknnoANUayexidOCpDpE0bDICM wuZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=+QRhJk0MXmcBlpj1fw6MhS+bnf5QaKKqC3bc/BXBzqQ=; b=iuFFHPG1qoo4fwGNxAWBDDYwx6jWL8viSDVEUYzEDsYoxR8VFl3AHT/9QXB6jhgWon rSM+mvh+4RriIMbeEjPqNxLfgNkaxFZ8OueVNGW62R2hLSFJy9lxRLA6bhzqA8P9YSPi sfhQ2FeZK9hz8WnWhQj0ymR8h7Hp1HsvUMXbqyd72pswRhcFyepOzNDMLGW69iNm5S51 RDBYzwb6fh4JJAzLK4NLDK0WUhEc2VSAWrCTtY9Qlt2weBd2jhpSXGp9ek2/e+f5e1/6 2Yweh0TRl+0x9AAxXHe9smZ+ko2IQWb6YcbBAcFnGEoiCyWdrDnh5j83ehUvIrwpS6dK Gbbg== X-Gm-Message-State: AJcUukfXgtRcNdD6e2CmLvPrDKxp2QV558bg5EDBWJxcHMiBv5ncW661 y5tMGh16optZky/y2p/XHWywR14nxA9OGw== X-Google-Smtp-Source: ALg8bN5sieoHokZROfPSSs9hEYsV4JoJMo19+NEYc2uQ3gv4z9SHweXzhDZAui+itOLQduEgBgYOqQ== X-Received: by 2002:a63:d949:: with SMTP id e9mr18096292pgj.24.1547328649475; Sat, 12 Jan 2019 13:30:49 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id y6sm151629818pfd.104.2019.01.12.13.30.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 12 Jan 2019 13:30:48 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org, linux-arch@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 16/16] io_uring: add io_uring_event cache hit information Date: Sat, 12 Jan 2019 14:30:11 -0700 Message-Id: <20190112213011.1439-17-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190112213011.1439-1-axboe@kernel.dk> References: <20190112213011.1439-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add hint on whether a read was served out of the page cache, or if it hit media. This is useful for buffered async IO, O_DIRECT reads would never have this set (for obvious reasons). If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT set. Signed-off-by: Jens Axboe --- fs/io_uring.c | 7 ++++++- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 66a76f9fa7af..5074cdd85f43 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -573,11 +573,16 @@ static void io_fput(struct io_kiocb *req) static void io_complete_scqring_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + unsigned ev_flags = 0; kiocb_end_write(kiocb); io_fput(req); - io_cqring_fill_event(req->ki_ctx, req->ki_user_data, res, 0); + + if (res > 0 && (req->ki_flags & REQ_F_FORCE_NONBLOCK)) + ev_flags = IOCQE_FLAG_CACHEHIT; + + io_cqring_fill_event(req->ki_ctx, req->ki_user_data, res, ev_flags); io_free_req(req); } diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ffdaf46f19dd..74370aed5f71 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -67,6 +67,11 @@ struct io_uring_cqe { __u32 flags; }; +/* + * io_uring_event->flags + */ +#define IOCQE_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ + /* * Magic offsets for the application to mmap the data it needs */