From patchwork Fri Jan 18 16:12:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770765 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7114A13B4 for ; Fri, 18 Jan 2019 16:12:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 60B852F3B0 for ; Fri, 18 Jan 2019 16:12:38 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 54EE52F432; Fri, 18 Jan 2019 16:12:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F24A02F571 for ; Fri, 18 Jan 2019 16:12:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727343AbfARQMh (ORCPT ); Fri, 18 Jan 2019 11:12:37 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:32916 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727630AbfARQMg (ORCPT ); Fri, 18 Jan 2019 11:12:36 -0500 Received: by mail-pf1-f195.google.com with SMTP id c123so6835788pfb.0 for ; Fri, 18 Jan 2019 08:12:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=; b=C9dElZo20Bxj0ZgMYG03l1rKb3R8/CuRHGH9nKq+ZlP0I6+4uXo5e/Uy7PYaL6Y7gR 8TIL8HgENThSnskWojwhmf/m+6Zln+R3GypmKVeAPegPqk6+DyseZ99hEMynMKnH1Hat hIUHKHHmJ/92OeeqeHdGb4H0YQ+d2qJg3m6p8Fu6eGK8cFBFAAATEKbi9GlUNIZb+LSd vYdWLNeqo1r2tHOKy+DQxbvbdSuBXrbuPjvLMcSv9YVc2DY+I0KssPTzWFYtspEw2xZu 9hJthWnPNfxyyV7KsTscqjsxeG8M03rnoJ+3ErP8dKKDDLMmxiI4TEZY6cLasMTg87n3 hwMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=AAKxls5Es8ExKqpwMiUOBUcGduRQU3BNIbRWpRo7VtQ=; b=satd94D0mbUakTIOFFPU8hyzzQcM+nJjQ1xkqsuIQ67W0tBnXQub8jYeFXFY8WhGCm wuYl0DoOsI8vkHtISTZcCo9h/RN2NBWOzPk43SNycFIwFUADVCYw3InQ45Ly81qVnYnv o24ghSPLMCobW6uiHBs7beH6KipbM6Tn2rxOuXy45mn8y+X7YaqdRB2Jw8Su4JXerqco o320kq0Ms2CbQfyPK5/ls56BaET8wrScQl7X4KbwKYvWbcFVmFIBAzY8TVLvMM2f6plG LXfvkTtKy/TibVMvmG2j1QrzqSRWCCfhvG6P84ztFkJ0VSMN5kXIwyz1N5dtjZnhcnKA Y/rw== X-Gm-Message-State: AJcUukdQvrvvBcBug2+GkFouuBF792/HNuhKK5/EQ2vr3esrw0BXmj24 ZUlS/ZWv0pQS0sfTD7SsHw9Vd6iP1J6ZeA== X-Google-Smtp-Source: ALg8bN4x0Onz1U8hCv8wS7z6Taa1bDSDrgvlHIwvmxWcJBQvsZ9Ne9kqKSuP+La7r81bsgy+D0N2qw== X-Received: by 2002:a63:5ec6:: with SMTP id s189mr17525861pgb.357.1547827955403; Fri, 18 Jan 2019 08:12:35 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:34 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 01/17] fs: add an iopoll method to struct file_operations Date: Fri, 18 Jan 2019 09:12:09 -0700 Message-Id: <20190118161225.4545-2-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. TODO: we can probably union ki_cookie with the existing hint and I/O priority fields to avoid struct kiocb growth. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h | 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 8dc8e9c2913f..761c6fd24a53 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index 811c77743dad..ccb0b7a63aa5 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned int ki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1786,6 +1787,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); From patchwork Fri Jan 18 16:12:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770769 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 85C0D13B5 for ; Fri, 18 Jan 2019 16:12:40 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 759CE2F3B0 for ; Fri, 18 Jan 2019 16:12:40 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 697BF2F562; Fri, 18 Jan 2019 16:12:40 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 08B1D2F3B0 for ; Fri, 18 Jan 2019 16:12:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727630AbfARQMj (ORCPT ); Fri, 18 Jan 2019 11:12:39 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:40971 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727694AbfARQMj (ORCPT ); Fri, 18 Jan 2019 11:12:39 -0500 Received: by mail-pg1-f194.google.com with SMTP id m1so6256542pgq.8 for ; Fri, 18 Jan 2019 08:12:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=; b=wXOyearj2mP7U9VmTbWOONl0HmQKu/ENcDZMwE+/0blLQ1c6IbRaTQLWTCAgkiHT1j 0KMIcVHNgsaOwx29gm7Osqf2hB/cKpB88MdlVEZqmjdDtwXuuTWTHeOxPpTAuFLTfZM5 TmU+ZxewfGGUiqdN6o/IH6IVn3O7SLo/L8bMSUncPdukeQXl7mfTiQtB4en3sNec65tF QSBJdaVa8IiGs6ixCRFN9ysjBozR2ad7S78JI0lfr5+jebRCO1TgjK26OHYbxA6U17j+ 2WnLDY5lfagB6ZCCG3FHkF+arQJOi1+yMyffh42Syyx+jvOH/s8B7TOb2lQmHJf1SdpE BPCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=IbkWyHpTil5Vluy4ET88nT33hqg2kUPldnzNz+Wucw8=; b=Xfh+oORhx2T51vGD0QapbVwm/5gfjqbSjmN3ABK5OlaoKtHztXGVuZOf2TU6+NLDMk hznATfQM3wO3Y3aAyN9s7YDQEJBvpnO+H3eLy96geoEJ8xtzzLiVqWIDCRab4Zs17d0d KLxoFde8O/c7ujsO4HGwp0kol3I6ISTWGUZ1ocBR/Pe/z9UbrjPFC+mlUzld3kRImcHq gXE/aIzrgNWhnVR7y/CaDLE/+B0Xk6jk4TemJq2y2sOnJIsbi25rVZ4Q3ehgieZuoKb9 dgcMh3U6gnzLa0kVViFnQyOXDi+d4AIgkyCgk/2CLZoWSX/DIcpWNdp9A5AwP6tvH2ox NzUA== X-Gm-Message-State: AJcUukcvcu3RzMGjqwiGmi9R9BdJaXAYI2kgBQV3hzJM+F7Sy9uVX32u Er1f2VWqpsRqm9rFc7yUKOt7pLl58ilW5A== X-Google-Smtp-Source: ALg8bN6Yz3Up/0axxhnVvTlvP7bc+l8vmNdlnaY86tKPutLhoeyKnqi+W6qtHT8SlmIAfwqTFauVyw== X-Received: by 2002:a63:6506:: with SMTP id z6mr18297332pgb.334.1547827957707; Fri, 18 Jan 2019 08:12:37 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:36 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 02/17] block: wire up block device iopoll method Date: Fri, 18 Jan 2019 09:12:10 -0700 Message-Id: <20190118161225.4545-3-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index c546cdce77e6..5415579f3e14 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -279,6 +279,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -396,6 +404,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2068,6 +2077,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, From patchwork Fri Jan 18 16:12:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770773 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A429713B4 for ; Fri, 18 Jan 2019 16:12:42 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 94B6F2F3B0 for ; Fri, 18 Jan 2019 16:12:42 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 88A752F5C6; Fri, 18 Jan 2019 16:12:42 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3633D2F571 for ; Fri, 18 Jan 2019 16:12:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727716AbfARQMl (ORCPT ); Fri, 18 Jan 2019 11:12:41 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:46330 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727672AbfARQMl (ORCPT ); Fri, 18 Jan 2019 11:12:41 -0500 Received: by mail-pg1-f194.google.com with SMTP id w7so6244110pgp.13 for ; Fri, 18 Jan 2019 08:12:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=; b=tBJafw9ns5NBiaGA9T3NqjcHIIYmW+YPCTAkqeXIUIg7zvwHikklv8ZMwrdh1nmh08 OlA5W0kppJcjCGIv09pBN9HtI+76ZaDPDtF5uS5Ir1eLJEsVRe7b75ybjkvv8ZFtX9ht HgEVYKjhK+GS33YlaWcO79q1/McNWX9zzfk5F+2vems8Boymzu8cDP2kmUOFj4h2QsHA GHEizMkL//r57Lkh4bDthHX8KX/PMw5lDVTpoTwvSJlRBZFJcWvUIi70SX47LXejG/2Z S0bmwjKhQiMM/A+KkG5ib4VgZRAZA8+A0YZOumwqJsItC8/qd3tUcAG7MJtsZ2OXNEQ1 IGCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=27Lm/JhUsJpkymG55uhLhyMpM6M6IdPOsWlq/569Gqs=; b=fCXmHPvM2UKoPF0cDKf+wCE52HJpTM7B77M6U5nYPZUiCZ//QJni/SFF/UjX4nJXUF dtYPzfNcWsoEpaGW1Brpwe3Qwj2/I5pC5tpGT6xZMdXJSMTZpb1PD6EzH0BTT9kjgVuP RWWmOtI172/2tHAe+jvK56hedV6stpbTqds6FSvTejgI6tBXVIQJT+KvZNo3+fWrr+IA jManOvSCFNdkUkqYzcN7D8OiVB7RRG+Wl+ztSYYgGmlA0bWZKaMF5bh/HLXitZUvNE4D eZCSH+NT69qrMFqyTpzinNFCVW9A/HlOk0eCOftoVWHV3fWDE0XLk+U3kHnBUorfQyNt RGzg== X-Gm-Message-State: AJcUukc0zxs+4mEtRdl/0vxGuILKYiNSosZP2ih8/8N+khG6+rgJF2/L 6HWnk/CwwALmeq7D1gj7giza2ov2Jz9juw== X-Google-Smtp-Source: ALg8bN76NkY8wIc2R0UryPmgdKuJYeJhKTqnJ5FG5pgXgACr42yVVuFZMkeyjrALp6xTb6HFyoPzjA== X-Received: by 2002:a63:134f:: with SMTP id 15mr18168131pgt.19.1547827959789; Fri, 18 Jan 2019 08:12:39 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.37 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:38 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 03/17] block: add bio_set_polled() helper Date: Fri, 18 Jan 2019 09:12:11 -0700 Message-Id: <20190118161225.4545-4-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Utilize the helper in the blockdev DIRECT_IO code. Signed-off-by: Jens Axboe --- fs/block_dev.c | 4 ++-- include/linux/bio.h | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 5415579f3e14..2ebd2a0d7789 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -233,7 +233,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, task_io_account_write(ret); } if (iocb->ki_flags & IOCB_HIPRI) - bio.bi_opf |= REQ_HIPRI; + bio_set_polled(&bio, iocb); qc = submit_bio(&bio); for (;;) { @@ -401,7 +401,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + bio_set_polled(bio, iocb); qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); diff --git a/include/linux/bio.h b/include/linux/bio.h index 7380b094dcca..f6f0a2b3cbc8 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -823,5 +823,19 @@ static inline int bio_integrity_add_page(struct bio *bio, struct page *page, #endif /* CONFIG_BLK_DEV_INTEGRITY */ +/* + * Mark a bio as polled. Note that for async polled IO, the caller must + * expect -EWOULDBLOCK if we cannot allocate a request (or other resources). + * We cannot block waiting for requests on polled IO, as those completions + * must be found by the caller. This is different than IRQ driven IO, where + * it's safe to wait for IO to complete. + */ +static inline void bio_set_polled(struct bio *bio, struct kiocb *kiocb) +{ + bio->bi_opf |= REQ_HIPRI; + if (!is_sync_kiocb(kiocb)) + bio->bi_opf |= REQ_NOWAIT; +} + #endif /* CONFIG_BLOCK */ #endif /* __LINUX_BIO_H */ From patchwork Fri Jan 18 16:12:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770777 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3DD1013B5 for ; Fri, 18 Jan 2019 16:12:45 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2E2EF2F432 for ; Fri, 18 Jan 2019 16:12:45 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 221792F571; Fri, 18 Jan 2019 16:12:45 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A98D12F432 for ; Fri, 18 Jan 2019 16:12:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727764AbfARQMo (ORCPT ); Fri, 18 Jan 2019 11:12:44 -0500 Received: from mail-pl1-f194.google.com ([209.85.214.194]:42982 "EHLO mail-pl1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727759AbfARQMn (ORCPT ); Fri, 18 Jan 2019 11:12:43 -0500 Received: by mail-pl1-f194.google.com with SMTP id y1so6546453plp.9 for ; Fri, 18 Jan 2019 08:12:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=; b=zgylkltFPf9oTz8XSdN/MkSPuY05Fp9gFfbNufjlsMG9YZUlU7X/zR2jkvcOYBv5P0 R1JDFDB4lYq0fNrS08cuTYi+CP82KdqxMGlc25mG6R0Hidab7yboHr3cLMNOHjyG1LJx WR8cqYSzZSKAV3LwBYX/YH0HnVZeoY+ztljsTMdkJQxIcqF8BHWOuheBjdLIqdGXadGO qciBlZFdZfRY6Gr0qBTAYC3ziEGFlgucqofqqutHgh7uCxjNNGUHOE1oNf5/gBxi83Y8 SzHLtWHdf9/ruMwXt4Epf8EMKWsXiSS+ZcwFYYeQMH++eU736/mPJSSNpk0wVUyulk1+ XKzw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=tkb1muQx9x9KnElGaaM03AUkKIKh+AW+Xsb0ZQn8sQk=; b=cwUAJpTlabcWH/XP3JWvpYToHIBmJzyo6Lk97FD9As6otRsQiRBVFOCGDGdRpknEHT SmO4xTjWvFrzP4XO8QDlhVcCFYKQ1acuoTkGUWoLruCUYN/ZA3kCsSzG7Aw8N/Wl6aKI 4hH0Lar0F1bd1eG/XVrzfTkX0861b+/rIGvoCjjKjC8rzcoMV6dDsE+6ncPK7qwoQByJ RnowhZckx/T9fg+HuYRmd0l8NXTsq24KiesEQ933Uo7i/qwAWNPZqn0Zb/fK2yb0dZPa 6ZBdULAMh+5Dq0Dv8MYApFiQYT3gTr3hvFD9kvJgkgCOFH0C7M0Mq2NU/VfSmgVDIQ8I YhZg== X-Gm-Message-State: AJcUukdSmLn6loc4dyYAt1sWDzVFylU6EJUL3T6LCjk5b0cqOgbL/J31 3jV0w7hlqSUvYuVRLPqjhfAU3MN9nVEDGw== X-Google-Smtp-Source: ALg8bN5t91tJkM/lv3o0fh4N+SAtur0eFAs/cRAT1qDk3+pnkD6YRb75jbTfCFolMX498mKFqmbodQ== X-Received: by 2002:a17:902:be0e:: with SMTP id r14mr18742808pls.124.1547827961950; Fri, 18 Jan 2019 08:12:41 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.40 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:41 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 04/17] iomap: wire up the iopoll method Date: Fri, 18 Jan 2019 09:12:12 -0700 Message-Id: <20190118161225.4545-5-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig Modified to use bio_set_polled(). Signed-off-by: Jens Axboe --- fs/gfs2/file.c | 2 ++ fs/iomap.c | 43 ++++++++++++++++++++++++++++--------------- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..58a768e59712 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index a3088fae567b..4ee50b76b4a1 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1454,6 +1454,28 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) + bio_set_polled(bio, dio->iocb); + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1566,7 +1588,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1580,15 +1602,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1691,9 +1708,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1701,11 +1715,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1916,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + if (!atomic_dec_and_test(&dio->ref)) { if (!dio->wait_for_completion) return -EIOCBQUEUED; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read = generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 9a4258154b25..0fefb5455bda 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -162,6 +162,7 @@ typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret, unsigned flags); ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, const struct iomap_ops *ops, iomap_dio_end_io_t end_io); +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin); #ifdef CONFIG_SWAP struct file; From patchwork Fri Jan 18 16:12:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770785 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 35D02186E for ; Fri, 18 Jan 2019 16:12:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 243242F3B0 for ; Fri, 18 Jan 2019 16:12:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 181B22F562; Fri, 18 Jan 2019 16:12:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 37BB62F432 for ; Fri, 18 Jan 2019 16:12:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727865AbfARQMs (ORCPT ); Fri, 18 Jan 2019 11:12:48 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:36826 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727770AbfARQMs (ORCPT ); Fri, 18 Jan 2019 11:12:48 -0500 Received: by mail-pg1-f193.google.com with SMTP id n2so6268619pgm.3 for ; Fri, 18 Jan 2019 08:12:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=bkYTmz4PWDU0QJQ9lJ67eprNC82pq5wUAIu7cU37Kq8=; b=OpBWwgg3dZ5Ymy7+EgCrjHv5DNG4l2g5Hkdei7Oa8tLYvaPmftGB1n40q2Y5eVKCgT fHt3TJJE6VIDULCrAVKa9CrhJfvUhCWny7DJ3foJcgwlE5BjK03uEEGXflxYKnc+0ydC vzCKosqUEFeil3N9gVxxnS2QA72RakDJk4XjZKD+JJvdq+s9re5n2wC6j+Puy/e7+WGG P3l/j1yfxdTos6MwdTGkmPqCo1pyhq00I7Nraq8GiXhl0aev+Ucku8cs8b4FJvF577iE WvlIV7V7NPTnySUUr/nfjZ0oG1yP6FZzW1pTKvET0+JTzz0r4GGTgjVxw4Gbl5OcQCg6 AJvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=bkYTmz4PWDU0QJQ9lJ67eprNC82pq5wUAIu7cU37Kq8=; b=RE7fn/iV3F2Gbm7rLNqlybTfwM9uZZ9trsHpLFEGCX7ewmMChC/EVqAlfjPeO4mO32 9UvWQvOVDGqfaI6Ks9QEVcqRF2aFhdRmWok6lys/74CpDGEbxxc5Dnj3cogJrHgeeQPw kH5XlKG3yk7nCwoq2+RUXbWeHZEI9bzyQHq+hbaRpZYILIxNoGCIxKEoV+k4E9RHaPce mkFi8tFnM7v5iBKhN6AXhIt7/mJ1ADDbZKpwlqGMJ5ae+CaNBVJyidTBaefhbrhCtLLu GY48z3hp0vq9KvJvGG0jFTG1UzPCgfsAHrEQcSFK9Ay7EyY2UaWk7QvbCdJCNmWk2Hjh FQgw== X-Gm-Message-State: AJcUukfzs39173Y1BW4PJXqem2ZNXTWfZ9ibIOJiNsPjGwl+cTRUfi7s 3/e93e2kCJKSvKd2loxSBB+JXBMMs5AtrQ== X-Google-Smtp-Source: ALg8bN40IXwAREcajU4M82hglZWXJj755kaukiYqrgZYQG15rWLiOO0mZWXSFSWuogCANansRuPsRQ== X-Received: by 2002:a62:b15:: with SMTP id t21mr20588375pfi.136.1547827964712; Fri, 18 Jan 2019 08:12:44 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.42 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:43 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 05/17] Add io_uring IO interface Date: Fri, 18 Jan 2019 09:12:13 -0700 Message-Id: <20190118161225.4545-6-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_sqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + fs/Makefile | 1 + fs/io_uring.c | 1085 ++++++++++++++++++++++++ include/linux/syscalls.h | 5 + include/uapi/linux/io_uring.h | 96 +++ init/Kconfig | 9 + kernel/sys_ni.c | 3 + 8 files changed, 1203 insertions(+) create mode 100644 fs/io_uring.c create mode 100644 include/uapi/linux/io_uring.h diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 3cf7b533b3d1..194e79c0032e 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -398,3 +398,5 @@ 384 i386 arch_prctl sys_arch_prctl __ia32_compat_sys_arch_prctl 385 i386 io_pgetevents sys_io_pgetevents __ia32_compat_sys_io_pgetevents 386 i386 rseq sys_rseq __ia32_sys_rseq +387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup +388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..453ff7a79002 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,8 @@ 332 common statx __x64_sys_statx 333 common io_pgetevents __x64_sys_io_pgetevents 334 common rseq __x64_sys_rseq +335 common io_uring_setup __x64_sys_io_uring_setup +336 common io_uring_enter __x64_sys_io_uring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/Makefile b/fs/Makefile index 293733f61594..8e15d6fc4340 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_USERFAULTFD) += userfaultfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_IO_URING) += io_uring.o obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FS_ENCRYPTION) += crypto/ obj-$(CONFIG_FILE_LOCKING) += locks.o diff --git a/fs/io_uring.c b/fs/io_uring.c new file mode 100644 index 000000000000..2952d0a82dd3 --- /dev/null +++ b/fs/io_uring.c @@ -0,0 +1,1085 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Shared application/kernel submission and completion ring pairs, for + * supporting fast/efficient IO. + * + * Copyright (C) 2018-2019 Jens Axboe + */ +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include + +#include "internal.h" + +struct io_uring { + u32 head ____cacheline_aligned_in_smp; + u32 tail ____cacheline_aligned_in_smp; +}; + +struct io_sq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 dropped; + u32 flags; + u32 array[]; +}; + +struct io_cq_ring { + struct io_uring r; + u32 ring_mask; + u32 ring_entries; + u32 overflow; + struct io_uring_cqe cqes[]; +}; + +struct io_ring_ctx { + struct { + struct percpu_ref refs; + } ____cacheline_aligned_in_smp; + + struct { + unsigned int flags; + bool compat; + + /* SQ ring */ + struct io_sq_ring *sq_ring; + unsigned cached_sq_head; + unsigned sq_entries; + unsigned sq_mask; + unsigned sq_thread_cpu; + struct io_uring_sqe *sq_sqes; + } ____cacheline_aligned_in_smp; + + /* IO offload */ + struct workqueue_struct *sqo_wq; + struct mm_struct *sqo_mm; + struct files_struct *sqo_files; + + struct { + /* CQ ring */ + struct io_cq_ring *cq_ring; + unsigned cached_cq_tail; + unsigned cq_entries; + unsigned cq_mask; + struct wait_queue_head cq_wait; + struct fasync_struct *cq_fasync; + } ____cacheline_aligned_in_smp; + + struct user_struct *user; + + struct completion ctx_done; + + struct { + struct mutex uring_lock; + wait_queue_head_t wait; + } ____cacheline_aligned_in_smp; + + struct { + spinlock_t completion_lock; + } ____cacheline_aligned_in_smp; +}; + +struct sqe_submit { + const struct io_uring_sqe *sqe; + unsigned index; +}; + +struct io_kiocb { + union { + struct kiocb rw; + struct sqe_submit submit; + }; + + struct io_ring_ctx *ctx; + struct list_head list; + unsigned int flags; +#define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ + u64 user_data; + + struct work_struct work; +}; + +#define IO_PLUG_THRESHOLD 2 + +static struct kmem_cache *req_cachep; + +static const struct file_operations io_uring_fops; + +static void io_ring_ctx_ref_free(struct percpu_ref *ref) +{ + struct io_ring_ctx *ctx = container_of(ref, struct io_ring_ctx, refs); + + complete(&ctx->ctx_done); +} + +static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) +{ + struct io_ring_ctx *ctx; + + ctx = kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return NULL; + + if (percpu_ref_init(&ctx->refs, io_ring_ctx_ref_free, 0, GFP_KERNEL)) { + kfree(ctx); + return NULL; + } + + ctx->flags = p->flags; + init_waitqueue_head(&ctx->cq_wait); + init_completion(&ctx->ctx_done); + mutex_init(&ctx->uring_lock); + init_waitqueue_head(&ctx->wait); + spin_lock_init(&ctx->completion_lock); + return ctx; +} + +static void io_commit_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + + if (ctx->cached_cq_tail != ring->r.tail) { + /* order cqe stores with ring update */ + smp_wmb(); + ring->r.tail = ctx->cached_cq_tail; + /* write side barrier of tail update, app has read side */ + smp_wmb(); + + if (wq_has_sleeper(&ctx->cq_wait)) { + wake_up_interruptible(&ctx->cq_wait); + kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); + } + } +} + +static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) +{ + struct io_cq_ring *ring = ctx->cq_ring; + unsigned tail; + + tail = ctx->cached_cq_tail; + smp_rmb(); + if (tail + 1 == READ_ONCE(ring->r.head)) + return NULL; + + ctx->cached_cq_tail++; + return &ring->cqes[tail & ctx->cq_mask]; +} + +static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + struct io_uring_cqe *cqe; + + /* + * If we can't get a cq entry, userspace overflowed the + * submission (by quite a lot). Increment the overflow count in + * the ring. + */ + cqe = io_get_cqring(ctx); + if (cqe) { + cqe->user_data = ki_user_data; + cqe->res = res; + cqe->flags = ev_flags; + io_commit_cqring(ctx); + } else + ctx->cq_ring->overflow++; + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->completion_lock, flags); + __io_cqring_add_event(ctx, ki_user_data, res, ev_flags); + spin_unlock_irqrestore(&ctx->completion_lock, flags); +} + +static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) +{ + percpu_ref_put_many(&ctx->refs, refs); + + if (waitqueue_active(&ctx->wait)) + wake_up(&ctx->wait); +} + +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + /* safe to use the non tryget, as we're inside ring ref already */ + percpu_ref_get(&ctx->refs); + + req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (req) { + req->ctx = ctx; + req->flags = 0; + return req; + } + + io_ring_drop_ctx_refs(ctx, 1); + return NULL; +} + +static void io_free_req(struct io_kiocb *req) +{ + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); +} + +static void kiocb_end_write(struct kiocb *kiocb) +{ + if (kiocb->ki_flags & IOCB_WRITE) { + struct inode *inode = file_inode(kiocb->ki_filp); + + /* + * Tell lockdep we inherited freeze protection from submission + * thread. + */ + if (S_ISREG(inode->i_mode)) + __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); + file_end_write(kiocb->ki_filp); + } +} + +static void io_complete_rw(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + fput(kiocb->ki_filp); + io_cqring_add_event(req->ctx, req->user_data, res, 0); + io_free_req(req); +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct kiocb *kiocb = &req->rw; + int ret; + + kiocb->ki_filp = fget(sqe->fd); + if (unlikely(!kiocb->ki_filp)) + return -EBADF; + kiocb->ki_pos = sqe->off; + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); + if (sqe->ioprio) { + ret = ioprio_check_cap(sqe->ioprio); + if (ret) + goto out_fput; + + kiocb->ki_ioprio = sqe->ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); + if (unlikely(ret)) + goto out_fput; + if (force_nonblock) { + kiocb->ki_flags |= IOCB_NOWAIT; + req->flags |= REQ_F_FORCE_NONBLOCK; + } + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + + kiocb->ki_complete = io_complete_rw; + return 0; +out_fput: + fput(kiocb->ki_filp); + return ret; +} + +static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) +{ + switch (ret) { + case -EIOCBQUEUED: + break; + case -ERESTARTSYS: + case -ERESTARTNOINTR: + case -ERESTARTNOHAND: + case -ERESTART_RESTARTBLOCK: + /* + * We can't just restart the syscall, since previously + * submitted sqes may already be in progress. Just fail this + * IO with EINTR. + */ + ret = -EINTR; + /* fall through */ + default: + kiocb->ki_complete(kiocb, ret, 0); + } +} + +static int io_import_iovec(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iovec **iovec, struct iov_iter *iter) +{ + void __user *buf = u64_to_user_ptr(sqe->addr); + +#ifdef CONFIG_COMPAT + if (ctx->compat) + return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV, + iovec, iter); +#endif + return import_iovec(rw, buf, sqe->len, UIO_FASTIOV, iovec, iter); +} + +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_READ))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->read_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); + if (!ret) { + ssize_t ret2; + + /* Catch -EAGAIN return for forced non-blocking submission */ + ret2 = call_read_iter(file, kiocb, &iter); + if (!force_nonblock || ret2 != -EAGAIN) + io_rw_done(kiocb, ret2); + else + ret = -EAGAIN; + } + kfree(iovec); +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; + struct kiocb *kiocb = &req->rw; + struct iov_iter iter; + struct file *file; + ssize_t ret; + + ret = io_prep_rw(req, sqe, force_nonblock); + if (ret) + return ret; + file = kiocb->ki_filp; + + ret = -EAGAIN; + if (force_nonblock && !(kiocb->ki_flags & IOCB_DIRECT)) + goto out_fput; + + ret = -EBADF; + if (unlikely(!(file->f_mode & FMODE_WRITE))) + goto out_fput; + ret = -EINVAL; + if (unlikely(!file->f_op->write_iter)) + goto out_fput; + + ret = io_import_iovec(req->ctx, WRITE, sqe, &iovec, &iter); + if (ret) + goto out_fput; + + ret = rw_verify_area(WRITE, file, &kiocb->ki_pos, + iov_iter_count(&iter)); + if (!ret) { + /* + * Open-code file_start_write here to grab freeze protection, + * which will be released by another thread in + * io_complete_rw(). Fool lockdep by telling it the lock got + * released so that it doesn't complain about the held lock when + * we return to userspace. + */ + if (S_ISREG(file_inode(file)->i_mode)) { + __sb_start_write(file_inode(file)->i_sb, + SB_FREEZE_WRITE, true); + __sb_writers_release(file_inode(file)->i_sb, + SB_FREEZE_WRITE); + } + kiocb->ki_flags |= IOCB_WRITE; + io_rw_done(kiocb, call_write_iter(file, kiocb, &iter)); + } +out_fput: + if (unlikely(ret)) + fput(file); + return ret; +} + +/* + * IORING_OP_NOP just posts a completion event, nothing else. + */ +static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + + __io_cqring_add_event(ctx, sqe->user_data, 0, 0); + io_free_req(req); + return 0; +} + +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, + struct sqe_submit *s, bool force_nonblock) +{ + const struct io_uring_sqe *sqe = s->sqe; + ssize_t ret; + + if (unlikely(s->index >= ctx->sq_entries)) + return -EINVAL; + req->user_data = sqe->user_data; + + ret = -EINVAL; + switch (sqe->opcode) { + case IORING_OP_NOP: + ret = io_nop(req, sqe); + break; + case IORING_OP_READV: + ret = io_read(req, sqe, force_nonblock); + break; + case IORING_OP_WRITEV: + ret = io_write(req, sqe, force_nonblock); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static void io_sq_wq_submit_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct sqe_submit *s = &req->submit; + struct io_ring_ctx *ctx = req->ctx; + mm_segment_t old_fs = get_fs(); + struct files_struct *old_files; + int ret; + + /* Ensure we clear previously set forced non-block flag */ + req->flags &= ~REQ_F_FORCE_NONBLOCK; + + old_files = current->files; + current->files = ctx->sqo_files; + + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + + use_mm(ctx->sqo_mm); + set_fs(USER_DS); + + ret = __io_submit_sqe(ctx, req, s, false); + + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); +err: + if (ret) { + io_cqring_add_event(ctx, s->sqe->user_data, ret, 0); + io_free_req(req); + } + current->files = old_files; +} + +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_kiocb *req; + ssize_t ret; + + /* enforce forwards compatibility on users */ + if (unlikely(s->sqe->flags)) + return -EINVAL; + + req = io_get_req(ctx); + if (unlikely(!req)) + return -EAGAIN; + + ret = __io_submit_sqe(ctx, req, s, true); + if (ret == -EAGAIN) { + memcpy(&req->submit, s, sizeof(*s)); + INIT_WORK(&req->work, io_sq_wq_submit_work); + queue_work(ctx->sqo_wq, &req->work); + ret = 0; + } + if (ret) + io_free_req(req); + + return ret; +} + +static void io_commit_sqring(struct io_ring_ctx *ctx) +{ + struct io_sq_ring *ring = ctx->sq_ring; + + if (ctx->cached_sq_head != ring->r.head) { + ring->r.head = ctx->cached_sq_head; + /* write side barrier of head update, app has read side */ + smp_wmb(); + } +} + +/* + * Undo last io_get_sqring() + */ +static void io_drop_sqring(struct io_ring_ctx *ctx) +{ + ctx->cached_sq_head--; +} + +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) +{ + struct io_sq_ring *ring = ctx->sq_ring; + unsigned head; + + head = ctx->cached_sq_head; + smp_rmb(); + if (head == READ_ONCE(ring->r.tail)) + return false; + + head = ring->array[head & ctx->sq_mask]; + if (head < ctx->sq_entries) { + s->index = head; + s->sqe = &ctx->sq_sqes[head]; + ctx->cached_sq_head++; + return true; + } + + /* drop invalid entries */ + ctx->cached_sq_head++; + ring->dropped++; + smp_wmb(); + return false; +} + +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +{ + int i, ret = 0, submit = 0; + struct blk_plug plug; + + if (to_submit > IO_PLUG_THRESHOLD) + blk_start_plug(&plug); + + for (i = 0; i < to_submit; i++) { + struct sqe_submit s; + + if (!io_get_sqring(ctx, &s)) + break; + + ret = io_submit_sqe(ctx, &s); + if (ret) { + io_drop_sqring(ctx); + break; + } + + submit++; + } + io_commit_sqring(ctx); + + if (to_submit > IO_PLUG_THRESHOLD) + blk_finish_plug(&plug); + + return submit ? submit : ret; +} + +/* + * Wait until events become available, if we don't already have some. The + * application must reap them itself, as they reside on the shared cq ring. + */ +static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events) +{ + struct io_cq_ring *ring = ctx->cq_ring; + DEFINE_WAIT(wait); + int ret = 0; + + smp_rmb(); + if (ring->r.head != ring->r.tail) + return 0; + if (!min_events) + return 0; + + do { + prepare_to_wait(&ctx->wait, &wait, TASK_INTERRUPTIBLE); + + ret = 0; + smp_rmb(); + if (ring->r.head != ring->r.tail) + break; + + schedule(); + + ret = -EINTR; + if (signal_pending(current)) + break; + } while (1); + + finish_wait(&ctx->wait, &wait); + return ring->r.head == ring->r.tail ? ret : 0; +} + +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, + unsigned min_complete, unsigned flags) +{ + int ret = 0; + + if (to_submit) { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } + if (flags & IORING_ENTER_GETEVENTS) { + int get_ret; + + if (!ret && to_submit) + min_complete = 0; + + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) + ret = get_ret; + } + + return ret; +} + +static int io_sq_offload_start(struct io_ring_ctx *ctx) +{ + int ret; + + ctx->sqo_mm = current->mm; + + /* + * This is safe since 'current' has the fd installed, and if that gets + * closed on exit, then fops->release() is invoked which waits for the + * async contexts to flush and exit before exiting. + */ + ret = -EBADF; + ctx->sqo_files = current->files; + if (!ctx->sqo_files) + goto err; + + /* Do QD, or 2 * CPUS, whatever is smallest */ + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, + min(ctx->sq_entries - 1, 2 * num_online_cpus())); + if (!ctx->sqo_wq) { + ret = -ENOMEM; + goto err; + } + + return 0; +err: + if (ctx->sqo_files) + ctx->sqo_files = NULL; + ctx->sqo_mm = NULL; + return ret; +} + +static void io_sq_offload_stop(struct io_ring_ctx *ctx) +{ + if (ctx->sqo_wq) { + destroy_workqueue(ctx->sqo_wq); + ctx->sqo_wq = NULL; + } +} + +static void __io_unaccount_mem(struct user_struct *user, unsigned long nr_pages) +{ + atomic_long_sub(nr_pages, &user->locked_vm); +} + +static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +{ + if (ctx->user) + __io_unaccount_mem(ctx->user, nr_pages); +} + +static int __io_account_mem(struct user_struct *user, unsigned long nr_pages) +{ + unsigned long page_limit, cur_pages, new_pages; + + /* Don't allow more pages than we can safely lock */ + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + do { + cur_pages = atomic_long_read(&user->locked_vm); + new_pages = cur_pages + nr_pages; + if (new_pages > page_limit) + return -ENOMEM; + } while (atomic_long_cmpxchg(&user->locked_vm, cur_pages, + new_pages) != cur_pages); + + return 0; +} + +static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t bytes; + + bytes = struct_size(sq_ring, array, sq_entries); + bytes += array_size(sizeof(struct io_uring_sqe), sq_entries); + bytes += struct_size(cq_ring, cqes, cq_entries); + + return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; +} + +static void io_free_scq_urings(struct io_ring_ctx *ctx) +{ + if (ctx->sq_ring) { + page_frag_free(ctx->sq_ring); + ctx->sq_ring = NULL; + } + if (ctx->sq_sqes) { + page_frag_free(ctx->sq_sqes); + ctx->sq_sqes = NULL; + } + if (ctx->cq_ring) { + page_frag_free(ctx->cq_ring); + ctx->cq_ring = NULL; + } +} + +static void io_ring_ctx_free(struct io_ring_ctx *ctx) +{ + io_sq_offload_stop(ctx); + io_free_scq_urings(ctx); + percpu_ref_exit(&ctx->refs); + io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); + kfree(ctx); +} + +static __poll_t io_uring_poll(struct file *file, poll_table *wait) +{ + struct io_ring_ctx *ctx = file->private_data; + __poll_t mask = 0; + + poll_wait(file, &ctx->cq_wait, wait); + smp_rmb(); + if (ctx->sq_ring->r.tail + 1 != ctx->cached_sq_head) + mask |= EPOLLOUT | EPOLLWRNORM; + if (ctx->cq_ring->r.head != ctx->cached_cq_tail) + mask |= EPOLLIN | EPOLLRDNORM; + + return mask; +} + +static int io_uring_fasync(int fd, struct file *file, int on) +{ + struct io_ring_ctx *ctx = file->private_data; + + return fasync_helper(fd, file, on, &ctx->cq_fasync); +} + +static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) +{ + mutex_lock(&ctx->uring_lock); + percpu_ref_kill(&ctx->refs); + mutex_unlock(&ctx->uring_lock); + + wait_for_completion(&ctx->ctx_done); + io_ring_ctx_free(ctx); +} + +static int io_uring_release(struct inode *inode, struct file *file) +{ + struct io_ring_ctx *ctx = file->private_data; + + file->private_data = NULL; + io_ring_ctx_wait_and_kill(ctx); + return 0; +} + +static int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + loff_t offset = (loff_t) vma->vm_pgoff << PAGE_SHIFT; + unsigned long sz = vma->vm_end - vma->vm_start; + struct io_ring_ctx *ctx = file->private_data; + unsigned long pfn; + struct page *page; + void *ptr; + + switch (offset) { + case IORING_OFF_SQ_RING: + ptr = ctx->sq_ring; + break; + case IORING_OFF_SQES: + ptr = ctx->sq_sqes; + break; + case IORING_OFF_CQ_RING: + ptr = ctx->cq_ring; + break; + default: + return -EINVAL; + } + + page = virt_to_head_page(ptr); + if (sz > (PAGE_SIZE << compound_order(page))) + return -EINVAL; + + pfn = virt_to_phys(ptr) >> PAGE_SHIFT; + return remap_pfn_range(vma, vma->vm_start, pfn, sz, vma->vm_page_prot); +} + +SYSCALL_DEFINE4(io_uring_enter, unsigned int, fd, u32, to_submit, + u32, min_complete, u32, flags) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ret = -ENXIO; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + +static const struct file_operations io_uring_fops = { + .release = io_uring_release, + .mmap = io_uring_mmap, + .poll = io_uring_poll, + .fasync = io_uring_fasync, +}; + +static void *io_mem_alloc(size_t size) +{ + gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | + __GFP_NORETRY; + + return (void *) __get_free_pages(gfp_flags, get_order(size)); +} + +static int io_allocate_scq_urings(struct io_ring_ctx *ctx, + struct io_uring_params *p) +{ + struct io_sq_ring *sq_ring; + struct io_cq_ring *cq_ring; + size_t size; + int ret; + + sq_ring = io_mem_alloc(struct_size(sq_ring, array, p->sq_entries)); + if (!sq_ring) + return -ENOMEM; + + ctx->sq_ring = sq_ring; + sq_ring->ring_mask = p->sq_entries - 1; + sq_ring->ring_entries = p->sq_entries; + ctx->sq_mask = sq_ring->ring_mask; + ctx->sq_entries = sq_ring->ring_entries; + + ret = -EOVERFLOW; + size = array_size(sizeof(struct io_uring_sqe), p->sq_entries); + if (size == SIZE_MAX) + goto err; + ret = -ENOMEM; + ctx->sq_sqes = io_mem_alloc(size); + if (!ctx->sq_sqes) + goto err; + + cq_ring = io_mem_alloc(struct_size(cq_ring, cqes, p->cq_entries)); + if (!cq_ring) + goto err; + + ctx->cq_ring = cq_ring; + cq_ring->ring_mask = p->cq_entries - 1; + cq_ring->ring_entries = p->cq_entries; + ctx->cq_mask = cq_ring->ring_mask; + ctx->cq_entries = cq_ring->ring_entries; + return 0; +err: + io_free_scq_urings(ctx); + return ret; +} + +static void io_fill_offsets(struct io_uring_params *p) +{ + memset(&p->sq_off, 0, sizeof(p->sq_off)); + p->sq_off.head = offsetof(struct io_sq_ring, r.head); + p->sq_off.tail = offsetof(struct io_sq_ring, r.tail); + p->sq_off.ring_mask = offsetof(struct io_sq_ring, ring_mask); + p->sq_off.ring_entries = offsetof(struct io_sq_ring, ring_entries); + p->sq_off.flags = offsetof(struct io_sq_ring, flags); + p->sq_off.dropped = offsetof(struct io_sq_ring, dropped); + p->sq_off.array = offsetof(struct io_sq_ring, array); + + memset(&p->cq_off, 0, sizeof(p->cq_off)); + p->cq_off.head = offsetof(struct io_cq_ring, r.head); + p->cq_off.tail = offsetof(struct io_cq_ring, r.tail); + p->cq_off.ring_mask = offsetof(struct io_cq_ring, ring_mask); + p->cq_off.ring_entries = offsetof(struct io_cq_ring, ring_entries); + p->cq_off.overflow = offsetof(struct io_cq_ring, overflow); + p->cq_off.cqes = offsetof(struct io_cq_ring, cqes); +} + +static int io_uring_create(unsigned entries, struct io_uring_params *p, + bool compat) +{ + struct user_struct *user = NULL; + struct io_ring_ctx *ctx; + int ret; + + if (entries > IORING_MAX_ENTRIES) + return -EINVAL; + + /* + * Use twice as many entries for the CQ ring. It's possible for the + * application to drive a higher depth than the size of the SQ ring, + * since the sqes are only used at submission time. This allows for + * some flexibility in overcommitting a bit. + */ + p->sq_entries = roundup_pow_of_two(entries); + p->cq_entries = 2 * p->sq_entries; + + if (!capable(CAP_IPC_LOCK)) { + user = get_uid(current_user()); + ret = __io_account_mem(user, ring_pages(p->sq_entries, + p->cq_entries)); + if (ret) { + free_uid(user); + return ret; + } + } + + ctx = io_ring_ctx_alloc(p); + if (!ctx) + return -ENOMEM; + ctx->compat = compat; + ctx->user = user; + + ret = io_allocate_scq_urings(ctx, p); + if (ret) + goto err; + + ret = io_sq_offload_start(ctx); + if (ret) + goto err; + + ret = anon_inode_getfd("[io_uring]", &io_uring_fops, ctx, + O_RDWR | O_CLOEXEC); + if (ret < 0) + goto err; + + io_fill_offsets(p); + return ret; +err: + io_ring_ctx_wait_and_kill(ctx); + return ret; +} + +/* + * Sets up an aio uring context, and returns the fd. Applications asks for a + * ring size, we return the actual sq/cq ring sizes (among other things) in the + * params structure passed in. + */ +static long io_uring_setup(u32 entries, struct io_uring_params __user *params, + bool compat) +{ + struct io_uring_params p; + long ret; + int i; + + if (copy_from_user(&p, params, sizeof(p))) + return -EFAULT; + for (i = 0; i < ARRAY_SIZE(p.resv); i++) { + if (p.resv[i]) + return -EINVAL; + } + + if (p.flags) + return -EINVAL; + + ret = io_uring_create(entries, &p, compat); + if (ret < 0) + return ret; + + if (copy_to_user(params, &p, sizeof(p))) + return -EFAULT; + + return ret; +} + +SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params, false); +} + +#ifdef CONFIG_COMPAT +COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, + struct io_uring_params __user *, params) +{ + return io_uring_setup(entries, params, true); +} +#endif + +static int __init io_uring_init(void) +{ + req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); + return 0; +}; +__initcall(io_uring_init); diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 257cccba3062..542757a4c898 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -69,6 +69,7 @@ struct file_handle; struct sigaltstack; struct rseq; union bpf_attr; +struct io_uring_params; #include #include @@ -309,6 +310,10 @@ asmlinkage long sys_io_pgetevents_time32(aio_context_t ctx_id, struct io_event __user *events, struct old_timespec32 __user *timeout, const struct __aio_sigset *sig); +asmlinkage long sys_io_uring_setup(u32 entries, + struct io_uring_params __user *p); +asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, + u32 min_complete, u32 flags); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h new file mode 100644 index 000000000000..ce65db9269a8 --- /dev/null +++ b/include/uapi/linux/io_uring.h @@ -0,0 +1,96 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +/* + * Header file for the io_uring interface. + * + * Copyright (C) 2019 Jens Axboe + * Copyright (C) 2019 Christoph Hellwig + */ +#ifndef LINUX_IO_URING_H +#define LINUX_IO_URING_H + +#include +#include + +#define IORING_MAX_ENTRIES 4096 + +/* + * IO submission data structure (Submission Queue Entry) + */ +struct io_uring_sqe { + __u8 opcode; /* type of operation for this sqe */ + __u8 flags; /* as of now unused */ + __u16 ioprio; /* ioprio for the request */ + __s32 fd; /* file descriptor to do IO on */ + __u64 off; /* offset into file */ + __u64 addr; /* pointer to buffer or iovecs */ + __u32 len; /* buffer size or number of iovecs */ + union { + __kernel_rwf_t rw_flags; + __u32 __resv; + }; + __u64 user_data; /* data to be passed back at completion time */ + __u64 __pad2[3]; +}; + +#define IORING_OP_NOP 0 +#define IORING_OP_READV 1 +#define IORING_OP_WRITEV 2 + +/* + * IO completion data structure (Completion Queue Entry) + */ +struct io_uring_cqe { + __u64 user_data; /* sqe->data submission passed back */ + __s32 res; /* result code for this event */ + __u32 flags; +}; + +/* + * Magic offsets for the application to mmap the data it needs + */ +#define IORING_OFF_SQ_RING 0ULL +#define IORING_OFF_CQ_RING 0x8000000ULL +#define IORING_OFF_SQES 0x10000000ULL + +/* + * Filled with the offset for mmap(2) + */ +struct io_sqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 flags; + __u32 dropped; + __u32 array; + __u32 resv[3]; +}; + +struct io_cqring_offsets { + __u32 head; + __u32 tail; + __u32 ring_mask; + __u32 ring_entries; + __u32 overflow; + __u32 cqes; + __u32 resv[4]; +}; + +/* + * io_uring_enter(2) flags + */ +#define IORING_ENTER_GETEVENTS (1 << 0) + +/* + * Passed in for io_uring_setup(2). Copied back with updated info on success + */ +struct io_uring_params { + __u32 sq_entries; + __u32 cq_entries; + __u32 flags; + __u16 resv[10]; + struct io_sqring_offsets sq_off; + struct io_cqring_offsets cq_off; +}; + +#endif diff --git a/init/Kconfig b/init/Kconfig index d47cb77a220e..ce7bd7af9312 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1402,6 +1402,15 @@ config AIO by some high performance threaded applications. Disabling this option saves about 7k. +config IO_URING + bool "Enable IO uring support" if EXPERT + select ANON_INODES + default y + help + This option enables support for the io_uring interface, enabling + applications to submit and completion IO through submission and + completion rings that are shared between the kernel and application. + config ADVISE_SYSCALLS bool "Enable madvise/fadvise syscalls" if EXPERT default y diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index ab9d0e3c6d50..d754811ec780 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -46,6 +46,9 @@ COND_SYSCALL(io_getevents); COND_SYSCALL(io_pgetevents); COND_SYSCALL_COMPAT(io_getevents); COND_SYSCALL_COMPAT(io_pgetevents); +COND_SYSCALL(io_uring_setup); +COND_SYSCALL_COMPAT(io_uring_setup); +COND_SYSCALL(io_uring_enter); /* fs/xattr.c */ From patchwork Fri Jan 18 16:12:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770783 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A78E613B5 for ; Fri, 18 Jan 2019 16:12:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 98D042F3B0 for ; Fri, 18 Jan 2019 16:12:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8D4222F571; Fri, 18 Jan 2019 16:12:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 350212F3B0 for ; Fri, 18 Jan 2019 16:12:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727903AbfARQMs (ORCPT ); Fri, 18 Jan 2019 11:12:48 -0500 Received: from mail-pg1-f195.google.com ([209.85.215.195]:44064 "EHLO mail-pg1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727825AbfARQMs (ORCPT ); Fri, 18 Jan 2019 11:12:48 -0500 Received: by mail-pg1-f195.google.com with SMTP id t13so6254015pgr.11 for ; Fri, 18 Jan 2019 08:12:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=B+ngOqxtkell8tMitE8QAbzC+xS03+bz1EA2WKjHVWQ=; b=b55XYKNn1eBDlEJOscP4/EvADszggrtPnUgZLmEZVccomiGNkrYvP8v3Ba5HM0FZlI /N3LVqYXNylPVY2U0K6cqLCDAc2NwIb+RQftc3RirQFAahsljwaiXVuBI1+E2MD4I6T3 MmtCFX7h8I6zNHll7WmNrjse5xdd+uB1u+JasLHrajgD+S7Wh1BhfsANHgP+/ddN9XDd uU31U8iMF+37CT01RmEF40ZjSfoIZ2W2nbRukzicdYUQlUSB8zhQ7PIs36ZSBuw8rfzl ie519csaqhPcmBG/E1OTyB4Cd4GN5mGQAwjJOmMa9RuAEU5ql77jf82DQgGei7beQTvE b6Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=B+ngOqxtkell8tMitE8QAbzC+xS03+bz1EA2WKjHVWQ=; b=NCnUbk5dvr9Bxr0jBz9+jEEKUkZNdLW7lyUddHU/+R187vh0Z8qBVEZ2+z0bLE0SKr V7nZqYydKrxSinhzr1d93zpsBRmZ3AXjs26ZahiKf9u/oVScBH/LmP5Y1IxtggXyNfJt 5o4a9aEN/jPIFrUjh2hR+vhZ3YCcPymiCkFQb2fzmaNb9wEhuHIJYjEUKPIYzfAN6vav oxVdVpV9Uc81yCC85nTwufEau/AhhgRbSCJQLu3Moioe+s0ArZSq4ccwVq4coOexwvMk hdwHmV7NIz5MZiGcARf7/dItTZhwSbUzg9JCPAGLSn8RFnenvh7x9C6K0RI8nsrDQBpR pJcA== X-Gm-Message-State: AJcUukeih2onv8qUPg4q4AjNhHxSsIvCC8KSuRMhIOsENmceya43lmLO 5Gd6eC96yHTSLhI+vcCxXB1hhquCmNO1cw== X-Google-Smtp-Source: ALg8bN6j16HsXbr7xqXEPoSo0EbI6cwDlwrCk0TxFaKdNOLoKqGGbxqrXC3d7IH6nGRymNdD+e6+jQ== X-Received: by 2002:a63:981:: with SMTP id 123mr18405305pgj.444.1547827966871; Fri, 18 Jan 2019 08:12:46 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:45 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 06/17] io_uring: add fsync support Date: Fri, 18 Jan 2019 09:12:14 -0700 Message-Id: <20190118161225.4545-7-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Christoph Hellwig Add a new fsync opcode, which either syncs a range if one is passed, or the whole file if the offset and length fields are both cleared to zero. A flag is provided to use fdatasync semantics, that is only force out metadata which is required to retrieve the file data, but not others like metadata. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/io_uring.c | 34 ++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 8 +++++++- 2 files changed, 41 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 2952d0a82dd3..f464f6ca2b7d 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4,6 +4,7 @@ * supporting fast/efficient IO. * * Copyright (C) 2018-2019 Jens Axboe + * Copyright (c) 2018-2019 Christoph Hellwig */ #include #include @@ -465,6 +466,36 @@ static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe) return 0; } +static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, + bool force_nonblock) +{ + struct io_ring_ctx *ctx = req->ctx; + loff_t end = sqe->off + sqe->len; + struct file *file; + int ret; + + /* fsync always requires a blocking context */ + if (force_nonblock) + return -EAGAIN; + + if (unlikely(sqe->addr || sqe->ioprio)) + return -EINVAL; + if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) + return -EINVAL; + + file = fget(sqe->fd); + if (unlikely(!file)) + return -EBADF; + + ret = vfs_fsync_range(file, sqe->off, end > 0 ? end : LLONG_MAX, + sqe->fsync_flags & IORING_FSYNC_DATASYNC); + + fput(file); + io_cqring_add_event(ctx, sqe->user_data, ret, 0); + io_free_req(req); + return 0; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, struct sqe_submit *s, bool force_nonblock) { @@ -486,6 +517,9 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_WRITEV: ret = io_write(req, sqe, force_nonblock); break; + case IORING_OP_FSYNC: + ret = io_fsync(req, sqe, force_nonblock); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ce65db9269a8..ca503ded73e3 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -26,7 +26,7 @@ struct io_uring_sqe { __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; - __u32 __resv; + __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ __u64 __pad2[3]; @@ -35,6 +35,12 @@ struct io_uring_sqe { #define IORING_OP_NOP 0 #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 +#define IORING_OP_FSYNC 3 + +/* + * sqe->fsync_flags + */ +#define IORING_FSYNC_DATASYNC (1 << 0) /* * IO completion data structure (Completion Queue Entry) From patchwork Fri Jan 18 16:12:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770789 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 057C7186E for ; Fri, 18 Jan 2019 16:12:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E8ECE2F3B0 for ; Fri, 18 Jan 2019 16:12:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DC0A32F562; Fri, 18 Jan 2019 16:12:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 256E82F432 for ; Fri, 18 Jan 2019 16:12:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727966AbfARQMv (ORCPT ); Fri, 18 Jan 2019 11:12:51 -0500 Received: from mail-pg1-f193.google.com ([209.85.215.193]:35248 "EHLO mail-pg1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727910AbfARQMv (ORCPT ); Fri, 18 Jan 2019 11:12:51 -0500 Received: by mail-pg1-f193.google.com with SMTP id s198so6275604pgs.2 for ; Fri, 18 Jan 2019 08:12:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=/f4/HvggUhSpeKw7gcUWf8uBBUceU7aiSRrb8WSqHTQ=; b=vINPBJnx1OLnWPjYla9Iy5HMmomU3uTihPo20DKZSShxzqPRFQWzybjqL2wwJ6ZMNW LGHNptD3/1TBFbOwuzvZCXrC01XnFPVhVb2x1lr6V6Tm7zxZmliH8f0G71+ayUmftSIa nt38yG7uKoe4JQDwAilAuPcQ2EXLIOTAg4rVlWeD+xNmKFaesBSxA5dvpm1NkVuViUsa GEBUWDbjsJBnrjUpgt/dy5SGtTCT2oYAP4TxGqYtDqMcsGQevzt1LGg1f25urObjJRUp Ft7ezQkpVpNRw2eSr28LHltBccVJ/spz2gKOuWkXSJtU69T3iCdo1zf6BLuWdjlzmRI+ TcTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=/f4/HvggUhSpeKw7gcUWf8uBBUceU7aiSRrb8WSqHTQ=; b=sz7o7WnfO3qG9kDgKoOfRbx97RjkT+HRhlR4fR9snHP/hh4kFPhD88fCRWphEw+W3n zrqXtHxGexHZwQvTbH/aO8ldjanPZqOq4eUA+cG6kSKp6k+oDcA/W8balTM+MmCMz79g OAWG3sPJYzju/lLa4S09dBFP5CzteIYJUer8bOpQYnJGRzkT8HzhKKZLQ0krH08NIL3S N0vrMcWD26yjaLPNCkQ7m6G0rCPe9Fd3I7/wyYbl2fKNheUNuSZF0cHjYugo3WnkJ1if G/ANxEFe5WT6TFD/JDyVmtwsZiEYRoT+xi8HYhme9fAq8aJvwwVlBPzUCj8tdHwnRP/r HLKQ== X-Gm-Message-State: AJcUuke+BOwaIMnOS9tiiXypehRyQl7n3bUL1HvdaVecY48KYSQTiNP/ h9Fob8pekJOEcMwzNlCWwLg1GnKFV1jDbA== X-Google-Smtp-Source: ALg8bN4iChDQ6M9wEPXMkkdOUbIaTKueeHE/TFQxJTJVhiY/lsSUWVnpdJyQ1iJzaTmdxxjrYEZDjg== X-Received: by 2002:a63:8e43:: with SMTP id k64mr18127992pge.346.1547827969327; Fri, 18 Jan 2019 08:12:49 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:48 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 07/17] io_uring: support for IO polling Date: Fri, 18 Jan 2019 09:12:15 -0700 Message-Id: <20190118161225.4545-8-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add support for a polled io_uring context. When a read or write is submitted to a polled context, the application must poll for completions on the CQ ring through io_uring_enter(2). Polled IO may not generate IRQ completions, hence they need to be actively found by the application itself. To use polling, io_uring_setup() must be used with the IORING_SETUP_IOPOLL flag being set. It is illegal to mix and match polled and non-polled IO on an io_uring. Signed-off-by: Jens Axboe --- fs/io_uring.c | 263 ++++++++++++++++++++++++++++++++-- include/uapi/linux/io_uring.h | 5 + 2 files changed, 257 insertions(+), 11 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index f464f6ca2b7d..0218d516d184 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -101,6 +101,8 @@ struct io_ring_ctx { struct { spinlock_t completion_lock; + struct list_head poll_list; + unsigned poll_multi_file; } ____cacheline_aligned_in_smp; }; @@ -119,12 +121,16 @@ struct io_kiocb { struct list_head list; unsigned int flags; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ +#define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ +#define REQ_F_IOPOLL_EAGAIN 4 /* submission got EAGAIN */ u64 user_data; + u64 res; struct work_struct work; }; #define IO_PLUG_THRESHOLD 2 +#define IO_IOPOLL_BATCH 8 static struct kmem_cache *req_cachep; @@ -156,6 +162,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) mutex_init(&ctx->uring_lock); init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); + INIT_LIST_HEAD(&ctx->poll_list); return ctx; } @@ -191,8 +198,8 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) return &ring->cqes[tail & ctx->cq_mask]; } -static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, - long res, unsigned ev_flags) +static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) { struct io_uring_cqe *cqe; @@ -206,9 +213,15 @@ static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, cqe->user_data = ki_user_data; cqe->res = res; cqe->flags = ev_flags; - io_commit_cqring(ctx); } else ctx->cq_ring->overflow++; +} + +static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, + long res, unsigned ev_flags) +{ + io_cqring_fill_event(ctx, ki_user_data, res, ev_flags); + io_commit_cqring(ctx); if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); @@ -250,12 +263,158 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) return NULL; } +static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) +{ + if (*nr) { + kmem_cache_free_bulk(req_cachep, *nr, reqs); + io_ring_drop_ctx_refs(ctx, *nr); + *nr = 0; + } +} + static void io_free_req(struct io_kiocb *req) { io_ring_drop_ctx_refs(req->ctx, 1); kmem_cache_free(req_cachep, req); } +/* + * Find and free completed poll iocbs + */ +static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, + struct list_head *done) +{ + void *reqs[IO_IOPOLL_BATCH]; + struct io_kiocb *req; + int to_free = 0; + + while (!list_empty(done)) { + req = list_first_entry(done, struct io_kiocb, list); + list_del(&req->list); + + io_cqring_fill_event(ctx, req->user_data, req->res, 0); + + reqs[to_free++] = req; + (*nr_events)++; + + fput(req->rw.ki_filp); + if (to_free == ARRAY_SIZE(reqs)) + io_free_req_many(ctx, reqs, &to_free); + } + io_commit_cqring(ctx); + + if (to_free) + io_free_req_many(ctx, reqs, &to_free); +} + +static int io_do_iopoll(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + struct io_kiocb *req, *tmp; + LIST_HEAD(done); + bool spin; + int ret; + + /* + * Only spin for completions if we don't have multiple devices hanging + * off our complete list, and we're under the requested amount. + */ + spin = !ctx->poll_multi_file && (*nr_events < min); + + ret = 0; + list_for_each_entry_safe(req, tmp, &ctx->poll_list, list) { + struct kiocb *kiocb = &req->rw; + + /* + * Move completed entries to our local list. If we find a + * request that requires polling, break out and complete + * the done list first, if we have entries there. + */ + if (req->flags & REQ_F_IOPOLL_COMPLETED) { + list_move_tail(&req->list, &done); + continue; + } + if (!list_empty(&done)) + break; + + ret = kiocb->ki_filp->f_op->iopoll(kiocb, spin); + if (ret < 0) + break; + + if (ret && spin) + spin = false; + ret = 0; + } + + if (!list_empty(&done)) + io_iopoll_complete(ctx, nr_events, &done); + + return ret; +} + +/* + * Poll for a mininum of 'min' events. Note that if min == 0 we consider that a + * non-spinning poll check - we'll still enter the driver poll loop, but only + * as a non-spinning completion check. + */ +static int io_iopoll_getevents(struct io_ring_ctx *ctx, unsigned int *nr_events, + long min) +{ + int ret; + + do { + if (list_empty(&ctx->poll_list)) + return 0; + + ret = io_do_iopoll(ctx, nr_events, min); + if (ret < 0) + break; + } while (min && *nr_events < min); + + if (ret < 0) + return ret; + + return *nr_events < min; +} + +/* + * We can't just wait for polled events to come to us, we have to actively + * find and complete them. + */ +static void io_iopoll_reap_events(struct io_ring_ctx *ctx) +{ + if (!(ctx->flags & IORING_SETUP_IOPOLL)) + return; + + mutex_lock(&ctx->uring_lock); + while (!list_empty(&ctx->poll_list)) { + unsigned int nr_events = 0; + + io_iopoll_getevents(ctx, &nr_events, 1); + } + mutex_unlock(&ctx->uring_lock); +} + +static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events, + long min) +{ + int ret = 0; + + while (!*nr_events || !need_resched()) { + int tmin = 0; + + if (*nr_events < min) + tmin = min - *nr_events; + + ret = io_iopoll_getevents(ctx, nr_events, tmin); + if (ret <= 0) + break; + ret = 0; + } + + return ret; +} + static void kiocb_end_write(struct kiocb *kiocb) { if (kiocb->ki_flags & IOCB_WRITE) { @@ -282,9 +441,60 @@ static void io_complete_rw(struct kiocb *kiocb, long res, long res2) io_free_req(req); } +static void io_complete_rw_iopoll(struct kiocb *kiocb, long res, long res2) +{ + struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + + kiocb_end_write(kiocb); + + if (unlikely(res == -EAGAIN)) { + req->flags |= REQ_F_IOPOLL_EAGAIN; + } else { + req->flags |= REQ_F_IOPOLL_COMPLETED; + req->res = res; + } +} + +/* + * After the iocb has been issued, it's safe to be found on the poll list. + * Adding the kiocb to the list AFTER submission ensures that we don't + * find it from a io_iopoll_getevents() thread before the issuer is done + * accessing the kiocb cookie. + */ +static void io_iopoll_req_issued(struct io_kiocb *req) +{ + struct io_ring_ctx *ctx = req->ctx; + + /* + * Track whether we have multiple files in our lists. This will impact + * how we do polling eventually, not spinning if we're on potentially + * different devices. + */ + if (list_empty(&ctx->poll_list)) { + ctx->poll_multi_file = 0; + } else if (!ctx->poll_multi_file) { + struct io_kiocb *list_req; + + list_req = list_first_entry(&ctx->poll_list, struct io_kiocb, + list); + if (list_req->rw.ki_filp != req->rw.ki_filp) + ctx->poll_multi_file = 1; + } + + /* + * For fast devices, IO may have already completed. If it has, add + * it to the front so we find it first. + */ + if (req->flags & REQ_F_IOPOLL_COMPLETED) + list_add(&req->list, &ctx->poll_list); + else + list_add_tail(&req->list, &ctx->poll_list); +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, bool force_nonblock) { + struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; int ret; @@ -310,12 +520,21 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_flags |= IOCB_NOWAIT; req->flags |= REQ_F_FORCE_NONBLOCK; } - if (kiocb->ki_flags & IOCB_HIPRI) { - ret = -EINVAL; - goto out_fput; - } + if (ctx->flags & IORING_SETUP_IOPOLL) { + ret = -EOPNOTSUPP; + if (!(kiocb->ki_flags & IOCB_DIRECT) || + !kiocb->ki_filp->f_op->iopoll) + goto out_fput; - kiocb->ki_complete = io_complete_rw; + kiocb->ki_flags |= IOCB_HIPRI; + kiocb->ki_complete = io_complete_rw_iopoll; + } else { + if (kiocb->ki_flags & IOCB_HIPRI) { + ret = -EINVAL; + goto out_fput; + } + kiocb->ki_complete = io_complete_rw; + } return 0; out_fput: fput(kiocb->ki_filp); @@ -461,6 +680,9 @@ static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_ring_ctx *ctx = req->ctx; + if (unlikely(ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + __io_cqring_add_event(ctx, sqe->user_data, 0, 0); io_free_req(req); return 0; @@ -478,6 +700,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (force_nonblock) return -EAGAIN; + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; if (unlikely(sqe->addr || sqe->ioprio)) return -EINVAL; if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) @@ -525,7 +749,16 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, break; } - return ret; + if (ret) + return ret; + + if (ctx->flags & IORING_SETUP_IOPOLL) { + if (req->flags & REQ_F_IOPOLL_EAGAIN) + return -EAGAIN; + io_iopoll_req_issued(req); + } + + return 0; } static void io_sq_wq_submit_work(struct work_struct *work) @@ -710,12 +943,18 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } if (flags & IORING_ENTER_GETEVENTS) { + unsigned nr_events = 0; int get_ret; if (!ret && to_submit) min_complete = 0; - get_ret = io_cqring_wait(ctx, min_complete); + if (ctx->flags & IORING_SETUP_IOPOLL) + get_ret = io_iopoll_check(ctx, &nr_events, + min_complete); + else + get_ret = io_cqring_wait(ctx, min_complete); + if (get_ret < 0 && !ret) ret = get_ret; } @@ -824,6 +1063,7 @@ static void io_free_scq_urings(struct io_ring_ctx *ctx) static void io_ring_ctx_free(struct io_ring_ctx *ctx) { io_sq_offload_stop(ctx); + io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); percpu_ref_exit(&ctx->refs); io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); @@ -858,6 +1098,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock); + io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); } @@ -1084,7 +1325,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params, return -EINVAL; } - if (p.flags) + if (p.flags & ~IORING_SETUP_IOPOLL) return -EINVAL; ret = io_uring_create(entries, &p, compat); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index ca503ded73e3..4fc5fbd07688 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -32,6 +32,11 @@ struct io_uring_sqe { __u64 __pad2[3]; }; +/* + * io_uring_setup() flags + */ +#define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ + #define IORING_OP_NOP 0 #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 From patchwork Fri Jan 18 16:12:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770793 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 202A213B4 for ; Fri, 18 Jan 2019 16:12:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0F2852F3B0 for ; Fri, 18 Jan 2019 16:12:55 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0360E2F562; Fri, 18 Jan 2019 16:12:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 924542F3B0 for ; Fri, 18 Jan 2019 16:12:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728018AbfARQMy (ORCPT ); Fri, 18 Jan 2019 11:12:54 -0500 Received: from mail-pf1-f193.google.com ([209.85.210.193]:41697 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727970AbfARQMx (ORCPT ); Fri, 18 Jan 2019 11:12:53 -0500 Received: by mail-pf1-f193.google.com with SMTP id b7so6807542pfi.8 for ; Fri, 18 Jan 2019 08:12:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=; b=Mm8Klr0VlMn+uHHmK19HjssOCsJ9BSauCBSZ0y1xlQS9ZVpPqZfSQTc8WX2417R0Wr 3NSiVzRKSRNVKeFDpg7BTHy3jH0KNnRWMkLKDzm3rJL7tBCETouziIpnlPMPLDKkPIep YKhYV1u7RZJLCtf+hd//aLCkmHFkhkMNTlRwB0WtnGK2TIwxMVuHjt+pxQgW+WuAd/jw 49R3LIeZBe9kGNWH92C0EmmqVaKlh6FZE4iPjXcK+9vTGL7l4CRT/2N2BAPDNis/nHai BAxyjgG0QTs8ERTZmQaR0yqH2RuBF2e8SmsiOtVdTXUUyfKeif3DdmU6MY37aih0ebKB HP5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=scNBCeOhysecepgCupm97WAYtsMfIFMGuolJ4I0NyhI=; b=npijGGY71wTKwBrxB3tuJtLHSFN1VMQxzyLhHJgxTAd+Zx27KJUbSxC99UngLmCFnn D3yVO443877VjSBhA6C8JcSMcP+CxJSg5EVDFBpaD1gzN9J1FdqnOXceVDjc6ogyL3rS i4yS5iMq0Bzs08JC0Nepu1fjBgIoHX5xq1ztMTMRCUnT09DbWa0/jtZt9VkCdT/wDEwH rG+Spli5ZLizH0kC0DbvoWDLBURCRHp9F/8Tb7nC1buFEqMrRi358os9TnDITIu3RtKd thwD6mIok0czY0WQ7z1TbmwmxxuS1MVsFhx8GbZ8gg/ae3yX3RyYSZeaYohVf4isJ+T2 s6+Q== X-Gm-Message-State: AJcUukdd7D7OGHygj4dLnlSo1qfLjckA7MteMP3KFDbS7FJl6wFfEyAY M44XshoUcSuuFXbZ41m7uOINk9auXwgbhA== X-Google-Smtp-Source: ALg8bN5COQ7Tg15kfRtNuoZBfYdoyaJdaA2y7hZihHtqW40cCktTgr3Tb9IEVjDUjwm350rK04z0XQ== X-Received: by 2002:a63:955a:: with SMTP id t26mr18504527pgn.449.1547827971782; Fri, 18 Jan 2019 08:12:51 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:50 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 08/17] fs: add fget_many() and fput_many() Date: Fri, 18 Jan 2019 09:12:16 -0700 Message-Id: <20190118161225.4545-9-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Signed-off-by: Jens Axboe --- fs/file.c | 15 ++++++++++----- fs/file_table.c | 9 +++++++-- include/linux/file.h | 2 ++ include/linux/fs.h | 4 +++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 3209ee271c41..e0d7ce70e860 100644 --- a/fs/file.c +++ b/fs/file.c @@ -705,7 +705,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -720,7 +720,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -728,15 +728,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return fget_many(fd, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -767,7 +772,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index 5679e7fcb6b0..155d7514a094 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,11 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index ccb0b7a63aa5..acaad78b6781 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,9 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) \ + atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) From patchwork Fri Jan 18 16:12:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770797 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5CBA4186E for ; Fri, 18 Jan 2019 16:12:57 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 494382F3B0 for ; Fri, 18 Jan 2019 16:12:57 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3DB452F562; Fri, 18 Jan 2019 16:12:57 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 948D52F432 for ; Fri, 18 Jan 2019 16:12:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728037AbfARQMz (ORCPT ); Fri, 18 Jan 2019 11:12:55 -0500 Received: from mail-pf1-f196.google.com ([209.85.210.196]:44104 "EHLO mail-pf1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728024AbfARQMz (ORCPT ); Fri, 18 Jan 2019 11:12:55 -0500 Received: by mail-pf1-f196.google.com with SMTP id u6so6804739pfh.11 for ; Fri, 18 Jan 2019 08:12:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=PwQBn9IIrQsdPjS9vWd08BtfF9gkn2bo+RQgQJfxJUs=; b=eVkW55zxoYsHXwQEwt7K2/GrZtiYzJByOW9SCi6LHUmzXzh4+rad/vmJ+X+6ALT+He aRzBXrC/o9jyq0rky1vbxk9Ccj7WmYznc+rIyciRIUBQE2OWA3q3afezCwY8WjiwaPM4 MuC3Xo3geoAVCHo6siZ2ydrYxijN4ZFfELw43OdcY66KK78dIVRfdndCFMsLpj458vvT Ai6LPsrS4MpOCMUYGDc5nAXlZk1NuGmD5sSU5/Ybkr9nu+d2aRyXOIQ2zmq3YcZwZ5Gg nr/Kei2a78MEQU7LRilR+bF676OJqHLt9zOfyO/4/w/5Cl1UqcQnm9uKjB1SuRAmIO8s 3m2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=PwQBn9IIrQsdPjS9vWd08BtfF9gkn2bo+RQgQJfxJUs=; b=N0GAF6ebXYgPASNoXsR6WJUMGCsjcquKhaFllSVSWpadZfh2vyHMmr6E+BkgePRfPw xwG8n+bqcWPXfLOQ/iL50bfuFVe4bHSwhFYULMZYC780CpPpz3vM4PH6bUTtXyzfzxcY o7qhb7lfGbjKkQ7FNgU8P6a6gX3iqX5QH7X9FFsRIOokp05atQHJ/Yc/OKfZtbse7yaO Q76sKiSi/UFvmzNUDA1/e1ssRndD840DcwclGG5xHXUdtWl+uk/2N6KWf63/rAZeco/m DrGVts9qo3knNVr/tjWhij4Yi3SdXMkDrw+6yMyys9EvhnAbwYlnDqcffjyVAAZ0GRFz eMQQ== X-Gm-Message-State: AJcUukfGM2MTRptgPHHC5u+AcBv07yZoe+K6vnYugr7k5b6VSXBv+fTf ZhtMZ7P5SYlMQuU60z9tuyrqn69CnEph4g== X-Google-Smtp-Source: ALg8bN7A0SD40pMGwsNJL0xvo663/fdmH6zoFu0Qq7jqVO0uIDt5gU9+BxLpkzTXggabHvmXlG7F/w== X-Received: by 2002:a63:66c6:: with SMTP id a189mr18318697pgc.167.1547827974240; Fri, 18 Jan 2019 08:12:54 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.52 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:53 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 09/17] io_uring: use fget/fput_many() for file references Date: Fri, 18 Jan 2019 09:12:17 -0700 Message-Id: <20190118161225.4545-10-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add a separate io_submit_state structure, to cache some of the things we need for IO submission. One such example is file reference batching. io_submit_state. We get as many references as the number of sqes we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in io_iopoll_reap(). Signed-off-by: Jens Axboe --- fs/io_uring.c | 139 ++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 118 insertions(+), 21 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 0218d516d184..668043c87ddb 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -132,6 +132,19 @@ struct io_kiocb { #define IO_PLUG_THRESHOLD 2 #define IO_IOPOLL_BATCH 8 +struct io_submit_state { + struct blk_plug plug; + + /* + * File reference cache + */ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; +}; + static struct kmem_cache *req_cachep; static const struct file_operations io_uring_fops; @@ -285,9 +298,11 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, struct list_head *done) { void *reqs[IO_IOPOLL_BATCH]; + int file_count, to_free; + struct file *file = NULL; struct io_kiocb *req; - int to_free = 0; + file_count = to_free = 0; while (!list_empty(done)) { req = list_first_entry(done, struct io_kiocb, list); list_del(&req->list); @@ -297,12 +312,28 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, reqs[to_free++] = req; (*nr_events)++; - fput(req->rw.ki_filp); + /* + * Batched puts of the same file, to avoid dirtying the + * file usage count multiple times, if avoidable. + */ + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } + if (to_free == ARRAY_SIZE(reqs)) io_free_req_many(ctx, reqs, &to_free); } io_commit_cqring(ctx); + if (file) + fput_many(file, file_count); if (to_free) io_free_req_many(ctx, reqs, &to_free); } @@ -491,14 +522,56 @@ static void io_iopoll_req_issued(struct io_kiocb *req) list_add_tail(&req->list, &ctx->poll_list); } +static void io_file_put(struct io_submit_state *state, struct file *file) +{ + if (!state) { + fput(file); + } else if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *io_file_get(struct io_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (state->file) { + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + io_file_put(state, NULL); + } + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw; int ret; - kiocb->ki_filp = fget(sqe->fd); + kiocb->ki_filp = io_file_get(state, sqe->fd); if (unlikely(!kiocb->ki_filp)) return -EBADF; kiocb->ki_pos = sqe->off; @@ -537,7 +610,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - fput(kiocb->ki_filp); + io_file_put(state, kiocb->ki_filp); return ret; } @@ -577,7 +650,7 @@ static int io_import_iovec(struct io_ring_ctx *ctx, int rw, } static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -585,7 +658,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(req, sqe, force_nonblock); + ret = io_prep_rw(req, sqe, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -620,7 +693,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, } static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) + bool force_nonblock, struct io_submit_state *state) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *kiocb = &req->rw; @@ -628,7 +701,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct file *file; ssize_t ret; - ret = io_prep_rw(req, sqe, force_nonblock); + ret = io_prep_rw(req, sqe, force_nonblock, state); if (ret) return ret; file = kiocb->ki_filp; @@ -721,7 +794,8 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, } static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct sqe_submit *s, bool force_nonblock) + struct sqe_submit *s, bool force_nonblock, + struct io_submit_state *state) { const struct io_uring_sqe *sqe = s->sqe; ssize_t ret; @@ -736,10 +810,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_nop(req, sqe); break; case IORING_OP_READV: - ret = io_read(req, sqe, force_nonblock); + ret = io_read(req, sqe, force_nonblock, state); break; case IORING_OP_WRITEV: - ret = io_write(req, sqe, force_nonblock); + ret = io_write(req, sqe, force_nonblock, state); break; case IORING_OP_FSYNC: ret = io_fsync(req, sqe, force_nonblock); @@ -784,7 +858,7 @@ static void io_sq_wq_submit_work(struct work_struct *work) use_mm(ctx->sqo_mm); set_fs(USER_DS); - ret = __io_submit_sqe(ctx, req, s, false); + ret = __io_submit_sqe(ctx, req, s, false, NULL); set_fs(old_fs); unuse_mm(ctx->sqo_mm); @@ -797,7 +871,8 @@ static void io_sq_wq_submit_work(struct work_struct *work) current->files = old_files; } -static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) +static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, + struct io_submit_state *state) { struct io_kiocb *req; ssize_t ret; @@ -810,7 +885,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) if (unlikely(!req)) return -EAGAIN; - ret = __io_submit_sqe(ctx, req, s, true); + ret = __io_submit_sqe(ctx, req, s, true, state); if (ret == -EAGAIN) { memcpy(&req->submit, s, sizeof(*s)); INIT_WORK(&req->work, io_sq_wq_submit_work); @@ -823,6 +898,26 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s) return ret; } +/* + * Batched submission is done, ensure local IO is flushed out. + */ +static void io_submit_state_end(struct io_submit_state *state) +{ + blk_finish_plug(&state->plug); + io_file_put(state, NULL); +} + +/* + * Start submission side cache. + */ +static void io_submit_state_start(struct io_submit_state *state, + struct io_ring_ctx *ctx, unsigned max_ios) +{ + blk_start_plug(&state->plug); + state->file = NULL; + state->ios_left = max_ios; +} + static void io_commit_sqring(struct io_ring_ctx *ctx) { struct io_sq_ring *ring = ctx->sq_ring; @@ -869,11 +964,13 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { + struct io_submit_state state, *statep = NULL; int i, ret = 0, submit = 0; - struct blk_plug plug; - if (to_submit > IO_PLUG_THRESHOLD) - blk_start_plug(&plug); + if (to_submit > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, to_submit); + statep = &state; + } for (i = 0; i < to_submit; i++) { struct sqe_submit s; @@ -881,7 +978,7 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) if (!io_get_sqring(ctx, &s)) break; - ret = io_submit_sqe(ctx, &s); + ret = io_submit_sqe(ctx, &s, statep); if (ret) { io_drop_sqring(ctx); break; @@ -891,8 +988,8 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) } io_commit_sqring(ctx); - if (to_submit > IO_PLUG_THRESHOLD) - blk_finish_plug(&plug); + if (statep) + io_submit_state_end(statep); return submit ? submit : ret; } From patchwork Fri Jan 18 16:12:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770801 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4734A186E for ; Fri, 18 Jan 2019 16:12:59 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2D3382F432 for ; Fri, 18 Jan 2019 16:12:59 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1DF7B2F3B0; Fri, 18 Jan 2019 16:12:59 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B6A622F3B0 for ; Fri, 18 Jan 2019 16:12:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728053AbfARQM5 (ORCPT ); Fri, 18 Jan 2019 11:12:57 -0500 Received: from mail-pl1-f193.google.com ([209.85.214.193]:35280 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728046AbfARQM5 (ORCPT ); Fri, 18 Jan 2019 11:12:57 -0500 Received: by mail-pl1-f193.google.com with SMTP id p8so6566135plo.2 for ; Fri, 18 Jan 2019 08:12:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=eSRoBWEYatYiUOfFsoUKfPfK8ZgE8idbL4z9lJ2pW/A=; b=OgZ2nF9N5lOkJoaoGeKmSfl/dGF76lnXTz4BmTXwvmsIe3dDSJ8/rNqBDQ/F459ZWy 57XuGDOPfF2wcAG0d8hSHd7hOSV97VqaXbVJWSxMjeF264amaFgvktYYm7G0xpCFQW1R 4bWaIASito8lCr+Ty6QSg0xNNRXHV0rHNfq8o03yQhZbP+Oc/9jVRWYQuGxdekg4CJ9l +qzz1z7rtE4IW9J5eHDmi4edOvrcOvxxRj/OopvCv1tG/YCvJBV63TDWaDO/LUb15cws DBu0Ju+Dk8PAS/qkYC3wPFdPnBm9rTjDgATm1O5TKfXfkz+vvSl9EkzfwU/KnWrWhetT jITA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=eSRoBWEYatYiUOfFsoUKfPfK8ZgE8idbL4z9lJ2pW/A=; b=scD9O1hWwd/7dCEg9eBOuHYugEUfR7f7uy4Ew2nAihEfGhE7VbTNUVAxNEDf4yck7x deXqxRj0tpJBGQo3q9f43ucnt9jsO142kgB7mgtUwx9F4uYugaxYzH9PdQBgj2sM0PxD XMUNO02eKNBmo970/v8DzGikEPJD3XrvHiqh4WiArt4I53KtztWPHXmhNl1bOBa0VCiF rgT4kI+O7nMPzpgkyb5br4aQz2cqKBAgXV2gehAHfJpo/sMknXXZcWoC9xMUsvltsIRO hnl3D9gVjQMIINba37ynJGhKNQ2kao2yvCwE7NiQr54uTmuxT2IWvab01MIeO42ttmx6 ytsA== X-Gm-Message-State: AJcUukfNXKSDniNlWgNllR4UK0b67XyKZ37HJpVy+H95WaEscD+Uu/ma 6BpZelUI2A/8buzxCsD8RG4j/rCFWOXDRQ== X-Google-Smtp-Source: ALg8bN6CmGOl4Nf2+on7FWN/sh41gbuhgkly7v4BPf35H+1oR9BSVHAzZTNnyN2n9Y0fegnzz83aWg== X-Received: by 2002:a17:902:7296:: with SMTP id d22mr20123933pll.265.1547827976250; Fri, 18 Jan 2019 08:12:56 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:55 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 10/17] io_uring: batch io_kiocb allocation Date: Fri, 18 Jan 2019 09:12:18 -0700 Message-Id: <20190118161225.4545-11-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the io_kiocb's we need in bulk. Signed-off-by: Jens Axboe --- fs/io_uring.c | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 668043c87ddb..666f4cee1a5b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -135,6 +135,13 @@ struct io_kiocb { struct io_submit_state { struct blk_plug plug; + /* + * io_kiocb alloc cache + */ + void *reqs[IO_IOPOLL_BATCH]; + unsigned int free_reqs; + unsigned int cur_req; + /* * File reference cache */ @@ -258,20 +265,42 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs) wake_up(&ctx->wait); } -static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx) +static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, + struct io_submit_state *state) { + gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; struct io_kiocb *req; /* safe to use the non tryget, as we're inside ring ref already */ percpu_ref_get(&ctx->refs); - req = kmem_cache_alloc(req_cachep, GFP_ATOMIC | __GFP_NOWARN); + if (!state) + req = kmem_cache_alloc(req_cachep, gfp); + else if (!state->free_reqs) { + size_t sz; + int ret; + + sz = min_t(size_t, state->ios_left, ARRAY_SIZE(state->reqs)); + ret = kmem_cache_alloc_bulk(req_cachep, gfp, sz, + state->reqs); + if (ret <= 0) + goto out; + state->free_reqs = ret - 1; + state->cur_req = 1; + req = state->reqs[0]; + } else { + req = state->reqs[state->cur_req]; + state->free_reqs--; + state->cur_req++; + } + if (req) { req->ctx = ctx; req->flags = 0; return req; } +out: io_ring_drop_ctx_refs(ctx, 1); return NULL; } @@ -881,7 +910,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, if (unlikely(s->sqe->flags)) return -EINVAL; - req = io_get_req(ctx); + req = io_get_req(ctx, state); if (unlikely(!req)) return -EAGAIN; @@ -905,6 +934,9 @@ static void io_submit_state_end(struct io_submit_state *state) { blk_finish_plug(&state->plug); io_file_put(state, NULL); + if (state->free_reqs) + kmem_cache_free_bulk(req_cachep, state->free_reqs, + &state->reqs[state->cur_req]); } /* @@ -914,6 +946,7 @@ static void io_submit_state_start(struct io_submit_state *state, struct io_ring_ctx *ctx, unsigned max_ios) { blk_start_plug(&state->plug); + state->free_reqs = 0; state->file = NULL; state->ios_left = max_ios; } From patchwork Fri Jan 18 16:12:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770805 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0D61613B5 for ; Fri, 18 Jan 2019 16:13:03 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F075D2F432 for ; Fri, 18 Jan 2019 16:13:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E42122F62B; Fri, 18 Jan 2019 16:13:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E1E142F432 for ; Fri, 18 Jan 2019 16:13:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728066AbfARQNB (ORCPT ); Fri, 18 Jan 2019 11:13:01 -0500 Received: from mail-pg1-f196.google.com ([209.85.215.196]:40570 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728058AbfARQNA (ORCPT ); Fri, 18 Jan 2019 11:13:00 -0500 Received: by mail-pg1-f196.google.com with SMTP id z10so6267545pgp.7 for ; Fri, 18 Jan 2019 08:12:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=; b=A6fuPsWmryqAotGgdkJOx2BsY2eX9eyNH3HsvXpZJLsCTiyFIQPLh0IQjOPpTBhtkg 1DKHlkE0S5qx6D53WTVCsfEyl1LHLE4sx1y26yQ1qURaxk+YRUaVXyHVkYK782ahJ1Tb 6ZMY8MFBErELXqJekZjsoABkjPE1oPC8SxYGnAa1eZxG5XU4iS3oE9QgDy1rQzQV906f BZG734IixjAJ9fFojXx5M0eYDyZ4NEUHOc6Pq45R3LWJUHj8I18IBE7QxlUKJlcTdR1S 6SP455XpfogNLGpeYVr+nhmhPy1/pf9Po7N1AAd5DfimjIWPe6T0TGOdOXQTzfYfBL9I FoTQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=Lz/3jHz1x36txQZ/D99mDx/1sUz8ZTGETsX8P4ZemoU=; b=aaK7NjD7nlSiKNrjk9buNvyTQ4nHVk6u1nJVP7ZFEY1PW2R4UM2ljNy0iGsi/Bv+Zw Un2fAMFMUA9nclNOEytDSHV6es/syOTVu0hm7Ad5NuMwiPILoC92q80+jbKPp5TB9k1I /cGeL9MkMq39FL9pvtUu7Sjn18PTaZUQmpymioSTzhHDC2peykIZegCf797kHoTf/T7/ NCTuYtqzjnmX/IO0X0LqX2EI6Fhnog+IQiJ1lfQDCyUh/qmNBKmAFXqe2Ly0UuaoWilm M9bNxtDwwsl7jTq6jlpKGDO0BZc4HBdhxXkZdN6sO3xPMXqTR0ohmaIjEg/DgMMkCpIy LNqw== X-Gm-Message-State: AJcUukfPUY2DofF84NHAD8ZRTDZgKrJxsq1dfBQLQkVs1xwnG69X9zM5 rRBQGdn2/lUitE1i2EhLU9ZPux8Y2a/OYg== X-Google-Smtp-Source: ALg8bN712noviIC1xEmIiCjv3l5v9g2lhCLyEmyAie4uv6ryHpf0FxuiEDELm5IUUISlyFg64cd1/A== X-Received: by 2002:a63:7306:: with SMTP id o6mr17935672pgc.343.1547827978627; Fri, 18 Jan 2019 08:12:58 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.56 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:12:57 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 11/17] block: implement bio helper to add iter bvec pages to bio Date: Fri, 18 Jan 2019 09:12:19 -0700 Message-Id: <20190118161225.4545-12-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. This requires that the caller doesn't releases the pages on IO completion, we add a BIO_HOLD_PAGES flag for that. The current two callers of bio_iov_iter_get_pages() are updated to check if they need to release pages on completion. This makes them work with bvecs that contain kernel mapped pages already. Signed-off-by: Jens Axboe --- block/bio.c | 59 ++++++++++++++++++++++++++++++++------- fs/block_dev.c | 5 ++-- fs/iomap.c | 5 ++-- include/linux/blk_types.h | 1 + 4 files changed, 56 insertions(+), 14 deletions(-) diff --git a/block/bio.c b/block/bio.c index 4db1008309ed..7af4f45d2ed6 100644 --- a/block/bio.c +++ b/block/bio.c @@ -828,6 +828,23 @@ int bio_add_page(struct bio *bio, struct page *page, } EXPORT_SYMBOL(bio_add_page); +static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + const struct bio_vec *bv = iter->bvec; + unsigned int len; + size_t size; + + len = min_t(size_t, bv->bv_len, iter->count); + size = bio_add_page(bio, bv->bv_page, len, + bv->bv_offset + iter->iov_offset); + if (size == len) { + iov_iter_advance(iter, size); + return 0; + } + + return -EINVAL; +} + #define PAGE_PTRS_PER_BVEC (sizeof(struct bio_vec) / sizeof(struct page *)) /** @@ -876,23 +893,43 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } /** - * bio_iov_iter_get_pages - pin user or kernel pages and add them to a bio + * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to - * @iter: iov iterator describing the region to be mapped + * @iter: iov iterator describing the region to be added + * + * This takes either an iterator pointing to user memory, or one pointing to + * kernel pages (BVEC iterator). If we're adding user pages, we pin them and + * map them into the kernel. On IO completion, the caller should put those + * pages. If we're adding kernel pages, we just have to add the pages to the + * bio directly. We don't grab an extra reference to those pages (the user + * should already have that), and we don't put the page on IO completion. + * The caller needs to check if the bio is flagged BIO_HOLD_PAGES on IO + * completion. If it isn't, then pages should be released. * - * Pins pages from *iter and appends them to @bio's bvec array. The - * pages will have to be released using put_page() when done. * The function tries, but does not guarantee, to pin as many pages as - * fit into the bio, or are requested in *iter, whatever is smaller. - * If MM encounters an error pinning the requested pages, it stops. - * Error is returned only if 0 pages could be pinned. + * fit into the bio, or are requested in *iter, whatever is smaller. If + * MM encounters an error pinning the requested pages, it stops. Error + * is returned only if 0 pages could be pinned. */ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) { + const bool is_bvec = iov_iter_is_bvec(iter); unsigned short orig_vcnt = bio->bi_vcnt; + /* + * If this is a BVEC iter, then the pages are kernel pages. Don't + * release them on IO completion. + */ + if (is_bvec) + bio_set_flag(bio, BIO_HOLD_PAGES); + do { - int ret = __bio_iov_iter_get_pages(bio, iter); + int ret; + + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); if (unlikely(ret)) return bio->bi_vcnt > orig_vcnt ? 0 : ret; @@ -1634,7 +1671,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); } } @@ -1650,7 +1688,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/fs/block_dev.c b/fs/block_dev.c index 2ebd2a0d7789..b7742014c9de 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -324,8 +324,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/fs/iomap.c b/fs/iomap.c index 4ee50b76b4a1..0a64c9c51203 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1582,8 +1582,9 @@ static void iomap_dio_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 5c7e7f859a24..97e206855cd3 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -215,6 +215,7 @@ struct bio { /* * bio flags */ +#define BIO_HOLD_PAGES 0 /* don't put O_DIRECT pages */ #define BIO_SEG_VALID 1 /* bi_phys_segments valid */ #define BIO_CLONED 2 /* doesn't own data */ #define BIO_BOUNCED 3 /* bio is a bounce bio */ From patchwork Fri Jan 18 16:12:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770809 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DB66D186E for ; Fri, 18 Jan 2019 16:13:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CA6922F432 for ; Fri, 18 Jan 2019 16:13:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BE4822F5C6; Fri, 18 Jan 2019 16:13:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B2C282F562 for ; Fri, 18 Jan 2019 16:13:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728073AbfARQND (ORCPT ); Fri, 18 Jan 2019 11:13:03 -0500 Received: from mail-pl1-f195.google.com ([209.85.214.195]:34119 "EHLO mail-pl1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728070AbfARQNC (ORCPT ); Fri, 18 Jan 2019 11:13:02 -0500 Received: by mail-pl1-f195.google.com with SMTP id w4so6565188plz.1 for ; Fri, 18 Jan 2019 08:13:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=SzwsZ8/yccb7Cy/45pUeHFHvq5a/t+V9CA39eQOvNjU=; b=0dQaUq/Tv4BbJVE+PnGS3uBfS1Buic2eBHH00evAXlPW4IjoKBkP50/YVxRUOA+9An 1ga6QKxicA+amSiuj4+2AaguuaFLrxsKLA14MnEoHF/Kk0vyZIp3c73vyDV1/S4y2L/L qwr8PCHKXjU4Z4gp/UD2EvxgZGscHm0nTV4lMo/6J15VXXk8oE3YsulsSQku5mBM3918 MDLgmG/jZPzIpSb+rLe0P65vI0yTIeuOjL0c9hHS6b5m6JgvaLRlHyXSwg1PNSkeRZr/ fT5JokNdzQLasQYn76f/NgRFcFhiti0mM3N4340GSTEPN5sZ635g+UopoeMBmebZfFKI 67ZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=SzwsZ8/yccb7Cy/45pUeHFHvq5a/t+V9CA39eQOvNjU=; b=pD7mYKGIqoq8p+tODj5nRn2T2YSCJHrKh/v/tmhf+nBBAyGL6JoDyGlzQ2IQC02QId A0Tg8pd/gXUoSqWifMHXyR7wQfUMit/6KsNw8ZpcPM3L/iJGHmdOv6FDA8YXHlGkK1Xz 9WFyRkK8ZGpXHMh67ZUfbYLjIynZfPhE4+O9IHmFSTh7UaVXCyt5DYBBAIXkFh92fQuK hOybmJqo9Ky8Pee+TyWvPujBxkczUyYNzljPkRTsp529RdOQBokwcv8rVisuw72jnRLG 6SynTXzM2PT44k9S5UTOo1ruAUS0xBnKsZH9m9GAWXRxrdfUWmGuc9p088lnHuEP8pkQ YY9Q== X-Gm-Message-State: AJcUukcaQHegyCqFQ8YNN4rKMuWCULVTpi13zB61gbl8HggoIsjIoHUA pX9XKPfuDwfJGAVzh2L2jU3dD41g2M/1eQ== X-Google-Smtp-Source: ALg8bN4z8VhNLzwiwDqEYV2D/vDphJK0f/IGFJ+kwxOR2SrwHixKLb/KHwsX3ui+T51V7KtiMyF70g== X-Received: by 2002:a17:902:5a4d:: with SMTP id f13mr20327706plm.49.1547827981161; Fri, 18 Jan 2019 08:13:01 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.12.58 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:00 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 12/17] io_uring: add support for pre-mapped user IO buffers Date: Fri, 18 Jan 2019 09:12:20 -0700 Message-Id: <20190118161225.4545-13-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must call io_uring_register() after having setup an io_uring context, passing in IORING_REGISTER_BUFFERS as the opcode. The argument must be a pointer to an iovec array, and the nr_args should contain how many iovecs the application wishes to map. If successful, these buffers are now mapped into the kernel, eligible for IO. To use these fixed buffers, the application must use the IORING_OP_READ_FIXED and IORING_OP_WRITE_FIXED opcodes, and then set sqe->index to the desired buffer index. sqe->addr..sqe->addr+seq->len must point to somewhere inside the indexed buffer. The application may register buffers throughout the lifetime of the io_uring context. It can call io_uring_register() with IORING_UNREGISTER_BUFFERS as the opcode to unregister the current set of buffers, and then register a new set. The application need not unregister buffers explicitly before shutting down the io_uring context. It's perfectly valid to setup a larger buffer, and then sometimes only use parts of it for an IO. As long as the range is within the originally mapped region, it will work just fine. For now, buffers must not be file backed. If file backed buffers are passed in, the registration will fail with -1/EOPNOTSUPP. This restriction may be relaxed in the future. RLIMIT_MEMLOCK is used to check how much memory we can pin. A somewhat arbitrary 1G per buffer size is also imposed. Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/io_uring.c | 355 ++++++++++++++++++++++++- include/linux/sched/user.h | 2 +- include/linux/syscalls.h | 2 + include/uapi/linux/io_uring.h | 13 +- kernel/sys_ni.c | 1 + 7 files changed, 362 insertions(+), 13 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 194e79c0032e..7e89016f8118 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -400,3 +400,4 @@ 386 i386 rseq sys_rseq __ia32_sys_rseq 387 i386 io_uring_setup sys_io_uring_setup __ia32_compat_sys_io_uring_setup 388 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter +389 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 453ff7a79002..8e05d4f05d88 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -345,6 +345,7 @@ 334 common rseq __x64_sys_rseq 335 common io_uring_setup __x64_sys_io_uring_setup 336 common io_uring_enter __x64_sys_io_uring_enter +337 common io_uring_register __x64_sys_io_uring_register # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/io_uring.c b/fs/io_uring.c index 666f4cee1a5b..5fb55784d563 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -25,8 +25,11 @@ #include #include #include +#include #include #include +#include +#include #include #include @@ -57,6 +60,13 @@ struct io_cq_ring { struct io_uring_cqe cqes[]; }; +struct io_mapped_ubuf { + u64 ubuf; + size_t len; + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct io_ring_ctx { struct { struct percpu_ref refs; @@ -90,6 +100,10 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp; + /* if used, fixed mapped user buffers */ + unsigned nr_user_bufs; + struct io_mapped_ubuf *user_bufs; + struct user_struct *user; struct completion ctx_done; @@ -664,12 +678,51 @@ static inline void io_rw_done(struct kiocb *kiocb, ssize_t ret) } } +static int io_import_fixed(struct io_ring_ctx *ctx, int rw, + const struct io_uring_sqe *sqe, + struct iov_iter *iter) +{ + struct io_mapped_ubuf *imu; + size_t len = sqe->len; + size_t offset; + int index; + + /* attempt to use fixed buffers without having provided iovecs */ + if (unlikely(!ctx->user_bufs)) + return -EFAULT; + if (unlikely(sqe->buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + + index = array_index_nospec(sqe->buf_index, ctx->sq_entries); + imu = &ctx->user_bufs[index]; + if ((unsigned long) sqe->addr < imu->ubuf || + (unsigned long) sqe->addr + len > imu->ubuf + imu->len) + return -EFAULT; + + /* + * May not be a start of buffer, set size appropriately + * and advance us to the beginning. + */ + offset = (unsigned long) sqe->addr - imu->ubuf; + iov_iter_bvec(iter, rw, imu->bvec, imu->nr_bvecs, offset + len); + if (offset) + iov_iter_advance(iter, offset); + return 0; +} + static int io_import_iovec(struct io_ring_ctx *ctx, int rw, const struct io_uring_sqe *sqe, struct iovec **iovec, struct iov_iter *iter) { void __user *buf = u64_to_user_ptr(sqe->addr); + if (sqe->opcode == IORING_OP_READ_FIXED || + sqe->opcode == IORING_OP_WRITE_FIXED) { + ssize_t ret = io_import_fixed(ctx, rw, sqe, iter); + *iovec = NULL; + return ret; + } + #ifdef CONFIG_COMPAT if (ctx->compat) return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV, @@ -804,7 +857,7 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) return -EINVAL; - if (unlikely(sqe->addr || sqe->ioprio)) + if (unlikely(sqe->addr || sqe->ioprio || sqe->buf_index)) return -EINVAL; if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) return -EINVAL; @@ -839,9 +892,19 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, ret = io_nop(req, sqe); break; case IORING_OP_READV: + if (unlikely(sqe->buf_index)) + return -EINVAL; ret = io_read(req, sqe, force_nonblock, state); break; case IORING_OP_WRITEV: + if (unlikely(sqe->buf_index)) + return -EINVAL; + ret = io_write(req, sqe, force_nonblock, state); + break; + case IORING_OP_READ_FIXED: + ret = io_read(req, sqe, force_nonblock, state); + break; + case IORING_OP_WRITE_FIXED: ret = io_write(req, sqe, force_nonblock, state); break; case IORING_OP_FSYNC: @@ -869,8 +932,9 @@ static void io_sq_wq_submit_work(struct work_struct *work) struct io_kiocb *req = container_of(work, struct io_kiocb, work); struct sqe_submit *s = &req->submit; struct io_ring_ctx *ctx = req->ctx; - mm_segment_t old_fs = get_fs(); struct files_struct *old_files; + mm_segment_t old_fs; + bool needs_user; int ret; /* Ensure we clear previously set forced non-block flag */ @@ -879,19 +943,32 @@ static void io_sq_wq_submit_work(struct work_struct *work) old_files = current->files; current->files = ctx->sqo_files; - if (!mmget_not_zero(ctx->sqo_mm)) { - ret = -EFAULT; - goto err; + /* + * If we're doing IO to fixed buffers, we don't need to get/set + * user context + */ + needs_user = true; + if (s->sqe->opcode == IORING_OP_READ_FIXED || + s->sqe->opcode == IORING_OP_WRITE_FIXED) + needs_user = false; + + if (needs_user) { + if (!mmget_not_zero(ctx->sqo_mm)) { + ret = -EFAULT; + goto err; + } + use_mm(ctx->sqo_mm); + old_fs = get_fs(); + set_fs(USER_DS); } - use_mm(ctx->sqo_mm); - set_fs(USER_DS); - ret = __io_submit_sqe(ctx, req, s, false, NULL); - set_fs(old_fs); - unuse_mm(ctx->sqo_mm); - mmput(ctx->sqo_mm); + if (needs_user) { + set_fs(old_fs); + unuse_mm(ctx->sqo_mm); + mmput(ctx->sqo_mm); + } err: if (ret) { io_cqring_add_event(ctx, s->sqe->user_data, ret, 0); @@ -1161,6 +1238,14 @@ static int __io_account_mem(struct user_struct *user, unsigned long nr_pages) return 0; } +static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) +{ + if (ctx->user) + return __io_account_mem(ctx->user, nr_pages); + + return 0; +} + static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) { struct io_sq_ring *sq_ring; @@ -1174,6 +1259,190 @@ static unsigned long ring_pages(unsigned sq_entries, unsigned cq_entries) return (bytes + PAGE_SIZE - 1) / PAGE_SIZE; } +static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return -ENXIO; + + for (i = 0; i < ctx->sq_entries; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + + for (j = 0; j < imu->nr_bvecs; j++) + put_page(imu->bvec[j].bv_page); + + io_unaccount_mem(ctx, imu->nr_bvecs); + kfree(imu->bvec); + imu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; + free_uid(ctx->user); + ctx->user = NULL; + return 0; +} + +static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst, + void __user *arg, unsigned index) +{ + struct iovec __user *src; + +#ifdef CONFIG_COMPAT + if (ctx->compat) { + struct compat_iovec __user *ciovs; + struct compat_iovec ciov; + + ciovs = (struct compat_iovec __user *) arg; + if (copy_from_user(&ciov, &ciovs[index], sizeof(ciov))) + return -EFAULT; + + dst->iov_base = (void __user *) (unsigned long) ciov.iov_base; + dst->iov_len = ciov.iov_len; + return 0; + } +#endif + src = (struct iovec __user *) arg; + if (copy_from_user(dst, &src[index], sizeof(*dst))) + return -EFAULT; + return 0; +} + +static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + struct vm_area_struct **vmas = NULL; + struct page **pages = NULL; + int i, j, got_pages = 0; + int ret = -EINVAL; + + if (ctx->user_bufs) + return -EBUSY; + if (!nr_args || nr_args > UIO_MAXIOV) + return -EINVAL; + + ctx->user_bufs = kcalloc(nr_args, sizeof(struct io_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + if (!capable(CAP_IPC_LOCK)) + ctx->user = get_uid(current_user()); + + for (i = 0; i < nr_args; i++) { + struct io_mapped_ubuf *imu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + struct iovec iov; + size_t size; + + ret = io_copy_iov(ctx, &iov, arg, i); + if (ret) + break; + + /* + * Don't impose further limits on the size and buffer + * constraints here, we'll -EINVAL later when IO is + * submitted if they are wrong. + */ + ret = -EFAULT; + if (!iov.iov_base) + goto err; + + /* arbitrary limit, but we need something */ + if (iov.iov_len > SZ_1G) + goto err; + + ubuf = (unsigned long) iov.iov_base; + end = (ubuf + iov.iov_len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = io_account_mem(ctx, nr_pages); + if (ret) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(vmas); + kfree(pages); + pages = kmalloc_array(nr_pages, sizeof(struct page *), + GFP_KERNEL); + vmas = kmalloc_array(nr_pages, + sizeof(struct vma_area_struct *), + GFP_KERNEL); + if (!pages || !vmas) { + io_unaccount_mem(ctx, nr_pages); + goto err; + } + got_pages = nr_pages; + } + + imu->bvec = kmalloc_array(nr_pages, sizeof(struct bio_vec), + GFP_KERNEL); + if (!imu->bvec) { + io_unaccount_mem(ctx, nr_pages); + goto err; + } + + down_write(¤t->mm->mmap_sem); + pret = get_user_pages_longterm(ubuf, nr_pages, FOLL_WRITE, + pages, vmas); + if (pret == nr_pages) { + /* don't support file backed memory */ + for (j = 0; j < nr_pages; j++) { + struct vm_area_struct *vma = vmas[j]; + + if (vma->vm_file) { + ret = -EOPNOTSUPP; + break; + } + } + } else { + ret = pret < 0 ? pret : -EFAULT; + } + up_write(¤t->mm->mmap_sem); + if (ret) { + /* + * if we did partial map, or found file backed vmas, + * release any pages we did get + */ + if (pret > 0) { + for (j = 0; j < pret; j++) + put_page(pages[j]); + } + io_unaccount_mem(ctx, nr_pages); + goto err; + } + + off = ubuf & ~PAGE_MASK; + size = iov.iov_len; + for (j = 0; j < nr_pages; j++) { + size_t vec_len; + + vec_len = min_t(size_t, size, PAGE_SIZE - off); + imu->bvec[j].bv_page = pages[j]; + imu->bvec[j].bv_len = vec_len; + imu->bvec[j].bv_offset = off; + off = 0; + size -= vec_len; + } + /* store original address for later verification */ + imu->ubuf = ubuf; + imu->len = iov.iov_len; + imu->nr_bvecs = nr_pages; + } + kfree(pages); + kfree(vmas); + ctx->nr_user_bufs = nr_args; + return 0; +err: + kfree(pages); + kfree(vmas); + io_sqe_buffer_unregister(ctx); + return ret; +} + static void io_free_scq_urings(struct io_ring_ctx *ctx) { if (ctx->sq_ring) { @@ -1195,6 +1464,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sq_offload_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_sqe_buffer_unregister(ctx); percpu_ref_exit(&ctx->refs); io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); kfree(ctx); @@ -1482,6 +1752,69 @@ COMPAT_SYSCALL_DEFINE2(io_uring_setup, u32, entries, } #endif +static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, + void __user *arg, unsigned nr_args) +{ + int ret; + + /* Drop our initial ref and wait for the ctx to be fully idle */ + percpu_ref_put(&ctx->refs); + percpu_ref_kill(&ctx->refs); + wait_for_completion(&ctx->ctx_done); + + switch (opcode) { + case IORING_REGISTER_BUFFERS: + ret = io_sqe_buffer_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_BUFFERS: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_buffer_unregister(ctx); + break; + default: + ret = -EINVAL; + break; + } + + /* bring the ctx back to life */ + reinit_completion(&ctx->ctx_done); + percpu_ref_resurrect(&ctx->refs); + percpu_ref_get(&ctx->refs); + return ret; +} + +SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode, + void __user *, arg, unsigned int, nr_args) +{ + struct io_ring_ctx *ctx; + long ret = -EBADF; + struct fd f; + + f = fdget(fd); + if (!f.file) + return -EBADF; + + ret = -EOPNOTSUPP; + if (f.file->f_op != &io_uring_fops) + goto out_fput; + + ret = -ENXIO; + ctx = f.file->private_data; + if (!percpu_ref_tryget(&ctx->refs)) + goto out_fput; + + ret = -EBUSY; + if (mutex_trylock(&ctx->uring_lock)) { + ret = __io_uring_register(ctx, opcode, arg, nr_args); + mutex_unlock(&ctx->uring_lock); + } + io_ring_drop_ctx_refs(ctx, 1); +out_fput: + fdput(f); + return ret; +} + static int __init io_uring_init(void) { req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index 39ad98c09c58..c7b5f86b91a1 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -40,7 +40,7 @@ struct user_struct { kuid_t uid; #if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || \ - defined(CONFIG_NET) + defined(CONFIG_NET) || defined(CONFIG_IO_URING) atomic_long_t locked_vm; #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 542757a4c898..101f7024d154 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -314,6 +314,8 @@ asmlinkage long sys_io_uring_setup(u32 entries, struct io_uring_params __user *p); asmlinkage long sys_io_uring_enter(unsigned int fd, u32 to_submit, u32 min_complete, u32 flags); +asmlinkage long sys_io_uring_register(unsigned int fd, unsigned int op, + void __user *arg, unsigned int nr_args); /* fs/xattr.c */ asmlinkage long sys_setxattr(const char __user *path, const char __user *name, diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 4fc5fbd07688..03ce7133c3b2 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -29,7 +29,10 @@ struct io_uring_sqe { __u32 fsync_flags; }; __u64 user_data; /* data to be passed back at completion time */ - __u64 __pad2[3]; + union { + __u16 buf_index; /* index into fixed buffers, if used */ + __u64 __pad2[3]; + }; }; /* @@ -41,6 +44,8 @@ struct io_uring_sqe { #define IORING_OP_READV 1 #define IORING_OP_WRITEV 2 #define IORING_OP_FSYNC 3 +#define IORING_OP_READ_FIXED 4 +#define IORING_OP_WRITE_FIXED 5 /* * sqe->fsync_flags @@ -104,4 +109,10 @@ struct io_uring_params { struct io_cqring_offsets cq_off; }; +/* + * io_uring_register(2) opcodes and arguments + */ +#define IORING_REGISTER_BUFFERS 0 +#define IORING_UNREGISTER_BUFFERS 1 + #endif diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index d754811ec780..38567718c397 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -49,6 +49,7 @@ COND_SYSCALL_COMPAT(io_pgetevents); COND_SYSCALL(io_uring_setup); COND_SYSCALL_COMPAT(io_uring_setup); COND_SYSCALL(io_uring_enter); +COND_SYSCALL(io_uring_register); /* fs/xattr.c */ From patchwork Fri Jan 18 16:12:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770813 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B327613B4 for ; Fri, 18 Jan 2019 16:13:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A31A02F432 for ; Fri, 18 Jan 2019 16:13:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 974CA2F5C8; Fri, 18 Jan 2019 16:13:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F331B2F432 for ; Fri, 18 Jan 2019 16:13:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728099AbfARQNF (ORCPT ); Fri, 18 Jan 2019 11:13:05 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:37741 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728078AbfARQNE (ORCPT ); Fri, 18 Jan 2019 11:13:04 -0500 Received: by mail-pf1-f194.google.com with SMTP id y126so6823010pfb.4 for ; Fri, 18 Jan 2019 08:13:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=IePKeuaL/YTYIaXfirgKMEywi+pxZ3RgRj/S9uouah8=; b=DGt5EcQyuLKh5nGhaqZZK289iwA9uaxGO9E+cQjz+2nhy0bPWYaPYTI2Vih2PSJGD3 6CF9v8caOWKLujxmG3JRs1nXqsRpufkPDiZ5KrUTu/DiIcSdKda5Ieim1LcsB77q3+iW cjlO3R7VFoHWSbqjr+FfOy14sdBu/UP97MNGsdb5CowKk72vfhxw0eOcievplxmMrUFh sg7E+uTJaxfEsNjpbPW7S896S5X2LbynBVu+yBbJfwcSECb9r+AmoiRznO58o3pm/ACX ljA9h8ThDK1DIit3REP8x+n78PN4t4QnZCnkaDwbheYym9zxk6g1DaPYZjEutUYvSbcP CMxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=IePKeuaL/YTYIaXfirgKMEywi+pxZ3RgRj/S9uouah8=; b=TMZSZbEdF4WXzj0hByqJxjuIBQ5gToyF34MYo/vt/HrSvW9qxDeM5Xb1drrV5FB/f4 g+hBiUn6SF+pixTRrHKkQjD5eU2o85zUPljmkNJTNgA4WilngdRCRUpKeWg8Hwo33QU5 0gFuIHKAN1xoPAsPkua89t38o25qcLpzckko2Ulvb3nIwFEgickrQH4nIEBVNuFwzKHA 23YWqik7WJSIKXJdyGlDRucWTKEGcLx85bU0eH+erV8/cYU+9cHoghE8DXQAG5joyvuk pD25u14hnllqdVJhp4n4oZZnbFTpFNX0h8THUmxzfemwbPk++7M0Mrap2XSkI3u65uxc F8TA== X-Gm-Message-State: AJcUuke60Sq06dcm06GurykvY1oBO+nGiQsbXH3JVhb6yQuatcN5xp0E oja9rqvGV++darCx1jHDhQ308Qbm5f6SVQ== X-Google-Smtp-Source: ALg8bN6hnWQi29jLkiLXcHSQqaqt57aJuBGNksRxmuYRtqerbUL5psyOGwbYxE7I+xLrHDaoKSzjhw== X-Received: by 2002:a63:7154:: with SMTP id b20mr18216124pgn.342.1547827983495; Fri, 18 Jan 2019 08:13:03 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.13.01 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:02 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 13/17] io_uring: add file set registration Date: Fri, 18 Jan 2019 09:12:21 -0700 Message-Id: <20190118161225.4545-14-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We normally have to fget/fput for each IO we do on a file. Even with the batching we do, the cost of the atomic inc/dec of the file usage count adds up. This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes for the io_uring_register(2) system call. The arguments passed in must be an array of __s32 holding file descriptors, and nr_args should hold the number of file descriptors the application wishes to pin for the duration of the io_uring context (or until IORING_UNREGISTER_FILES is called). When used, the application must set IOSQE_FIXED_FILE in the sqe->flags member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd to the index in the array passed in to IORING_REGISTER_FILES. Files are automatically unregistered when the io_uring context is torn down. An application need only unregister if it wishes to register a few set of fds. Signed-off-by: Jens Axboe --- fs/io_uring.c | 125 +++++++++++++++++++++++++++++----- include/uapi/linux/io_uring.h | 9 ++- 2 files changed, 116 insertions(+), 18 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 5fb55784d563..6aaa0bf3648c 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -100,6 +100,10 @@ struct io_ring_ctx { struct fasync_struct *cq_fasync; } ____cacheline_aligned_in_smp; + /* if used, fixed file set */ + struct file **user_files; + unsigned nr_user_files; + /* if used, fixed mapped user buffers */ unsigned nr_user_bufs; struct io_mapped_ubuf *user_bufs; @@ -137,6 +141,7 @@ struct io_kiocb { #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_IOPOLL_EAGAIN 4 /* submission got EAGAIN */ +#define REQ_F_FIXED_FILE 8 /* ctx owns file */ u64 user_data; u64 res; @@ -359,15 +364,17 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events, * Batched puts of the same file, to avoid dirtying the * file usage count multiple times, if avoidable. */ - if (!file) { - file = req->rw.ki_filp; - file_count = 1; - } else if (file == req->rw.ki_filp) { - file_count++; - } else { - fput_many(file, file_count); - file = req->rw.ki_filp; - file_count = 1; + if (!(req->flags & REQ_F_FIXED_FILE)) { + if (!file) { + file = req->rw.ki_filp; + file_count = 1; + } else if (file == req->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = req->rw.ki_filp; + file_count = 1; + } } if (to_free == ARRAY_SIZE(reqs)) @@ -504,13 +511,19 @@ static void kiocb_end_write(struct kiocb *kiocb) } } +static void io_fput(struct io_kiocb *req) +{ + if (!(req->flags & REQ_F_FIXED_FILE)) + fput(req->rw.ki_filp); +} + static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); kiocb_end_write(kiocb); - fput(kiocb->ki_filp); + io_fput(req); io_cqring_add_event(req->ctx, req->user_data, res, 0); io_free_req(req); } @@ -614,7 +627,14 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct kiocb *kiocb = &req->rw; int ret; - kiocb->ki_filp = io_file_get(state, sqe->fd); + if (sqe->flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files)) + return -EBADF; + kiocb->ki_filp = ctx->user_files[sqe->fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + kiocb->ki_filp = io_file_get(state, sqe->fd); + } if (unlikely(!kiocb->ki_filp)) return -EBADF; kiocb->ki_pos = sqe->off; @@ -653,7 +673,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, } return 0; out_fput: - io_file_put(state, kiocb->ki_filp); + if (!(sqe->flags & IOSQE_FIXED_FILE)) + io_file_put(state, kiocb->ki_filp); return ret; } @@ -770,7 +791,7 @@ static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, kfree(iovec); out_fput: if (unlikely(ret)) - fput(file); + io_fput(req); return ret; } @@ -824,7 +845,7 @@ static ssize_t io_write(struct io_kiocb *req, const struct io_uring_sqe *sqe, } out_fput: if (unlikely(ret)) - fput(file); + io_fput(req); return ret; } @@ -862,14 +883,23 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, if (unlikely(sqe->fsync_flags & ~IORING_FSYNC_DATASYNC)) return -EINVAL; - file = fget(sqe->fd); + if (sqe->flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files)) + return -EBADF; + file = ctx->user_files[sqe->fd]; + } else { + file = fget(sqe->fd); + } + if (unlikely(!file)) return -EBADF; ret = vfs_fsync_range(file, sqe->off, end > 0 ? end : LLONG_MAX, sqe->fsync_flags & IORING_FSYNC_DATASYNC); - fput(file); + if (!(sqe->flags & IOSQE_FIXED_FILE)) + fput(file); + io_cqring_add_event(ctx, sqe->user_data, ret, 0); io_free_req(req); return 0; @@ -984,7 +1014,7 @@ static int io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(s->sqe->flags)) + if (unlikely(s->sqe->flags & ~IOSQE_FIXED_FILE)) return -EINVAL; req = io_get_req(ctx, state); @@ -1169,6 +1199,57 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, return ret; } +static int io_sqe_files_unregister(struct io_ring_ctx *ctx) +{ + int i; + + if (!ctx->user_files) + return -ENXIO; + + for (i = 0; i < ctx->nr_user_files; i++) + fput(ctx->user_files[i]); + + kfree(ctx->user_files); + ctx->user_files = NULL; + ctx->nr_user_files = 0; + return 0; +} + +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, + unsigned nr_args) +{ + __s32 __user *fds = (__s32 __user *) arg; + int fd, i, ret = 0; + + if (ctx->user_files) + return -EBUSY; + if (!nr_args) + return -EINVAL; + + ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL); + if (!ctx->user_files) + return -ENOMEM; + + for (i = 0; i < nr_args; i++) { + ret = -EFAULT; + if (copy_from_user(&fd, &fds[i], sizeof(fd))) + break; + + ctx->user_files[i] = fget(fd); + + ret = -EBADF; + if (!ctx->user_files[i]) + break; + ctx->nr_user_files++; + ret = 0; + } + + if (ret) + io_sqe_files_unregister(ctx); + + return ret; +} + static int io_sq_offload_start(struct io_ring_ctx *ctx) { int ret; @@ -1464,6 +1545,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx) io_sq_offload_stop(ctx); io_iopoll_reap_events(ctx); io_free_scq_urings(ctx); + io_sqe_files_unregister(ctx); io_sqe_buffer_unregister(ctx); percpu_ref_exit(&ctx->refs); io_unaccount_mem(ctx, ring_pages(ctx->sq_entries, ctx->cq_entries)); @@ -1772,6 +1854,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_sqe_buffer_unregister(ctx); break; + case IORING_REGISTER_FILES: + ret = io_sqe_files_register(ctx, arg, nr_args); + break; + case IORING_UNREGISTER_FILES: + ret = -EINVAL; + if (arg || nr_args) + break; + ret = io_sqe_files_unregister(ctx); + break; default: ret = -EINVAL; break; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 03ce7133c3b2..8323320077ec 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -18,7 +18,7 @@ */ struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ - __u8 flags; /* as of now unused */ + __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ @@ -35,6 +35,11 @@ struct io_uring_sqe { }; }; +/* + * sqe->flags + */ +#define IOSQE_FIXED_FILE (1 << 0) /* use fixed fileset */ + /* * io_uring_setup() flags */ @@ -114,5 +119,7 @@ struct io_uring_params { */ #define IORING_REGISTER_BUFFERS 0 #define IORING_UNREGISTER_BUFFERS 1 +#define IORING_REGISTER_FILES 2 +#define IORING_UNREGISTER_FILES 3 #endif From patchwork Fri Jan 18 16:12:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770817 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 81D7A13B5 for ; Fri, 18 Jan 2019 16:13:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 709712F432 for ; Fri, 18 Jan 2019 16:13:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 647CE2F5C8; Fri, 18 Jan 2019 16:13:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AC4052F432 for ; Fri, 18 Jan 2019 16:13:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728105AbfARQNI (ORCPT ); Fri, 18 Jan 2019 11:13:08 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:40577 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728100AbfARQNH (ORCPT ); Fri, 18 Jan 2019 11:13:07 -0500 Received: by mail-pg1-f194.google.com with SMTP id z10so6267660pgp.7 for ; Fri, 18 Jan 2019 08:13:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=C+lUBfdzYfDre8IGi2k1Kyr9iKAx4mtf+OfNWW4p/HQ=; b=Hnji6THUwl94uwkZlcok8NKuRxRGTSVMcYB9z0nn3p6Vl8ufirPBm53lMlMi1OmCy8 maz/DERFCx0l1fZdavOapuIJudr6cl0RSLWi1jVAsthWUsjmMxJorlKDAWvF/6smzOET Ttx09MpdMhOxaRczzsCWJjPlUDUZt3ubVTu1EjR+ZUnf/2aKaFkBPCq9VOTtV8eyVfCa NyJo+yVAS5iJ4YJ/jutFwLfSMqcaj2RQsb4UDWnskZyxB4E3PVmAttDhiyuc78wNCt8s 7ZwePAJXUNhdXIx5AJas/0g5MwicmUY/6VQ33sXxJY1sSSqU9MduxF8iYFWZOwKNBxA+ x+Vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=C+lUBfdzYfDre8IGi2k1Kyr9iKAx4mtf+OfNWW4p/HQ=; b=BAIY9Q0dvQtyBQkYmJRYoDZVQz1x6G6ohfIGuHjF2eBHt5KhuZlfhuaNcWc2U2IzTw d+cm9lLnq+VRfZASWsLc11XErMPFqt2BmhWviSQCEEYmoGleUh1nJJcsLrcRnXBb9XDs L5ApHDWd0pbcxaVvqw9BGSaQuQQvwctBChKZt6YnrH6/n7VSuXVrHFvHec5KARJ2Zw7i cTytWzIdCnNxMzBo8/mcbtD/DC+AsnlLDbtZLh/3L/9IUSCfyVN3TdJcxcazExSvuPgL F6b+4fDaWvXUzI/5MBRHCJye/pwSpe2gnAb+NkYYbAi52bzki9lrM8kxUaeGCPjHcQFB CgHg== X-Gm-Message-State: AJcUukf9K+OmToor2C+d0dNv62Ew4Ol7O94aoZuAMf0IkJ+j9wak5aC+ RkgkI+BYGTDdSVEQl2CXRdwUXqo/KbPNUg== X-Google-Smtp-Source: ALg8bN5FXh2vWl00DcIid8vj40xgr+Pexh/Wmnw66yppehiI8QsSiWSRv8kcMtp+UC4MdZvuii3iLQ== X-Received: by 2002:a63:a401:: with SMTP id c1mr18345189pgf.403.1547827985636; Fri, 18 Jan 2019 08:13:05 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.13.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:04 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 14/17] io_uring: add submission polling Date: Fri, 18 Jan 2019 09:12:22 -0700 Message-Id: <20190118161225.4545-15-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This enables an application to do IO, without ever entering the kernel. By using the SQ ring to fill in new sqes and watching for completions on the CQ ring, we can submit and reap IOs without doing a single system call. The kernel side thread will poll for new submissions, and in case of HIPRI/polled IO, it'll also poll for completions. Proof of concept. If the thread has been idle for 1 second, it will set sq_ring->flags |= IORING_SQ_NEED_WAKEUP. The application will have to call io_uring_enter() to start things back up again. If IO is kept busy, that will never be needed. Basically an application that has this feature enabled will guard it's io_uring_enter(2) call with: read_barrier(); if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP) io_uring_enter(fd, to_submit, 0, 0); instead of calling it unconditionally. Improvements: 1) Maybe have smarter backoff. Busy loop for X time, then go to monitor/mwait, finally the schedule we have now after an idle second. Might not be worth the complexity. 2) Probably want the application to pass in the appropriate grace period, not hard code it at 1 second. Signed-off-by: Jens Axboe --- fs/io_uring.c | 220 +++++++++++++++++++++++++++++++++- include/uapi/linux/io_uring.h | 10 +- 2 files changed, 223 insertions(+), 7 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 6aaa0bf3648c..cdd9873edfe3 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -87,8 +88,10 @@ struct io_ring_ctx { /* IO offload */ struct workqueue_struct *sqo_wq; + struct task_struct *sqo_thread; /* if using sq thread polling */ struct mm_struct *sqo_mm; struct files_struct *sqo_files; + wait_queue_head_t sqo_wait; struct { /* CQ ring */ @@ -264,6 +267,9 @@ static void __io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, if (waitqueue_active(&ctx->wait)) wake_up(&ctx->wait); + if ((ctx->flags & IORING_SETUP_SQPOLL) && + waitqueue_active(&ctx->sqo_wait)) + wake_up(&ctx->sqo_wait); } static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 ki_user_data, @@ -1102,6 +1108,169 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) return false; } +static int io_submit_sqes(struct io_ring_ctx *ctx, struct sqe_submit *sqes, + unsigned int nr, bool mm_fault) +{ + struct io_submit_state state, *statep = NULL; + int ret, i, submitted = 0; + + if (nr > IO_PLUG_THRESHOLD) { + io_submit_state_start(&state, ctx, nr); + statep = &state; + } + + for (i = 0; i < nr; i++) { + if (unlikely(mm_fault)) + ret = -EFAULT; + else + ret = io_submit_sqe(ctx, &sqes[i], statep); + if (!ret) { + submitted++; + continue; + } + + io_cqring_add_event(ctx, sqes[i].sqe->user_data, ret, 0); + } + + if (statep) + io_submit_state_end(&state); + + return submitted; +} + +static int io_sq_thread(void *data) +{ + struct sqe_submit sqes[IO_IOPOLL_BATCH]; + struct io_ring_ctx *ctx = data; + struct mm_struct *cur_mm = NULL; + struct files_struct *old_files; + mm_segment_t old_fs; + DEFINE_WAIT(wait); + unsigned inflight; + unsigned long timeout; + + old_files = current->files; + current->files = ctx->sqo_files; + + old_fs = get_fs(); + set_fs(USER_DS); + + timeout = inflight = 0; + while (!kthread_should_stop()) { + bool all_fixed, mm_fault = false; + int i; + + if (inflight) { + unsigned int nr_events = 0; + + /* + * Normal IO, just pretend everything completed. + * We don't have to poll completions for that. + */ + if (ctx->flags & IORING_SETUP_IOPOLL) { + /* + * App should not use IORING_ENTER_GETEVENTS + * with thread polling, but if it does, then + * ensure we are mutually exclusive. + */ + if (mutex_trylock(&ctx->uring_lock)) { + io_iopoll_check(ctx, &nr_events, 0); + mutex_unlock(&ctx->uring_lock); + } + } else { + nr_events = inflight; + } + + inflight -= nr_events; + if (!inflight) + timeout = jiffies + HZ; + } + + if (!io_get_sqring(ctx, &sqes[0])) { + /* + * We're polling, let us spin for a second without + * work before going to sleep. + */ + if (inflight || !time_after(jiffies, timeout)) { + cpu_relax(); + continue; + } + + /* + * Drop cur_mm before scheduling, we can't hold it for + * long periods (or over schedule()). Do this before + * adding ourselves to the waitqueue, as the unuse/drop + * may sleep. + */ + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + cur_mm = NULL; + } + + prepare_to_wait(&ctx->sqo_wait, &wait, + TASK_INTERRUPTIBLE); + + /* Tell userspace we may need a wakeup call */ + ctx->sq_ring->flags |= IORING_SQ_NEED_WAKEUP; + smp_wmb(); + + if (!io_get_sqring(ctx, &sqes[0])) { + if (kthread_should_park()) + kthread_parkme(); + if (kthread_should_stop()) { + finish_wait(&ctx->sqo_wait, &wait); + break; + } + if (signal_pending(current)) + flush_signals(current); + schedule(); + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + continue; + } + finish_wait(&ctx->sqo_wait, &wait); + + ctx->sq_ring->flags &= ~IORING_SQ_NEED_WAKEUP; + smp_wmb(); + } + + i = 0; + all_fixed = true; + do { + if (sqes[i].sqe->opcode != IORING_OP_READ_FIXED && + sqes[i].sqe->opcode != IORING_OP_WRITE_FIXED) + all_fixed = false; + + i++; + if (i == ARRAY_SIZE(sqes)) + break; + } while (io_get_sqring(ctx, &sqes[i])); + + io_commit_sqring(ctx); + + /* Unless all new commands are FIXED regions, grab mm */ + if (!all_fixed && !cur_mm) { + mm_fault = !mmget_not_zero(ctx->sqo_mm); + if (!mm_fault) { + use_mm(ctx->sqo_mm); + cur_mm = ctx->sqo_mm; + } + } + + inflight += io_submit_sqes(ctx, sqes, i, mm_fault); + } + current->files = old_files; + set_fs(old_fs); + if (cur_mm) { + unuse_mm(cur_mm); + mmput(cur_mm); + } + return 0; +} + static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) { struct io_submit_state state, *statep = NULL; @@ -1175,9 +1344,14 @@ static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, int ret = 0; if (to_submit) { - ret = io_ring_submit(ctx, to_submit); - if (ret < 0) - return ret; + if (ctx->flags & IORING_SETUP_SQPOLL) { + wake_up(&ctx->sqo_wait); + ret = to_submit; + } else { + ret = io_ring_submit(ctx, to_submit); + if (ret < 0) + return ret; + } } if (flags & IORING_ENTER_GETEVENTS) { unsigned nr_events = 0; @@ -1250,10 +1424,12 @@ static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, return ret; } -static int io_sq_offload_start(struct io_ring_ctx *ctx) +static int io_sq_offload_start(struct io_ring_ctx *ctx, + struct io_uring_params *p) { int ret; + init_waitqueue_head(&ctx->sqo_wait); ctx->sqo_mm = current->mm; /* @@ -1266,6 +1442,27 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) if (!ctx->sqo_files) goto err; + if (ctx->flags & IORING_SETUP_SQPOLL) { + if (p->flags & IORING_SETUP_SQ_AFF) { + ctx->sqo_thread = kthread_create_on_cpu(io_sq_thread, + ctx, p->sq_thread_cpu, + "io_uring-sq"); + } else { + ctx->sqo_thread = kthread_create(io_sq_thread, ctx, + "io_uring-sq"); + } + if (IS_ERR(ctx->sqo_thread)) { + ret = PTR_ERR(ctx->sqo_thread); + ctx->sqo_thread = NULL; + goto err; + } + wake_up_process(ctx->sqo_thread); + } else if (p->flags & IORING_SETUP_SQ_AFF) { + /* Can't have SQ_AFF without SQPOLL */ + ret = -EINVAL; + goto err; + } + /* Do QD, or 2 * CPUS, whatever is smallest */ ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, min(ctx->sq_entries - 1, 2 * num_online_cpus())); @@ -1276,6 +1473,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) return 0; err: + if (ctx->sqo_thread) { + kthread_park(ctx->sqo_thread); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } if (ctx->sqo_files) ctx->sqo_files = NULL; ctx->sqo_mm = NULL; @@ -1284,6 +1486,11 @@ static int io_sq_offload_start(struct io_ring_ctx *ctx) static void io_sq_offload_stop(struct io_ring_ctx *ctx) { + if (ctx->sqo_thread) { + kthread_park(ctx->sqo_thread); + kthread_stop(ctx->sqo_thread); + ctx->sqo_thread = NULL; + } if (ctx->sqo_wq) { destroy_workqueue(ctx->sqo_wq); ctx->sqo_wq = NULL; @@ -1772,7 +1979,7 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p, if (ret) goto err; - ret = io_sq_offload_start(ctx); + ret = io_sq_offload_start(ctx, p); if (ret) goto err; @@ -1807,7 +2014,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params, return -EINVAL; } - if (p.flags & ~IORING_SETUP_IOPOLL) + if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | + IORING_SETUP_SQ_AFF)) return -EINVAL; ret = io_uring_create(entries, &p, compat); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 8323320077ec..37c7402be9ca 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -44,6 +44,8 @@ struct io_uring_sqe { * io_uring_setup() flags */ #define IORING_SETUP_IOPOLL (1 << 0) /* io_context is polled */ +#define IORING_SETUP_SQPOLL (1 << 1) /* SQ poll thread */ +#define IORING_SETUP_SQ_AFF (1 << 2) /* sq_thread_cpu is valid */ #define IORING_OP_NOP 0 #define IORING_OP_READV 1 @@ -87,6 +89,11 @@ struct io_sqring_offsets { __u32 resv[3]; }; +/* + * sq_ring->flags + */ +#define IORING_SQ_NEED_WAKEUP (1 << 0) /* needs io_uring_enter wakeup */ + struct io_cqring_offsets { __u32 head; __u32 tail; @@ -109,7 +116,8 @@ struct io_uring_params { __u32 sq_entries; __u32 cq_entries; __u32 flags; - __u16 resv[10]; + __u16 sq_thread_cpu; + __u16 resv[9]; struct io_sqring_offsets sq_off; struct io_cqring_offsets cq_off; }; From patchwork Fri Jan 18 16:12:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770821 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 693B213B4 for ; Fri, 18 Jan 2019 16:13:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 58DC82F432 for ; Fri, 18 Jan 2019 16:13:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4D94C2F571; Fri, 18 Jan 2019 16:13:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 046F92F432 for ; Fri, 18 Jan 2019 16:13:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728110AbfARQNJ (ORCPT ); Fri, 18 Jan 2019 11:13:09 -0500 Received: from mail-pg1-f194.google.com ([209.85.215.194]:37963 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727587AbfARQNJ (ORCPT ); Fri, 18 Jan 2019 11:13:09 -0500 Received: by mail-pg1-f194.google.com with SMTP id g189so6260430pgc.5 for ; Fri, 18 Jan 2019 08:13:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=DTlkF97vGefqpS0bMWv746R15dMGRBYhpLm7YzqPk8E=; b=Vpd+bVRIeZu+98XKdNR5oAaJQdU3jBp1Sxi8qU7IQAPY7TIokkpLgrKpmwc3uy6iH1 x0Jp9e4lWBwqXdfCf7jN5xUAsGyrun3Gq22lre5vaDMoZJArjgAumCFZFsqGyfwSb07g 4ka/VZsFLjPMC1D7gLtbOwUdQTCFmGUs5Zle63XTJz4VIKJOUmL1QLBB34B3+APp4L2+ yvwx+4WJvK3Wm+vrk/vExHJmxauzzv+RJ26U/7jIWMCqgH/fR6kFRMDOT3/bUTC0eSZv l4vL6W4VtqjhBSM3clzku2i3Z/5K1aDnMdZWSpvRtCKcfHQckH38Qt/7Do3PAvxpIdg9 issA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=DTlkF97vGefqpS0bMWv746R15dMGRBYhpLm7YzqPk8E=; b=Bwn2HP6Z2bw9KBEyYu34kA8m4j/DYaLvPBqDpk98pGzAaPIZd40/JcalWSdYhQ/lU5 fWswL0J4WQFOD7gHQcyuYHF0UyFie9b/4WrDirewluGGjHhBhXNwKQGYAdKp5HEatYFF LS13UEAFcxgRDyfS30zJ2woyoc1Cn6p94rvYrAXKoSldxWYDp+hs3dEoTGLccIviitvU bNmTsbfHr5okP3qnZUjVfiIWbIGGZuwIwKzBX3AiGDiPUolbWEGCM55jk7hHYFbKfDDV WqRhvzkDiib9ztfxP+Z3L9U8u7BNcpg7LLqHeARwqWUNobj2T32KF3X5l2VVQnwSPHsG 8D5A== X-Gm-Message-State: AJcUukeFCn7Y+eCj5niGUk0l1hq20OqYNhreekD2k2GJZjgdYpi8U+N0 V4vqr9SHD1u7Exba1mwQ6EFVnghU67QEXA== X-Google-Smtp-Source: ALg8bN51A4DrcA9djqPZFWlvZUqpGWUCbey9g5sJ1DeJWqt4np/zgv/BqeGZ92uOU3ipTg3fYR2zXw== X-Received: by 2002:a63:5ec6:: with SMTP id s189mr17528226pgb.357.1547827987643; Fri, 18 Jan 2019 08:13:07 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.13.05 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:06 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 15/17] io_uring: add io_kiocb ref count Date: Fri, 18 Jan 2019 09:12:23 -0700 Message-Id: <20190118161225.4545-16-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP We'll use this for the POLL implementation. Regular requests will NOT be using references, so initialize it to 0. Any real use of the io_kiocb ref will initialize it to at least 2. Signed-off-by: Jens Axboe --- fs/io_uring.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index cdd9873edfe3..4f13b3371156 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -141,6 +141,7 @@ struct io_kiocb { struct io_ring_ctx *ctx; struct list_head list; unsigned int flags; + refcount_t refs; #define REQ_F_FORCE_NONBLOCK 1 /* inline submission attempt */ #define REQ_F_IOPOLL_COMPLETED 2 /* polled IO has completed */ #define REQ_F_IOPOLL_EAGAIN 4 /* submission got EAGAIN */ @@ -322,6 +323,7 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx, if (req) { req->ctx = ctx; req->flags = 0; + refcount_set(&req->refs, 0); return req; } @@ -341,8 +343,10 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr) static void io_free_req(struct io_kiocb *req) { - io_ring_drop_ctx_refs(req->ctx, 1); - kmem_cache_free(req_cachep, req); + if (!refcount_read(&req->refs) || refcount_dec_and_test(&req->refs)) { + io_ring_drop_ctx_refs(req->ctx, 1); + kmem_cache_free(req_cachep, req); + } } /* From patchwork Fri Jan 18 16:12:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770825 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D034B186E for ; Fri, 18 Jan 2019 16:13:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BFE442F432 for ; Fri, 18 Jan 2019 16:13:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B42252F571; Fri, 18 Jan 2019 16:13:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C9D22F562 for ; Fri, 18 Jan 2019 16:13:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728119AbfARQNM (ORCPT ); Fri, 18 Jan 2019 11:13:12 -0500 Received: from mail-pg1-f196.google.com ([209.85.215.196]:37968 "EHLO mail-pg1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728111AbfARQNL (ORCPT ); Fri, 18 Jan 2019 11:13:11 -0500 Received: by mail-pg1-f196.google.com with SMTP id g189so6260492pgc.5 for ; Fri, 18 Jan 2019 08:13:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=O3neaYsAbIZK3byoBHo3dvTbGdgtiLHgmzkkYuWj7B0=; b=qNXiRiArxidtFO1qyGe3XYCGndC2bx2exZyQHLDKh9UKCawt0Z4Bk2Wgyw+pfhjJNp OM8Mq13ZmSs+DMBzkIaEevqXwKFafEl4ofxtca7v5AluP/lopWUkH3xMFoIhCytWfCmj gFJDhFoYFI1VmtgyVGrrfT9WPLX0VAxnCDi4ll/xBfrH5ViE0hpQNcw7FnDVMYSUyKiH byKfVRQ+xv5yL3egaphVe4aGc7OxycWHMunQl1m551AY4/Aa9xrKIb/pFyLbP6vaa7CH O29j0IMOgB0Relf2Ukcrf19fYTmiuG3K2f/bKfxBS4nBbykD2SslW87vxkYcvSVVE8Wy JXag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=O3neaYsAbIZK3byoBHo3dvTbGdgtiLHgmzkkYuWj7B0=; b=m/vkIsQ8cF9YDlq4EGj0Eoxi5k70JqhxKeN+FsgzTbcaMLOC8Pk1I+wnEIE8bZwDl1 pwTl88aAxZ+axW5xL8wDjWyg7pGLFwGDLEIx1IRO4Vi+DXLYkTffW+7d5gLFHHUoPGSu PUz8S8Szc/cEbY5keniUqr11RAAsDV8ekzPTvVJaYdtC++YuWzs3+X4xL4jHVUjMZsKk WSZVlh+eN7uXtdT4YVMouGnMfTMrjfHOWAcrSAzL653k4Gpx9Ok2fkxTC7+1SJ+P4kk0 0nE6cGFmuFQXYb3zYW+XeIdhhowyTEbUJqamtZNm4+3B/vxMt2sUGHcl6zz7/oEV3GpN oQRQ== X-Gm-Message-State: AJcUukeRS8SiK8iCvUZl5UwwiwzQtV7BuWmWsMQHKwbzXSJEfKTmBldC tev7lx7a9fjqi9r00rXa6g6D+Uo6zhA2Yw== X-Google-Smtp-Source: ALg8bN7boUqTFFzmzZ7PtXLluffC1u5PVXVKswqsmWTsgGY9SnXQR+mjqDT2qN94yUUmyhii+m+aTA== X-Received: by 2002:a62:75d1:: with SMTP id q200mr19925305pfc.254.1547827990026; Fri, 18 Jan 2019 08:13:10 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.13.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:09 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 16/17] io_uring: add support for IORING_OP_POLL Date: Fri, 18 Jan 2019 09:12:24 -0700 Message-Id: <20190118161225.4545-17-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This is basically a direct port of bfe4037e722e, which implements a one-shot poll command through aio. Description below is based on that commit as well. However, instead of adding a POLL command and relying on io_cancel(2) to remove it, we mimic the epoll(2) interface of having a command to add a poll notification, IORING_OP_POLL_ADD, and one to remove it again, IORING_OP_POLL_REMOVE. To poll for a file descriptor the application should submit an sqe of type IORING_OP_POLL. It will poll the fd for the events specified in the poll_events field. Unlike poll or epoll without EPOLLONESHOT this interface always works in one shot mode, that is once the sqe is completed, it will have to be resubmitted. Based-on-code-from: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/io_uring.c | 245 ++++++++++++++++++++++++++++++++++ include/uapi/linux/io_uring.h | 3 + 2 files changed, 248 insertions(+) diff --git a/fs/io_uring.c b/fs/io_uring.c index 4f13b3371156..4709a19d692b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -124,6 +124,7 @@ struct io_ring_ctx { spinlock_t completion_lock; struct list_head poll_list; unsigned poll_multi_file; + struct list_head cancel_list; } ____cacheline_aligned_in_smp; }; @@ -132,9 +133,19 @@ struct sqe_submit { unsigned index; }; +struct io_poll_iocb { + struct file *file; + struct wait_queue_head *head; + __poll_t events; + bool woken; + bool canceled; + struct wait_queue_entry wait; +}; + struct io_kiocb { union { struct kiocb rw; + struct io_poll_iocb poll; struct sqe_submit submit; }; @@ -206,6 +217,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) init_waitqueue_head(&ctx->wait); spin_lock_init(&ctx->completion_lock); INIT_LIST_HEAD(&ctx->poll_list); + INIT_LIST_HEAD(&ctx->cancel_list); return ctx; } @@ -915,6 +927,232 @@ static int io_fsync(struct io_kiocb *req, const struct io_uring_sqe *sqe, return 0; } +static void io_poll_remove_one(struct io_kiocb *req) +{ + struct io_poll_iocb *poll = &req->poll; + + spin_lock(&poll->head->lock); + WRITE_ONCE(poll->canceled, true); + if (!list_empty(&poll->wait.entry)) { + list_del_init(&poll->wait.entry); + queue_work(req->ctx->sqo_wq, &req->work); + } + spin_unlock(&poll->head->lock); + + list_del_init(&req->list); +} + +static void io_poll_remove_all(struct io_ring_ctx *ctx) +{ + struct io_kiocb *req; + + spin_lock_irq(&ctx->completion_lock); + while (!list_empty(&ctx->cancel_list)) { + req = list_first_entry(&ctx->cancel_list, struct io_kiocb,list); + io_poll_remove_one(req); + } + spin_unlock_irq(&ctx->completion_lock); +} + +/* + * Find a running poll command that matches one specified in sqe->addr, + * and remove it if found. + */ +static int io_poll_remove(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_kiocb *poll_req, *next; + int ret = -ENOENT; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->ioprio || sqe->off || sqe->len || sqe->buf_index || + sqe->poll_events) + return -EINVAL; + + spin_lock_irq(&ctx->completion_lock); + list_for_each_entry_safe(poll_req, next, &ctx->cancel_list, list) { + if (sqe->addr == poll_req->user_data) { + io_poll_remove_one(poll_req); + ret = 0; + break; + } + } + spin_unlock_irq(&ctx->completion_lock); + + io_cqring_add_event(req->ctx, sqe->user_data, ret, 0); + io_free_req(req); + return 0; +} + +static void io_poll_complete(struct io_kiocb *req, __poll_t mask) +{ + io_cqring_add_event(req->ctx, req->user_data, mangle_poll(mask), 0); + io_fput(req); + io_free_req(req); +} + +static void io_poll_complete_work(struct work_struct *work) +{ + struct io_kiocb *req = container_of(work, struct io_kiocb, work); + struct io_poll_iocb *poll = &req->poll; + struct poll_table_struct pt = { ._key = poll->events }; + struct io_ring_ctx *ctx = req->ctx; + __poll_t mask = 0; + + if (!READ_ONCE(poll->canceled)) + mask = vfs_poll(poll->file, &pt) & poll->events; + + /* + * Note that ->ki_cancel callers also delete iocb from active_reqs after + * calling ->ki_cancel. We need the ctx_lock roundtrip here to + * synchronize with them. In the cancellation case the list_del_init + * itself is not actually needed, but harmless so we keep it in to + * avoid further branches in the fast path. + */ + spin_lock_irq(&ctx->completion_lock); + if (!mask && !READ_ONCE(poll->canceled)) { + add_wait_queue(poll->head, &poll->wait); + spin_unlock_irq(&ctx->completion_lock); + return; + } + list_del_init(&req->list); + spin_unlock_irq(&ctx->completion_lock); + + io_poll_complete(req, mask); +} + +static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, + void *key) +{ + struct io_poll_iocb *poll = container_of(wait, struct io_poll_iocb, + wait); + struct io_kiocb *req = container_of(poll, struct io_kiocb, poll); + struct io_ring_ctx *ctx = req->ctx; + __poll_t mask = key_to_poll(key); + + poll->woken = true; + + /* for instances that support it check for an event match first: */ + if (mask) { + if (!(mask & poll->events)) + return 0; + + /* try to complete the iocb inline if we can: */ + if (spin_trylock(&ctx->completion_lock)) { + list_del(&req->list); + spin_unlock(&ctx->completion_lock); + + list_del_init(&poll->wait.entry); + io_poll_complete(req, mask); + return 1; + } + } + + list_del_init(&poll->wait.entry); + queue_work(ctx->sqo_wq, &req->work); + return 1; +} + +struct io_poll_table { + struct poll_table_struct pt; + struct io_kiocb *req; + int error; +}; + +static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, + struct poll_table_struct *p) +{ + struct io_poll_table *pt = container_of(p, struct io_poll_table, pt); + + if (unlikely(pt->req->poll.head)) { + pt->error = -EINVAL; + return; + } + + pt->error = 0; + pt->req->poll.head = head; + add_wait_queue(head, &pt->req->poll.wait); +} + +static int io_poll_add(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_poll_iocb *poll = &req->poll; + struct io_ring_ctx *ctx = req->ctx; + struct io_poll_table ipt; + __poll_t mask; + + if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) + return -EINVAL; + if (sqe->addr || sqe->ioprio || sqe->off || sqe->len || sqe->buf_index) + return -EINVAL; + + INIT_WORK(&req->work, io_poll_complete_work); + poll->events = demangle_poll(sqe->poll_events) | EPOLLERR | EPOLLHUP; + + if (sqe->flags & IOSQE_FIXED_FILE) { + if (unlikely(!ctx->user_files || sqe->fd >= ctx->nr_user_files)) + return -EBADF; + poll->file = ctx->user_files[sqe->fd]; + req->flags |= REQ_F_FIXED_FILE; + } else { + poll->file = fget(sqe->fd); + } + if (unlikely(!poll->file)) + return -EBADF; + + poll->head = NULL; + poll->woken = false; + poll->canceled = false; + + ipt.pt._qproc = io_poll_queue_proc; + ipt.pt._key = poll->events; + ipt.req = req; + ipt.error = -EINVAL; /* same as no support for IOCB_CMD_POLL */ + + /* initialized the list so that we can do list_empty checks */ + INIT_LIST_HEAD(&poll->wait.entry); + init_waitqueue_func_entry(&poll->wait, io_poll_wake); + + /* one for removal from waitqueue, one for this function */ + refcount_set(&req->refs, 2); + + mask = vfs_poll(poll->file, &ipt.pt) & poll->events; + if (unlikely(!poll->head)) { + /* we did not manage to set up a waitqueue, done */ + goto out; + } + + spin_lock_irq(&ctx->completion_lock); + spin_lock(&poll->head->lock); + if (poll->woken) { + /* wake_up context handles the rest */ + mask = 0; + ipt.error = 0; + } else if (mask || ipt.error) { + /* if we get an error or a mask we are done */ + WARN_ON_ONCE(list_empty(&poll->wait.entry)); + list_del_init(&poll->wait.entry); + } else { + /* actually waiting for an event */ + list_add_tail(&req->list, &ctx->cancel_list); + } + spin_unlock(&poll->head->lock); + spin_unlock_irq(&ctx->completion_lock); + +out: + if (unlikely(ipt.error)) { + if (!(sqe->flags & IOSQE_FIXED_FILE)) + fput(poll->file); + return ipt.error; + } + + if (mask) + io_poll_complete(req, mask); + io_free_req(req); + return 0; +} + static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, struct sqe_submit *s, bool force_nonblock, struct io_submit_state *state) @@ -950,6 +1188,12 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, case IORING_OP_FSYNC: ret = io_fsync(req, sqe, force_nonblock); break; + case IORING_OP_POLL_ADD: + ret = io_poll_add(req, sqe); + break; + case IORING_OP_POLL_REMOVE: + ret = io_poll_remove(req, sqe); + break; default: ret = -EINVAL; break; @@ -1791,6 +2035,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); mutex_unlock(&ctx->uring_lock); + io_poll_remove_all(ctx); io_iopoll_reap_events(ctx); wait_for_completion(&ctx->ctx_done); io_ring_ctx_free(ctx); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 37c7402be9ca..60b52c551c87 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -27,6 +27,7 @@ struct io_uring_sqe { union { __kernel_rwf_t rw_flags; __u32 fsync_flags; + __u16 poll_events; }; __u64 user_data; /* data to be passed back at completion time */ union { @@ -53,6 +54,8 @@ struct io_uring_sqe { #define IORING_OP_FSYNC 3 #define IORING_OP_READ_FIXED 4 #define IORING_OP_WRITE_FIXED 5 +#define IORING_OP_POLL_ADD 6 +#define IORING_OP_POLL_REMOVE 7 /* * sqe->fsync_flags From patchwork Fri Jan 18 16:12:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jens Axboe X-Patchwork-Id: 10770829 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5900D186E for ; Fri, 18 Jan 2019 16:13:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 490152F432 for ; Fri, 18 Jan 2019 16:13:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3D78E2F571; Fri, 18 Jan 2019 16:13:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E7E952F432 for ; Fri, 18 Jan 2019 16:13:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728125AbfARQNO (ORCPT ); Fri, 18 Jan 2019 11:13:14 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:44124 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728010AbfARQNO (ORCPT ); Fri, 18 Jan 2019 11:13:14 -0500 Received: by mail-pf1-f195.google.com with SMTP id u6so6805134pfh.11 for ; Fri, 18 Jan 2019 08:13:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=4mAP7BjReU2H349vfmTmNTZ8t+6jtk7ue9ku7Z8AMWw=; b=BJoh+Sq2eVxX+RxLeR1YbB0Keb04Sf7iBOJeliQOgTgCY5LexdwJMAlejNztIHatbD PokEL6M9WMOY7XzDj/2uBXpONCL6ok0RS9a70fhaEyNEKsNqdFwqijGA3Gqa6TQ9i/Yr uzZNns7agei5nBMubVCot/DOp1eGKCGHAYD71wfV4J2aRYFbNoXmYPfY2dmCzz2FwuHg znNG3ND8xq2q7XTAXXl1+xSabSK/x6smynNPwd/whinAO9hDzq07KzedfcDJ+p+HOQGV HS5vsF86xSGnJqfMptCljmFcH1Jt9mt9Phi6lYqEUxKH/n3KBFHD0DS84gNJhbdYnd4O OWKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=4mAP7BjReU2H349vfmTmNTZ8t+6jtk7ue9ku7Z8AMWw=; b=OOldP6MTt6qKLrZo0C3PQWpcGKLwleUlRSdQQxsczhGt2Iv8Y+Fc/PqE6EOv3vZBqL L8rQV8qSOh0KibjyK2ORDM9RXfWKBcSj83NgyN560qiIiwRyuQJOtGDWAsy4GHDwCC7c m1EEfKjxl2RtePcPi52kzTzV2Q6X9X9TzCknlBx0RXyJH5hf4gpgt4G+GvqFPihdESgc +VDDfgeb7SZqeVz2l7xxnRd663/0pxNhd3CXrsyAHBzpquw6LVJUQ3GobCeSVFE+4xnP ntqK+sjnnYot8jB1mvpdR82CAW2YoXr+IEiRvemQnIZmZpccBHjS/iqboDGL8z5u0n6w geRQ== X-Gm-Message-State: AJcUukc9800BCH+U1rVBfeBjIdzv3Xj3BjjBeM4E+OzL1YKPPjrl5QKR zeeq7Rzvy/xMx5ntl5GpToL8Xu4lY+8Qag== X-Google-Smtp-Source: ALg8bN6uAhcqjM0ABX796h1VzJRXLWlQz8K/+nldjaDzGOCRhRUomMFhyjgD9OXjn/IAu2y9BV8cNg== X-Received: by 2002:a63:6506:: with SMTP id z6mr18299912pgb.334.1547827992178; Fri, 18 Jan 2019 08:13:12 -0800 (PST) Received: from x1.localdomain (66.29.188.166.static.utbb.net. [66.29.188.166]) by smtp.gmail.com with ESMTPSA id m20sm5317804pgv.93.2019.01.18.08.13.10 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 18 Jan 2019 08:13:11 -0800 (PST) From: Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-block@vger.kernel.org Cc: hch@lst.de, jmoyer@redhat.com, avi@scylladb.com, Jens Axboe Subject: [PATCH 17/17] io_uring: add io_uring_event cache hit information Date: Fri, 18 Jan 2019 09:12:25 -0700 Message-Id: <20190118161225.4545-18-axboe@kernel.dk> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190118161225.4545-1-axboe@kernel.dk> References: <20190118161225.4545-1-axboe@kernel.dk> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Add hint on whether a read was served out of the page cache, or if it hit media. This is useful for buffered async IO, O_DIRECT reads would never have this set (for obvious reasons). If the read hit page cache, cqe->flags will have IOCQE_FLAG_CACHEHIT set. Signed-off-by: Jens Axboe --- fs/io_uring.c | 7 ++++++- include/uapi/linux/io_uring.h | 5 +++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 4709a19d692b..aec7c3db32c5 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -542,11 +542,16 @@ static void io_fput(struct io_kiocb *req) static void io_complete_rw(struct kiocb *kiocb, long res, long res2) { struct io_kiocb *req = container_of(kiocb, struct io_kiocb, rw); + unsigned ev_flags = 0; kiocb_end_write(kiocb); io_fput(req); - io_cqring_add_event(req->ctx, req->user_data, res, 0); + + if (res > 0 && (req->flags & REQ_F_FORCE_NONBLOCK)) + ev_flags = IOCQE_FLAG_CACHEHIT; + + io_cqring_add_event(req->ctx, req->user_data, res, ev_flags); io_free_req(req); } diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 60b52c551c87..3b8d623031ad 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -71,6 +71,11 @@ struct io_uring_cqe { __u32 flags; }; +/* + * io_uring_event->flags + */ +#define IOCQE_FLAG_CACHEHIT (1 << 0) /* IO did not hit media */ + /* * Magic offsets for the application to mmap the data it needs */