From patchwork Mon May 17 18:35:19 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262797 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7B99C433ED for ; Mon, 17 May 2021 18:35:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 87CF0611ED for ; Mon, 17 May 2021 18:35:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239306AbhEQShK (ORCPT ); Mon, 17 May 2021 14:37:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40004 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237302AbhEQShI (ORCPT ); Mon, 17 May 2021 14:37:08 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2595DC061573 for ; Mon, 17 May 2021 11:35:52 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id n3so3663732plf.7 for ; Mon, 17 May 2021 11:35:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=wqJ1xSTew63qqczYAjdusvy3qG1VX8/J1KmQjEgHR2E=; b=wzggP5LYyy6xcx78afLXLuoSyO6Jpdc8UUfFCeszwbtEaOLwo9DLdDmq35c8yvJlth xqRPQ6xJ4VAxWxnYcSFFDwrIJCVsNSddLG2VGmgPljldL7eAJMAKuoSjYg7ActQBq170 ik/umiKp27tWePVoxVvLBWRRzSW7WyhtvOOJwlzCdBeTDNKBbXE4Jv5/vbleol3xTabV p6WUdhLXMdjFmw5A2Wf+2OiV7rAf1A/I0TQLBaFqp1Lu/Xs+sgF+2jmsEoIXL8G5xM50 J+rz55A8IfZCvV5UJVsPayjXB8K1RwiHgEM28oFQ6HNvd4oZEwfYVFVpsvlpBnNLhLyi TSlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=wqJ1xSTew63qqczYAjdusvy3qG1VX8/J1KmQjEgHR2E=; b=cmzzzZ18t9WcOCTYPTvTxAmPntKe4MV0zBOIrFd+Ppn5x8XbS41tQ1cvIYZz4rv7La 3oUPXLuQE5dQgSnkMYNXnomdYr7IAbqJKmt4A/YjkhDuVTCI/ANEf4/ULnPIzpCnJmhi gy30AgcHOAEsa0SXei7RIFIcO8rIJ5VLPbrBlyvDe/po/9AlypvjhSuDIyjBVvjZoGdm fyDGktdq0zROI3Bj+oDqLLizYA11tJV2b5ZUqlRXNYX/45FB3rEpim0Bldf4JIWzLasd IwaSBgh1IqKY4vAJZRAXe/4pkoU76xjPuGs9wfZu9oigCXEGJ3QHrp1crZCCbE4Iy+g7 WmCg== X-Gm-Message-State: AOAM5337AmdM5zzdPJ4z0LVBZXn3Qssg6Y4qMCJcl+zubJsdVrRvUn1c d1elqViWV689j7I7dpI5ewPyOGYILC7GNA== X-Google-Smtp-Source: ABdhPJzxGzNmjENobhO+XXGjAEx2gIv30W5taF3FejDx2NhwHe24Qj7Rc+xUML3UFY6qpj8wm/PeFQ== X-Received: by 2002:a17:902:26c:b029:ef:96e9:1471 with SMTP id 99-20020a170902026cb02900ef96e91471mr1205309plc.63.1621276545877; Mon, 17 May 2021 11:35:45 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:44 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 1/9] iov_iter: add copy_struct_from_iter() Date: Mon, 17 May 2021 11:35:19 -0700 Message-Id: <80b5f8bc277912222121f6ab9a9796d7f20998eb.1621276134.git.osandov@fb.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval This is essentially copy_struct_from_user() but for an iov_iter. Suggested-by: Aleksa Sarai Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- include/linux/uio.h | 1 + lib/iov_iter.c | 91 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+) diff --git a/include/linux/uio.h b/include/linux/uio.h index d3ec87706d75..cbaf6b3bfcbc 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -129,6 +129,7 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); +int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i); size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i); size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index c701b7a187f2..129f264416ff 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -995,6 +995,97 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, } EXPORT_SYMBOL(copy_page_from_iter); +/** + * copy_struct_from_iter - copy a struct from an iov_iter + * @dst: Destination buffer. + * @ksize: Size of @dst struct. + * @i: Source iterator. + * + * Copies a struct from an iov_iter in a way that guarantees + * backwards-compatibility for struct arguments in an iovec (as long as the + * rules for copy_struct_from_user() are followed). + * + * The source struct is assumed to be stored in the current segment of the + * iov_iter, and its size is the size of the current segment. The iov_iter must + * be positioned at the beginning of the current segment. + * + * The recommended usage is something like the following: + * + * int do_foo(struct iov_iter *i) + * { + * size_t usize = iov_iter_single_seg_count(i); + * struct foo karg; + * int err; + * + * if (usize > PAGE_SIZE) + * return -E2BIG; + * if (usize < FOO_SIZE_VER0) + * return -EINVAL; + * err = copy_struct_from_iter(&karg, sizeof(karg), i); + * if (err) + * return err; + * + * // ... + * } + * + * Returns 0 on success or one of the following errors: + * * -E2BIG: (size of current segment > @ksize) and there are non-zero + * trailing bytes in the current segment. + * * -EFAULT: access to userspace failed. + * * -EINVAL: the iterator is not at the beginning of the current segment. + * + * On success, the iterator is advanced to the next segment. On error, the + * iterator is not advanced. + */ +int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i) +{ + size_t usize; + int ret; + + if (i->iov_offset != 0) + return -EINVAL; + if (iter_is_iovec(i)) { + usize = i->iov->iov_len; + might_fault(); + if (copyin(dst, i->iov->iov_base, min(ksize, usize))) + return -EFAULT; + if (usize > ksize) { + ret = check_zeroed_user(i->iov->iov_base + ksize, + usize - ksize); + if (ret < 0) + return ret; + else if (ret == 0) + return -E2BIG; + } + } else if (iov_iter_is_kvec(i)) { + usize = i->kvec->iov_len; + memcpy(dst, i->kvec->iov_base, min(ksize, usize)); + if (usize > ksize && + memchr_inv(i->kvec->iov_base + ksize, 0, usize - ksize)) + return -E2BIG; + } else if (iov_iter_is_bvec(i)) { + char *p; + + usize = i->bvec->bv_len; + p = kmap_local_page(i->bvec->bv_page); + memcpy(dst, p + i->bvec->bv_offset, min(ksize, usize)); + if (usize > ksize && + memchr_inv(p + i->bvec->bv_offset + ksize, 0, + usize - ksize)) { + kunmap_local(p); + return -E2BIG; + } + kunmap_local(p); + } else { + return -EFAULT; + } + if (usize < ksize) + memset(dst + usize, 0, ksize - usize); + iov_iter_advance(i, usize); + return 0; +} +EXPORT_SYMBOL_GPL(copy_struct_from_iter); + static size_t pipe_zero(size_t bytes, struct iov_iter *i) { struct pipe_inode_info *pipe = i->pipe; From patchwork Mon May 17 18:35:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262795 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D114C433B4 for ; Mon, 17 May 2021 18:35:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3107E611EE for ; Mon, 17 May 2021 18:35:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237297AbhEQShG (ORCPT ); Mon, 17 May 2021 14:37:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39978 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237246AbhEQShF (ORCPT ); Mon, 17 May 2021 14:37:05 -0400 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B7CDFC061756 for ; Mon, 17 May 2021 11:35:48 -0700 (PDT) Received: by mail-pl1-x62c.google.com with SMTP id z4so1469311plg.8 for ; Mon, 17 May 2021 11:35:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=kjaKJuPg24WMM0B2D7qxQum77MyM4W16RR2mAmo/IF4=; b=uShP3AxPxz6sO2ua0uYTf8/k6zA1Hb4WE8Kt3EbIG+fuzIgBwCiY+f03n/bZbWmD49 SL1OZXVZD3GqmCFr4h0+3Nhufe1rKoa3N10lOHgtHva0QLliI2NPUhgsYI7D1jUFFbrl Qam4tVLkysALwqeC4nkpUeNKcVrJ/UqX91XSiU9a5tUVrlLOGyU3lknxy8rM4mAxmfgq cOjJW0/iVxAJ619xpMLn8PHFHbdfxGJxIVpi99nIOXSeyJe/KepEsu9c81RC7Qcpsgsn HuVkwBLZXulQJgdRG8erxyiRlftdezaRmeXwPi07COZKLrgXzgO4B6yLQnMOfkJecyEu RYSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=kjaKJuPg24WMM0B2D7qxQum77MyM4W16RR2mAmo/IF4=; b=cgdaDfoU79kEpHxlTmavDek8isSCN+zjT8CYR9mNQx+PLGgNJDJHizMaZfLZZ9WCvU TdjYBwEcdCiMa4ae/wUnsoLwNrpn8SgVjwLPKojW85dgv/X9yW6qP7sV1KHLfTyVtcNz zN2WVMmbQoou6h0eyD7rrJyazWa9bUKy1xTo/GTgfp64vwoJbb89MmS/liOuZS+9IcyN Xl+XpUJ4NBw3dyU6pODf75/5wTu1iZQxU6x2hOtVSv5n7sseQNWkY4Sr1FkqJ0Rm4Gie r0cEcdMTaSMwVX80h0kQO80IeGuCUzLmdZfNfD8Ep3bnEh0mpwg6MoIGr5kMSgbH6QGd mivg== X-Gm-Message-State: AOAM531rC2XwZsIEqcXywK6Ulyl7PHJimYcXmapFUkG/0JZqU9VwgMpA QF4B7bpENb8/LkDiUVOfr45iuSZb4zsAGA== X-Google-Smtp-Source: ABdhPJw7SSgcTdHIHJ5DgK5kXdU1I3O8d+Z4YR4aU3fERslSDh3T0W+XcaN5xcvBWEpAQFhJfP6ZJg== X-Received: by 2002:a17:902:c406:b029:ef:7ba2:f308 with SMTP id k6-20020a170902c406b02900ef7ba2f308mr1262764plk.9.1621276547609; Mon, 17 May 2021 11:35:47 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:46 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 2/9] fs: add O_ALLOW_ENCODED open flag Date: Mon, 17 May 2021 11:35:20 -0700 Message-Id: <0bdad6c8f03be64bc0a7ea9fcd525df7fa5b3ca5.1621276134.git.osandov@fb.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval The upcoming RWF_ENCODED operation introduces some security concerns: 1. Compressed writes will pass arbitrary data to decompression algorithms in the kernel. 2. Compressed reads can leak truncated/hole punched data. Therefore, we need to require privilege for RWF_ENCODED. It's not possible to do the permissions checks at the time of the read or write because, e.g., io_uring submits IO from a worker thread. So, add an open flag which requires CAP_SYS_ADMIN. It can also be set and cleared with fcntl(). The flag is not cleared in any way on fork or exec. Note that the usual issue that unknown open flags are ignored doesn't really matter for O_ALLOW_ENCODED; if the kernel doesn't support O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- arch/alpha/include/uapi/asm/fcntl.h | 1 + arch/parisc/include/uapi/asm/fcntl.h | 1 + arch/sparc/include/uapi/asm/fcntl.h | 1 + fs/fcntl.c | 10 ++++++++-- fs/namei.c | 4 ++++ include/linux/fcntl.h | 2 +- include/uapi/asm-generic/fcntl.h | 4 ++++ 7 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h index 50bdc8e8a271..391e0d112e41 100644 --- a/arch/alpha/include/uapi/asm/fcntl.h +++ b/arch/alpha/include/uapi/asm/fcntl.h @@ -34,6 +34,7 @@ #define O_PATH 040000000 #define __O_TMPFILE 0100000000 +#define O_ALLOW_ENCODED 0200000000 #define F_GETLK 7 #define F_SETLK 8 diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h index 03dee816cb13..0feb31faaefa 100644 --- a/arch/parisc/include/uapi/asm/fcntl.h +++ b/arch/parisc/include/uapi/asm/fcntl.h @@ -19,6 +19,7 @@ #define O_PATH 020000000 #define __O_TMPFILE 040000000 +#define O_ALLOW_ENCODED 0100000000 #define F_GETLK64 8 #define F_SETLK64 9 diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h index 67dae75e5274..ac3e8c9cb32c 100644 --- a/arch/sparc/include/uapi/asm/fcntl.h +++ b/arch/sparc/include/uapi/asm/fcntl.h @@ -37,6 +37,7 @@ #define O_PATH 0x1000000 #define __O_TMPFILE 0x2000000 +#define O_ALLOW_ENCODED 0x8000000 #define F_GETOWN 5 /* for sockets. */ #define F_SETOWN 6 /* for sockets. */ diff --git a/fs/fcntl.c b/fs/fcntl.c index dfc72f15be7f..eca4eb008194 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -31,7 +31,8 @@ #include #include -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \ + O_ALLOW_ENCODED) static int setfl(int fd, struct file * filp, unsigned long arg) { @@ -50,6 +51,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (!inode_owner_or_capable(file_mnt_user_ns(filp), inode)) return -EPERM; + /* O_ALLOW_ENCODED can only be set by superuser */ + if ((arg & O_ALLOW_ENCODED) && !(filp->f_flags & O_ALLOW_ENCODED) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* required for strict SunOS emulation */ if (O_NONBLOCK != O_NDELAY) if (arg & O_NDELAY) @@ -1043,7 +1049,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index 79b0ff9b151e..b05f121b3947 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2997,6 +2997,10 @@ static int may_open(struct user_namespace *mnt_userns, const struct path *path, if (flag & O_NOATIME && !inode_owner_or_capable(mnt_userns, inode)) return -EPERM; + /* O_ALLOW_ENCODED can only be set by superuser */ + if ((flag & O_ALLOW_ENCODED) && !capable(CAP_SYS_ADMIN)) + return -EPERM; + return 0; } diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 766fcd973beb..2cd6a9185d4c 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ALLOW_ENCODED) /* List of all valid flags for the how->upgrade_mask argument: */ #define VALID_UPGRADE_FLAGS \ diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..75321c7a66ac 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -89,6 +89,10 @@ #define __O_TMPFILE 020000000 #endif +#ifndef O_ALLOW_ENCODED +#define O_ALLOW_ENCODED 040000000 +#endif + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) From patchwork Mon May 17 18:35:21 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262799 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2CF6CC43462 for ; Mon, 17 May 2021 18:35:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 01783611ED for ; Mon, 17 May 2021 18:35:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239353AbhEQShK (ORCPT ); Mon, 17 May 2021 14:37:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237246AbhEQShJ (ORCPT ); Mon, 17 May 2021 14:37:09 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B0D93C061573 for ; Mon, 17 May 2021 11:35:52 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id gb21-20020a17090b0615b029015d1a863a91so116395pjb.2 for ; Mon, 17 May 2021 11:35:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=4+7lsPmjBUnhLejnGPlyPiIC1GkPnGJh960LplBN6Jw=; b=cpCt5UMYzdRStfQ7fEAX/2coAsS3qk3Y9uZbOhldv6IPhXnTisf1stAASdTFv/5SNv Qo4kt0HytiEXvpS85JRMW/4V3TYYi3+K3OMAdVs5GKy+vYH9vREpzuMPZ6K7dID/yuzT esMmEmp/glLij6JhmjHcMNw2yg37m2RvUmZgzBlf0NHJm3XIJxl4Qk0KtsSllLLm1J4x qVOM98gTJOdIE5cJolCIFCe827oGhwJ6lkOzEzJO16fAapkJQwwWYRwr1LJGECrQLS91 HluAyl3D7aB177+MYpszDODH/w+1dW9/fXLU21Uf0OGY8UajZmoeWqD8iuC+Sky4cYmN pu4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=4+7lsPmjBUnhLejnGPlyPiIC1GkPnGJh960LplBN6Jw=; b=szhPm0foOvwJZf3iwbfkdF4BwpC5XKzcd3LhxUT+aW+Oy8AdqJ+D2dI1i1bqKqNTNp 6DhGgwHSY/4jOgbgzSOA0LfbpKN+BQuIesTL5iE5mok6NNl8uEl2EEG0gQBDGORPjqVM MTXSIxHKTdHMBhkXCcPGWvsfix/V5znE7jRIBLbvjgeWZUrVagnt3AAWzcTKSvFKnEII OOF5bpL8Y4w0mGrLt7j4vIaoo4OvKUsEvNiq44PmqwZR5ETJDwwI3ARgJ7bWngCodNnb Cs45TlH81MLDGElnJkjLZZlfKv/Fj/HSsSclFXLmDzGaNoiTmvr1VjRODj+vqRD2F5sc 34XQ== X-Gm-Message-State: AOAM533PGvV4uFzAiA4cQM1/EO2gg1M5HffCZPsHe582ZpTjwKqBWA4Z l0hlHJ2P/EvAc3iYHASKlEHM1kPP4mpqHA== X-Google-Smtp-Source: ABdhPJz0Ib9O8RxGNqZDyVcfdnINifdUBp5jBhxu1fPJ5yI69ujVXvVvPmgJ86Yebght+2XdX8zLvg== X-Received: by 2002:a17:902:6b81:b029:ea:dcc5:b841 with SMTP id p1-20020a1709026b81b02900eadcc5b841mr1446115plk.29.1621276549640; Mon, 17 May 2021 11:35:49 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:48 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 3/9] fs: add RWF_ENCODED for reading/writing compressed data Date: Mon, 17 May 2021 11:35:21 -0700 Message-Id: <85abdef1969c6502960abe41830ef0dcdb4db0dc.1621276134.git.osandov@fb.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Btrfs supports transparent compression: data written by the user can be compressed when written to disk and decompressed when read back. However, we'd like to add an interface to write pre-compressed data directly to the filesystem, and the matching interface to read compressed data without decompressing it. This adds support for so-called "encoded I/O" via preadv2() and pwritev2(). A new RWF_ENCODED flags indicates that a read or write is "encoded". If this flag is set, iov[0].iov_base points to a struct encoded_iov which is used for metadata: namely, the compression algorithm, unencoded (i.e., decompressed) length, and what subrange of the unencoded data should be used (needed for truncated or hole-punched extents and when reading in the middle of an extent). For reads, the filesystem returns this information; for writes, the caller provides it to the filesystem. iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be used to extend the interface in the future a la copy_struct_from_user(). The remaining iovecs contain the encoded extent. This adds the VFS helpers for supporting encoded I/O and documentation. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- Documentation/filesystems/encoded_io.rst | 240 +++++++++++++++++++++++ Documentation/filesystems/index.rst | 1 + fs/read_write.c | 168 ++++++++++++++-- include/linux/encoded_io.h | 17 ++ include/linux/fs.h | 13 ++ include/uapi/linux/encoded_io.h | 30 +++ include/uapi/linux/fs.h | 5 +- 7 files changed, 460 insertions(+), 14 deletions(-) create mode 100644 Documentation/filesystems/encoded_io.rst create mode 100644 include/linux/encoded_io.h create mode 100644 include/uapi/linux/encoded_io.h diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst new file mode 100644 index 000000000000..38f1dc940331 --- /dev/null +++ b/Documentation/filesystems/encoded_io.rst @@ -0,0 +1,240 @@ +=========== +Encoded I/O +=========== + +Several filesystems (e.g., Btrfs) support transparent encoding (e.g., +compression, encryption) of data on disk: written data is encoded by the kernel +before it is written to disk, and read data is decoded before being returned to +the user. In some cases, it is useful to skip this encoding step. For example, +the user may want to read the compressed contents of a file or write +pre-compressed data directly to a file. This is referred to as "encoded I/O". + +User API +======== + +Encoded I/O is specified with the ``RWF_ENCODED`` flag to ``preadv2()`` and +``pwritev2()``. If ``RWF_ENCODED`` is specified, then ``iov[0].iov_base`` +points to an ``encoded_iov`` structure, defined in ```` +as:: + + struct encoded_iov { + __aligned_u64 len; + __aligned_u64 unencoded_len; + __aligned_u64 unencoded_offset; + __u32 compression; + __u32 encryption; + }; + +This may be extended in the future, so ``iov[0].iov_len`` must be set to +``sizeof(struct encoded_iov)`` for forward/backward compatibility. The +remaining buffers contain the encoded data. + +``compression`` and ``encryption`` are the encoding fields. ``compression`` is +``ENCODED_IOV_COMPRESSION_NONE`` (zero) or a filesystem-specific +``ENCODED_IOV_COMPRESSION_*`` constant; see `Filesystem support`_ below. +``encryption`` is currently always ``ENCODED_IOV_ENCRYPTION_NONE`` (zero). + +``unencoded_len`` is the length of the unencoded (i.e., decrypted and +decompressed) data. ``unencoded_offset`` is the offset from the first byte of +the unencoded data to the first byte of logical data in the file (less than or +equal to ``unencoded_len``). ``len`` is the length of the data in the file +(less than or equal to ``unencoded_len - unencoded_offset``). See `Extent +layout`_ below for some examples. + +If the unencoded data is actually longer than ``unencoded_len``, then it is +truncated; if it is shorter, then it is extended with zeroes. + +``pwritev2()`` uses the metadata specified in ``iov[0]``, writes the encoded +data from the remaining buffers, and returns the number of encoded bytes +written (that is, the sum of ``iov[n].iov_len for 1 <= n < iovcnt``; partial +writes will not occur). At least one encoding field must be non-zero. Note that +the encoded data is not validated when it is written; if it is not valid (e.g., +it cannot be decompressed), then a subsequent read may return an error. If the +offset argument to ``pwritev2()`` is -1, then the file offset is incremented by +``len``. If ``iov[0].iov_len`` is less than ``sizeof(struct encoded_iov)`` in +the kernel, then any fields unknown to user space are treated as if they were +zero; if it is greater and any fields unknown to the kernel are non-zero, then +``pwritev2()`` returns -1 and sets errno to ``E2BIG``. + +``preadv2()`` populates the metadata in ``iov[0]``, the encoded data in the +remaining buffers, and returns the number of encoded bytes read. This will only +return one extent per call. This can also read data which is not encoded; all +encoding fields will be zero in that case. If the offset argument to +``preadv2()`` is -1, then the file offset is incremented by ``len``. If +``iov[0].iov_len`` is less than ``sizeof(struct encoded_iov)`` in the kernel +and any fields unknown to user space are non-zero, then ``preadv2()`` returns +-1 and sets errno to ``E2BIG``; if it is greater, then any fields unknown to +the kernel are returned as zero. If the provided buffers are not large enough +to return an entire encoded extent, then ``preadv2()`` returns -1 and sets +errno to ``ENOBUFS``. + +As the filesystem page cache typically contains decoded data, encoded I/O +bypasses the page cache. + +Extent layout +------------- + +By using ``len``, ``unencoded_len``, and ``unencoded_offset``, it is possible +to refer to a subset of an unencoded extent. + +In the simplest case, ``len`` is equal to ``unencoded_len`` and +``unencoded_offset`` is zero. This means that the entire unencoded extent is +used. + +However, suppose we read 50 bytes into a file which contains a single +compressed extent. The filesystem must still return the entire compressed +extent for us to be able to decompress it, so ``unencoded_len`` would be the +length of the entire decompressed extent. However, because the read was at +offset 50, the first 50 bytes should be ignored. Therefore, +``unencoded_offset`` would be 50, and ``len`` would accordingly be +``unencoded_len - 50``. + +Additionally, suppose we want to create an encrypted file with length 500, but +the file is encrypted with a block cipher using a block size of 4096. The +unencoded data would therefore include the appropriate padding, and +``unencoded_len`` would be 4096. However, to represent the logical size of the +file, ``len`` would be 500 (and ``unencoded_offset`` would be 0). + +Similar situations can arise in other cases: + +* If the filesystem pads data to the filesystem block size before compressing, + then compressed files with a size unaligned to the filesystem block size will + end with an extent with ``len < unencoded_len``. + +* Extents cloned from the middle of a larger encoded extent with + ``FICLONERANGE`` may have a non-zero ``unencoded_offset`` and/or + ``len < unencoded_len``. + +* If the middle of an encoded extent is overwritten, the filesystem may create + extents with a non-zero ``unencoded_offset`` and/or ``len < unencoded_len`` + for the parts that were not overwritten. + +Security +-------- + +Encoded I/O creates the potential for some security issues: + +* Encoded writes allow writing arbitrary data which the kernel will decode on a + subsequent read. Decompression algorithms are complex and may have bugs that + can be exploited by maliciously crafted data. +* Encoded reads may return data that is not logically present in the file (see + the discussion of ``len`` vs ``unencoded_len`` above). It may not be intended + for this data to be readable. + +Therefore, encoded I/O requires privilege. Namely, the ``RWF_ENCODED`` flag may +only be used if the file description has the ``O_ALLOW_ENCODED`` file status +flag set, and the ``O_ALLOW_ENCODED`` flag may only be set by a thread with the +``CAP_SYS_ADMIN`` capability. The ``O_ALLOW_ENCODED`` flag can be set by +``open()`` or ``fcntl()``. It can also be cleared by ``fcntl()``; clearing it +does not require ``CAP_SYS_ADMIN``. Note that it is not cleared on ``fork()`` +or ``execve()``. One may wish to use ``O_CLOEXEC`` with ``O_ALLOW_ENCODED``. + +Filesystem support +------------------ + +Encoded I/O is supported on the following filesystems: + +Btrfs (since Linux 5.14) +~~~~~~~~~~~~~~~~~~~~~~~~ + +Btrfs supports encoded reads and writes of compressed data. The data is encoded +as follows: + +* If ``compression`` is ``ENCODED_IOV_COMPRESSION_BTRFS_ZLIB``, then the encoded + data is a single zlib stream. +* If ``compression`` is ``ENCODED_IOV_COMPRESSION_BTRFS_ZSTD``, then the + encoded data is a single zstd frame compressed with the windowLog compression + parameter set to no more than 17. +* If ``compression`` is one of ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K``, + ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K``, + ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K``, + ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K``, or + ``ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K``, then the encoded data is + compressed page by page (using the page size indicated by the name of the + constant) with LZO1X and wrapped in the format documented in the Linux kernel + source file ``fs/btrfs/lzo.c``. + +Additionally, there are some restrictions on ``pwritev2()``: + +* ``offset`` (or the current file offset if ``offset`` is -1) must be aligned + to the sector size of the filesystem. +* ``len`` must be aligned to the sector size of the filesystem unless the data + ends at or beyond the current end of the file. +* ``unencoded_len`` and the length of the encoded data must each be no more + than 128 KiB. This limit may increase in the future. +* The length of the encoded data must be less than or equal to + ``unencoded_len.`` +* If using LZO, the filesystem's page size must match the compression page + size. + +Implementation +============== + +This section describes the requirements for filesystems implementing encoded +I/O. + +First of all, a filesystem supporting encoded I/O must indicate this by setting +the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation:: + + static int foo_file_open(struct inode *inode, struct file *filp) + { + ... + filep->f_mode |= FMODE_ENCODED_IO; + ... + } + +Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the +``IOCB_ENCODED`` flag in ``kiocb->ki_flags``. + +Reads +----- + +Encoded ``read_iter`` should: + +1. Call ``generic_encoded_read_checks()`` to validate the file and buffers + provided by userspace. +2. Initialize the ``encoded_iov`` appropriately. +3. Copy it to the user with ``copy_encoded_iov_to_iter()``. +4. Copy the encoded data to the user. +5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``. +6. Return the size of the encoded data read, not including the ``encoded_iov``. + +There are a few details to be aware of: + +* Encoded ``read_iter`` should support reading unencoded data if the extent is + not encoded. +* If the buffers provided by the user are not large enough to contain an entire + encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to + avoid confusing userspace with truncated data that cannot be properly + decoded. +* Reads in the middle of an encoded extent can be returned by setting + ``encoded_iov->unencoded_offset`` to non-zero. +* Truncated unencoded data (e.g., because the file does not end on a block + boundary) may be returned by setting ``encoded_iov->len`` to a value smaller + value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``. + +Writes +------ + +Encoded ``write_iter`` should (in addition to the usual accounting/checks done +by ``write_iter``): + +1. Call ``copy_encoded_iov_from_iter()`` to get and validate the + ``encoded_iov``. +2. Call ``generic_encoded_write_checks()`` instead of + ``generic_write_checks()``. +3. Check that the provided encoding in ``encoded_iov`` is supported. +4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``. +5. Return the size of the encoded data written. + +Again, there are a few details: + +* Encoded ``write_iter`` doesn't need to support writing unencoded data. +* ``write_iter`` should either write all of the encoded data or none of it; it + must not do partial writes. +* ``write_iter`` doesn't need to validate the encoded data; a subsequent read + may return, e.g., ``-EIO`` if the data is not valid. +* The user may lie about the unencoded size of the data; a subsequent read + should truncate or zero-extend the unencoded data rather than returning an + error. +* Be careful of page cache coherency. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index d4853cb919d2..670c673c5956 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -54,6 +54,7 @@ filesystem implementations. fscrypt fsverity netfs_library + encoded_io Filesystems =========== diff --git a/fs/read_write.c b/fs/read_write.c index 9db7adf160d2..f8db16e01227 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" #include @@ -1632,24 +1633,15 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count) return 0; } -/* - * Performs necessary checks before doing a write - * - * Can adjust writing position or amount of bytes to write. - * Returns appropriate error code that caller should return or - * zero in case that write should be allowed. - */ -ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - loff_t count; - int ret; if (IS_SWAPFILE(inode)) return -ETXTBSY; - if (!iov_iter_count(from)) + if (!*count) return 0; /* FIXME: this is for backwards compatibility with 2.4 */ @@ -1659,8 +1651,22 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) return -EINVAL; - count = iov_iter_count(from); - ret = generic_write_check_limits(file, iocb->ki_pos, &count); + return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count); +} + +/* + * Performs necessary checks before doing a write + * + * Can adjust writing position or amount of bytes to write. + * Returns appropriate error code that caller should return or + * zero in case that write should be allowed. + */ +ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +{ + loff_t count = iov_iter_count(from); + int ret; + + ret = generic_write_checks_common(iocb, &count); if (ret) return ret; @@ -1691,3 +1697,139 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out) return 0; } + +/** + * generic_encoded_write_checks() - check an encoded write + * @iocb: I/O context. + * @encoded: Encoding metadata. + * + * This should be called by RWF_ENCODED write implementations rather than + * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG + * instead of adjusting the size of the write. + * + * Return: 0 on success, -errno on error. + */ +int generic_encoded_write_checks(struct kiocb *iocb, + const struct encoded_iov *encoded) +{ + loff_t count = encoded->len; + int ret; + + if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED)) + return -EPERM; + + ret = generic_write_checks_common(iocb, &count); + if (ret) + return ret; + + if (count != encoded->len) { + /* + * The write got truncated by generic_write_checks_common(). We + * can't do a partial encoded write. + */ + return -EFBIG; + } + return 0; +} +EXPORT_SYMBOL(generic_encoded_write_checks); + +/** + * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace + * @encoded: Returned encoding metadata. + * @from: Source iterator. + * + * This copies in the &struct encoded_iov and does some basic sanity checks. + * This should always be used rather than a plain copy_from_iter(), as it does + * the proper handling for backward- and forward-compatibility. + * + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the + * copied structure contained non-zero fields that this kernel doesn't + * support, -EINVAL if the copied structure was invalid. + */ +int copy_encoded_iov_from_iter(struct encoded_iov *encoded, + struct iov_iter *from) +{ + size_t usize; + int ret; + + usize = iov_iter_single_seg_count(from); + if (usize > PAGE_SIZE) + return -E2BIG; + if (usize < ENCODED_IOV_SIZE_VER0) + return -EINVAL; + ret = copy_struct_from_iter(encoded, sizeof(*encoded), from); + if (ret) + return ret; + + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE && + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES || + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES) + return -EINVAL; + if (encoded->unencoded_offset > encoded->unencoded_len) + return -EINVAL; + if (encoded->len > encoded->unencoded_len - encoded->unencoded_offset) + return -EINVAL; + return 0; +} +EXPORT_SYMBOL(copy_encoded_iov_from_iter); + +/** + * generic_encoded_read_checks() - sanity check an RWF_ENCODED read + * @iocb: I/O context. + * @iter: Destination iterator for read. + * + * This should always be called by RWF_ENCODED read implementations before + * returning any data. + * + * Return: Number of bytes available to return encoded data in @iter on success, + * -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if + * the size of the &struct encoded_iov iovec is invalid. + */ +ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter) +{ + size_t usize; + + if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED)) + return -EPERM; + usize = iov_iter_single_seg_count(iter); + if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0) + return -EINVAL; + return iov_iter_count(iter) - usize; +} +EXPORT_SYMBOL(generic_encoded_read_checks); + +/** + * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace + * @encoded: Encoding metadata to return. + * @to: Destination iterator. + * + * This should always be used by RWF_ENCODED read implementations rather than a + * plain copy_to_iter(), as it does the proper handling for backward- and + * forward-compatibility. The iterator must be sanity-checked with + * generic_encoded_read_checks() before this is called. + * + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there + * were non-zero fields in @encoded that the user buffer could not + * accommodate. + */ +int copy_encoded_iov_to_iter(const struct encoded_iov *encoded, + struct iov_iter *to) +{ + size_t ksize = sizeof(*encoded); + size_t usize = iov_iter_single_seg_count(to); + size_t size = min(ksize, usize); + + /* We already sanity-checked usize in generic_encoded_read_checks(). */ + + if (usize < ksize && + memchr_inv((char *)encoded + usize, 0, ksize - usize)) + return -E2BIG; + if (copy_to_iter(encoded, size, to) != size || + (usize > ksize && + iov_iter_zero(usize - ksize, to) != usize - ksize)) + return -EFAULT; + return 0; +} +EXPORT_SYMBOL(copy_encoded_iov_to_iter); diff --git a/include/linux/encoded_io.h b/include/linux/encoded_io.h new file mode 100644 index 000000000000..a8cfc0108ba0 --- /dev/null +++ b/include/linux/encoded_io.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_ENCODED_IO_H +#define _LINUX_ENCODED_IO_H + +#include + +struct encoded_iov; +struct iov_iter; +struct kiocb; +extern int generic_encoded_write_checks(struct kiocb *, + const struct encoded_iov *); +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *); +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *); +extern int copy_encoded_iov_to_iter(const struct encoded_iov *, + struct iov_iter *); + +#endif /* _LINUX_ENCODED_IO_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index c3c88fdb9b2a..2a9ab11baaed 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -181,6 +181,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File supports async buffered reads */ #define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000) +/* File supports encoded IO */ +#define FMODE_ENCODED_IO ((__force fmode_t)0x80000000) + /* * Attribute flags. These should be or-ed together to figure out what * has been changed! @@ -311,6 +314,7 @@ enum rw_hint { #define IOCB_SYNC (__force int) RWF_SYNC #define IOCB_NOWAIT (__force int) RWF_NOWAIT #define IOCB_APPEND (__force int) RWF_APPEND +#define IOCB_ENCODED (__force int) RWF_ENCODED /* non-RWF related bits - start at 16 */ #define IOCB_EVENTFD (1 << 16) @@ -3223,6 +3227,13 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); extern int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count); +struct encoded_iov; +extern int generic_encoded_write_checks(struct kiocb *, + const struct encoded_iov *); +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *); +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *); +extern int copy_encoded_iov_to_iter(const struct encoded_iov *, + struct iov_iter *); extern int generic_file_rw_checks(struct file *file_in, struct file *file_out); ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *to, ssize_t already_read); @@ -3528,6 +3539,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; kiocb_flags |= IOCB_NOIO; } + if ((flags & RWF_ENCODED) && !(ki->ki_filp->f_mode & FMODE_ENCODED_IO)) + return -EOPNOTSUPP; kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags |= IOCB_DSYNC; diff --git a/include/uapi/linux/encoded_io.h b/include/uapi/linux/encoded_io.h new file mode 100644 index 000000000000..cf741453dba4 --- /dev/null +++ b/include/uapi/linux/encoded_io.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_ENCODED_IO_H +#define _UAPI_LINUX_ENCODED_IO_H + +#include + +#define ENCODED_IOV_COMPRESSION_NONE 0 +#define ENCODED_IOV_COMPRESSION_BTRFS_ZLIB 1 +#define ENCODED_IOV_COMPRESSION_BTRFS_ZSTD 2 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K 3 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K 4 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K 5 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K 6 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K 7 +#define ENCODED_IOV_COMPRESSION_TYPES 8 + +#define ENCODED_IOV_ENCRYPTION_NONE 0 +#define ENCODED_IOV_ENCRYPTION_TYPES 1 + +struct encoded_iov { + __aligned_u64 len; + __aligned_u64 unencoded_len; + __aligned_u64 unencoded_offset; + __u32 compression; + __u32 encryption; +}; + +#define ENCODED_IOV_SIZE_VER0 32 + +#endif /* _UAPI_LINUX_ENCODED_IO_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 4c32e97dcdf0..0ef3a073c9b4 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -300,8 +300,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* encoded (e.g., compressed and/or encrypted) IO */ +#define RWF_ENCODED ((__force __kernel_rwf_t)0x00000020) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_ENCODED) #endif /* _UAPI_LINUX_FS_H */ From patchwork Mon May 17 18:35:22 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262801 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 584D7C433ED for ; Mon, 17 May 2021 18:35:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 36AA5611CC for ; Mon, 17 May 2021 18:35:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239349AbhEQShN (ORCPT ); Mon, 17 May 2021 14:37:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237246AbhEQShL (ORCPT ); Mon, 17 May 2021 14:37:11 -0400 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9495EC061756 for ; Mon, 17 May 2021 11:35:54 -0700 (PDT) Received: by mail-pl1-x62d.google.com with SMTP id p6so3652513plr.11 for ; Mon, 17 May 2021 11:35:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VbxOJCYq8JSOEAJrTK39MYATUopuHkNjEjyqSmH0zGM=; b=m2JTwb7qTzQIa66kW5g3+PEcqai2g9AY/TYfyCylOB7oDec9OoJRBp55L1M8jBbcqR iv6+NMjf50FmX0h20txW3f6qczehibCoH3VpMUG7yQUiWboIaa0RVFj+FHDXtLjw5xc0 B5/b/Sgowh2YH/p23gDZkWH7eUy6AADLXDlrmGrXis+N+qodBDZY665H1R+/2cW9wfWV XM/5UxQWjOkLwZWDFSXtTA1124UTrCYsLT6wh1b0yY7LBnvGahpZD4aipR5IkASDOJtD JfmUo79um9hwdp1HOCqaqYu6s2VtdeLBArteKhP6I2Ra/TfbMTSjO3fV4BJVBq6Kx/5l d5Xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VbxOJCYq8JSOEAJrTK39MYATUopuHkNjEjyqSmH0zGM=; b=Z7vrdJjGs1mLf9rrxFj0UDdyUmbCKBe3pjdiXUrzlzD43lC203MZHZcYUhwHezkYhh gxJBKQQ+WMuMHPblgIkVaeXP/0fq7xVmTsoSz9kUNzDw/UcOlHh9ZaM/IrWyXd9PRjzO QAqhISQQW29ToRqlklwXUs21IvFFYHfqci+XOpGofStFqa/B6Qu5unFuTJ2SYUQKSxUQ bU5dD80gPYh2u3+NvfGBJkW4iGzP7cQMDMEgRbk7KXhY+Gsv61yYcESSC1wEeUA2z3f1 6KtaT/rQmrpXAkoRbMNm7YEoNW2FqUc803Bv/P13oHEz14McMTGNRAFESJcTfBZ7uwE7 Mf0g== X-Gm-Message-State: AOAM532GJvEp0QrpAs8h95aVqZm/UpaJbrf0OrR7S3L6vk9Hg+cY77w/ HB/gxTTBXRLJjNOLTfiuwklMUvmlW7j9nw== X-Google-Smtp-Source: ABdhPJxJmj2E5+I3r9ZTd/XK5ovz7nFdYhDKEV/bXFHxm7rL2kOWGoI4lehWU70i70zx0zsl5/Pogw== X-Received: by 2002:a17:90a:da06:: with SMTP id e6mr463737pjv.183.1621276553369; Mon, 17 May 2021 11:35:53 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:50 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 4/9] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Date: Mon, 17 May 2021 11:35:22 -0700 Message-Id: <7d5c96bc72802a96b3793ac11abbb97bcbd218ab.1621276134.git.osandov@fb.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval btrfs_csum_one_bio() loops over each filesystem block in the bio while keeping a cursor of its current logical position in the file in order to look up the ordered extent to add the checksums to. However, this doesn't make much sense for compressed extents, as a sector on disk does not correspond to a sector of decompressed file data. It happens to work because 1) the compressed bio always covers one ordered extent and 2) the size of the bio is always less than the size of the ordered extent. However, the second point will not always be true for encoded writes. Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that it can assume that the bio only covers one ordered extent. Since we're already changing the signature, let's get rid of the contig parameter and make it implied by the offset parameter, similar to the change we recently made to btrfs_lookup_bio_sums(). Additionally, let's rename nr_sectors to blockcount to make it clear that it's the number of filesystem blocks, not the number of 512-byte sectors. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 5 +++-- fs/btrfs/ctree.h | 2 +- fs/btrfs/file-item.c | 35 ++++++++++++++++------------------- fs/btrfs/inode.c | 8 ++++---- 4 files changed, 24 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 2bea01d23a5b..b6d9a9657c3a 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -454,7 +454,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, start, 1); + ret = btrfs_csum_one_bio(inode, bio, start, + true); BUG_ON(ret); /* -ENOMEM */ } @@ -486,7 +487,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, start, 1); + ret = btrfs_csum_one_bio(inode, bio, start, true); BUG_ON(ret); /* -ENOMEM */ } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 938d8ebf4cf3..178ad516eaaa 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3082,7 +3082,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_ordered_sum *sums); blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, - u64 file_start, int contig); + u64 offset, bool one_ordered); int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit); void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 294602f139ef..8f755ef20aaa 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -615,28 +615,28 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, * btrfs_csum_one_bio - Calculates checksums of the data contained inside a bio * @inode: Owner of the data inside the bio * @bio: Contains the data to be checksummed - * @file_start: offset in file this bio begins to describe - * @contig: Boolean. If true/1 means all bio vecs in this bio are - * contiguous and they begin at @file_start in the file. False/0 - * means this bio can contains potentially discontigous bio vecs - * so the logical offset of each should be calculated separately. + * @offset: If (u64)-1, @bio may contain discontiguous bio vecs, so the + * file offsets are determined from the page offsets in the bio. + * Otherwise, this is the starting file offset of the bio vecs in + * @bio, which must be contiguous. + * @one_ordered: If true, @bio only refers to one ordered extent. */ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, - u64 file_start, int contig) + u64 offset, bool one_ordered) { struct btrfs_fs_info *fs_info = inode->root->fs_info; SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); struct btrfs_ordered_sum *sums; struct btrfs_ordered_extent *ordered = NULL; + const bool page_offsets = (offset == (u64)-1); char *data; struct bvec_iter iter; struct bio_vec bvec; int index; - int nr_sectors; + int blockcount; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; int i; - u64 offset; unsigned nofs_flag; nofs_flag = memalloc_nofs_save(); @@ -650,18 +650,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, sums->len = bio->bi_iter.bi_size; INIT_LIST_HEAD(&sums->list); - if (contig) - offset = file_start; - else - offset = 0; /* shut up gcc */ - sums->bytenr = bio->bi_iter.bi_sector << 9; index = 0; shash->tfm = fs_info->csum_shash; bio_for_each_segment(bvec, bio, iter) { - if (!contig) + if (page_offsets) offset = page_offset(bvec.bv_page) + bvec.bv_offset; if (!ordered) { @@ -669,13 +664,14 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, BUG_ON(!ordered); /* Logic error */ } - nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, + blockcount = BTRFS_BYTES_TO_BLKS(fs_info, bvec.bv_len + fs_info->sectorsize - 1); - for (i = 0; i < nr_sectors; i++) { - if (offset >= ordered->file_offset + ordered->num_bytes || - offset < ordered->file_offset) { + for (i = 0; i < blockcount; i++) { + if (!one_ordered && + (offset >= ordered->file_offset + ordered->num_bytes || + offset < ordered->file_offset)) { unsigned long bytes_left; sums->len = this_sum_bytes; @@ -706,7 +702,8 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, sums->sums + index); kunmap_atomic(data); index += fs_info->csum_size; - offset += fs_info->sectorsize; + if (!one_ordered) + offset += fs_info->sectorsize; this_sum_bytes += fs_info->sectorsize; total_bytes += fs_info->sectorsize; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 955d0f5849e3..b96206bb5b52 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2230,7 +2230,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio, static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio, u64 dio_file_offset) { - return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0); + return btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false); } bool btrfs_bio_fits_in_ordered_extent(struct page *page, struct bio *bio, @@ -2420,7 +2420,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio, 0, btrfs_submit_bio_start); goto out; } else if (!skip_sum) { - ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0); + ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false); if (ret) goto out; } @@ -7999,7 +7999,7 @@ static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode, struct bio *bio, u64 dio_file_offset) { - return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, 1); + return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, false); } static void btrfs_end_dio_bio(struct bio *bio) @@ -8058,7 +8058,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, * If we aren't doing async submit, calculate the csum of the * bio now. */ - ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, 1); + ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, false); if (ret) goto err; } else { From patchwork Mon May 17 18:35:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262803 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0941BC43470 for ; Mon, 17 May 2021 18:35:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E106A611ED for ; Mon, 17 May 2021 18:35:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239990AbhEQShO (ORCPT ); Mon, 17 May 2021 14:37:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40032 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240123AbhEQShN (ORCPT ); Mon, 17 May 2021 14:37:13 -0400 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B67B2C061761 for ; Mon, 17 May 2021 11:35:56 -0700 (PDT) Received: by mail-pj1-x1033.google.com with SMTP id o17-20020a17090a9f91b029015cef5b3c50so101917pjp.4 for ; Mon, 17 May 2021 11:35:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=i2716w1iw4vl30jbPOVvlywDtS/lFYri4Oq9cdHThms=; b=kXfOkAD7gWNlNJinNnxlsmqAAL4FPqDwn0zqjiBkff7HShfKLwxjJafFlzOJATJfxt r26YiIZ4pAseEbeSnLD3Qmap3Lq1AM+J5AymcQ6RWvSlrlqyn1/UwKdXeK7JXb4DvcJh 03T69sBSlLtWNidWNSg/g5CPehvpisH1hNDllywMztgcV+NoZjAUmOvg1x6WmDyCRflj x870FwofzEzR3FJGZnGa3u/6IG9Vkp+YXjj03ioh+1zNsFjOLDaD0Dl6Qqq1h0AivV+h obpVDDeWCGIn5I8X1CBP3Hv6B3ylQ+ww9efooti3+5xtxzecl2xFxW+rDrbasvK+LV9t 0lpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=i2716w1iw4vl30jbPOVvlywDtS/lFYri4Oq9cdHThms=; b=ISOsGAYKuN5dwyCZzeqHPhECVXKdvkE6n5xr19TgNAAxB4RdJSDKJwFkiMdSgt2z/i XFs6rhjwN/JYia+K4mwbIlk9p0bwfEY5Oj6E5ymj3XFAylJdKBAgFjMSxS4/sOBqF3Qv 2QWnsRI7s0rgw7yHi3uXpkrEbKtG1hiH2edKBW+Dh1yaj6f2B/iRxc0ajdUc2YDLoD/3 /T+7SEIXpKMqpDB0jnDQW5lWmak+Z+sbcVTiwWOf+J5aHRfnlOSdntd+OHGhOph5HtvU 8q4W7XC6RH6Bt5QJ8zRgJtuDCYZrjNlNndtR3omZGrTtJgQCubErZwux9Hv+yJz1/iLo YJQA== X-Gm-Message-State: AOAM530GQgyoWGhcEvx+nnXXXHtBEUZuBH5rtShpCMOaqtxOkK/NQmaB MD2MK59j0Pji7BECIm4V3Wtw/+5JVy17BQ== X-Google-Smtp-Source: ABdhPJzMOnCMgT/Vul9tt2L5UKw8eiXu7WX5ppVtxh1coc26d+J9Un0Ss3sRjcYKDLBAH849v8b0Wg== X-Received: by 2002:a17:902:f291:b029:f0:ba5b:5c47 with SMTP id k17-20020a170902f291b02900f0ba5b5c47mr1437201plc.41.1621276555499; Mon, 17 May 2021 11:35:55 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:54 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 5/9] btrfs: add ram_bytes and offset to btrfs_ordered_extent Date: Mon, 17 May 2021 11:35:23 -0700 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, we only create ordered extents when ram_bytes == num_bytes and offset == 0. However, RWF_ENCODED writes may create extents which only refer to a subset of the full unencoded extent, so we need to plumb these fields through the ordered extent infrastructure and pass them down to insert_reserved_file_extent(). Since we're changing the btrfs_add_ordered_extent* signature, let's get rid of the trivial wrappers and add a kernel-doc. Reviewed-by: Nikolay Borisov Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/inode.c | 56 +++++++++++--------- fs/btrfs/ordered-data.c | 112 +++++++++++----------------------------- fs/btrfs/ordered-data.h | 22 ++++---- 3 files changed, 76 insertions(+), 114 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b96206bb5b52..894d8bd33288 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -912,12 +912,12 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) goto out_free_reserve; free_extent_map(em); - ret = btrfs_add_ordered_extent_compress(inode, - async_extent->start, - ins.objectid, - async_extent->ram_size, - ins.offset, - async_extent->compress_type); + ret = btrfs_add_ordered_extent(inode, async_extent->start, + async_extent->ram_size, + async_extent->ram_size, + ins.objectid, ins.offset, 0, + 1 << BTRFS_ORDERED_COMPRESSED, + async_extent->compress_type); if (ret) { btrfs_drop_extent_cache(inode, async_extent->start, async_extent->start + @@ -1122,9 +1122,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode, } free_extent_map(em); - ret = btrfs_add_ordered_extent(inode, start, ins.objectid, - ram_size, cur_alloc_size, - BTRFS_ORDERED_REGULAR); + ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size, + ins.objectid, cur_alloc_size, 0, + 0, BTRFS_COMPRESS_NONE); if (ret) goto out_drop_extent_cache; @@ -1784,10 +1784,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, goto error; } free_extent_map(em); - ret = btrfs_add_ordered_extent(inode, cur_offset, - disk_bytenr, num_bytes, - num_bytes, - BTRFS_ORDERED_PREALLOC); + ret = btrfs_add_ordered_extent(inode, + cur_offset, num_bytes, num_bytes, + disk_bytenr, num_bytes, 0, + 1 << BTRFS_ORDERED_PREALLOC, + BTRFS_COMPRESS_NONE); if (ret) { btrfs_drop_extent_cache(inode, cur_offset, cur_offset + num_bytes - 1, @@ -1796,9 +1797,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, } } else { ret = btrfs_add_ordered_extent(inode, cur_offset, + num_bytes, num_bytes, disk_bytenr, num_bytes, - num_bytes, - BTRFS_ORDERED_NOCOW); + 0, + 1 << BTRFS_ORDERED_NOCOW, + BTRFS_COMPRESS_NONE); if (ret) goto error; } @@ -2724,6 +2727,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key ins; u64 disk_num_bytes = btrfs_stack_file_extent_disk_num_bytes(stack_fi); u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi); + u64 offset = btrfs_stack_file_extent_offset(stack_fi); u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi); u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi); struct btrfs_drop_extents_args drop_args = { 0 }; @@ -2798,7 +2802,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, goto out; ret = btrfs_alloc_reserved_file_extent(trans, root, btrfs_ino(inode), - file_pos, qgroup_reserved, &ins); + file_pos - offset, + qgroup_reserved, &ins); out: btrfs_free_path(path); @@ -2824,20 +2829,20 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans, struct btrfs_ordered_extent *oe) { struct btrfs_file_extent_item stack_fi; - u64 logical_len; bool update_inode_bytes; + u64 num_bytes = oe->num_bytes; + u64 ram_bytes = oe->ram_bytes; memset(&stack_fi, 0, sizeof(stack_fi)); btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG); btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, oe->disk_bytenr); btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi, oe->disk_num_bytes); + btrfs_set_stack_file_extent_offset(&stack_fi, oe->offset); if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags)) - logical_len = oe->truncated_len; - else - logical_len = oe->num_bytes; - btrfs_set_stack_file_extent_num_bytes(&stack_fi, logical_len); - btrfs_set_stack_file_extent_ram_bytes(&stack_fi, logical_len); + num_bytes = ram_bytes = oe->truncated_len; + btrfs_set_stack_file_extent_num_bytes(&stack_fi, num_bytes); + btrfs_set_stack_file_extent_ram_bytes(&stack_fi, ram_bytes); btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type); /* Encryption and other encoding is reserved and all 0 */ @@ -7221,8 +7226,11 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, if (IS_ERR(em)) goto out; } - ret = btrfs_add_ordered_extent_dio(inode, start, block_start, len, - block_len, type); + ret = btrfs_add_ordered_extent(inode, start, len, len, block_start, + block_len, 0, + (1 << type) | + (1 << BTRFS_ORDERED_DIRECT), + BTRFS_COMPRESS_NONE); if (ret) { if (em) { free_extent_map(em); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 6c413bb451a3..57dc2b90fee8 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -142,16 +142,27 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, return ret; } -/* - * Allocate and add a new ordered_extent into the per-inode tree. +/** + * btrfs_add_ordered_extent - Add an ordered extent to the per-inode tree. + * @inode: inode that this extent is for. + * @file_offset: Logical offset in file where the extent starts. + * @num_bytes: Logical length of extent in file. + * @ram_bytes: Full length of unencoded data. + * @disk_bytenr: Offset of extent on disk. + * @disk_num_bytes: Size of extent on disk. + * @offset: Offset into unencoded data where file data starts. + * @flags: Flags specifying type of extent (1 << BTRFS_ORDERED_*). + * @compress_type: Compression algorithm used for data. * - * The tree is given a single reference on the ordered extent that was - * inserted. + * Most of these parameters correspond to &struct btrfs_file_extent_item. The + * tree is given a single reference on the ordered extent that was inserted. + * + * Return: 0 or -ENOMEM. */ -static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, int dio, - int compress_type) +int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, int flags, + int compress_type) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -160,7 +171,8 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset struct btrfs_ordered_extent *entry; int ret; - if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) { + if (flags & + ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) { /* For nocow write, we can release the qgroup rsv right now */ ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes); if (ret < 0) @@ -180,9 +192,11 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset return -ENOMEM; entry->file_offset = file_offset; - entry->disk_bytenr = disk_bytenr; entry->num_bytes = num_bytes; + entry->ram_bytes = ram_bytes; + entry->disk_bytenr = disk_bytenr; entry->disk_num_bytes = disk_num_bytes; + entry->offset = offset; entry->bytes_left = num_bytes; entry->inode = igrab(&inode->vfs_inode); entry->compress_type = compress_type; @@ -192,18 +206,12 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset entry->disk = NULL; entry->partno = (u8)-1; - ASSERT(type == BTRFS_ORDERED_REGULAR || - type == BTRFS_ORDERED_NOCOW || - type == BTRFS_ORDERED_PREALLOC || - type == BTRFS_ORDERED_COMPRESSED); - set_bit(type, &entry->flags); + ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0); + entry->flags = flags; percpu_counter_add_batch(&fs_info->ordered_bytes, num_bytes, fs_info->delalloc_batch); - if (dio) - set_bit(BTRFS_ORDERED_DIRECT, &entry->flags); - /* one ref for the tree */ refcount_set(&entry->refs, 1); init_waitqueue_head(&entry->wait); @@ -248,41 +256,6 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset return 0; } -int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, - int type) -{ - ASSERT(type == BTRFS_ORDERED_REGULAR || - type == BTRFS_ORDERED_NOCOW || - type == BTRFS_ORDERED_PREALLOC); - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 0, - BTRFS_COMPRESS_NONE); -} - -int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type) -{ - ASSERT(type == BTRFS_ORDERED_REGULAR || - type == BTRFS_ORDERED_NOCOW || - type == BTRFS_ORDERED_PREALLOC); - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 1, - BTRFS_COMPRESS_NONE); -} - -int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int compress_type) -{ - ASSERT(compress_type != BTRFS_COMPRESS_NONE); - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, - BTRFS_ORDERED_COMPRESSED, 0, - compress_type); -} - /* * Add a struct btrfs_ordered_sum into the list of checksums to be inserted * when an ordered extent is finished. If the list covers more than one @@ -919,35 +892,12 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, struct inode *inode = ordered->inode; u64 file_offset = ordered->file_offset + pos; u64 disk_bytenr = ordered->disk_bytenr + pos; - u64 num_bytes = len; - u64 disk_num_bytes = len; - int type; - unsigned long flags_masked = ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT); - int compress_type = ordered->compress_type; - unsigned long weight; - int ret; + unsigned long flags = ordered->flags & BTRFS_ORDERED_TYPE_FLAGS; - weight = hweight_long(flags_masked); - WARN_ON_ONCE(weight > 1); - if (!weight) - type = 0; - else - type = __ffs(flags_masked); - - if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) { - WARN_ON_ONCE(1); - ret = btrfs_add_ordered_extent_compress(BTRFS_I(inode), - file_offset, disk_bytenr, num_bytes, - disk_num_bytes, compress_type); - } else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) { - ret = btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset, - disk_bytenr, num_bytes, disk_num_bytes, type); - } else { - ret = btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, - disk_bytenr, num_bytes, disk_num_bytes, type); - } - - return ret; + WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED)); + return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len, + disk_bytenr, len, 0, flags, + ordered->compress_type); } int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index e60c07f36427..fe1c50da373c 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -76,6 +76,13 @@ enum { BTRFS_ORDERED_PENDING, }; +/* BTRFS_ORDERED_* flags that specify the type of the extent. */ +#define BTRFS_ORDERED_TYPE_FLAGS ((1UL << BTRFS_ORDERED_REGULAR) | \ + (1UL << BTRFS_ORDERED_NOCOW) | \ + (1UL << BTRFS_ORDERED_PREALLOC) | \ + (1UL << BTRFS_ORDERED_COMPRESSED) | \ + (1UL << BTRFS_ORDERED_DIRECT)) + struct btrfs_ordered_extent { /* logical offset in the file */ u64 file_offset; @@ -84,9 +91,11 @@ struct btrfs_ordered_extent { * These fields directly correspond to the same fields in * btrfs_file_extent_item. */ - u64 disk_bytenr; u64 num_bytes; + u64 ram_bytes; + u64 disk_bytenr; u64 disk_num_bytes; + u64 offset; /* number of bytes that still need writing */ u64 bytes_left; @@ -180,14 +189,9 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode, u64 *file_offset, u64 io_size, int uptodate); int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, - int type); -int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type); -int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int compress_type); + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, int flags, + int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode, From patchwork Mon May 17 18:35:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262805 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNWANTED_LANGUAGE_BODY, URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CB9E1C43462 for ; Mon, 17 May 2021 18:36:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AE31A611EE for ; Mon, 17 May 2021 18:36:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240123AbhEQShR (ORCPT ); Mon, 17 May 2021 14:37:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240581AbhEQShP (ORCPT ); Mon, 17 May 2021 14:37:15 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2C38C06175F for ; Mon, 17 May 2021 11:35:58 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id s19so3316933pfe.8 for ; Mon, 17 May 2021 11:35:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=pGd6ZOqutiISEoK7GMZmMSVggIwSSrM5KWFADxrvzP4=; b=k+Zn73Nve87VEch/b5Zm8bYBQfeLvKtO+k2kyxJT7gsnJtl6IBjf/E+Yn9vrTQ/zYd JD9pzJCMkHB9NLKONKQjoqdjduoloDOzn9sgOEThtKO/i4PK3AfYpIb1Yw/YvVh4ueCM OcWOHThrEHiawkchgxDOZu3ztlQj/NLB2CQrbWo7nCwLl+OQLA7bG7eW9+woO8zACwCx SM//pg8Td5QAshyK5/+ZFcY3dj+xgdLMMQNowB0dTJuO77oemZhr+gu04aQTCzAAC/Ry VTrijOgAkFGXYKUcASrZ7gr0XkXx2MBgBkCInCQt4hUAm23CvgU+0n+19YDaVc567qIW 9ClA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=pGd6ZOqutiISEoK7GMZmMSVggIwSSrM5KWFADxrvzP4=; b=Z6R7F3KH8KgR072H9wYArqEjHAW3wes+3wSs6SnKmmQgJ8NkD4/a4bQtJ/0Vx1Gp2Y FJ32AJQDEBB44u7Z5RbfMdRUsXE/wggOBSRa2eWXKGWNBZNvTJv75BBOXrcX2Q71JQVH qQDisEWFbKJrlsgyYDZ7opMMlJvw6Hk3kjE+2a6HL4zR9i3jM2UPqVBZaKJUG+UBm57x 483q4FqbXdDl+Y6bEtiwqew4bkHIvzA9TWlg9QMLxzZqFH4KXnpIFPIoVLWYU1JkkFGg 2NRFFLcEY/SFg4QmlUU25Xpo+OKpUtWEFE4SWd5uHhL7XrTnHPHJOhoVserhAgCJnQlj vj1w== X-Gm-Message-State: AOAM531IU97sCDxFwHzy+uXz48jJpbiNCOC5IeRzx+TXYfEPds9v3aHJ 7p+pgr93HoYi49WuKA0H8VULoFdsdBEqXA== X-Google-Smtp-Source: ABdhPJzl4LuY6FdGC/AKshsuPUSfYSsYMIqMVZfIUpPkyPXehenG73k67WOVkiZeacTzyDH8cQ5TPQ== X-Received: by 2002:a65:6156:: with SMTP id o22mr852362pgv.71.1621276557692; Mon, 17 May 2021 11:35:57 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:56 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 6/9] btrfs: support different disk extent size for delalloc Date: Mon, 17 May 2021 11:35:24 -0700 Message-Id: <8d021cdbf61270ce6bb318a8286b0bff42f84610.1621276134.git.osandov@fb.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, we always reserve the same extent size in the file and extent size on disk for delalloc because the former is the worst case for the latter. For RWF_ENCODED writes, we know the exact size of the extent on disk, which may be less than or greater than (for bookends) the size in the file. Add a disk_num_bytes parameter to btrfs_delalloc_reserve_metadata() so that we can reserve the correct amount of csum bytes. No functional change. Reviewed-by: Nikolay Borisov Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/delalloc-space.c | 18 ++++++++++-------- fs/btrfs/file.c | 3 ++- fs/btrfs/inode.c | 2 +- fs/btrfs/relocation.c | 4 ++-- 5 files changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 178ad516eaaa..3a5cf06d9860 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2784,7 +2784,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, struct btrfs_block_rsv *rsv); void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes); -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + u64 disk_num_bytes); u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info, u64 start, u64 end); diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c index 56642ca7af10..3af8a477a5cc 100644 --- a/fs/btrfs/delalloc-space.c +++ b/fs/btrfs/delalloc-space.c @@ -267,11 +267,11 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info, } static void calc_inode_reservations(struct btrfs_fs_info *fs_info, - u64 num_bytes, u64 *meta_reserve, - u64 *qgroup_reserve) + u64 num_bytes, u64 disk_num_bytes, + u64 *meta_reserve, u64 *qgroup_reserve) { u64 nr_extents = count_max_extents(num_bytes); - u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes); + u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes); u64 inode_update = btrfs_calc_metadata_size(fs_info, 1); *meta_reserve = btrfs_calc_insert_metadata_size(fs_info, @@ -285,7 +285,8 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info, *qgroup_reserve = nr_extents * fs_info->nodesize; } -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + u64 disk_num_bytes) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -315,6 +316,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) } num_bytes = ALIGN(num_bytes, fs_info->sectorsize); + disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize); /* * We always want to do it this way, every other way is wrong and ends @@ -326,8 +328,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) * everything out and try again, which is bad. This way we just * over-reserve slightly, and clean up the mess when we are done. */ - calc_inode_reservations(fs_info, num_bytes, &meta_reserve, - &qgroup_reserve); + calc_inode_reservations(fs_info, num_bytes, disk_num_bytes, + &meta_reserve, &qgroup_reserve); ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true); if (ret) return ret; @@ -346,7 +348,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) spin_lock(&inode->lock); nr_extents = count_max_extents(num_bytes); btrfs_mod_outstanding_extents(inode, nr_extents); - inode->csum_bytes += num_bytes; + inode->csum_bytes += disk_num_bytes; btrfs_calculate_inode_block_rsv_size(fs_info, inode); spin_unlock(&inode->lock); @@ -451,7 +453,7 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode, ret = btrfs_check_data_free_space(inode, reserved, start, len); if (ret < 0) return ret; - ret = btrfs_delalloc_reserve_metadata(inode, len); + ret = btrfs_delalloc_reserve_metadata(inode, len, len); if (ret < 0) btrfs_free_reserved_data_space(inode, *reserved, start, len); return ret; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 3b10d98b4ebb..a97eff337570 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1727,7 +1727,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, fs_info->sectorsize); WARN_ON(reserve_bytes == 0); ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), - reserve_bytes); + reserve_bytes, + reserve_bytes); if (ret) { if (!only_release_metadata) btrfs_free_reserved_data_space(BTRFS_I(inode), diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 894d8bd33288..d7bdbdc13826 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4870,7 +4870,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len, goto out; } } - ret = btrfs_delalloc_reserve_metadata(inode, blocksize); + ret = btrfs_delalloc_reserve_metadata(inode, blocksize, blocksize); if (ret < 0) { if (!only_release_metadata) btrfs_free_reserved_data_space(inode, data_reserved, diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index b70be2ac2e9e..414e362cb020 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2921,8 +2921,8 @@ static int relocate_file_extent_cluster(struct inode *inode, index = (cluster->start - offset) >> PAGE_SHIFT; last_index = (cluster->end - offset) >> PAGE_SHIFT; while (index <= last_index) { - ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), - PAGE_SIZE); + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE, + PAGE_SIZE); if (ret) goto out; From patchwork Mon May 17 18:35:25 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262807 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA1C4C43461 for ; Mon, 17 May 2021 18:36:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8B694611CC for ; Mon, 17 May 2021 18:36:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240656AbhEQShS (ORCPT ); Mon, 17 May 2021 14:37:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240787AbhEQShR (ORCPT ); Mon, 17 May 2021 14:37:17 -0400 Received: from mail-pg1-x531.google.com (mail-pg1-x531.google.com [IPv6:2607:f8b0:4864:20::531]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D025EC061573 for ; Mon, 17 May 2021 11:36:00 -0700 (PDT) Received: by mail-pg1-x531.google.com with SMTP id 6so5244303pgk.5 for ; Mon, 17 May 2021 11:36:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PO8XOaX4zXCwvpyXZ0fHL1sw/mqAVbmWN23SOMkLe2Y=; b=kI++/2Z3AuDpZHYbhfRzPOMJTicXnsiBFXYnKN52LupSMiVZCxbYsuawNR9aaAcKLB oQlxYLwmho73QxvsbUmpVrJ0mrG1agS3Ujzct1c2DfuesYac2HwIbJLz5WUvCvvQ98Ry D0djHzHuqljF07FTjDcxbZZT2btMnY/hw/0fSo/SxPyts6j/jAIu9YOPmgo1co7q3MCk vfY3tkZPz1YxHbRTmTHoUZ2Lxdhejp3rNAt2ALaHMPN6wYAwji/fAjhYYbI3yv65yo1l 9Q2aUbUJLp803xLiaQQBvAyTGCaKp2FZcleyHr2KwtT2ijzKl5GpMitv2bN0y1RyzkA9 9m+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PO8XOaX4zXCwvpyXZ0fHL1sw/mqAVbmWN23SOMkLe2Y=; b=gXW5tKc0td7ebBucoEt4Arnw8d4i5vSuczlYYNTCxqLD9Husp7orv6MDFjaAwC2MIq BNFFznW7XnyYivW4XG3JZWjhZn1qOJapBl200+6yJuCXM/GnvdGApMAH3YB2l5CSKnz1 Tqtux4O29uqTLgiE+3mBmL6FUPQcfY2XmopMD+3m1hE7NYrXEfnor5uMeU7NXwNyVKIN cbSj2uZsKECx5+hqX9OkmPppnhuFpMHBR7+EfLKQpnhWNetpBnjvieiXlGnxKBwt9zho 0N+yzmS76bDvWM30jWTPwjkcK4D7s8hJz/EGHfjY7dQAkzCErYAoTjfPrj+AIh98NIMu ul/w== X-Gm-Message-State: AOAM532vkXTVTVEFcPeFVSFG9YJFTAL6qStQq5GjsteLY3ck4zh6J6yP /b+DGjkEcGvu1qMzFVT6GiSbrxsA899nOw== X-Google-Smtp-Source: ABdhPJwT0bCe8hO9VAEpPHdtOZh3JeYcfulD4CNYBM1kee7u3zq0FMM4wKll/hFI9cqE/pQT6gFTYQ== X-Received: by 2002:a63:58f:: with SMTP id 137mr859057pgf.241.1621276559658; Mon, 17 May 2021 11:35:59 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:35:58 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 7/9] btrfs: optionally extend i_size in cow_file_range_inline() Date: Mon, 17 May 2021 11:35:25 -0700 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval Currently, an inline extent is always created after i_size is extended from btrfs_dirty_pages(). However, for encoded writes, we only want to update i_size after we successfully created the inline extent. Add an update_i_size parameter to cow_file_range_inline() and insert_inline_extent() and pass in the size of the extent rather than determining it from i_size. Since the start parameter is always passed as 0, get rid of it and simplify the logic in these two functions. While we're here, let's document the requirements for creating an inline extent. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/inode.c | 100 +++++++++++++++++++++++------------------------ 1 file changed, 48 insertions(+), 52 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d7bdbdc13826..40fe9602f622 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -209,9 +209,10 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans, static int insert_inline_extent(struct btrfs_trans_handle *trans, struct btrfs_path *path, bool extent_inserted, struct btrfs_root *root, struct inode *inode, - u64 start, size_t size, size_t compressed_size, + size_t size, size_t compressed_size, int compress_type, - struct page **compressed_pages) + struct page **compressed_pages, + bool update_i_size) { struct extent_buffer *leaf; struct page *page = NULL; @@ -220,7 +221,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, struct btrfs_file_extent_item *ei; int ret; size_t cur_size = size; - unsigned long offset; + u64 i_size; ASSERT((compressed_size > 0 && compressed_pages) || (compressed_size == 0 && !compressed_pages)); @@ -233,7 +234,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, size_t datasize; key.objectid = btrfs_ino(BTRFS_I(inode)); - key.offset = start; + key.offset = 0; key.type = BTRFS_EXTENT_DATA_KEY; datasize = btrfs_file_extent_calc_inline_size(cur_size); @@ -271,12 +272,10 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_compression(leaf, ei, compress_type); } else { - page = find_get_page(inode->i_mapping, - start >> PAGE_SHIFT); + page = find_get_page(inode->i_mapping, 0); btrfs_set_file_extent_compression(leaf, ei, 0); kaddr = kmap_atomic(page); - offset = offset_in_page(start); - write_extent_buffer(leaf, kaddr + offset, ptr, size); + write_extent_buffer(leaf, kaddr, ptr, size); kunmap_atomic(kaddr); put_page(page); } @@ -287,8 +286,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * We align size to sectorsize for inline extents just for simplicity * sake. */ - size = ALIGN(size, root->fs_info->sectorsize); - ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), start, size); + ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), 0, + ALIGN(size, root->fs_info->sectorsize)); if (ret) goto fail; @@ -301,7 +300,13 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * before we unlock the pages. Otherwise we * could end up racing with unlink. */ - BTRFS_I(inode)->disk_i_size = inode->i_size; + i_size = i_size_read(inode); + if (update_i_size && size > i_size) { + i_size_write(inode, size); + i_size = size; + } + BTRFS_I(inode)->disk_i_size = i_size; + fail: return ret; } @@ -312,35 +317,31 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * does the checks required to make sure the data is small enough * to fit as an inline extent. */ -static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start, - u64 end, size_t compressed_size, +static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size, + size_t compressed_size, int compress_type, - struct page **compressed_pages) + struct page **compressed_pages, + bool update_i_size) { struct btrfs_drop_extents_args drop_args = { 0 }; struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_trans_handle *trans; - u64 isize = i_size_read(&inode->vfs_inode); - u64 actual_end = min(end + 1, isize); - u64 inline_len = actual_end - start; - u64 aligned_end = ALIGN(end, fs_info->sectorsize); - u64 data_len = inline_len; + u64 data_len = compressed_size ? compressed_size : size; int ret; struct btrfs_path *path; - if (compressed_size) - data_len = compressed_size; - - if (start > 0 || - actual_end > fs_info->sectorsize || + /* + * We can create an inline extent if it ends at or beyond the current + * i_size, is no larger than a sector (decompressed), and the (possibly + * compressed) data fits in a leaf and the configured maximum inline + * size. + */ + if (size < i_size_read(&inode->vfs_inode) || + size > fs_info->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) || - (!compressed_size && - (actual_end & (fs_info->sectorsize - 1)) == 0) || - end + 1 < isize || - data_len > fs_info->max_inline) { + data_len > fs_info->max_inline) return 1; - } path = btrfs_alloc_path(); if (!path) @@ -354,30 +355,21 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start, trans->block_rsv = &inode->block_rsv; drop_args.path = path; - drop_args.start = start; - drop_args.end = aligned_end; + drop_args.start = 0; + drop_args.end = fs_info->sectorsize; drop_args.drop_cache = true; drop_args.replace_extent = true; - - if (compressed_size && compressed_pages) - drop_args.extent_item_size = btrfs_file_extent_calc_inline_size( - compressed_size); - else - drop_args.extent_item_size = btrfs_file_extent_calc_inline_size( - inline_len); - + drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(data_len); ret = btrfs_drop_extents(trans, root, inode, &drop_args); if (ret) { btrfs_abort_transaction(trans, ret); goto out; } - if (isize > actual_end) - inline_len = min_t(u64, isize, actual_end); - ret = insert_inline_extent(trans, path, drop_args.extent_inserted, - root, &inode->vfs_inode, start, - inline_len, compressed_size, - compress_type, compressed_pages); + ret = insert_inline_extent(trans, path, drop_args.extent_inserted, root, + &inode->vfs_inode, size, compressed_size, + compress_type, compressed_pages, + update_i_size); if (ret && ret != -ENOSPC) { btrfs_abort_transaction(trans, ret); goto out; @@ -386,7 +378,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start, goto out; } - btrfs_update_inode_bytes(inode, inline_len, drop_args.bytes_found); + btrfs_update_inode_bytes(inode, size, drop_args.bytes_found); ret = btrfs_update_inode(trans, root, inode); if (ret && ret != -ENOSPC) { btrfs_abort_transaction(trans, ret); @@ -662,14 +654,15 @@ static noinline int compress_file_range(struct async_chunk *async_chunk) /* we didn't compress the entire range, try * to make an uncompressed inline extent. */ - ret = cow_file_range_inline(BTRFS_I(inode), start, end, + ret = cow_file_range_inline(BTRFS_I(inode), actual_end, 0, BTRFS_COMPRESS_NONE, - NULL); + NULL, false); } else { /* try making a compressed inline extent */ - ret = cow_file_range_inline(BTRFS_I(inode), start, end, + ret = cow_file_range_inline(BTRFS_I(inode), actual_end, total_compressed, - compress_type, pages); + compress_type, pages, + false); } if (ret <= 0) { unsigned long clear_flags = EXTENT_DELALLOC | @@ -1054,9 +1047,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode, inode_should_defrag(inode, start, end, num_bytes, SZ_64K); if (start == 0) { + u64 actual_end = min_t(u64, i_size_read(&inode->vfs_inode), + end + 1); + /* lets try to make an inline extent */ - ret = cow_file_range_inline(inode, start, end, 0, - BTRFS_COMPRESS_NONE, NULL); + ret = cow_file_range_inline(inode, actual_end, 0, + BTRFS_COMPRESS_NONE, NULL, false); if (ret == 0) { /* * We use DO_ACCOUNTING here because we need the From patchwork Mon May 17 18:35:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262809 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F09DC43461 for ; Mon, 17 May 2021 18:36:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 07309611CC for ; Mon, 17 May 2021 18:36:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241210AbhEQShW (ORCPT ); Mon, 17 May 2021 14:37:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40076 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241004AbhEQShU (ORCPT ); Mon, 17 May 2021 14:37:20 -0400 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66DA9C06175F for ; Mon, 17 May 2021 11:36:03 -0700 (PDT) Received: by mail-pj1-x102d.google.com with SMTP id n6-20020a17090ac686b029015d2f7aeea8so119469pjt.1 for ; Mon, 17 May 2021 11:36:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=wZqB3J/+YXyyasSgOhF1p98f+WcWqppeBYm04HkW81Q=; b=TfnZZDpPWxKbrB5OPnU6mmOENbJ71zu7g4MXEIVpp6CzdQxIaEbp+V8BTSYt/0XL7k y2O6ToSrwIo3DDEoRsEGBEYnUqY97SREQMXKUfh+tqeLoppXFe+tz6Yr56MZBMScjMB1 95bwOBN2/GcgPzFhoTNGQLLFOmNVzkSEqNEb0htmc4I1sr0Q1M9I5ocW+kN/+7ffo5bN 0EGyt+Yp1V26JxfmFMyOv4+AOUNglz+ZjPnF3juP6pSa4m4rpospCh2+KzmzT2SsfwAP uxq/IvvkLHsW+Gpmz8Ps20Nqjy992udIDUJ4Mj2g4G06aPRm3UEIEdSndNksO0ps1Crl 0AAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=wZqB3J/+YXyyasSgOhF1p98f+WcWqppeBYm04HkW81Q=; b=cusxv+Wuc1Dj7Iqw1+uc38reWZQ2WARkADiJyqQmABfUN3CpNCC7Xm0CmYZsNofwfL Ktxl9MI2sN5/GsGfiHv85BSaLGClxWsvn5WNLVzG1zXLL1zZNnkamcs01pXVxlMAZrkO UzZ9XfU6otAL6Iml0ykUdeMx0fUWPSASrq6pIWXLSRTBg9Tu5PYXrjAg3a3w4uLe7NG2 rhtODhXZ063j9eAAaGYPY+bIzj7TykS6fffiYHrMPsf2Z69MggAdlFNgdRdLG7K++jcR aeItOq+SJi7PBV/pl3BtkA4h11IiwzIR0VmRxzOGqEQe1TVB3mWioXo2CPU3WEc5VQxi inQg== X-Gm-Message-State: AOAM532g5/bJo08JVYzApr0S9eM+ccnNO1mSK16ax708oy2sZVwqWLJi ebGi3+4kAh+aocOO39HUU/gTnB2cncxl9g== X-Google-Smtp-Source: ABdhPJyx/EStVSNGFqWamSVJY0lqxW/zcTwkKdP1R7R/b6AXHa1Y65b3wTEllkryMQsyYmsbPrPWHQ== X-Received: by 2002:a17:902:8d83:b029:ef:9dd8:4d9 with SMTP id v3-20020a1709028d83b02900ef9dd804d9mr1458459plo.40.1621276562137; Mon, 17 May 2021 11:36:02 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.35.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:36:01 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 8/9] btrfs: implement RWF_ENCODED reads Date: Mon, 17 May 2021 11:35:26 -0700 Message-Id: <33e1ff65aca4007732de2adf9bd8273aeb150263.1621276134.git.osandov@fb.com> X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we fill in zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - We hold the inode lock during the operation. - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - We don't do read repair, because it turns out that read repair is currently broken for compressed data. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 5 + fs/btrfs/inode.c | 504 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 511 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3a5cf06d9860..00612695d57a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3190,6 +3190,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); + extern const struct dentry_operations btrfs_dentry_operations; extern const struct iomap_ops btrfs_dio_iomap_ops; extern const struct iomap_dio_ops btrfs_dio_ops; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a97eff337570..2e6b47f866b7 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3636,6 +3636,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) { ssize_t ret = 0; + if (iocb->ki_flags & IOCB_ENCODED) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + return btrfs_encoded_read(iocb, to); + } if (iocb->ki_flags & IOCB_DIRECT) { ret = btrfs_direct_read(iocb, to); if (ret < 0 || !iov_iter_count(to) || diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 40fe9602f622..fc4a288257a5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -10208,6 +10209,509 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) } } +static int encoded_iov_compression_from_btrfs(unsigned int compress_type) +{ + switch (compress_type) { + case BTRFS_COMPRESS_NONE: + return ENCODED_IOV_COMPRESSION_NONE; + case BTRFS_COMPRESS_ZLIB: + return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB; + case BTRFS_COMPRESS_LZO: + /* + * The LZO format depends on the page size. 64k is the maximum + * sectorsize (and thus page size) that we support. + */ + if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K) + return -EINVAL; + return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12); + case BTRFS_COMPRESS_ZSTD: + return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD; + default: + return -EUCLEAN; + } +} + +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb, + struct iov_iter *iter, u64 start, + u64 lockend, + struct extent_state **cached_state, + u64 extent_start, size_t count, + struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_file_extent_item *item; + u64 ram_bytes; + unsigned long ptr; + void *tmp; + ssize_t ret; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path, + btrfs_ino(BTRFS_I(inode)), extent_start, + 0); + if (ret) { + if (ret > 0) { + /* The extent item disappeared? */ + ret = -EIO; + } + goto out; + } + leaf = path->nodes[0]; + item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_file_extent_item); + + ram_bytes = btrfs_file_extent_ram_bytes(leaf, item); + ptr = btrfs_file_extent_inline_start(item); + + encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) - + iocb->ki_pos); + ret = encoded_iov_compression_from_btrfs( + btrfs_file_extent_compression(leaf, item)); + if (ret < 0) + goto out; + encoded->compression = ret; + if (encoded->compression) { + size_t inline_size; + + inline_size = btrfs_file_extent_inline_item_len(leaf, + btrfs_item_nr(path->slots[0])); + if (inline_size > count) { + ret = -ENOBUFS; + goto out; + } + count = inline_size; + encoded->unencoded_len = ram_bytes; + encoded->unencoded_offset = iocb->ki_pos - extent_start; + } else { + encoded->len = encoded->unencoded_len = count = + min_t(u64, count, encoded->len); + ptr += iocb->ki_pos - extent_start; + } + + tmp = kmalloc(count, GFP_NOFS); + if (!tmp) { + ret = -ENOMEM; + goto out; + } + read_extent_buffer(leaf, tmp, ptr, count); + btrfs_release_path(path); + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock_shared(inode); + *unlocked = true; + + ret = copy_encoded_iov_to_iter(encoded, iter); + if (ret) + goto out_free; + ret = copy_to_iter(tmp, count, iter); + if (ret != count) + ret = -EFAULT; +out_free: + kfree(tmp); +out: + btrfs_free_path(path); + return ret; +} + +struct btrfs_encoded_read_private { + struct inode *inode; + wait_queue_head_t wait; + atomic_t pending; + blk_status_t status; + bool skip_csum; +}; + +static blk_status_t submit_encoded_read_bio(struct inode *inode, + struct bio *bio, int mirror_num, + unsigned long bio_flags) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + blk_status_t ret; + + if (!priv->skip_csum) { + ret = btrfs_lookup_bio_sums(inode, bio, NULL); + if (ret) + return ret; + } + + ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA); + if (ret) { + btrfs_io_bio_free_csum(io_bio); + return ret; + } + + atomic_inc(&priv->pending); + ret = btrfs_map_bio(fs_info, bio, mirror_num); + if (ret) { + atomic_dec(&priv->pending); + btrfs_io_bio_free_csum(io_bio); + } + return ret; +} + +static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio) +{ + const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK; + struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private; + struct inode *inode = priv->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + u32 sectorsize = fs_info->sectorsize; + struct bio_vec *bvec; + struct bvec_iter_all iter_all; + u64 start = io_bio->logical; + u32 bio_offset = 0; + + if (priv->skip_csum || !uptodate) + return io_bio->bio.bi_status; + + bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) { + unsigned int i, nr_sectors, pgoff; + + nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len); + pgoff = bvec->bv_offset; + for (i = 0; i < nr_sectors; i++) { + ASSERT(pgoff < PAGE_SIZE); + if (check_data_csum(inode, io_bio, bio_offset, + bvec->bv_page, pgoff, start)) + return BLK_STS_IOERR; + start += sectorsize; + bio_offset += sectorsize; + pgoff += sectorsize; + } + } + return BLK_STS_OK; +} + +static void btrfs_encoded_read_endio(struct bio *bio) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + blk_status_t status; + + status = btrfs_encoded_read_check_bio(io_bio); + if (status) { + /* + * The memory barrier implied by the atomic_dec_return() here + * pairs with the memory barrier implied by the + * atomic_dec_return() or io_wait_event() in + * btrfs_encoded_read_regular_fill_pages() to ensure that this + * write is observed before the load of status in + * btrfs_encoded_read_regular_fill_pages(). + */ + WRITE_ONCE(priv->status, status); + } + if (!atomic_dec_return(&priv->pending)) + wake_up(&priv->wait); + btrfs_io_bio_free_csum(io_bio); + bio_put(bio); +} + +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset, + u64 disk_io_size, struct page **pages) +{ + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_encoded_read_private priv = { + .inode = inode, + .pending = ATOMIC_INIT(1), + .skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM, + }; + unsigned long i = 0; + u64 cur = 0; + int ret; + + init_waitqueue_head(&priv.wait); + /* + * Submit bios for the extent, splitting due to bio or stripe limits as + * necessary. + */ + while (cur < disk_io_size) { + struct extent_map *em; + struct btrfs_io_geometry geom; + struct bio *bio = NULL; + u64 remaining; + + em = btrfs_get_chunk_map(fs_info, offset + cur, + disk_io_size - cur); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + } else { + ret = btrfs_get_io_geometry(fs_info, em, BTRFS_MAP_READ, + offset + cur, + disk_io_size - cur, &geom); + } + if (ret) { + WRITE_ONCE(priv.status, errno_to_blk_status(ret)); + break; + } + remaining = min(geom.len, disk_io_size - cur); + while (bio || remaining) { + size_t bytes = min_t(u64, remaining, PAGE_SIZE); + + if (!bio) { + bio = btrfs_bio_alloc(offset + cur); + bio->bi_end_io = btrfs_encoded_read_endio; + bio->bi_private = &priv; + bio->bi_opf = REQ_OP_READ; + } + + if (!bytes || + bio_add_page(bio, pages[i], bytes, 0) < bytes) { + blk_status_t status; + + status = submit_encoded_read_bio(inode, bio, 0, + 0); + if (status) { + WRITE_ONCE(priv.status, status); + bio_put(bio); + goto out; + } + bio = NULL; + continue; + } + + i++; + cur += bytes; + remaining -= bytes; + } + } + +out: + if (atomic_dec_return(&priv.pending)) + io_wait_event(priv.wait, !atomic_read(&priv.pending)); + /* See btrfs_encoded_read_endio() for ordering. */ + return blk_status_to_errno(READ_ONCE(priv.status)); +} + +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, + struct iov_iter *iter, + u64 start, u64 lockend, + struct extent_state **cached_state, + u64 offset, u64 disk_io_size, + size_t count, + const struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct page **pages; + unsigned long nr_pages, i; + u64 cur; + size_t page_offset; + ssize_t ret; + + nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE); + pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out; + } + } + + ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size, + pages); + if (ret) + goto out; + + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock_shared(inode); + *unlocked = true; + + ret = copy_encoded_iov_to_iter(encoded, iter); + if (ret) + goto out; + if (encoded->compression) { + i = 0; + page_offset = 0; + } else { + i = (iocb->ki_pos - start) >> PAGE_SHIFT; + page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1); + } + cur = 0; + while (cur < count) { + size_t bytes = min_t(size_t, count - cur, + PAGE_SIZE - page_offset); + + if (copy_page_to_iter(pages[i], page_offset, bytes, + iter) != bytes) { + ret = -EFAULT; + goto out; + } + i++; + cur += bytes; + page_offset = 0; + } + ret = count; +out: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + __free_page(pages[i]); + } + kfree(pages); + return ret; +} + +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + ssize_t ret; + size_t count; + u64 start, lockend, offset, disk_io_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + struct encoded_iov encoded = {}; + bool unlocked = false; + + ret = generic_encoded_read_checks(iocb, iter); + if (ret < 0) + return ret; + if (ret == 0) + return copy_encoded_iov_to_iter(&encoded, iter); + count = ret; + + file_accessed(iocb->ki_filp); + + inode_lock_shared(inode); + + if (iocb->ki_pos >= inode->i_size) { + inode_unlock_shared(inode); + return copy_encoded_iov_to_iter(&encoded, iter); + } + start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize); + /* + * We don't know how long the extent containing iocb->ki_pos is, but if + * it's compressed we know that it won't be longer than this. + */ + lockend = start + BTRFS_MAX_UNCOMPRESSED - 1; + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, + lockend - start + 1); + if (ret) + goto out_unlock_inode; + lock_extent_bits(io_tree, start, lockend, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + lockend - start + 1); + if (!ordered) + break; + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, lockend, &cached_state); + cond_resched(); + } + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, + lockend - start + 1); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_unlock_extent; + } + + if (em->block_start == EXTENT_MAP_INLINE) { + u64 extent_start = em->start; + + /* + * For inline extents we get everything we need out of the + * extent item. + */ + free_extent_map(em); + em = NULL; + ret = btrfs_encoded_read_inline(iocb, iter, start, lockend, + &cached_state, extent_start, + count, &encoded, &unlocked); + goto out; + } + + /* + * We only want to return up to EOF even if the extent extends beyond + * that. + */ + encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) - + iocb->ki_pos); + if (em->block_start == EXTENT_MAP_HOLE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) { + offset = EXTENT_MAP_HOLE; + encoded.len = encoded.unencoded_len = count = + min_t(u64, count, encoded.len); + } else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { + offset = em->block_start; + /* + * Bail if the buffer isn't large enough to return the whole + * compressed extent. + */ + if (em->block_len > count) { + ret = -ENOBUFS; + goto out_em; + } + disk_io_size = count = em->block_len; + encoded.unencoded_len = em->ram_bytes; + encoded.unencoded_offset = iocb->ki_pos - em->orig_start; + ret = encoded_iov_compression_from_btrfs(em->compress_type); + if (ret < 0) + goto out_em; + encoded.compression = ret; + } else { + offset = em->block_start + (start - em->start); + if (encoded.len > count) + encoded.len = count; + /* + * Don't read beyond what we locked. This also limits the page + * allocations that we'll do. + */ + disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start; + encoded.len = encoded.unencoded_len = count = + start + disk_io_size - iocb->ki_pos; + disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize); + } + free_extent_map(em); + em = NULL; + + if (offset == EXTENT_MAP_HOLE) { + unlock_extent_cached(io_tree, start, lockend, &cached_state); + inode_unlock_shared(inode); + unlocked = true; + ret = copy_encoded_iov_to_iter(&encoded, iter); + if (ret) + goto out; + ret = iov_iter_zero(count, iter); + if (ret != count) + ret = -EFAULT; + } else { + ret = btrfs_encoded_read_regular(iocb, iter, start, lockend, + &cached_state, offset, + disk_io_size, count, &encoded, + &unlocked); + } + +out: + if (ret >= 0) + iocb->ki_pos += encoded.len; +out_em: + free_extent_map(em); +out_unlock_extent: + if (!unlocked) + unlock_extent_cached(io_tree, start, lockend, &cached_state); +out_unlock_inode: + if (!unlocked) + inode_unlock_shared(inode); + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a From patchwork Mon May 17 18:35:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12262811 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C124C433ED for ; Mon, 17 May 2021 18:36:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 319C0611F1 for ; Mon, 17 May 2021 18:36:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241442AbhEQShZ (ORCPT ); Mon, 17 May 2021 14:37:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40096 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241312AbhEQShY (ORCPT ); Mon, 17 May 2021 14:37:24 -0400 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A6212C061573 for ; Mon, 17 May 2021 11:36:07 -0700 (PDT) Received: by mail-pl1-x62d.google.com with SMTP id p6so3652886plr.11 for ; Mon, 17 May 2021 11:36:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=7cXsLsHnJzVBsPEV5SCSRP4oIh+BEG387qT3grJ77BE=; b=kfMW015XHsn/T4Dj21sfsZ4S1fnWSBgsQuzP6nCA3Ct574RoDI/IARciUnpWWg9KLY 1fNdJ1XXaclpg81u6nPdH7F31QmSB8zPlglNGz+ckRt/uHQwhgQEB1gSmvlnDajrYHG7 C4kKQoY1NEhpo+EciWAFlKTjbz19RoArZZsOqlK6xOrWepe3EtaehCaL/PgmzyqZJkiA MTc5Df9/uadVhYy5sayLRbV3crTLy1RGGqhnnnoVr5B96TRaEguHgl6/ECX6dl4U0o60 peieMo8HNaqboQqZuIoeSJSlXNSuN+xFlwhUEwGPD2msK0XWD+M4xb+vemLqCnBpOpzf UNsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=7cXsLsHnJzVBsPEV5SCSRP4oIh+BEG387qT3grJ77BE=; b=OKkTFHCZhdxn7U9DVethXxfKTcJ75z/nDlORPqyC9NETQILboXx3CEYVqOt0QIYe7Q 8xOAE1O5Q/x0a6ZsCAhCo4IjZ0JpNP5aQ4jX3LqMfvh9Im++03GSc+R3yJV8hXmyrrL9 xqRhyI8kzgc4q2skubzKJA+5XC+4xZp88HOb2x8IRxQAfMs3ioEN6Wq9uNmzQWi6/Z7y I5jvzmVOOcFn2MPjCF5+z76GR3RWjeEp3kzK1UhAn+w25thltXXF4qprlM5wdPLfBIkL Aew2x5cqjvqpJPWEWPAZSejYQXY2MuDCnq9LgYJzVgIkl1nEYoRL2ZieI6XaSbIfb7QV wknQ== X-Gm-Message-State: AOAM532IeTGbMc/CsuZ9infkjFH+dIB3lUA7TLa/apxXdbmfQ135yU5m kFknXekK8puUs7SmdiI84NaIgYnCWMD5mw== X-Google-Smtp-Source: ABdhPJwWlYmXsZ4UvptCMaVmlM8upiRge1tfWY/hG5NZCX0pBF1nseU/+Kt2wN/IZmcj0TvQmWF8NA== X-Received: by 2002:a17:90b:1c0b:: with SMTP id oc11mr837383pjb.156.1621276565320; Mon, 17 May 2021 11:36:05 -0700 (PDT) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:19a9]) by smtp.gmail.com with ESMTPSA id v15sm5498763pfm.187.2021.05.17.11.36.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 May 2021 11:36:04 -0700 (PDT) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Linus Torvalds , Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH RERESEND v9 9/9] btrfs: implement RWF_ENCODED writes Date: Mon, 17 May 2021 11:35:27 -0700 Message-Id: X-Mailer: git-send-email 2.31.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Omar Sandoval The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Now that read and write are implemented, this also sets the FMODE_ENCODED_IO flag in btrfs_file_open(). Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 7 +- fs/btrfs/compression.h | 6 +- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 38 +++++- fs/btrfs/inode.c | 259 +++++++++++++++++++++++++++++++++++++++- fs/btrfs/ordered-data.c | 12 +- fs/btrfs/ordered-data.h | 5 +- 7 files changed, 316 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index b6d9a9657c3a..8757541e8909 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -354,7 +354,8 @@ static void end_compressed_bio_write(struct bio *bio) bio->bi_status == BLK_STS_OK); cb->compressed_pages[0]->mapping = NULL; - end_compressed_writeback(inode, cb); + if (cb->writeback) + end_compressed_writeback(inode, cb); /* note, our inode could be gone now */ /* @@ -390,7 +391,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, struct page **compressed_pages, unsigned long nr_pages, unsigned int write_flags, - struct cgroup_subsys_state *blkcg_css) + struct cgroup_subsys_state *blkcg_css, + bool writeback) { struct btrfs_fs_info *fs_info = inode->root->fs_info; struct bio *bio = NULL; @@ -414,6 +416,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, cb->mirror_num = 0; cb->compressed_pages = compressed_pages; cb->compressed_len = compressed_len; + cb->writeback = writeback; cb->orig_bio = NULL; cb->nr_pages = nr_pages; diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index 8001b700ea3a..f95cdc16f503 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -49,6 +49,9 @@ struct compressed_bio { /* the compression algorithm for this bio */ int compress_type; + /* Whether this is a write for writeback. */ + bool writeback; + /* number of compressed pages in the array */ unsigned long nr_pages; @@ -96,7 +99,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, struct page **compressed_pages, unsigned long nr_pages, unsigned int write_flags, - struct cgroup_subsys_state *blkcg_css); + struct cgroup_subsys_state *blkcg_css, + bool writeback); blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, int mirror_num, unsigned long bio_flags); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 00612695d57a..c22f75d1266c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3191,6 +3191,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); +ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded); extern const struct dentry_operations btrfs_dentry_operations; extern const struct iomap_ops btrfs_dio_iomap_ops; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2e6b47f866b7..201b6ce0267c 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3,6 +3,7 @@ * Copyright (C) 2007 Oracle. All rights reserved. */ +#include #include #include #include @@ -1987,6 +1988,32 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) return written ? written : err; } +static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file_inode(file); + struct encoded_iov encoded; + ssize_t ret; + + ret = copy_encoded_iov_from_iter(&encoded, from); + if (ret) + return ret; + + btrfs_inode_lock(inode, 0); + ret = generic_encoded_write_checks(iocb, &encoded); + if (ret || encoded.len == 0) + goto out; + + ret = btrfs_write_check(iocb, from, encoded.len); + if (ret < 0) + goto out; + + ret = btrfs_do_encoded_write(iocb, from, &encoded); +out: + btrfs_inode_unlock(inode, 0); + return ret; +} + static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { @@ -2003,14 +2030,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, if (test_bit(BTRFS_FS_STATE_ERROR, &inode->root->fs_info->fs_state)) return -EROFS; - if (!(iocb->ki_flags & IOCB_DIRECT) && - (iocb->ki_flags & IOCB_NOWAIT)) + if ((iocb->ki_flags & IOCB_NOWAIT) && + (!(iocb->ki_flags & IOCB_DIRECT) || + (iocb->ki_flags & IOCB_ENCODED))) return -EOPNOTSUPP; if (sync) atomic_inc(&inode->sync_writers); - if (iocb->ki_flags & IOCB_DIRECT) + if (iocb->ki_flags & IOCB_ENCODED) + num_written = btrfs_encoded_write(iocb, from); + else if (iocb->ki_flags & IOCB_DIRECT) num_written = btrfs_direct_write(iocb, from); else num_written = btrfs_buffered_write(iocb, from); @@ -3594,7 +3624,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) static int btrfs_file_open(struct inode *inode, struct file *filp) { - filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC; + filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_ENCODED_IO; return generic_file_open(inode, filp); } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index fc4a288257a5..2201fd6b9344 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -934,7 +934,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) ins.offset, async_extent->pages, async_extent->nr_pages, async_chunk->write_flags, - async_chunk->blkcg_css)) { + async_chunk->blkcg_css, true)) { struct page *p = async_extent->pages[0]; const u64 start = async_extent->start; const u64 end = start + async_extent->ram_size - 1; @@ -2850,6 +2850,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans, * except if the ordered extent was truncated. */ update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) || + test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) || test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags); return insert_reserved_file_extent(trans, BTRFS_I(oe->inode), @@ -2884,7 +2885,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) && - !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags)) + !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags) && + !test_bit(BTRFS_ORDERED_ENCODED, &ordered_extent->flags)) clear_bits |= EXTENT_DELALLOC_NEW; freespace_inode = btrfs_is_free_space_inode(inode); @@ -10712,6 +10714,259 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) return ret; } +ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_root *root = BTRFS_I(inode)->root; + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct extent_changeset *data_reserved = NULL; + struct extent_state *cached_state = NULL; + int compression; + size_t orig_count; + u64 start, end; + u64 num_bytes, ram_bytes, disk_num_bytes; + unsigned long nr_pages, i; + struct page **pages; + struct btrfs_key ins; + bool extent_reserved = false; + struct extent_map *em; + ssize_t ret; + + switch (encoded->compression) { + case ENCODED_IOV_COMPRESSION_BTRFS_ZLIB: + compression = BTRFS_COMPRESS_ZLIB; + break; + case ENCODED_IOV_COMPRESSION_BTRFS_ZSTD: + compression = BTRFS_COMPRESS_ZSTD; + break; + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K: + /* The page size must match for LZO. */ + if (encoded->compression - + ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + 12 != PAGE_SHIFT) + return -EINVAL; + compression = BTRFS_COMPRESS_LZO; + break; + default: + return -EINVAL; + } + if (encoded->encryption != ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + + orig_count = iov_iter_count(from); + + /* The extent size must be sane. */ + if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED || + orig_count > BTRFS_MAX_COMPRESSED || orig_count == 0) + return -EINVAL; + + /* + * The compressed data must be smaller than the decompressed data. + * + * It's of course possible for data to compress to larger or the same + * size, but the buffered I/O path falls back to no compression for such + * data, and we don't want to break any assumptions by creating these + * extents. + * + * Note that this is less strict than the current check we have that the + * compressed data must be at least one sector smaller than the + * decompressed data. We only want to enforce the weaker requirement + * from old kernels that it is at least one byte smaller. + */ + if (orig_count >= encoded->unencoded_len) + return -EINVAL; + + /* The extent must start on a sector boundary. */ + start = iocb->ki_pos; + if (!IS_ALIGNED(start, fs_info->sectorsize)) + return -EINVAL; + + /* + * The extent must end on a sector boundary. However, we allow a write + * which ends at or extends i_size to have an unaligned length; we round + * up the extent size and set i_size to the unaligned end. + */ + if (start + encoded->len < inode->i_size && + !IS_ALIGNED(start + encoded->len, fs_info->sectorsize)) + return -EINVAL; + + /* Finally, the offset in the unencoded data must be sector-aligned. */ + if (!IS_ALIGNED(encoded->unencoded_offset, fs_info->sectorsize)) + return -EINVAL; + + num_bytes = ALIGN(encoded->len, fs_info->sectorsize); + ram_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize); + end = start + num_bytes - 1; + + /* + * If the extent cannot be inline, the compressed data on disk must be + * sector-aligned. For convenience, we extend it with zeroes if it + * isn't. + */ + disk_num_bytes = ALIGN(orig_count, fs_info->sectorsize); + nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE); + pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL_ACCOUNT); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from)); + char *kaddr; + + pages[i] = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out_pages; + } + kaddr = kmap(pages[i]); + if (copy_from_iter(kaddr, bytes, from) != bytes) { + kunmap(pages[i]); + ret = -EFAULT; + goto out_pages; + } + if (bytes < PAGE_SIZE) + memset(kaddr + bytes, 0, PAGE_SIZE - bytes); + kunmap(pages[i]); + } + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, num_bytes); + if (ret) + goto out_pages; + ret = invalidate_inode_pages2_range(inode->i_mapping, + start >> PAGE_SHIFT, + end >> PAGE_SHIFT); + if (ret) + goto out_pages; + lock_extent_bits(io_tree, start, end, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + num_bytes); + if (!ordered && + !filemap_range_has_page(inode->i_mapping, start, end)) + break; + if (ordered) + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, end, &cached_state); + cond_resched(); + } + + /* + * We don't use the higher-level delalloc space functions because our + * num_bytes and disk_num_bytes are different. + */ + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), disk_num_bytes); + if (ret) + goto out_unlock; + ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, start, + num_bytes); + if (ret) + goto out_free_data_space; + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), num_bytes, + disk_num_bytes); + if (ret) + goto out_qgroup_free_data; + + /* Try an inline extent first. */ + if (start == 0 && encoded->unencoded_len == encoded->len && + encoded->unencoded_offset == 0) { + ret = cow_file_range_inline(BTRFS_I(inode), encoded->len, + orig_count, compression, pages, + true); + if (ret <= 0) { + if (ret == 0) + ret = orig_count; + goto out_delalloc_release; + } + } + + ret = btrfs_reserve_extent(root, disk_num_bytes, disk_num_bytes, + disk_num_bytes, 0, 0, &ins, 1, 1); + if (ret) + goto out_delalloc_release; + extent_reserved = true; + + em = create_io_em(BTRFS_I(inode), start, num_bytes, + start - encoded->unencoded_offset, ins.objectid, + ins.offset, ins.offset, ram_bytes, compression, + BTRFS_ORDERED_COMPRESSED); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_free_reserved; + } + free_extent_map(em); + + ret = btrfs_add_ordered_extent(BTRFS_I(inode), start, num_bytes, + ram_bytes, ins.objectid, ins.offset, + encoded->unencoded_offset, + (1 << BTRFS_ORDERED_ENCODED) | + (1 << BTRFS_ORDERED_COMPRESSED), + compression); + if (ret) { + btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); + goto out_free_reserved; + } + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + + if (start + encoded->len > inode->i_size) + i_size_write(inode, start + encoded->len); + + unlock_extent_cached(io_tree, start, end, &cached_state); + + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes); + + if (btrfs_submit_compressed_write(BTRFS_I(inode), start, num_bytes, + ins.objectid, ins.offset, pages, + nr_pages, 0, NULL, false)) { + struct page *page = pages[0]; + + page->mapping = inode->i_mapping; + btrfs_writepage_endio_finish_ordered(page, start, end, 0); + page->mapping = NULL; + ret = -EIO; + goto out_pages; + } + ret = orig_count; + goto out; + +out_free_reserved: + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); +out_delalloc_release: + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes); + btrfs_delalloc_release_metadata(BTRFS_I(inode), disk_num_bytes, + ret < 0); +out_qgroup_free_data: + if (ret < 0) { + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, start, + num_bytes); + } +out_free_data_space: + /* + * If btrfs_reserve_extent() succeeded, then we already decremented + * bytes_may_use. + */ + if (!extent_reserved) + btrfs_free_reserved_data_space_noquota(fs_info, disk_num_bytes); +out_unlock: + unlock_extent_cached(io_tree, start, end, &cached_state); +out_pages: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + __free_page(pages[i]); + } + kvfree(pages); +out: + if (ret >= 0) + iocb->ki_pos += encoded->len; + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 57dc2b90fee8..266a0918bbbf 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -464,9 +464,15 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, spin_lock(&btrfs_inode->lock); btrfs_mod_outstanding_extents(btrfs_inode, -1); spin_unlock(&btrfs_inode->lock); - if (root != fs_info->tree_root) - btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes, - false); + if (root != fs_info->tree_root) { + u64 release; + + if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags)) + release = entry->disk_num_bytes; + else + release = entry->num_bytes; + btrfs_delalloc_release_metadata(btrfs_inode, release, false); + } percpu_counter_add_batch(&fs_info->ordered_bytes, -entry->num_bytes, fs_info->delalloc_batch); diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index fe1c50da373c..abed13023094 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -74,6 +74,8 @@ enum { BTRFS_ORDERED_LOGGED_CSUM, /* We wait for this extent to complete in the current transaction */ BTRFS_ORDERED_PENDING, + /* RWF_ENCODED I/O */ + BTRFS_ORDERED_ENCODED, }; /* BTRFS_ORDERED_* flags that specify the type of the extent. */ @@ -81,7 +83,8 @@ enum { (1UL << BTRFS_ORDERED_NOCOW) | \ (1UL << BTRFS_ORDERED_PREALLOC) | \ (1UL << BTRFS_ORDERED_COMPRESSED) | \ - (1UL << BTRFS_ORDERED_DIRECT)) + (1UL << BTRFS_ORDERED_DIRECT) | \ + (1UL << BTRFS_ORDERED_ENCODED)) struct btrfs_ordered_extent { /* logical offset in the file */