From patchwork Fri Jan 22 20:46:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040635 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FEA4C433E0 for ; Fri, 22 Jan 2021 21:16:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1E24323AF8 for ; Fri, 22 Jan 2021 21:16:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729371AbhAVVQD (ORCPT ); Fri, 22 Jan 2021 16:16:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728807AbhAVUtW (ORCPT ); Fri, 22 Jan 2021 15:49:22 -0500 Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9FC9CC0617AB for ; Fri, 22 Jan 2021 12:47:21 -0800 (PST) Received: by mail-pj1-x102a.google.com with SMTP id u4so4642664pjn.4 for ; Fri, 22 Jan 2021 12:47:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=rUrSv5a3k4qaJA6xCjOA8O8ZkzvHiOJp37R6oMRhd/s=; b=cy0QcBWEJOpNlrEEH0Rtvrozscq6t7muLe7G4vOHIQmvaG6O34slLP+UhdcVfMS2Ij m50JAjx5fsKG6cHJiQ1VHW5woqouctcQZDdXSy8ZXhzewz3YtBTXfMdIyYe2Y2K3wWVN 6/Nch2PFRFvWM6ApcJD+68fu26qelaHpN7z5kuw6HVAiBssB6r1QRILGkS6+9SnQ5yfT 0di5h77BoSxmZD10Mjk+1KvTyHmRR6DHI7QghfEo6IsYdhYBA+VjQtAMTLq96j4leXhK riaGoCpLAWX1SeVu0bJgSpbDsthPNPd/sstc3ceGHV3/a1SJl4L2pYHh3767mev3QHbF bLGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rUrSv5a3k4qaJA6xCjOA8O8ZkzvHiOJp37R6oMRhd/s=; b=QZyK3C6MOMLp8VUofWIVhCCO5FbJvMbGVdr81lrwhOc3XzJlb0PbKapar/7Nl+znJl RjYCG9XTx/5NSGnV4mk0cFSPEP4vb6ME/bOe4AAqnNzlMwTSx6zaXzvHNx2p5ErVxp9s uKSPI3OjkB/j62fYTKLbIsS6t5xb4glKGFYHeSjZPmYzLuew6j7JYE485CJT2azClY3V sNRV5W+OA2lAXydq0URex8pVTMdS0k84YHBviw6YZqQaAUu9UDzN2ScfPr4QvWCrE24z wt3uXQJGT6oM7XHKbXFCUt5NMGjuQAUWbtds+a7H+GB8vDhG6E8xjyBw2p6Rj/743ekC D7iA== X-Gm-Message-State: AOAM530r6mi6ORkfRpGIo3r0xfqJfwknYp2GGPc0Aiou8RJRgylQHRi+ eqw6A4jkwci9HeAYuQTgl+8QAA== X-Google-Smtp-Source: ABdhPJz1B4n6Rk7dvmWq3RNevUOFnZeWA8G4FgacZGyIY3z/iihHeNfBrAFENgHBLu0CfBfWS3X5tQ== X-Received: by 2002:a17:90b:1a86:: with SMTP id ng6mr7239385pjb.113.1611348441102; Fri, 22 Jan 2021 12:47:21 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:20 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 01/10] iov_iter: add copy_struct_from_iter() Date: Fri, 22 Jan 2021 12:46:48 -0800 Message-Id: X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval This is essentially copy_struct_from_user() but for an iov_iter. Suggested-by: Aleksa Sarai Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- include/linux/uio.h | 2 ++ lib/iov_iter.c | 82 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+) diff --git a/include/linux/uio.h b/include/linux/uio.h index 72d88566694e..f4e6ea85a269 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -121,6 +121,8 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); +int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i, + size_t usize); size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i); size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index a21e6a5792c5..f45826ed7528 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -948,6 +948,88 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, } EXPORT_SYMBOL(copy_page_from_iter); +/** + * copy_struct_from_iter - copy a struct from an iov_iter + * @dst: Destination buffer. + * @ksize: Size of @dst struct. + * @i: Source iterator. + * @usize: (Alleged) size of struct in @i. + * + * Copies a struct from an iov_iter in a way that guarantees + * backwards-compatibility for struct arguments in an iovec (as long as the + * rules for copy_struct_from_user() are followed). + * + * The recommended usage is that @usize be taken from the current segment: + * + * int do_foo(struct iov_iter *i) + * { + * size_t usize = iov_iter_single_seg_count(i); + * struct foo karg; + * int err; + * + * if (usize > PAGE_SIZE) + * return -E2BIG; + * if (usize < FOO_SIZE_VER0) + * return -EINVAL; + * err = copy_struct_from_iter(&karg, sizeof(karg), i, usize); + * if (err) + * return err; + * + * // ... + * } + * + * Return: 0 on success, -errno on error (see copy_struct_from_user()). + * + * On success, the iterator is advanced @usize bytes. On error, the iterator is + * not advanced. + */ +int copy_struct_from_iter(void *dst, size_t ksize, struct iov_iter *i, + size_t usize) +{ + if (usize <= ksize) { + if (!copy_from_iter_full(dst, usize, i)) + return -EFAULT; + memset(dst + usize, 0, ksize - usize); + } else { + size_t copied = 0, copy; + int ret; + + if (WARN_ON(iov_iter_is_pipe(i)) || unlikely(i->count < usize)) + return -EFAULT; + if (iter_is_iovec(i)) + might_fault(); + iterate_all_kinds(i, usize, v, ({ + copy = min(ksize - copied, v.iov_len); + if (copy && copyin(dst + copied, v.iov_base, copy)) + return -EFAULT; + copied += copy; + ret = check_zeroed_user(v.iov_base + copy, + v.iov_len - copy); + if (ret <= 0) + return ret ?: -E2BIG; + 0;}), ({ + char *addr = kmap_atomic(v.bv_page); + copy = min_t(size_t, ksize - copied, v.bv_len); + memcpy(dst + copied, addr + v.bv_offset, copy); + copied += copy; + ret = memchr_inv(addr + v.bv_offset + copy, 0, + v.bv_len - copy) ? -E2BIG : 0; + kunmap_atomic(addr); + if (ret) + return ret; + }), ({ + copy = min(ksize - copied, v.iov_len); + memcpy(dst + copied, v.iov_base, copy); + if (memchr_inv(v.iov_base, 0, v.iov_len)) + return -E2BIG; + }) + ) + iov_iter_advance(i, usize); + } + return 0; +} +EXPORT_SYMBOL_GPL(copy_struct_from_iter); + static size_t pipe_zero(size_t bytes, struct iov_iter *i) { struct pipe_inode_info *pipe = i->pipe; From patchwork Fri Jan 22 20:46:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040631 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F44AC433DB for ; Fri, 22 Jan 2021 21:16:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CDC7323B09 for ; Fri, 22 Jan 2021 21:16:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730161AbhAVVPz (ORCPT ); Fri, 22 Jan 2021 16:15:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729468AbhAVUtW (ORCPT ); Fri, 22 Jan 2021 15:49:22 -0500 Received: from mail-pl1-x629.google.com (mail-pl1-x629.google.com [IPv6:2607:f8b0:4864:20::629]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CB40FC061352 for ; Fri, 22 Jan 2021 12:47:23 -0800 (PST) Received: by mail-pl1-x629.google.com with SMTP id h15so1482992pli.8 for ; Fri, 22 Jan 2021 12:47:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ebBFLla+oq6Fu8NNd3NcGHArvFeQ09WYnlC9IkZrxIc=; b=CRAF4UcnAm/BtIvVRfBo4+C4xSKqe1kvd5ymqv1SnomF7F+lPPFW3nV2U014DKZ8/m gaInPQXQnY7Kuk4NQ/xioc90wp6AhyUMUn2STCF1ZMP66n9zzOCuUrXhU/I6xofMKjSl 2fVQQHKy7GEdsijw6IeX/rSTujpvHskGJ0arnytgw8N+ZDxHidc8PY/BAhqaJO+5iAKM h6FD00W5lpR+pXoQGg8T4sgCRCC0hytnlCibvHPRdCd9G2gIy5mMvkmOf8pt2dNwIKTC FJ9qNFyTLo0RLmDkV81EarSPHAGWoykD2mriPKAF6HO2j1Eo3i0PbHFhTG/6gnyTd+dQ lLvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ebBFLla+oq6Fu8NNd3NcGHArvFeQ09WYnlC9IkZrxIc=; b=WJzIctJHpY8oN4wrZQ0Gj27QUqsWCIvSVDBNRymDOEuq64goERTc9VG6OyUqcu+Vqg O40jjkzCHTvElZpZcP5nZSQ5lC0lm8ya7ZkgHXdnAch71F0H6GWz30/xy4QpjIIALKzU NNDHGb1np1cWJxOQiWzPSqy3GR40gQ5APSUPQAYu7Mnl3UwS7Jkq461s1TviFCdyJUab GCpNBbwoPrg4QuZ6K7zFw+ghquIo1rjnM50dc2nYXHjoSTuiB839yb8VN+ihWs2+01MR 8L9gv5ijylK0+vjhqr9mPmVZ2h3Q+G9N4r4WrSCmqiZgb34bhIJFbyeyI+CDrEnoGEiU QMrg== X-Gm-Message-State: AOAM53214kzFPSdKvIyJAWBwmpgDsuNn+jzsCVAu28foR9YsO0glqTcF EVUi3MuWsdEEf8JqOcHnHc3DYg== X-Google-Smtp-Source: ABdhPJzlU6VoXtH6Y1tNDhS0ekWFfZXU5C+5K1BPn9WzwPAHc8pTtFnqs9nco3ZT5ZQORSBVeehdnQ== X-Received: by 2002:a17:90b:78f:: with SMTP id l15mr214949pjz.234.1611348443355; Fri, 22 Jan 2021 12:47:23 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:22 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 02/10] fs: add O_ALLOW_ENCODED open flag Date: Fri, 22 Jan 2021 12:46:49 -0800 Message-Id: <09988d880282a6ef0dd04d1fce7db1dbbd2d335c.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval The upcoming RWF_ENCODED operation introduces some security concerns: 1. Compressed writes will pass arbitrary data to decompression algorithms in the kernel. 2. Compressed reads can leak truncated/hole punched data. Therefore, we need to require privilege for RWF_ENCODED. It's not possible to do the permissions checks at the time of the read or write because, e.g., io_uring submits IO from a worker thread. So, add an open flag which requires CAP_SYS_ADMIN. It can also be set and cleared with fcntl(). The flag is not cleared in any way on fork or exec. Note that the usual issue that unknown open flags are ignored doesn't really matter for O_ALLOW_ENCODED; if the kernel doesn't support O_ALLOW_ENCODED, then it doesn't support RWF_ENCODED, either. Signed-off-by: Omar Sandoval Reviewed-by: Josef Bacik --- arch/alpha/include/uapi/asm/fcntl.h | 1 + arch/parisc/include/uapi/asm/fcntl.h | 1 + arch/sparc/include/uapi/asm/fcntl.h | 1 + fs/fcntl.c | 10 ++++++++-- fs/namei.c | 4 ++++ include/linux/fcntl.h | 2 +- include/uapi/asm-generic/fcntl.h | 4 ++++ 7 files changed, 20 insertions(+), 3 deletions(-) diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h index 50bdc8e8a271..391e0d112e41 100644 --- a/arch/alpha/include/uapi/asm/fcntl.h +++ b/arch/alpha/include/uapi/asm/fcntl.h @@ -34,6 +34,7 @@ #define O_PATH 040000000 #define __O_TMPFILE 0100000000 +#define O_ALLOW_ENCODED 0200000000 #define F_GETLK 7 #define F_SETLK 8 diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h index 03dee816cb13..72ea9bdf5f04 100644 --- a/arch/parisc/include/uapi/asm/fcntl.h +++ b/arch/parisc/include/uapi/asm/fcntl.h @@ -19,6 +19,7 @@ #define O_PATH 020000000 #define __O_TMPFILE 040000000 +#define O_ALLOW_ENCODED 100000000 #define F_GETLK64 8 #define F_SETLK64 9 diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h index 67dae75e5274..ac3e8c9cb32c 100644 --- a/arch/sparc/include/uapi/asm/fcntl.h +++ b/arch/sparc/include/uapi/asm/fcntl.h @@ -37,6 +37,7 @@ #define O_PATH 0x1000000 #define __O_TMPFILE 0x2000000 +#define O_ALLOW_ENCODED 0x8000000 #define F_GETOWN 5 /* for sockets. */ #define F_SETOWN 6 /* for sockets. */ diff --git a/fs/fcntl.c b/fs/fcntl.c index 05b36b28f2e8..2c1f0580a497 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -30,7 +30,8 @@ #include #include -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME) +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \ + O_ALLOW_ENCODED) static int setfl(int fd, struct file * filp, unsigned long arg) { @@ -49,6 +50,11 @@ static int setfl(int fd, struct file * filp, unsigned long arg) if (!inode_owner_or_capable(inode)) return -EPERM; + /* O_ALLOW_ENCODED can only be set by superuser */ + if ((arg & O_ALLOW_ENCODED) && !(filp->f_flags & O_ALLOW_ENCODED) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* required for strict SunOS emulation */ if (O_NONBLOCK != O_NDELAY) if (arg & O_NDELAY) @@ -1035,7 +1041,7 @@ static int __init fcntl_init(void) * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY * is defined as O_NONBLOCK on some platforms and not on others. */ - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32( (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) | __FMODE_EXEC | __FMODE_NONOTIFY)); diff --git a/fs/namei.c b/fs/namei.c index 78443a85480a..73b925b133a0 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2892,6 +2892,10 @@ static int may_open(const struct path *path, int acc_mode, int flag) if (flag & O_NOATIME && !inode_owner_or_capable(inode)) return -EPERM; + /* O_ALLOW_ENCODED can only be set by superuser */ + if ((flag & O_ALLOW_ENCODED) && !capable(CAP_SYS_ADMIN)) + return -EPERM; + return 0; } diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h index 921e750843e6..dc66c557b7d0 100644 --- a/include/linux/fcntl.h +++ b/include/linux/fcntl.h @@ -10,7 +10,7 @@ (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \ O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \ FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \ - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE) + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_ALLOW_ENCODED) /* List of all valid flags for the how->upgrade_mask argument: */ #define VALID_UPGRADE_FLAGS \ diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h index 9dc0bf0c5a6e..75321c7a66ac 100644 --- a/include/uapi/asm-generic/fcntl.h +++ b/include/uapi/asm-generic/fcntl.h @@ -89,6 +89,10 @@ #define __O_TMPFILE 020000000 #endif +#ifndef O_ALLOW_ENCODED +#define O_ALLOW_ENCODED 040000000 +#endif + /* a horrid kludge trying to make sure that this will fail on old kernels */ #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY) #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT) From patchwork Fri Jan 22 20:46:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040633 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E75D9C433E0 for ; Fri, 22 Jan 2021 21:16:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A780923AF8 for ; Fri, 22 Jan 2021 21:16:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729052AbhAVVPl (ORCPT ); Fri, 22 Jan 2021 16:15:41 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49250 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730013AbhAVUtu (ORCPT ); Fri, 22 Jan 2021 15:49:50 -0500 Received: from mail-pf1-x429.google.com (mail-pf1-x429.google.com [IPv6:2607:f8b0:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 685B6C0612F2 for ; Fri, 22 Jan 2021 12:47:26 -0800 (PST) Received: by mail-pf1-x429.google.com with SMTP id m6so4613971pfm.6 for ; Fri, 22 Jan 2021 12:47:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=hkMakF1H9sKSx+14xS5xdHWBqio1iUifywoB5U1bl1E=; b=LQmuf9fSR7/TEu759ocKYpGqSAAsuOtpjJlbknbdan5HqrUo/FrJ1G2qZm340w92EF 3BvvWhjViA9CGjA4B2IOgr9qZDfN049ISO/9TvFwFsURTc/mLICDilgKs2po6gS97bTd X+nHYezvZiaAzfS8wGK2tbc95nDmlEmv3c92uPxGh8+hAboDQlR4GPA+8Z2gmcTjqEAw Ga5U5pinYAPcA+9VxH3tjKCowrjyOANbP5Z9wl50+KlEeEEBBZ1EQqfydXVu1lXSbTpT wUGT/fRMn4gtCz1cYQGdHA76+GOEsV7SwNTOAx40pXfKPYMXVvMvLYHXKbsHDQc7r6fK k7Hw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=hkMakF1H9sKSx+14xS5xdHWBqio1iUifywoB5U1bl1E=; b=C6Z34WAF0MVwsrdkl4/5PwK8tRR4DO0T37jbwKl5KTS+HI7YiZKIelyQhVCGKsdWXo K0GpzEE+sSqLZMo2kQ1ofDJKy/UCFn0rsXNei/iaxtnSbBa2rwBCltLSx7RH9sukY0iq /joN8DaizCjBgRd+2zrJuqtSyu1VjdaN7Xg7ef8861ctpqL73luOwxbAtN5fhNObARuw 0R5jv+zKYpAEmCT0R44+1SQ6fCguKI5A8t3Kqyy55Sc/KqpRgsKMWyMP4pY4+Mfe1ov6 H8Ge2B4F9M+X+E2AH6FktxnDduiy5vRKKqMvTpYJ1Ur6IDaQOGCQUnkmsH64EU8BSDSF 96uA== X-Gm-Message-State: AOAM530IooeQlC6fTwMZilIV001UHrk3wS+HbeNRYY1k8eeEG3kwLqhR Ph7ENPMPVhocX1hIbXrwfguFPQ== X-Google-Smtp-Source: ABdhPJwBiKA7/Mmzk9/51oMteCPkAI0afMkOBM9RnvjCsxUar/QEXog+LXtc+TbdMyhtHKCfAqb8qQ== X-Received: by 2002:a62:e30e:0:b029:1b9:3823:4b3a with SMTP id g14-20020a62e30e0000b02901b938234b3amr290238pfh.15.1611348445782; Fri, 22 Jan 2021 12:47:25 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:24 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 03/10] fs: add RWF_ENCODED for reading/writing compressed data Date: Fri, 22 Jan 2021 12:46:50 -0800 Message-Id: <5700071a93a4d660e1df719fb63c0c73177cec06.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval Btrfs supports transparent compression: data written by the user can be compressed when written to disk and decompressed when read back. However, we'd like to add an interface to write pre-compressed data directly to the filesystem, and the matching interface to read compressed data without decompressing it. This adds support for so-called "encoded I/O" via preadv2() and pwritev2(). A new RWF_ENCODED flags indicates that a read or write is "encoded". If this flag is set, iov[0].iov_base points to a struct encoded_iov which is used for metadata: namely, the compression algorithm, unencoded (i.e., decompressed) length, and what subrange of the unencoded data should be used (needed for truncated or hole-punched extents and when reading in the middle of an extent). For reads, the filesystem returns this information; for writes, the caller provides it to the filesystem. iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be used to extend the interface in the future a la copy_struct_from_user(). The remaining iovecs contain the encoded extent. This adds the VFS helpers for supporting encoded I/O and documentation for filesystem support. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- Documentation/filesystems/encoded_io.rst | 74 ++++++++++ Documentation/filesystems/index.rst | 1 + fs/read_write.c | 168 +++++++++++++++++++++-- include/linux/encoded_io.h | 17 +++ include/linux/fs.h | 13 ++ include/uapi/linux/encoded_io.h | 30 ++++ include/uapi/linux/fs.h | 5 +- 7 files changed, 294 insertions(+), 14 deletions(-) create mode 100644 Documentation/filesystems/encoded_io.rst create mode 100644 include/linux/encoded_io.h create mode 100644 include/uapi/linux/encoded_io.h diff --git a/Documentation/filesystems/encoded_io.rst b/Documentation/filesystems/encoded_io.rst new file mode 100644 index 000000000000..50405276d866 --- /dev/null +++ b/Documentation/filesystems/encoded_io.rst @@ -0,0 +1,74 @@ +=========== +Encoded I/O +=========== + +Encoded I/O is a mechanism for reading and writing encoded (e.g., compressed +and/or encrypted) data directly from/to the filesystem. The userspace interface +is thoroughly described in the :manpage:`encoded_io(7)` man page; this document +describes the requirements for filesystem support. + +First of all, a filesystem supporting encoded I/O must indicate this by setting +the ``FMODE_ENCODED_IO`` flag in its ``file_open`` file operation:: + + static int foo_file_open(struct inode *inode, struct file *filp) + { + ... + filep->f_mode |= FMODE_ENCODED_IO; + ... + } + +Encoded I/O goes through ``read_iter`` and ``write_iter``, designated by the +``IOCB_ENCODED`` flag in ``kiocb->ki_flags``. + +Reads +===== + +Encoded ``read_iter`` should: + +1. Call ``generic_encoded_read_checks()`` to validate the file and buffers + provided by userspace. +2. Initialize the ``encoded_iov`` appropriately. +3. Copy it to the user with ``copy_encoded_iov_to_iter()``. +4. Copy the encoded data to the user. +5. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``. +6. Return the size of the encoded data read, not including the ``encoded_iov``. + +There are a few details to be aware of: + +* Encoded ``read_iter`` should support reading unencoded data if the extent is + not encoded. +* If the buffers provided by the user are not large enough to contain an entire + encoded extent, then ``read_iter`` should return ``-ENOBUFS``. This is to + avoid confusing userspace with truncated data that cannot be properly + decoded. +* Reads in the middle of an encoded extent can be returned by setting + ``encoded_iov->unencoded_offset`` to non-zero. +* Truncated unencoded data (e.g., because the file does not end on a block + boundary) may be returned by setting ``encoded_iov->len`` to a value smaller + value than ``encoded_iov->unencoded_len - encoded_iov->unencoded_offset``. + +Writes +====== + +Encoded ``write_iter`` should (in addition to the usual accounting/checks done +by ``write_iter``): + +1. Call ``copy_encoded_iov_from_iter()`` to get and validate the + ``encoded_iov``. +2. Call ``generic_encoded_write_checks()`` instead of + ``generic_write_checks()``. +3. Check that the provided encoding in ``encoded_iov`` is supported. +4. Advance ``kiocb->ki_pos`` by ``encoded_iov->len``. +5. Return the size of the encoded data written. + +Again, there are a few details: + +* Encoded ``write_iter`` doesn't need to support writing unencoded data. +* ``write_iter`` should either write all of the encoded data or none of it; it + must not do partial writes. +* ``write_iter`` doesn't need to validate the encoded data; a subsequent read + may return, e.g., ``-EIO`` if the data is not valid. +* The user may lie about the unencoded size of the data; a subsequent read + should truncate or zero-extend the unencoded data rather than returning an + error. +* Be careful of page cache coherency. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 7be9b46d85d9..258b745ea3d8 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -53,6 +53,7 @@ filesystem implementations. journalling fscrypt fsverity + encoded_io Filesystems =========== diff --git a/fs/read_write.c b/fs/read_write.c index 75f764b43418..daa3d131ea99 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" #include @@ -1625,24 +1626,15 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count) return 0; } -/* - * Performs necessary checks before doing a write - * - * Can adjust writing position or amount of bytes to write. - * Returns appropriate error code that caller should return or - * zero in case that write should be allowed. - */ -ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +static int generic_write_checks_common(struct kiocb *iocb, loff_t *count) { struct file *file = iocb->ki_filp; struct inode *inode = file->f_mapping->host; - loff_t count; - int ret; if (IS_SWAPFILE(inode)) return -ETXTBSY; - if (!iov_iter_count(from)) + if (!*count) return 0; /* FIXME: this is for backwards compatibility with 2.4 */ @@ -1652,8 +1644,22 @@ ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) return -EINVAL; - count = iov_iter_count(from); - ret = generic_write_check_limits(file, iocb->ki_pos, &count); + return generic_write_check_limits(iocb->ki_filp, iocb->ki_pos, count); +} + +/* + * Performs necessary checks before doing a write + * + * Can adjust writing position or amount of bytes to write. + * Returns appropriate error code that caller should return or + * zero in case that write should be allowed. + */ +ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) +{ + loff_t count = iov_iter_count(from); + int ret; + + ret = generic_write_checks_common(iocb, &count); if (ret) return ret; @@ -1684,3 +1690,139 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out) return 0; } + +/** + * generic_encoded_write_checks() - check an encoded write + * @iocb: I/O context. + * @encoded: Encoding metadata. + * + * This should be called by RWF_ENCODED write implementations rather than + * generic_write_checks(). Unlike generic_write_checks(), it returns -EFBIG + * instead of adjusting the size of the write. + * + * Return: 0 on success, -errno on error. + */ +int generic_encoded_write_checks(struct kiocb *iocb, + const struct encoded_iov *encoded) +{ + loff_t count = encoded->len; + int ret; + + if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED)) + return -EPERM; + + ret = generic_write_checks_common(iocb, &count); + if (ret) + return ret; + + if (count != encoded->len) { + /* + * The write got truncated by generic_write_checks_common(). We + * can't do a partial encoded write. + */ + return -EFBIG; + } + return 0; +} +EXPORT_SYMBOL(generic_encoded_write_checks); + +/** + * copy_encoded_iov_from_iter() - copy a &struct encoded_iov from userspace + * @encoded: Returned encoding metadata. + * @from: Source iterator. + * + * This copies in the &struct encoded_iov and does some basic sanity checks. + * This should always be used rather than a plain copy_from_iter(), as it does + * the proper handling for backward- and forward-compatibility. + * + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if the + * copied structure contained non-zero fields that this kernel doesn't + * support, -EINVAL if the copied structure was invalid. + */ +int copy_encoded_iov_from_iter(struct encoded_iov *encoded, + struct iov_iter *from) +{ + size_t usize; + int ret; + + usize = iov_iter_single_seg_count(from); + if (usize > PAGE_SIZE) + return -E2BIG; + if (usize < ENCODED_IOV_SIZE_VER0) + return -EINVAL; + ret = copy_struct_from_iter(encoded, sizeof(*encoded), from, usize); + if (ret) + return ret; + + if (encoded->compression == ENCODED_IOV_COMPRESSION_NONE && + encoded->encryption == ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + if (encoded->compression > ENCODED_IOV_COMPRESSION_TYPES || + encoded->encryption > ENCODED_IOV_ENCRYPTION_TYPES) + return -EINVAL; + if (encoded->unencoded_offset > encoded->unencoded_len) + return -EINVAL; + if (encoded->len > encoded->unencoded_len - encoded->unencoded_offset) + return -EINVAL; + return 0; +} +EXPORT_SYMBOL(copy_encoded_iov_from_iter); + +/** + * generic_encoded_read_checks() - sanity check an RWF_ENCODED read + * @iocb: I/O context. + * @iter: Destination iterator for read. + * + * This should always be called by RWF_ENCODED read implementations before + * returning any data. + * + * Return: Number of bytes available to return encoded data in @iter on success, + * -EPERM if the file was not opened with O_ALLOW_ENCODED, -EINVAL if + * the size of the &struct encoded_iov iovec is invalid. + */ +ssize_t generic_encoded_read_checks(struct kiocb *iocb, struct iov_iter *iter) +{ + size_t usize; + + if (!(iocb->ki_filp->f_flags & O_ALLOW_ENCODED)) + return -EPERM; + usize = iov_iter_single_seg_count(iter); + if (usize > PAGE_SIZE || usize < ENCODED_IOV_SIZE_VER0) + return -EINVAL; + return iov_iter_count(iter) - usize; +} +EXPORT_SYMBOL(generic_encoded_read_checks); + +/** + * copy_encoded_iov_to_iter() - copy a &struct encoded_iov to userspace + * @encoded: Encoding metadata to return. + * @to: Destination iterator. + * + * This should always be used by RWF_ENCODED read implementations rather than a + * plain copy_to_iter(), as it does the proper handling for backward- and + * forward-compatibility. The iterator must be sanity-checked with + * generic_encoded_read_checks() before this is called. + * + * Return: 0 on success, -EFAULT if access to userspace failed, -E2BIG if there + * were non-zero fields in @encoded that the user buffer could not + * accommodate. + */ +int copy_encoded_iov_to_iter(const struct encoded_iov *encoded, + struct iov_iter *to) +{ + size_t ksize = sizeof(*encoded); + size_t usize = iov_iter_single_seg_count(to); + size_t size = min(ksize, usize); + + /* We already sanity-checked usize in generic_encoded_read_checks(). */ + + if (usize < ksize && + memchr_inv((char *)encoded + usize, 0, ksize - usize)) + return -E2BIG; + if (copy_to_iter(encoded, size, to) != size || + (usize > ksize && + iov_iter_zero(usize - ksize, to) != usize - ksize)) + return -EFAULT; + return 0; +} +EXPORT_SYMBOL(copy_encoded_iov_to_iter); diff --git a/include/linux/encoded_io.h b/include/linux/encoded_io.h new file mode 100644 index 000000000000..a8cfc0108ba0 --- /dev/null +++ b/include/linux/encoded_io.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_ENCODED_IO_H +#define _LINUX_ENCODED_IO_H + +#include + +struct encoded_iov; +struct iov_iter; +struct kiocb; +extern int generic_encoded_write_checks(struct kiocb *, + const struct encoded_iov *); +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *); +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *); +extern int copy_encoded_iov_to_iter(const struct encoded_iov *, + struct iov_iter *); + +#endif /* _LINUX_ENCODED_IO_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index fd47deea7c17..2a0700e4efcc 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -178,6 +178,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File supports async buffered reads */ #define FMODE_BUF_RASYNC ((__force fmode_t)0x40000000) +/* File supports encoded IO */ +#define FMODE_ENCODED_IO ((__force fmode_t)0x80000000) + /* * Attribute flags. These should be or-ed together to figure out what * has been changed! @@ -308,6 +311,7 @@ enum rw_hint { #define IOCB_SYNC (__force int) RWF_SYNC #define IOCB_NOWAIT (__force int) RWF_NOWAIT #define IOCB_APPEND (__force int) RWF_APPEND +#define IOCB_ENCODED (__force int) RWF_ENCODED /* non-RWF related bits - start at 16 */ #define IOCB_EVENTFD (1 << 16) @@ -2961,6 +2965,13 @@ extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *); extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *); extern int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count); +struct encoded_iov; +extern int generic_encoded_write_checks(struct kiocb *, + const struct encoded_iov *); +extern int copy_encoded_iov_from_iter(struct encoded_iov *, struct iov_iter *); +extern ssize_t generic_encoded_read_checks(struct kiocb *, struct iov_iter *); +extern int copy_encoded_iov_to_iter(const struct encoded_iov *, + struct iov_iter *); extern int generic_file_rw_checks(struct file *file_in, struct file *file_out); extern ssize_t generic_file_buffered_read(struct kiocb *iocb, struct iov_iter *to, ssize_t already_read); @@ -3267,6 +3278,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags) return -EOPNOTSUPP; kiocb_flags |= IOCB_NOIO; } + if ((flags & RWF_ENCODED) && !(ki->ki_filp->f_mode & FMODE_ENCODED_IO)) + return -EOPNOTSUPP; kiocb_flags |= (__force int) (flags & RWF_SUPPORTED); if (flags & RWF_SYNC) kiocb_flags |= IOCB_DSYNC; diff --git a/include/uapi/linux/encoded_io.h b/include/uapi/linux/encoded_io.h new file mode 100644 index 000000000000..cf741453dba4 --- /dev/null +++ b/include/uapi/linux/encoded_io.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_ENCODED_IO_H +#define _UAPI_LINUX_ENCODED_IO_H + +#include + +#define ENCODED_IOV_COMPRESSION_NONE 0 +#define ENCODED_IOV_COMPRESSION_BTRFS_ZLIB 1 +#define ENCODED_IOV_COMPRESSION_BTRFS_ZSTD 2 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K 3 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K 4 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K 5 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K 6 +#define ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K 7 +#define ENCODED_IOV_COMPRESSION_TYPES 8 + +#define ENCODED_IOV_ENCRYPTION_NONE 0 +#define ENCODED_IOV_ENCRYPTION_TYPES 1 + +struct encoded_iov { + __aligned_u64 len; + __aligned_u64 unencoded_len; + __aligned_u64 unencoded_offset; + __u32 compression; + __u32 encryption; +}; + +#define ENCODED_IOV_SIZE_VER0 32 + +#endif /* _UAPI_LINUX_ENCODED_IO_H */ diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index f44eb0a04afd..0800b01524e8 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -300,8 +300,11 @@ typedef int __bitwise __kernel_rwf_t; /* per-IO O_APPEND */ #define RWF_APPEND ((__force __kernel_rwf_t)0x00000010) +/* encoded (e.g., compressed and/or encrypted) IO */ +#define RWF_ENCODED ((__force __kernel_rwf_t)0x00000020) + /* mask of flags supported by the kernel */ #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ - RWF_APPEND) + RWF_APPEND | RWF_ENCODED) #endif /* _UAPI_LINUX_FS_H */ From patchwork Fri Jan 22 20:46:51 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040459 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66537C43381 for ; Fri, 22 Jan 2021 20:51:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2AAAF23B06 for ; Fri, 22 Jan 2021 20:51:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730109AbhAVUvF (ORCPT ); Fri, 22 Jan 2021 15:51:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49276 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730229AbhAVUul (ORCPT ); Fri, 22 Jan 2021 15:50:41 -0500 Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7A51C06121F for ; Fri, 22 Jan 2021 12:47:28 -0800 (PST) Received: by mail-pg1-x52e.google.com with SMTP id i7so4607730pgc.8 for ; Fri, 22 Jan 2021 12:47:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=6h5hjUeTADVbhi8eHCMzrRhrLIPDNaWWcRgC7w9ijyo=; b=NSlH8Ne2gMtUEugZ8xzh7OlBaW2jT7JgdKEkjoSpfFEnbdGNk/E4akR24gYwxrdqw0 fnXQQc9GpBsdfwJYPIUJtbLP1C/01acNe/9vsGGsBXJ8ScPiy9A9/3N0LOtXygCf6Mpt 0atYzKkGBBCCo3+HHGeXGwGdzWAU9lnRumvLzEykLE5cAkTrvMNhkjHeoLdH5lUWHyoC sBhkF4Fi5tKTGNBHyyGgJVgd4d61bTjwSEeEm0soRXU7J4MNzEZd3XPRpULLlP2bTqIh YfothBa/R0Iv8fSdnhJd5WujYalhtKB+lgnpUlNZEiFCy/NCBTQBkBwmFepVgj/DY2QH 5LQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=6h5hjUeTADVbhi8eHCMzrRhrLIPDNaWWcRgC7w9ijyo=; b=ZePns+cw8IwfsRVufJoSS2+HFXOcLTRSaUQeVWBGv+LfrGCKSeeJU017tZLQuS3Kif yjiAgQ19FoIpQueZiz6dLYxcUihBw6zk9CC9WjkFyUqfwVjKB9y4dU7MCnmFtD+zboss a70Fn/YTQK1tjHdSFmTwllhsEi+7dPnfFqrCYSrfUrg9GyaQk7mo47j65+CxL096GmAn hD+HCJX5C8nw2/NPEDEbPK1ZzqZvR9GlNzZJSwamqM+9FEf7GbR8idg/8RZfJ87s84/J lpogRx4h2EV96KqPB1U+25e3qWA6k7cQRpMR8UNUbi6VdEi56ayzktwR3FfXxwiJtym9 CtKg== X-Gm-Message-State: AOAM531frM0r7DU1fbDpzEmSLfqoyKPquJ3mhua6fVyflwPegkCvkXvE j7L4pp8T/P1YZjxNVCABMPUqew== X-Google-Smtp-Source: ABdhPJzhrRQ2JKaTHVfq9C0+YrZj8Zv/nyNIO/WRAiVWyezpYTZdzbsa9CF7k6pZ/z2BekWOvSyXyw== X-Received: by 2002:a63:4b5c:: with SMTP id k28mr66913pgl.294.1611348448453; Fri, 22 Jan 2021 12:47:28 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:27 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 04/10] btrfs: fix check_data_csum() error message for direct I/O Date: Fri, 22 Jan 2021 12:46:51 -0800 Message-Id: <3a20de6d6ea2a8ebbed0637480f9aa8fff8da19c.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to check_data_csum()") replaced the start parameter to check_data_csum() with page_offset(), but page_offset() is not meaningful for direct I/O pages. Bring back the start parameter. Fixes: 265d4ac03fdf ("btrfs: sink parameter start and len to check_data_csum") Signed-off-by: Omar Sandoval Reviewed-by: Josef Bacik --- fs/btrfs/inode.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ef6cb7b620d0..d2ece8554416 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2947,11 +2947,13 @@ void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, * @bio_offset: offset to the beginning of the bio (in bytes) * @page: page where is the data to be verified * @pgoff: offset inside the page + * @start: logical offset in the file * * The length of such check is always one sector size. */ static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio, - u32 bio_offset, struct page *page, u32 pgoff) + u32 bio_offset, struct page *page, u32 pgoff, + u64 start) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); @@ -2978,8 +2980,8 @@ static int check_data_csum(struct inode *inode, struct btrfs_io_bio *io_bio, kunmap_atomic(kaddr); return 0; zeroit: - btrfs_print_data_csum_error(BTRFS_I(inode), page_offset(page) + pgoff, - csum, csum_expected, io_bio->mirror_num); + btrfs_print_data_csum_error(BTRFS_I(inode), start, csum, csum_expected, + io_bio->mirror_num); if (io_bio->device) btrfs_dev_stat_inc_and_print(io_bio->device, BTRFS_DEV_STAT_CORRUPTION_ERRS); @@ -3032,7 +3034,8 @@ int btrfs_verify_data_csum(struct btrfs_io_bio *io_bio, u32 bio_offset, pg_off += sectorsize, bio_offset += sectorsize) { int ret; - ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off); + ret = check_data_csum(inode, io_bio, bio_offset, page, pg_off, + page_offset(page) + pg_off); if (ret < 0) return -EIO; } @@ -7742,7 +7745,8 @@ static blk_status_t btrfs_check_read_dio_bio(struct inode *inode, ASSERT(pgoff < PAGE_SIZE); if (uptodate && (!csum || !check_data_csum(inode, io_bio, - bio_offset, bvec.bv_page, pgoff))) { + bio_offset, bvec.bv_page, + pgoff, start))) { clean_io_failure(fs_info, failure_tree, io_tree, start, bvec.bv_page, btrfs_ino(BTRFS_I(inode)), From patchwork Fri Jan 22 20:46:52 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040629 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B30E1C433E0 for ; Fri, 22 Jan 2021 21:15:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7D8C523AF8 for ; Fri, 22 Jan 2021 21:15:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729609AbhAVVPU (ORCPT ); Fri, 22 Jan 2021 16:15:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49278 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728585AbhAVUus (ORCPT ); Fri, 22 Jan 2021 15:50:48 -0500 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 91086C061221 for ; Fri, 22 Jan 2021 12:47:31 -0800 (PST) Received: by mail-pf1-x432.google.com with SMTP id q20so4602705pfu.8 for ; Fri, 22 Jan 2021 12:47:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=VhHGYUrQEFiQHVA8AXG0InFmphCs9/EXfiOgqK1QhVc=; b=bnXW9+hb6TDls7gQhd4oXm/0OkYlqL5pS5rGyf+a5bisR39L4vxDexZY062D/v0SYu x5EIf0X1JLUKwFm61p5/weeRfiQLEFCGA9xY+ouwMo8f8h71jkTe7uOHUARQ4jCzOzQz bpFSLj/ODB/eQsc6N5vFFVvem0LUX2WCIfosPXRWDx91mZn//nDLSzZh6uqNeGoZQIGm CWmTtWpd4gBb4sw4WOZ4KuxrJHH3nMEIY62xNNL2cSUXIrzYZNnz/E8lAV0NrVU6lXmw d0LEYxXU+oa3f/Dt6A28/WIowKBS47Vn7gyE7ezjQ2WspA5E2eKai6JwEN+csC+onK/y rVJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=VhHGYUrQEFiQHVA8AXG0InFmphCs9/EXfiOgqK1QhVc=; b=rh2/l3vQVOBYqwTUI5xvHSbliYU0xXXPvJYWiisXsfLOypA9WdM7fMdOKYrlhlr1s/ 8oWUnBFQaWpsQq0dFIlutHr7nx8hA/obQybCDhMtdgX8r/rAvFx4Kaz2IEpV/2J7OcY7 mgiM7kwwW9UDlw1ghWXTcXi2lIxtErkIvCgUJmr7jnjwOFJwj7tqY+liTw2QLyjcxgkV xg7dQ7HD3Gou6dePqkyep1w7SuJAyYtbt7kVK8oVbJgrZA7Xz6wzx1DJx624PJBB/SOj TGQHgrnNPa2VcqTxRJR0nsAqWUwVlWVV8I/YNhO7JCZvyXgVJWPHJThNP2pgRtafKz/7 QPlA== X-Gm-Message-State: AOAM531tYEVfv4NUTcMB0AZseKD4/U9uMRr4yQASTs4xgYTrhBsTMnkN rbVXo7sCNONeIN9lJHRf1XnXeA== X-Google-Smtp-Source: ABdhPJwtq/3iY2Ut2QFubzDKCMRWCMZbHi/jmXgZ5Y3vXB5qHMGtMAkBjYTxSZOOkMUt1UqjhbHvjQ== X-Received: by 2002:a63:3303:: with SMTP id z3mr752033pgz.302.1611348451040; Fri, 22 Jan 2021 12:47:31 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:29 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 05/10] btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio() Date: Fri, 22 Jan 2021 12:46:52 -0800 Message-Id: <455a42339058dab8c4b9c77547dde8daad7f9981.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval btrfs_csum_one_bio() loops over each filesystem block in the bio while keeping a cursor of its current logical position in the file in order to look up the ordered extent to add the checksums to. However, this doesn't make much sense for compressed extents, as a sector on disk does not correspond to a sector of decompressed file data. It happens to work because 1) the compressed bio always covers one ordered extent and 2) the size of the bio is always less than the size of the ordered extent. However, the second point will not always be true for encoded writes. Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that it can assume that the bio only covers one ordered extent. Since we're already changing the signature, let's get rid of the contig parameter and make it implied by the offset parameter, similar to the change we recently made to btrfs_lookup_bio_sums(). Additionally, let's rename nr_sectors to blockcount to make it clear that it's the number of filesystem blocks, not the number of 512-byte sectors. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 5 +++-- fs/btrfs/ctree.h | 2 +- fs/btrfs/file-item.c | 35 ++++++++++++++++------------------- fs/btrfs/inode.c | 8 ++++---- 4 files changed, 24 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 5ae3fa0386b7..d9d1d9e38fc5 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -436,7 +436,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, start, 1); + ret = btrfs_csum_one_bio(inode, bio, start, + true); BUG_ON(ret); /* -ENOMEM */ } @@ -468,7 +469,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ if (!skip_sum) { - ret = btrfs_csum_one_bio(inode, bio, start, 1); + ret = btrfs_csum_one_bio(inode, bio, start, true); BUG_ON(ret); /* -ENOMEM */ } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1bd416e5a731..f1b9a9e42cc6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3051,7 +3051,7 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_ordered_sum *sums); blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, - u64 file_start, int contig); + u64 offset, bool one_ordered); int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, struct list_head *list, int search_commit); void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 6ccfc019ad90..82d110c8dc03 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -608,28 +608,28 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, * btrfs_csum_one_bio - Calculates checksums of the data contained inside a bio * @inode: Owner of the data inside the bio * @bio: Contains the data to be checksummed - * @file_start: offset in file this bio begins to describe - * @contig: Boolean. If true/1 means all bio vecs in this bio are - * contiguous and they begin at @file_start in the file. False/0 - * means this bio can contains potentially discontigous bio vecs - * so the logical offset of each should be calculated separately. + * @offset: If (u64)-1, @bio may contain discontiguous bio vecs, so the + * file offsets are determined from the page offsets in the bio. + * Otherwise, this is the starting file offset of the bio vecs in + * @bio, which must be contiguous. + * @one_ordered: If true, @bio only refers to one ordered extent. */ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, - u64 file_start, int contig) + u64 offset, bool one_ordered) { struct btrfs_fs_info *fs_info = inode->root->fs_info; SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); struct btrfs_ordered_sum *sums; struct btrfs_ordered_extent *ordered = NULL; + const bool page_offsets = (offset == (u64)-1); char *data; struct bvec_iter iter; struct bio_vec bvec; int index; - int nr_sectors; + int blockcount; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; int i; - u64 offset; unsigned nofs_flag; nofs_flag = memalloc_nofs_save(); @@ -643,18 +643,13 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, sums->len = bio->bi_iter.bi_size; INIT_LIST_HEAD(&sums->list); - if (contig) - offset = file_start; - else - offset = 0; /* shut up gcc */ - sums->bytenr = bio->bi_iter.bi_sector << 9; index = 0; shash->tfm = fs_info->csum_shash; bio_for_each_segment(bvec, bio, iter) { - if (!contig) + if (page_offsets) offset = page_offset(bvec.bv_page) + bvec.bv_offset; if (!ordered) { @@ -662,13 +657,14 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, BUG_ON(!ordered); /* Logic error */ } - nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, + blockcount = BTRFS_BYTES_TO_BLKS(fs_info, bvec.bv_len + fs_info->sectorsize - 1); - for (i = 0; i < nr_sectors; i++) { - if (offset >= ordered->file_offset + ordered->num_bytes || - offset < ordered->file_offset) { + for (i = 0; i < blockcount; i++) { + if (!one_ordered && + (offset >= ordered->file_offset + ordered->num_bytes || + offset < ordered->file_offset)) { unsigned long bytes_left; sums->len = this_sum_bytes; @@ -699,7 +695,8 @@ blk_status_t btrfs_csum_one_bio(struct btrfs_inode *inode, struct bio *bio, sums->sums + index); kunmap_atomic(data); index += fs_info->csum_size; - offset += fs_info->sectorsize; + if (!one_ordered) + offset += fs_info->sectorsize; this_sum_bytes += fs_info->sectorsize; total_bytes += fs_info->sectorsize; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d2ece8554416..b92c4a1d6d2f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2214,7 +2214,7 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t size, struct bio *bio, static blk_status_t btrfs_submit_bio_start(struct inode *inode, struct bio *bio, u64 dio_file_offset) { - return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0); + return btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false); } /* @@ -2282,7 +2282,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio, 0, btrfs_submit_bio_start); goto out; } else if (!skip_sum) { - ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0); + ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, (u64)-1, false); if (ret) goto out; } @@ -7820,7 +7820,7 @@ static blk_status_t btrfs_submit_bio_start_direct_io(struct inode *inode, struct bio *bio, u64 dio_file_offset) { - return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, 1); + return btrfs_csum_one_bio(BTRFS_I(inode), bio, dio_file_offset, false); } static void btrfs_end_dio_bio(struct bio *bio) @@ -7877,7 +7877,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, * If we aren't doing async submit, calculate the csum of the * bio now. */ - ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, 1); + ret = btrfs_csum_one_bio(BTRFS_I(inode), bio, file_offset, false); if (ret) goto err; } else { From patchwork Fri Jan 22 20:46:53 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040627 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6928EC433E6 for ; Fri, 22 Jan 2021 21:15:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 35D3823B1B for ; Fri, 22 Jan 2021 21:15:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730338AbhAVVOv (ORCPT ); Fri, 22 Jan 2021 16:14:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729609AbhAVUuv (ORCPT ); Fri, 22 Jan 2021 15:50:51 -0500 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5CB4C061225 for ; Fri, 22 Jan 2021 12:47:33 -0800 (PST) Received: by mail-pg1-x530.google.com with SMTP id p18so4594327pgm.11 for ; Fri, 22 Jan 2021 12:47:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=sf/VGTkBOVp1k0vMJxptrl+aS+095yzOgG7VoqiuH4Y=; b=WWXEeqHPa1xzbG17c1b/owMUIb4QMiwRaq2nbKjxEH9kkcTUk4YEMcg9ZDJgiy3Uoe nXUg9OZ6hikIs/BI3dL/WSGECltRSH/qpLlPKEFwg8WpTy6ARxsayhVyhim1modGCyP2 dxnHGJB4c4+zk2f0pKULtlTpW7/JskrCdY4peTaY3h3QDsIjSGWkmpWVifvgVkQL/Mo0 vLFevXV7CO0wkGzHuAf4xaxw+y6qd5J5DQbqLiBTTssXAMu+vRlNXgGvUd6qyPNgfmsJ IoWkMlE5TRDxU6Q4WbzLrGBoKHwF1MEA7gI0BlkxyFf5PfdTRqgpFcmvg+UWlsgcv2N3 YEHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=sf/VGTkBOVp1k0vMJxptrl+aS+095yzOgG7VoqiuH4Y=; b=pUU9Xar9vDWgYpF1XKgzdiis+5ymICOJU1L55zzmJoM0v98xLKo84u9wifISmaxL6C V1PWHkUUQyfWqqkdVpiSKS1Hz29Cqi2cF4CKMX9uDa/Ff5Z6BsimhrbgRujaiyT1TGof bRrdQSPGykBntTYnXZoo7BXlu/p1SPHjKXEfHAHiw5Ogkip4OTF9/2aExcIndA+TBXyA VKoaCf8RpCReDM1s9UpL1PUOUL2mVp6aSXRnP9TgBKdehYTZngtatce1HVBVozj6tEun DUGVjqfJnrq+TcLQaErz7e5hyObsybpqCQp4XIXabrRjalOR8yHJRu8wA5mn8TCai+Nb u2GQ== X-Gm-Message-State: AOAM531LFzzuSJXuCst4VcP5c/9ahypx9Adit11rmbajWYispBZhh58u 7s3USgN9AzOZkswi0nFxtklOJg== X-Google-Smtp-Source: ABdhPJxKfikyhz2z3V+/3bzXSwYRlDsAuMWnkSgGkQarXU4Vdt2cgxRUmP6O5yxGUXkuK3mmCgLNFA== X-Received: by 2002:aa7:978e:0:b029:1bd:f965:66dd with SMTP id o14-20020aa7978e0000b02901bdf96566ddmr3196881pfp.46.1611348453371; Fri, 22 Jan 2021 12:47:33 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:32 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 06/10] btrfs: add ram_bytes and offset to btrfs_ordered_extent Date: Fri, 22 Jan 2021 12:46:53 -0800 Message-Id: <5c1bfbc1277255b334edb8f8b319fc7bd1856460.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval Currently, we only create ordered extents when ram_bytes == num_bytes and offset == 0. However, RWF_ENCODED writes may create extents which only refer to a subset of the full unencoded extent, so we need to plumb these fields through the ordered extent infrastructure and pass them down to insert_reserved_file_extent(). Since we're changing the btrfs_add_ordered_extent* signature, let's get rid of the trivial wrappers and add a kernel-doc. Reviewed-by: Nikolay Borisov Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/inode.c | 56 ++++++++++++++++++--------------- fs/btrfs/ordered-data.c | 68 ++++++++++++++++------------------------- fs/btrfs/ordered-data.h | 16 ++++------ 3 files changed, 64 insertions(+), 76 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b92c4a1d6d2f..139d690f171e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -912,13 +912,12 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) goto out_free_reserve; free_extent_map(em); - ret = btrfs_add_ordered_extent_compress(inode, - async_extent->start, - ins.objectid, - async_extent->ram_size, - ins.offset, - BTRFS_ORDERED_COMPRESSED, - async_extent->compress_type); + ret = btrfs_add_ordered_extent(inode, async_extent->start, + async_extent->ram_size, + async_extent->ram_size, + ins.objectid, ins.offset, 0, + 1 << BTRFS_ORDERED_COMPRESSED, + async_extent->compress_type); if (ret) { btrfs_drop_extent_cache(inode, async_extent->start, async_extent->start + @@ -1126,8 +1125,9 @@ static noinline int cow_file_range(struct btrfs_inode *inode, } free_extent_map(em); - ret = btrfs_add_ordered_extent(inode, start, ins.objectid, - ram_size, cur_alloc_size, 0); + ret = btrfs_add_ordered_extent(inode, start, ram_size, ram_size, + ins.objectid, cur_alloc_size, 0, + 0, BTRFS_COMPRESS_NONE); if (ret) goto out_drop_extent_cache; @@ -1768,10 +1768,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, goto error; } free_extent_map(em); - ret = btrfs_add_ordered_extent(inode, cur_offset, - disk_bytenr, num_bytes, - num_bytes, - BTRFS_ORDERED_PREALLOC); + ret = btrfs_add_ordered_extent(inode, + cur_offset, num_bytes, num_bytes, + disk_bytenr, num_bytes, 0, + 1 << BTRFS_ORDERED_PREALLOC, + BTRFS_COMPRESS_NONE); if (ret) { btrfs_drop_extent_cache(inode, cur_offset, cur_offset + num_bytes - 1, @@ -1780,9 +1781,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, } } else { ret = btrfs_add_ordered_extent(inode, cur_offset, + num_bytes, num_bytes, disk_bytenr, num_bytes, - num_bytes, - BTRFS_ORDERED_NOCOW); + 0, + 1 << BTRFS_ORDERED_NOCOW, + BTRFS_COMPRESS_NONE); if (ret) goto error; } @@ -2586,6 +2589,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key ins; u64 disk_num_bytes = btrfs_stack_file_extent_disk_num_bytes(stack_fi); u64 disk_bytenr = btrfs_stack_file_extent_disk_bytenr(stack_fi); + u64 offset = btrfs_stack_file_extent_offset(stack_fi); u64 num_bytes = btrfs_stack_file_extent_num_bytes(stack_fi); u64 ram_bytes = btrfs_stack_file_extent_ram_bytes(stack_fi); struct btrfs_drop_extents_args drop_args = { 0 }; @@ -2660,7 +2664,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, goto out; ret = btrfs_alloc_reserved_file_extent(trans, root, btrfs_ino(inode), - file_pos, qgroup_reserved, &ins); + file_pos - offset, + qgroup_reserved, &ins); out: btrfs_free_path(path); @@ -2686,20 +2691,20 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans, struct btrfs_ordered_extent *oe) { struct btrfs_file_extent_item stack_fi; - u64 logical_len; bool update_inode_bytes; + u64 num_bytes = oe->num_bytes; + u64 ram_bytes = oe->ram_bytes; memset(&stack_fi, 0, sizeof(stack_fi)); btrfs_set_stack_file_extent_type(&stack_fi, BTRFS_FILE_EXTENT_REG); btrfs_set_stack_file_extent_disk_bytenr(&stack_fi, oe->disk_bytenr); btrfs_set_stack_file_extent_disk_num_bytes(&stack_fi, oe->disk_num_bytes); + btrfs_set_stack_file_extent_offset(&stack_fi, oe->offset); if (test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags)) - logical_len = oe->truncated_len; - else - logical_len = oe->num_bytes; - btrfs_set_stack_file_extent_num_bytes(&stack_fi, logical_len); - btrfs_set_stack_file_extent_ram_bytes(&stack_fi, logical_len); + num_bytes = ram_bytes = oe->truncated_len; + btrfs_set_stack_file_extent_num_bytes(&stack_fi, num_bytes); + btrfs_set_stack_file_extent_ram_bytes(&stack_fi, ram_bytes); btrfs_set_stack_file_extent_compression(&stack_fi, oe->compress_type); /* Encryption and other encoding is reserved and all 0 */ @@ -7053,8 +7058,11 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, if (IS_ERR(em)) goto out; } - ret = btrfs_add_ordered_extent_dio(inode, start, block_start, len, - block_len, type); + ret = btrfs_add_ordered_extent(inode, start, len, len, block_start, + block_len, 0, + (1 << type) | + (1 << BTRFS_ORDERED_DIRECT), + BTRFS_COMPRESS_NONE); if (ret) { if (em) { free_extent_map(em); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index d5d326c674b1..e11a517fab10 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -153,16 +153,27 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, return ret; } -/* - * Allocate and add a new ordered_extent into the per-inode tree. +/** + * btrfs_add_ordered_extent - Add an ordered extent to the per-inode tree. + * @inode: inode that this extent is for. + * @file_offset: Logical offset in file where the extent starts. + * @num_bytes: Logical length of extent in file. + * @ram_bytes: Full length of unencoded data. + * @disk_bytenr: Offset of extent on disk. + * @disk_num_bytes: Size of extent on disk. + * @offset: Offset into unencoded data where file data starts. + * @flags: Flags specifying type of extent (1 << BTRFS_ORDERED_*). + * @compress_type: Compression algorithm used for data. * - * The tree is given a single reference on the ordered extent that was - * inserted. + * Most of these parameters correspond to &struct btrfs_file_extent_item. The + * tree is given a single reference on the ordered extent that was inserted. + * + * Return: 0 or -ENOMEM. */ -static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, int dio, - int compress_type) +int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, int flags, + int compress_type) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -171,7 +182,8 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset struct btrfs_ordered_extent *entry; int ret; - if (type == BTRFS_ORDERED_NOCOW || type == BTRFS_ORDERED_PREALLOC) { + if (flags & + ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC))) { /* For nocow write, we can release the qgroup rsv right now */ ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes); if (ret < 0) @@ -191,21 +203,21 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset return -ENOMEM; entry->file_offset = file_offset; - entry->disk_bytenr = disk_bytenr; entry->num_bytes = num_bytes; + entry->ram_bytes = ram_bytes; + entry->disk_bytenr = disk_bytenr; entry->disk_num_bytes = disk_num_bytes; + entry->offset = offset; entry->bytes_left = num_bytes; entry->inode = igrab(&inode->vfs_inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; entry->qgroup_rsv = ret; - if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) - set_bit(type, &entry->flags); - if (dio) { + entry->flags = flags; + if (flags & (1 << BTRFS_ORDERED_DIRECT)) { percpu_counter_add_batch(&fs_info->dio_bytes, num_bytes, fs_info->delalloc_batch); - set_bit(BTRFS_ORDERED_DIRECT, &entry->flags); } /* one ref for the tree */ @@ -252,34 +264,6 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset return 0; } -int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, - int type) -{ - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 0, - BTRFS_COMPRESS_NONE); -} - -int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type) -{ - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 1, - BTRFS_COMPRESS_NONE); -} - -int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, - int compress_type) -{ - return __btrfs_add_ordered_extent(inode, file_offset, disk_bytenr, - num_bytes, disk_num_bytes, type, 0, - compress_type); -} - /* * Add a struct btrfs_ordered_sum into the list of checksums to be inserted * when an ordered extent is finished. If the list covers more than one diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 46194c2c05d4..8c8d6ed7a350 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -72,9 +72,11 @@ struct btrfs_ordered_extent { * These fields directly correspond to the same fields in * btrfs_file_extent_item. */ - u64 disk_bytenr; u64 num_bytes; + u64 ram_bytes; + u64 disk_bytenr; u64 disk_num_bytes; + u64 offset; /* number of bytes that still need writing */ u64 bytes_left; @@ -160,15 +162,9 @@ bool btrfs_dec_test_first_ordered_pending(struct btrfs_inode *inode, u64 *file_offset, u64 io_size, int uptodate); int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, u64 disk_num_bytes, - int type); -int btrfs_add_ordered_extent_dio(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type); -int btrfs_add_ordered_extent_compress(struct btrfs_inode *inode, u64 file_offset, - u64 disk_bytenr, u64 num_bytes, - u64 disk_num_bytes, int type, - int compress_type); + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, int flags, + int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode, From patchwork Fri Jan 22 20:46:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040625 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNWANTED_LANGUAGE_BODY, URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34897C433E0 for ; Fri, 22 Jan 2021 21:15:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id ECB4323B09 for ; Fri, 22 Jan 2021 21:15:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730042AbhAVVOj (ORCPT ); Fri, 22 Jan 2021 16:14:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49340 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729626AbhAVUuw (ORCPT ); Fri, 22 Jan 2021 15:50:52 -0500 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4DF97C0611C0 for ; Fri, 22 Jan 2021 12:47:36 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id d4so3979986plh.5 for ; Fri, 22 Jan 2021 12:47:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Kw3F5rWDMcL6nOv1EKwx/gx++jd0GsSbmo5x3yxoxT0=; b=0Idg9QUUqpDhVaSM4ZrUNVzHPYRC9BwAS1R/vQP+EGu1+AtS6iFj5eHoyqtnTOZXZk ID3mwfTUpUa2zNhB1zOyxdbCPoEyQL5yzHoREmzTfYDPT5jwJybSKTZgMpWoOZKoJapI UyQ/M5r7mQl+nZfYxu4/3MQxldAQCEXuxIZbp0YpmA/SFT3mm4i4W12FnMaFH6fXcsNR WqmJnB7ygbXpqWLNcPZkfscUAgbpKYPWbWV+3Hx3aA3e5g8hXReBzUqsf4/zP6Uh5PWc haqqGpQ+MtYdVTs8p6SGdeWX7OX82ULXD1JIKkHBLdtZixTku8sWt486YDIglboAdpnN n1/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Kw3F5rWDMcL6nOv1EKwx/gx++jd0GsSbmo5x3yxoxT0=; b=TVuMfn3uErSkqYLIx74dKk5y0BGpE1g0dOjYAgg2sJoDC7gb+SU2F6nrwnNwUqRohD xRUxu2C3tbTU+Oc5sjyuN00SOa4Mb+emwj+x+zbZmmcZocz9tx3SAGNY7hvigM5IC5Gn AT73xvugn7N+q+3k3EEVGs3IqPVqPp3XqpsI2ZdHBfLSRqcN+VrxQ7FTqBVUeHtJRxG3 biRl950LLc5dQbjFeoKgKCq9DztCWdNKCKKhkF98xCRQv8VxJEn48uGP/cVzJe4qH85p 3V2ZLjzkNqQ1DCSp/l9u7C20cd93LTsS1d0p6SwcgW6isV2Zii0fwd4cHbIDCJHILPFW 5gHg== X-Gm-Message-State: AOAM5324dbl0zsc59Vdlhju+YlgoCA51PwGtD4EX89JUzQHOcSnx5Lsq gjDniXl5/pkfYnGw+92bfzuJVA== X-Google-Smtp-Source: ABdhPJx2Kbm4mC28XaaS1EC0zEMrJVAG7ReNdBuiUIt311GT2CnHbpO7sKgNAuS3rmUIQtSzt0qrmw== X-Received: by 2002:a17:902:7102:b029:de:aa85:e04e with SMTP id a2-20020a1709027102b02900deaa85e04emr6425718pll.23.1611348455821; Fri, 22 Jan 2021 12:47:35 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:34 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 07/10] btrfs: support different disk extent size for delalloc Date: Fri, 22 Jan 2021 12:46:54 -0800 Message-Id: <781bf3370a61ad265cff3b89fc8c168d6146ad01.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval Currently, we always reserve the same extent size in the file and extent size on disk for delalloc because the former is the worst case for the latter. For RWF_ENCODED writes, we know the exact size of the extent on disk, which may be less than or greater than (for bookends) the size in the file. Add a disk_num_bytes parameter to btrfs_delalloc_reserve_metadata() so that we can reserve the correct amount of csum bytes. No functional change. Reviewed-by: Nikolay Borisov Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/delalloc-space.c | 18 ++++++++++-------- fs/btrfs/file.c | 3 ++- fs/btrfs/inode.c | 2 +- fs/btrfs/relocation.c | 4 ++-- 5 files changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f1b9a9e42cc6..3a0c4fb4c657 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2746,7 +2746,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root, struct btrfs_block_rsv *rsv); void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes); -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + u64 disk_num_bytes); u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo); int btrfs_error_unpin_extent_range(struct btrfs_fs_info *fs_info, u64 start, u64 end); diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c index bacee09b7bfd..948b78f03f63 100644 --- a/fs/btrfs/delalloc-space.c +++ b/fs/btrfs/delalloc-space.c @@ -265,11 +265,11 @@ static void btrfs_calculate_inode_block_rsv_size(struct btrfs_fs_info *fs_info, } static void calc_inode_reservations(struct btrfs_fs_info *fs_info, - u64 num_bytes, u64 *meta_reserve, - u64 *qgroup_reserve) + u64 num_bytes, u64 disk_num_bytes, + u64 *meta_reserve, u64 *qgroup_reserve) { u64 nr_extents = count_max_extents(num_bytes); - u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, num_bytes); + u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes); u64 inode_update = btrfs_calc_metadata_size(fs_info, 1); *meta_reserve = btrfs_calc_insert_metadata_size(fs_info, @@ -283,7 +283,8 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info, *qgroup_reserve = nr_extents * fs_info->nodesize; } -int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) +int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes, + u64 disk_num_bytes) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -313,6 +314,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) } num_bytes = ALIGN(num_bytes, fs_info->sectorsize); + disk_num_bytes = ALIGN(disk_num_bytes, fs_info->sectorsize); /* * We always want to do it this way, every other way is wrong and ends @@ -324,8 +326,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) * everything out and try again, which is bad. This way we just * over-reserve slightly, and clean up the mess when we are done. */ - calc_inode_reservations(fs_info, num_bytes, &meta_reserve, - &qgroup_reserve); + calc_inode_reservations(fs_info, num_bytes, disk_num_bytes, + &meta_reserve, &qgroup_reserve); ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_reserve, true); if (ret) return ret; @@ -344,7 +346,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) spin_lock(&inode->lock); nr_extents = count_max_extents(num_bytes); btrfs_mod_outstanding_extents(inode, nr_extents); - inode->csum_bytes += num_bytes; + inode->csum_bytes += disk_num_bytes; btrfs_calculate_inode_block_rsv_size(fs_info, inode); spin_unlock(&inode->lock); @@ -448,7 +450,7 @@ int btrfs_delalloc_reserve_space(struct btrfs_inode *inode, ret = btrfs_check_data_free_space(inode, reserved, start, len); if (ret < 0) return ret; - ret = btrfs_delalloc_reserve_metadata(inode, len); + ret = btrfs_delalloc_reserve_metadata(inode, len, len); if (ret < 0) btrfs_free_reserved_data_space(inode, *reserved, start, len); return ret; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d81ae1f518f2..32578f31cf43 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1733,7 +1733,8 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, fs_info->sectorsize); WARN_ON(reserve_bytes == 0); ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), - reserve_bytes); + reserve_bytes, + reserve_bytes); if (ret) { if (!only_release_metadata) btrfs_free_reserved_data_space(BTRFS_I(inode), diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 139d690f171e..3c1f1297c9d9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4712,7 +4712,7 @@ int btrfs_truncate_block(struct btrfs_inode *inode, loff_t from, loff_t len, goto out; } } - ret = btrfs_delalloc_reserve_metadata(inode, blocksize); + ret = btrfs_delalloc_reserve_metadata(inode, blocksize, blocksize); if (ret < 0) { if (!only_release_metadata) btrfs_free_reserved_data_space(inode, data_reserved, diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 9f2289bcdde6..85e9e7287bfd 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2660,8 +2660,8 @@ static int relocate_file_extent_cluster(struct inode *inode, index = (cluster->start - offset) >> PAGE_SHIFT; last_index = (cluster->end - offset) >> PAGE_SHIFT; while (index <= last_index) { - ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), - PAGE_SIZE); + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), PAGE_SIZE, + PAGE_SIZE); if (ret) goto out; From patchwork Fri Jan 22 20:46:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040509 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14ACDC433E9 for ; Fri, 22 Jan 2021 20:52:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D1F3323B08 for ; Fri, 22 Jan 2021 20:52:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730645AbhAVUwJ (ORCPT ); Fri, 22 Jan 2021 15:52:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49776 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730791AbhAVUvU (ORCPT ); Fri, 22 Jan 2021 15:51:20 -0500 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23CB0C0611C3 for ; Fri, 22 Jan 2021 12:47:39 -0800 (PST) Received: by mail-pf1-x435.google.com with SMTP id j12so4592712pfj.12 for ; Fri, 22 Jan 2021 12:47:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CKZzF4yC72/mGdDfVW+0y5Y0vRVlRm12hctIRk+n9zk=; b=CnlU8cDk7vOZvo3jtxN1dw2hD8lI7Ua6iLh6t0aSwS43O3v7C/Y/j2boamRNwsC8az wQ0INuOVo666YQ6MuTsOb3wP/KWe4GQT0CAt++l7kQar/lQpw2STvIV/zVvQOJE/+1nB 506mpMpRLcfieZQ5tKj9Lyk591+bRnaKSccUzwnw7AWbALUlnkBe5BB5MGNjBG8tn6O2 1n9SCUtYpbwIF2DK5SOS4TcmEbTi6BXK8wrXwBz8OUAqx6m2nvIXB3eTWcNtJYxk1iO6 XYouyPozrQgy3eEJD7iP7pIIRNPo6IUYM31TC6L/BcEtwkVgdemjVWj5N+pYItuu4+o7 GCMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CKZzF4yC72/mGdDfVW+0y5Y0vRVlRm12hctIRk+n9zk=; b=CeXRhtLVZjZJQrEGFmvX5+HeKJa3pd1cT0F3B2fAGdUdq1t5UgSREzuDD19YmxIsyS 1ON1u8e5ATbkeR8M9Rgzc0dehg+lZN1LcypaPJMWjTi3HAIsNexm3wu7pnOITabEn07g Zw5uiwJkvKuf0cm0vaqhjVcUKCCoL1BC54J60cnfFcaoeZTeo/qOm6c9knQryouX1cbl YRqv/1FUew60SF2DUDdYtLQlh3mts4A2V2XLI84NpU5xLzGlzQDbXOXLNTn8ZHZzd6J/ DpR3+cq28AsaYc7Z3LGaD4xiPxqu4t6L/ljWy/nEH7ZDK3mjiin2PWkFR5RLlj0u9sIE Kghw== X-Gm-Message-State: AOAM530BX1nhTrwOLXKrEacinDjMcs7OeYaL92AsOiWaqVDP9xohM07K qng+7rurATT/uiHe6QRscctSEA== X-Google-Smtp-Source: ABdhPJxMUyhebHxm+DW5OavGApjzSj618wIFbO3LWFjQBbFUHW1AgYfEZe5aZ/7W4bmcf7UMFqj3XQ== X-Received: by 2002:a62:5547:0:b029:1a4:cb2a:2833 with SMTP id j68-20020a6255470000b02901a4cb2a2833mr6536895pfb.35.1611348458605; Fri, 22 Jan 2021 12:47:38 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:37 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 08/10] btrfs: optionally extend i_size in cow_file_range_inline() Date: Fri, 22 Jan 2021 12:46:55 -0800 Message-Id: <99661081363ab004c73361aaca4a418a11fc10fc.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval Currently, an inline extent is always created after i_size is extended from btrfs_dirty_pages(). However, for encoded writes, we only want to update i_size after we successfully created the inline extent. Add an update_i_size parameter to cow_file_range_inline() and insert_inline_extent() and pass in the size of the extent rather than determining it from i_size. Since the start parameter is always passed as 0, get rid of it and simplify the logic in these two functions. While we're here, let's document the requirements for creating an inline extent. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/inode.c | 100 +++++++++++++++++++++++------------------------ 1 file changed, 48 insertions(+), 52 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3c1f1297c9d9..b362e73af840 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -203,9 +203,10 @@ static int btrfs_init_inode_security(struct btrfs_trans_handle *trans, static int insert_inline_extent(struct btrfs_trans_handle *trans, struct btrfs_path *path, bool extent_inserted, struct btrfs_root *root, struct inode *inode, - u64 start, size_t size, size_t compressed_size, + size_t size, size_t compressed_size, int compress_type, - struct page **compressed_pages) + struct page **compressed_pages, + bool update_i_size) { struct extent_buffer *leaf; struct page *page = NULL; @@ -214,7 +215,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, struct btrfs_file_extent_item *ei; int ret; size_t cur_size = size; - unsigned long offset; + u64 i_size; ASSERT((compressed_size > 0 && compressed_pages) || (compressed_size == 0 && !compressed_pages)); @@ -227,7 +228,7 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, size_t datasize; key.objectid = btrfs_ino(BTRFS_I(inode)); - key.offset = start; + key.offset = 0; key.type = BTRFS_EXTENT_DATA_KEY; datasize = btrfs_file_extent_calc_inline_size(cur_size); @@ -265,12 +266,10 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_compression(leaf, ei, compress_type); } else { - page = find_get_page(inode->i_mapping, - start >> PAGE_SHIFT); + page = find_get_page(inode->i_mapping, 0); btrfs_set_file_extent_compression(leaf, ei, 0); kaddr = kmap_atomic(page); - offset = offset_in_page(start); - write_extent_buffer(leaf, kaddr + offset, ptr, size); + write_extent_buffer(leaf, kaddr, ptr, size); kunmap_atomic(kaddr); put_page(page); } @@ -281,8 +280,8 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * We align size to sectorsize for inline extents just for simplicity * sake. */ - size = ALIGN(size, root->fs_info->sectorsize); - ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), start, size); + ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), 0, + ALIGN(size, root->fs_info->sectorsize)); if (ret) goto fail; @@ -295,7 +294,13 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * before we unlock the pages. Otherwise we * could end up racing with unlink. */ - BTRFS_I(inode)->disk_i_size = inode->i_size; + i_size = i_size_read(inode); + if (update_i_size && size > i_size) { + i_size_write(inode, size); + i_size = size; + } + BTRFS_I(inode)->disk_i_size = i_size; + fail: return ret; } @@ -306,35 +311,31 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans, * does the checks required to make sure the data is small enough * to fit as an inline extent. */ -static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start, - u64 end, size_t compressed_size, +static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 size, + size_t compressed_size, int compress_type, - struct page **compressed_pages) + struct page **compressed_pages, + bool update_i_size) { struct btrfs_drop_extents_args drop_args = { 0 }; struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_trans_handle *trans; - u64 isize = i_size_read(&inode->vfs_inode); - u64 actual_end = min(end + 1, isize); - u64 inline_len = actual_end - start; - u64 aligned_end = ALIGN(end, fs_info->sectorsize); - u64 data_len = inline_len; + u64 data_len = compressed_size ? compressed_size : size; int ret; struct btrfs_path *path; - if (compressed_size) - data_len = compressed_size; - - if (start > 0 || - actual_end > fs_info->sectorsize || + /* + * We can create an inline extent if it ends at or beyond the current + * i_size, is no larger than a sector (decompressed), and the (possibly + * compressed) data fits in a leaf and the configured maximum inline + * size. + */ + if (size < i_size_read(&inode->vfs_inode) || + size > fs_info->sectorsize || data_len > BTRFS_MAX_INLINE_DATA_SIZE(fs_info) || - (!compressed_size && - (actual_end & (fs_info->sectorsize - 1)) == 0) || - end + 1 < isize || - data_len > fs_info->max_inline) { + data_len > fs_info->max_inline) return 1; - } path = btrfs_alloc_path(); if (!path) @@ -348,30 +349,21 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start, trans->block_rsv = &inode->block_rsv; drop_args.path = path; - drop_args.start = start; - drop_args.end = aligned_end; + drop_args.start = 0; + drop_args.end = fs_info->sectorsize; drop_args.drop_cache = true; drop_args.replace_extent = true; - - if (compressed_size && compressed_pages) - drop_args.extent_item_size = btrfs_file_extent_calc_inline_size( - compressed_size); - else - drop_args.extent_item_size = btrfs_file_extent_calc_inline_size( - inline_len); - + drop_args.extent_item_size = btrfs_file_extent_calc_inline_size(data_len); ret = btrfs_drop_extents(trans, root, inode, &drop_args); if (ret) { btrfs_abort_transaction(trans, ret); goto out; } - if (isize > actual_end) - inline_len = min_t(u64, isize, actual_end); - ret = insert_inline_extent(trans, path, drop_args.extent_inserted, - root, &inode->vfs_inode, start, - inline_len, compressed_size, - compress_type, compressed_pages); + ret = insert_inline_extent(trans, path, drop_args.extent_inserted, root, + &inode->vfs_inode, size, compressed_size, + compress_type, compressed_pages, + update_i_size); if (ret && ret != -ENOSPC) { btrfs_abort_transaction(trans, ret); goto out; @@ -380,7 +372,7 @@ static noinline int cow_file_range_inline(struct btrfs_inode *inode, u64 start, goto out; } - btrfs_update_inode_bytes(inode, inline_len, drop_args.bytes_found); + btrfs_update_inode_bytes(inode, size, drop_args.bytes_found); ret = btrfs_update_inode(trans, root, inode); if (ret && ret != -ENOSPC) { btrfs_abort_transaction(trans, ret); @@ -661,14 +653,15 @@ static noinline int compress_file_range(struct async_chunk *async_chunk) /* we didn't compress the entire range, try * to make an uncompressed inline extent. */ - ret = cow_file_range_inline(BTRFS_I(inode), start, end, + ret = cow_file_range_inline(BTRFS_I(inode), actual_end, 0, BTRFS_COMPRESS_NONE, - NULL); + NULL, false); } else { /* try making a compressed inline extent */ - ret = cow_file_range_inline(BTRFS_I(inode), start, end, + ret = cow_file_range_inline(BTRFS_I(inode), actual_end, total_compressed, - compress_type, pages); + compress_type, pages, + false); } if (ret <= 0) { unsigned long clear_flags = EXTENT_DELALLOC | @@ -1056,9 +1049,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode, inode_should_defrag(inode, start, end, num_bytes, SZ_64K); if (start == 0) { + u64 actual_end = min_t(u64, i_size_read(&inode->vfs_inode), + end + 1); + /* lets try to make an inline extent */ - ret = cow_file_range_inline(inode, start, end, 0, - BTRFS_COMPRESS_NONE, NULL); + ret = cow_file_range_inline(inode, actual_end, 0, + BTRFS_COMPRESS_NONE, NULL, false); if (ret == 0) { /* * We use DO_ACCOUNTING here because we need the From patchwork Fri Jan 22 20:46:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040511 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9624EC433E0 for ; Fri, 22 Jan 2021 20:52:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5B43823B06 for ; Fri, 22 Jan 2021 20:52:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730214AbhAVUwN (ORCPT ); Fri, 22 Jan 2021 15:52:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730010AbhAVUvZ (ORCPT ); Fri, 22 Jan 2021 15:51:25 -0500 Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3733EC061A10 for ; Fri, 22 Jan 2021 12:47:42 -0800 (PST) Received: by mail-pf1-x42a.google.com with SMTP id f63so4594480pfa.13 for ; Fri, 22 Jan 2021 12:47:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=TUZdv8lxBSmweb1rMFRfNKKi+9BkXN8FPRaotQfYADg=; b=QZ9seJF4f5ecUWZgBkL5NZtBqQSGDSnM9Uuze26BzKRsGk4+lOJyz5YFzlGSZWCHM9 v9vNUCfMk3gMk1Fo7+PFiuMwSQHyWxgIoX25peW/yOKCVmuP7Z1iHJCnOvF7Cm+T9Zlu wpasYSf7TY7idmqv9ElSHSfA8lchcQN+cdfNERjQdp9ufr8hfwbFFLVHDmYHQVkfnYyZ JwJUmtpjZQN3CN8mwjU/NijyZROTlKMwksarQniE8t5dF8CYYerzlQMaLyZrvF6n+RTy 3ED1k0rQkywCZtVxDBk2GfrC5+D0UBK3P145fjFKXdlr9Zbva3aH7qk7zxMJ2vx5VkeS cy4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TUZdv8lxBSmweb1rMFRfNKKi+9BkXN8FPRaotQfYADg=; b=qBZ3lqsVVb4NDNQXlKnZCQaYIDrjeQ9IknbSV/6Sejt96hwAzg760CqvT0iWk3gxOA ysyrVpVXkdrkoJvVFbmaiPxaeGV960q9bsMMsb1qyqHB9XmJBlAQHGfLt+6eRSr6Hagq NLNVqegeny4uPepxOFCpyHWimcboTt85VANTVkbJKEd3R5/Sg1c18lgMk1vcgQVIVNhM U5YnmDQAF8UF/uc0bFfDiKxPdpxcGBQzDz9mYDTw3xIa7hUGDsu/G3LbvIWpKsjecmtO jkgQp9cbSoQ/oZz39eQR9uv65v3ycRxLH1upKJ0J3CHGMWjcHZc/C6jMJQUZnSYT7Z9D uA0A== X-Gm-Message-State: AOAM533Znw9DAeJsJi8hahCqUwWZI/p0j1sVLzhp6Ao1r0ICj4jYIwCv AE6ra2BYauKkuVSXEPwW+yf+ww== X-Google-Smtp-Source: ABdhPJw0ZSwR9N0gXjnHO8+hnboKm+ut2fdUOemEN8EwXxzpgtGKJg/XCp8DUIzXWk4bO7sHirkVAQ== X-Received: by 2002:a62:3886:0:b029:1b9:4214:68e3 with SMTP id f128-20020a6238860000b02901b9421468e3mr6692924pfa.3.1611348461624; Fri, 22 Jan 2021 12:47:41 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:40 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 09/10] btrfs: implement RWF_ENCODED reads Date: Fri, 22 Jan 2021 12:46:56 -0800 Message-Id: X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval There are 4 main cases: 1. Inline extents: we copy the data straight out of the extent buffer. 2. Hole/preallocated extents: we fill in zeroes. 3. Regular, uncompressed extents: we read the sectors we need directly from disk. 4. Regular, compressed extents: we read the entire compressed extent from disk and indicate what subset of the decompressed extent is in the file. This initial implementation simplifies a few things that can be improved in the future: - We hold the inode lock during the operation. - Cases 1, 3, and 4 allocate temporary memory to read into before copying out to userspace. - We don't do read repair, because it turns out that read repair is currently broken for compressed data. Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 5 + fs/btrfs/inode.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 504 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3a0c4fb4c657..d093f7a900d7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3157,6 +3157,8 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); + extern const struct dentry_operations btrfs_dentry_operations; extern const struct iomap_ops btrfs_dio_iomap_ops; extern const struct iomap_dio_ops btrfs_dio_ops; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 32578f31cf43..88eb4c5ddd52 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3630,6 +3630,11 @@ static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) { ssize_t ret = 0; + if (iocb->ki_flags & IOCB_ENCODED) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EOPNOTSUPP; + return btrfs_encoded_read(iocb, to); + } if (iocb->ki_flags & IOCB_DIRECT) { ret = btrfs_direct_read(iocb, to); if (ret < 0 || !iov_iter_count(to) || diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b362e73af840..913f1eba3a92 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -9981,6 +9982,502 @@ void btrfs_set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) } } +static int encoded_iov_compression_from_btrfs(unsigned int compress_type) +{ + switch (compress_type) { + case BTRFS_COMPRESS_NONE: + return ENCODED_IOV_COMPRESSION_NONE; + case BTRFS_COMPRESS_ZLIB: + return ENCODED_IOV_COMPRESSION_BTRFS_ZLIB; + case BTRFS_COMPRESS_LZO: + /* + * The LZO format depends on the page size. 64k is the maximum + * sectorsize (and thus page size) that we support. + */ + if (PAGE_SIZE < SZ_4K || PAGE_SIZE > SZ_64K) + return -EINVAL; + return ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + (PAGE_SHIFT - 12); + case BTRFS_COMPRESS_ZSTD: + return ENCODED_IOV_COMPRESSION_BTRFS_ZSTD; + default: + return -EUCLEAN; + } +} + +static ssize_t btrfs_encoded_read_inline(struct kiocb *iocb, + struct iov_iter *iter, u64 start, + u64 lockend, + struct extent_state **cached_state, + u64 extent_start, size_t count, + struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_file_extent_item *item; + u64 ram_bytes; + unsigned long ptr; + void *tmp; + ssize_t ret; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + ret = btrfs_lookup_file_extent(NULL, BTRFS_I(inode)->root, path, + btrfs_ino(BTRFS_I(inode)), extent_start, + 0); + if (ret) { + if (ret > 0) { + /* The extent item disappeared? */ + ret = -EIO; + } + goto out; + } + leaf = path->nodes[0]; + item = btrfs_item_ptr(leaf, path->slots[0], + struct btrfs_file_extent_item); + + ram_bytes = btrfs_file_extent_ram_bytes(leaf, item); + ptr = btrfs_file_extent_inline_start(item); + + encoded->len = (min_t(u64, extent_start + ram_bytes, inode->i_size) - + iocb->ki_pos); + ret = encoded_iov_compression_from_btrfs( + btrfs_file_extent_compression(leaf, item)); + if (ret < 0) + goto out; + encoded->compression = ret; + if (encoded->compression) { + size_t inline_size; + + inline_size = btrfs_file_extent_inline_item_len(leaf, + btrfs_item_nr(path->slots[0])); + if (inline_size > count) { + ret = -ENOBUFS; + goto out; + } + count = inline_size; + encoded->unencoded_len = ram_bytes; + encoded->unencoded_offset = iocb->ki_pos - extent_start; + } else { + encoded->len = encoded->unencoded_len = count = + min_t(u64, count, encoded->len); + ptr += iocb->ki_pos - extent_start; + } + + tmp = kmalloc(count, GFP_NOFS); + if (!tmp) { + ret = -ENOMEM; + goto out; + } + read_extent_buffer(leaf, tmp, ptr, count); + btrfs_release_path(path); + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock_shared(inode); + *unlocked = true; + + ret = copy_encoded_iov_to_iter(encoded, iter); + if (ret) + goto out_free; + ret = copy_to_iter(tmp, count, iter); + if (ret != count) + ret = -EFAULT; +out_free: + kfree(tmp); +out: + btrfs_free_path(path); + return ret; +} + +struct btrfs_encoded_read_private { + struct inode *inode; + wait_queue_head_t wait; + atomic_t pending; + blk_status_t status; + bool skip_csum; +}; + +static blk_status_t submit_encoded_read_bio(struct inode *inode, + struct bio *bio, int mirror_num, + unsigned long bio_flags) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + blk_status_t ret; + + if (!priv->skip_csum) { + ret = btrfs_lookup_bio_sums(inode, bio, NULL); + if (ret) + return ret; + } + + ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA); + if (ret) { + btrfs_io_bio_free_csum(io_bio); + return ret; + } + + atomic_inc(&priv->pending); + ret = btrfs_map_bio(fs_info, bio, mirror_num); + if (ret) { + atomic_dec(&priv->pending); + btrfs_io_bio_free_csum(io_bio); + } + return ret; +} + +static blk_status_t btrfs_encoded_read_check_bio(struct btrfs_io_bio *io_bio) +{ + const bool uptodate = io_bio->bio.bi_status == BLK_STS_OK; + struct btrfs_encoded_read_private *priv = io_bio->bio.bi_private; + struct inode *inode = priv->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + u32 sectorsize = fs_info->sectorsize; + struct bio_vec *bvec; + struct bvec_iter_all iter_all; + u64 start = io_bio->logical; + u32 bio_offset = 0; + + if (priv->skip_csum || !uptodate) + return io_bio->bio.bi_status; + + bio_for_each_segment_all(bvec, &io_bio->bio, iter_all) { + unsigned int i, nr_sectors, pgoff; + + nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec->bv_len); + pgoff = bvec->bv_offset; + for (i = 0; i < nr_sectors; i++) { + ASSERT(pgoff < PAGE_SIZE); + if (check_data_csum(inode, io_bio, bio_offset, + bvec->bv_page, pgoff, start)) + return BLK_STS_IOERR; + start += sectorsize; + bio_offset += sectorsize; + pgoff += sectorsize; + } + } + return BLK_STS_OK; +} + +static void btrfs_encoded_read_endio(struct bio *bio) +{ + struct btrfs_encoded_read_private *priv = bio->bi_private; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + blk_status_t status; + + status = btrfs_encoded_read_check_bio(io_bio); + if (status) { + /* + * The memory barrier implied by the atomic_dec_return() here + * pairs with the memory barrier implied by the + * atomic_dec_return() or io_wait_event() in + * btrfs_encoded_read_regular_fill_pages() to ensure that this + * write is observed before the load of status in + * btrfs_encoded_read_regular_fill_pages(). + */ + WRITE_ONCE(priv->status, status); + } + if (!atomic_dec_return(&priv->pending)) + wake_up(&priv->wait); + btrfs_io_bio_free_csum(io_bio); + bio_put(bio); +} + +static int btrfs_encoded_read_regular_fill_pages(struct inode *inode, u64 offset, + u64 disk_io_size, struct page **pages) +{ + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_encoded_read_private priv = { + .inode = inode, + .pending = ATOMIC_INIT(1), + .skip_csum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM, + }; + unsigned long i = 0; + u64 cur = 0; + int ret; + + init_waitqueue_head(&priv.wait); + /* + * Submit bios for the extent, splitting due to bio or stripe limits as + * necessary. + */ + while (cur < disk_io_size) { + struct btrfs_io_geometry geom; + struct bio *bio = NULL; + u64 remaining; + + ret = btrfs_get_io_geometry(fs_info, BTRFS_MAP_READ, + offset + cur, disk_io_size - cur, + &geom); + if (ret) { + WRITE_ONCE(priv.status, errno_to_blk_status(ret)); + break; + } + remaining = min(geom.len, disk_io_size - cur); + while (bio || remaining) { + size_t bytes = min_t(u64, remaining, PAGE_SIZE); + + if (!bio) { + bio = btrfs_bio_alloc(offset + cur); + bio->bi_end_io = btrfs_encoded_read_endio; + bio->bi_private = &priv; + bio->bi_opf = REQ_OP_READ; + } + + if (!bytes || + bio_add_page(bio, pages[i], bytes, 0) < bytes) { + blk_status_t status; + + status = submit_encoded_read_bio(inode, bio, 0, + 0); + if (status) { + WRITE_ONCE(priv.status, status); + bio_put(bio); + goto out; + } + bio = NULL; + continue; + } + + i++; + cur += bytes; + remaining -= bytes; + } + } + +out: + if (atomic_dec_return(&priv.pending)) + io_wait_event(priv.wait, !atomic_read(&priv.pending)); + /* See btrfs_encoded_read_endio() for ordering. */ + return blk_status_to_errno(READ_ONCE(priv.status)); +} + +static ssize_t btrfs_encoded_read_regular(struct kiocb *iocb, + struct iov_iter *iter, + u64 start, u64 lockend, + struct extent_state **cached_state, + u64 offset, u64 disk_io_size, + size_t count, + const struct encoded_iov *encoded, + bool *unlocked) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct page **pages; + unsigned long nr_pages, i; + u64 cur; + size_t page_offset; + ssize_t ret; + + nr_pages = DIV_ROUND_UP(disk_io_size, PAGE_SIZE); + pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + pages[i] = alloc_page(GFP_NOFS | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out; + } + } + + ret = btrfs_encoded_read_regular_fill_pages(inode, offset, disk_io_size, + pages); + if (ret) + goto out; + + unlock_extent_cached(io_tree, start, lockend, cached_state); + inode_unlock_shared(inode); + *unlocked = true; + + ret = copy_encoded_iov_to_iter(encoded, iter); + if (ret) + goto out; + if (encoded->compression) { + i = 0; + page_offset = 0; + } else { + i = (iocb->ki_pos - start) >> PAGE_SHIFT; + page_offset = (iocb->ki_pos - start) & (PAGE_SIZE - 1); + } + cur = 0; + while (cur < count) { + size_t bytes = min_t(size_t, count - cur, + PAGE_SIZE - page_offset); + + if (copy_page_to_iter(pages[i], page_offset, bytes, + iter) != bytes) { + ret = -EFAULT; + goto out; + } + i++; + cur += bytes; + page_offset = 0; + } + ret = count; +out: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + __free_page(pages[i]); + } + kfree(pages); + return ret; +} + +ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + ssize_t ret; + size_t count; + u64 start, lockend, offset, disk_io_size; + struct extent_state *cached_state = NULL; + struct extent_map *em; + struct encoded_iov encoded = {}; + bool unlocked = false; + + ret = generic_encoded_read_checks(iocb, iter); + if (ret < 0) + return ret; + if (ret == 0) + return copy_encoded_iov_to_iter(&encoded, iter); + count = ret; + + file_accessed(iocb->ki_filp); + + inode_lock_shared(inode); + + if (iocb->ki_pos >= inode->i_size) { + inode_unlock_shared(inode); + return copy_encoded_iov_to_iter(&encoded, iter); + } + start = ALIGN_DOWN(iocb->ki_pos, fs_info->sectorsize); + /* + * We don't know how long the extent containing iocb->ki_pos is, but if + * it's compressed we know that it won't be longer than this. + */ + lockend = start + BTRFS_MAX_UNCOMPRESSED - 1; + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, + lockend - start + 1); + if (ret) + goto out_unlock_inode; + lock_extent_bits(io_tree, start, lockend, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + lockend - start + 1); + if (!ordered) + break; + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, lockend, &cached_state); + cond_resched(); + } + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, + lockend - start + 1); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_unlock_extent; + } + + if (em->block_start == EXTENT_MAP_INLINE) { + u64 extent_start = em->start; + + /* + * For inline extents we get everything we need out of the + * extent item. + */ + free_extent_map(em); + em = NULL; + ret = btrfs_encoded_read_inline(iocb, iter, start, lockend, + &cached_state, extent_start, + count, &encoded, &unlocked); + goto out; + } + + /* + * We only want to return up to EOF even if the extent extends beyond + * that. + */ + encoded.len = (min_t(u64, extent_map_end(em), inode->i_size) - + iocb->ki_pos); + if (em->block_start == EXTENT_MAP_HOLE || + test_bit(EXTENT_FLAG_PREALLOC, &em->flags)) { + offset = EXTENT_MAP_HOLE; + encoded.len = encoded.unencoded_len = count = + min_t(u64, count, encoded.len); + } else if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) { + offset = em->block_start; + /* + * Bail if the buffer isn't large enough to return the whole + * compressed extent. + */ + if (em->block_len > count) { + ret = -ENOBUFS; + goto out_em; + } + disk_io_size = count = em->block_len; + encoded.unencoded_len = em->ram_bytes; + encoded.unencoded_offset = iocb->ki_pos - em->orig_start; + ret = encoded_iov_compression_from_btrfs(em->compress_type); + if (ret < 0) + goto out_em; + encoded.compression = ret; + } else { + offset = em->block_start + (start - em->start); + if (encoded.len > count) + encoded.len = count; + /* + * Don't read beyond what we locked. This also limits the page + * allocations that we'll do. + */ + disk_io_size = min(lockend + 1, iocb->ki_pos + encoded.len) - start; + encoded.len = encoded.unencoded_len = count = + start + disk_io_size - iocb->ki_pos; + disk_io_size = ALIGN(disk_io_size, fs_info->sectorsize); + } + free_extent_map(em); + em = NULL; + + if (offset == EXTENT_MAP_HOLE) { + unlock_extent_cached(io_tree, start, lockend, &cached_state); + inode_unlock_shared(inode); + unlocked = true; + ret = copy_encoded_iov_to_iter(&encoded, iter); + if (ret) + goto out; + ret = iov_iter_zero(count, iter); + if (ret != count) + ret = -EFAULT; + } else { + ret = btrfs_encoded_read_regular(iocb, iter, start, lockend, + &cached_state, offset, + disk_io_size, count, &encoded, + &unlocked); + } + +out: + if (ret >= 0) + iocb->ki_pos += encoded.len; +out_em: + free_extent_map(em); +out_unlock_extent: + if (!unlocked) + unlock_extent_cached(io_tree, start, lockend, &cached_state); +out_unlock_inode: + if (!unlocked) + inode_unlock_shared(inode); + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a From patchwork Fri Jan 22 20:46:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Omar Sandoval X-Patchwork-Id: 12040623 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36891C433DB for ; Fri, 22 Jan 2021 21:15:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EB18323AF8 for ; Fri, 22 Jan 2021 21:15:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728559AbhAVVN4 (ORCPT ); Fri, 22 Jan 2021 16:13:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49252 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730161AbhAVUvG (ORCPT ); Fri, 22 Jan 2021 15:51:06 -0500 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82664C061A28 for ; Fri, 22 Jan 2021 12:47:44 -0800 (PST) Received: by mail-pl1-x62b.google.com with SMTP id b8so3996807plx.0 for ; Fri, 22 Jan 2021 12:47:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=osandov-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CXYMjzRfe5QIfTe4WA5J5AC/HDZYFKkTHb/rvq9bhT0=; b=ZXAbafth1wP5gELP78gyC6SQiGXwQG01DX2CQJ8VhNZz4IIsj8d1xKEfqlOHZhhPbz qo/huNADOaJKK2n6iOZCGHBAHp/+mnEPoQJIg1j3tB9atN6pANb8geMTr1ucptXHAcOP FhcuwE3d1YL9s2ASfcnNQsmuVfCTbE1EJoW3fHxOmq5C1RTTFmDjHIpWwHybIvVbtO9u Kqrz1aRfhVkNAKIJ3KuES20EZF60lMw6A0XbfFB/+dwUhnH6iZBxjiUairRtr4eoZENS aU+1mmZkRoi8KXSOcG6rNAAOhneSBuaSo9/jgU1cycHYRZ/hEHrWEZxpi+/15BzAal/N 4TSA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CXYMjzRfe5QIfTe4WA5J5AC/HDZYFKkTHb/rvq9bhT0=; b=ov4oGZyyGKKFWwuLSm4M1qeRqWhw68GYJ9Qxtw5cHyXQCC5J8DgPl64gfBu/Dq/FJT UtYJabdVwoKub5OYOY/HgZ1hxALOgKrjW8JGb1Frj0yn34HZ8ezP0qFEFkqQ+MvIzDU3 LnMHmt9OCdEVYFRZS2SdvlIKiMDSGcFnxfN4cJs0DtA+3K802sTstCQvJWA8hyeXBzCl ZhyD/RPT9AUj/vj6lBkrlbkPtxHx1kqVp/btVoR4wBai9BohjINW4HtLpJHFB9jLbENu ol/UiU3SHjAIHd7oABgluwCRAuMj2ke997lXzeAmToRQhKhJo40xJ7SXrobpusJB5PNB 0P4w== X-Gm-Message-State: AOAM5329wPdsCk2RE3HEyT/aso/fJRgKOJ8e8gFI59gg0u56nsgtBJN/ yPjvL5uJF32575Zji2DoDoMpLw== X-Google-Smtp-Source: ABdhPJxNBUOBZdAC1UuRMRhazxXWGa4ctweh0WaiKa+AKypkRppdVdYd7tGqYAjBn7gQBzVmFSUsdQ== X-Received: by 2002:a17:902:8b8b:b029:de:3a7f:299d with SMTP id ay11-20020a1709028b8bb02900de3a7f299dmr6480850plb.16.1611348463969; Fri, 22 Jan 2021 12:47:43 -0800 (PST) Received: from relinquished.tfbnw.net ([2620:10d:c090:400::5:ea88]) by smtp.gmail.com with ESMTPSA id j18sm4092900pfc.99.2021.01.22.12.47.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Jan 2021 12:47:42 -0800 (PST) From: Omar Sandoval To: linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, Al Viro , Christoph Hellwig Cc: Dave Chinner , Jann Horn , Amir Goldstein , Aleksa Sarai , linux-api@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v7 10/10] btrfs: implement RWF_ENCODED writes Date: Fri, 22 Jan 2021 12:46:57 -0800 Message-Id: <19c9a45a76310918b8bbdf376ba76718cdbb40d8.1611346706.git.osandov@fb.com> X-Mailer: git-send-email 2.30.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Omar Sandoval The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Now that read and write are implemented, this also sets the FMODE_ENCODED_IO flag in btrfs_file_open(). Reviewed-by: Josef Bacik Signed-off-by: Omar Sandoval --- fs/btrfs/compression.c | 7 +- fs/btrfs/compression.h | 6 +- fs/btrfs/ctree.h | 2 + fs/btrfs/file.c | 38 +++++- fs/btrfs/inode.c | 259 +++++++++++++++++++++++++++++++++++++++- fs/btrfs/ordered-data.c | 12 +- fs/btrfs/ordered-data.h | 2 + 7 files changed, 314 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index d9d1d9e38fc5..4cfd03918ffc 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -336,7 +336,8 @@ static void end_compressed_bio_write(struct bio *bio) bio->bi_status == BLK_STS_OK); cb->compressed_pages[0]->mapping = NULL; - end_compressed_writeback(inode, cb); + if (cb->writeback) + end_compressed_writeback(inode, cb); /* note, our inode could be gone now */ /* @@ -372,7 +373,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, struct page **compressed_pages, unsigned long nr_pages, unsigned int write_flags, - struct cgroup_subsys_state *blkcg_css) + struct cgroup_subsys_state *blkcg_css, + bool writeback) { struct btrfs_fs_info *fs_info = inode->root->fs_info; struct bio *bio = NULL; @@ -396,6 +398,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, cb->mirror_num = 0; cb->compressed_pages = compressed_pages; cb->compressed_len = compressed_len; + cb->writeback = writeback; cb->orig_bio = NULL; cb->nr_pages = nr_pages; diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index 8001b700ea3a..f95cdc16f503 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -49,6 +49,9 @@ struct compressed_bio { /* the compression algorithm for this bio */ int compress_type; + /* Whether this is a write for writeback. */ + bool writeback; + /* number of compressed pages in the array */ unsigned long nr_pages; @@ -96,7 +99,8 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start, struct page **compressed_pages, unsigned long nr_pages, unsigned int write_flags, - struct cgroup_subsys_state *blkcg_css); + struct cgroup_subsys_state *blkcg_css, + bool writeback); blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, int mirror_num, unsigned long bio_flags); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d093f7a900d7..33a08ab5cb0e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3158,6 +3158,8 @@ int btrfs_writepage_cow_fixup(struct page *page, u64 start, u64 end); void btrfs_writepage_endio_finish_ordered(struct page *page, u64 start, u64 end, int uptodate); ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter); +ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded); extern const struct dentry_operations btrfs_dentry_operations; extern const struct iomap_ops btrfs_dio_iomap_ops; diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 88eb4c5ddd52..dd7fec85fea1 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3,6 +3,7 @@ * Copyright (C) 2007 Oracle. All rights reserved. */ +#include #include #include #include @@ -1993,6 +1994,32 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from) return written ? written : err; } +static ssize_t btrfs_encoded_write(struct kiocb *iocb, struct iov_iter *from) +{ + struct file *file = iocb->ki_filp; + struct inode *inode = file_inode(file); + struct encoded_iov encoded; + ssize_t ret; + + ret = copy_encoded_iov_from_iter(&encoded, from); + if (ret) + return ret; + + btrfs_inode_lock(inode, 0); + ret = generic_encoded_write_checks(iocb, &encoded); + if (ret || encoded.len == 0) + goto out; + + ret = btrfs_write_check(iocb, from, encoded.len); + if (ret < 0) + goto out; + + ret = btrfs_do_encoded_write(iocb, from, &encoded); +out: + btrfs_inode_unlock(inode, 0); + return ret; +} + static ssize_t btrfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from) { @@ -2009,14 +2036,17 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, if (test_bit(BTRFS_FS_STATE_ERROR, &inode->root->fs_info->fs_state)) return -EROFS; - if (!(iocb->ki_flags & IOCB_DIRECT) && - (iocb->ki_flags & IOCB_NOWAIT)) + if ((iocb->ki_flags & IOCB_NOWAIT) && + (!(iocb->ki_flags & IOCB_DIRECT) || + (iocb->ki_flags & IOCB_ENCODED))) return -EOPNOTSUPP; if (sync) atomic_inc(&inode->sync_writers); - if (iocb->ki_flags & IOCB_DIRECT) + if (iocb->ki_flags & IOCB_ENCODED) + num_written = btrfs_encoded_write(iocb, from); + else if (iocb->ki_flags & IOCB_DIRECT) num_written = btrfs_direct_write(iocb, from); else num_written = btrfs_buffered_write(iocb, from); @@ -3587,7 +3617,7 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) static int btrfs_file_open(struct inode *inode, struct file *filp) { - filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC; + filp->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_ENCODED_IO; return generic_file_open(inode, filp); } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 913f1eba3a92..c2fe76f57bf5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -935,7 +935,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) ins.offset, async_extent->pages, async_extent->nr_pages, async_chunk->write_flags, - async_chunk->blkcg_css)) { + async_chunk->blkcg_css, true)) { struct page *p = async_extent->pages[0]; const u64 start = async_extent->start; const u64 end = start + async_extent->ram_size - 1; @@ -2712,6 +2712,7 @@ static int insert_ordered_extent_file_extent(struct btrfs_trans_handle *trans, * except if the ordered extent was truncated. */ update_inode_bytes = test_bit(BTRFS_ORDERED_DIRECT, &oe->flags) || + test_bit(BTRFS_ORDERED_ENCODED, &oe->flags) || test_bit(BTRFS_ORDERED_TRUNCATED, &oe->flags); return insert_reserved_file_extent(trans, BTRFS_I(oe->inode), @@ -2746,7 +2747,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) && - !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags)) + !test_bit(BTRFS_ORDERED_DIRECT, &ordered_extent->flags) && + !test_bit(BTRFS_ORDERED_ENCODED, &ordered_extent->flags)) clear_bits |= EXTENT_DELALLOC_NEW; freespace_inode = btrfs_is_free_space_inode(inode); @@ -10478,6 +10480,259 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter) return ret; } +ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from, + struct encoded_iov *encoded) +{ + struct inode *inode = file_inode(iocb->ki_filp); + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_root *root = BTRFS_I(inode)->root; + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + struct extent_changeset *data_reserved = NULL; + struct extent_state *cached_state = NULL; + int compression; + size_t orig_count; + u64 start, end; + u64 num_bytes, ram_bytes, disk_num_bytes; + unsigned long nr_pages, i; + struct page **pages; + struct btrfs_key ins; + bool extent_reserved = false; + struct extent_map *em; + ssize_t ret; + + switch (encoded->compression) { + case ENCODED_IOV_COMPRESSION_BTRFS_ZLIB: + compression = BTRFS_COMPRESS_ZLIB; + break; + case ENCODED_IOV_COMPRESSION_BTRFS_ZSTD: + compression = BTRFS_COMPRESS_ZSTD; + break; + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_8K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_16K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_32K: + case ENCODED_IOV_COMPRESSION_BTRFS_LZO_64K: + /* The page size must match for LZO. */ + if (encoded->compression - + ENCODED_IOV_COMPRESSION_BTRFS_LZO_4K + 12 != PAGE_SHIFT) + return -EINVAL; + compression = BTRFS_COMPRESS_LZO; + break; + default: + return -EINVAL; + } + if (encoded->encryption != ENCODED_IOV_ENCRYPTION_NONE) + return -EINVAL; + + orig_count = iov_iter_count(from); + + /* The extent size must be sane. */ + if (encoded->unencoded_len > BTRFS_MAX_UNCOMPRESSED || + orig_count > BTRFS_MAX_COMPRESSED || orig_count == 0) + return -EINVAL; + + /* + * The compressed data must be smaller than the decompressed data. + * + * It's of course possible for data to compress to larger or the same + * size, but the buffered I/O path falls back to no compression for such + * data, and we don't want to break any assumptions by creating these + * extents. + * + * Note that this is less strict than the current check we have that the + * compressed data must be at least one sector smaller than the + * decompressed data. We only want to enforce the weaker requirement + * from old kernels that it is at least one byte smaller. + */ + if (orig_count >= encoded->unencoded_len) + return -EINVAL; + + /* The extent must start on a sector boundary. */ + start = iocb->ki_pos; + if (!IS_ALIGNED(start, fs_info->sectorsize)) + return -EINVAL; + + /* + * The extent must end on a sector boundary. However, we allow a write + * which ends at or extends i_size to have an unaligned length; we round + * up the extent size and set i_size to the unaligned end. + */ + if (start + encoded->len < inode->i_size && + !IS_ALIGNED(start + encoded->len, fs_info->sectorsize)) + return -EINVAL; + + /* Finally, the offset in the unencoded data must be sector-aligned. */ + if (!IS_ALIGNED(encoded->unencoded_offset, fs_info->sectorsize)) + return -EINVAL; + + num_bytes = ALIGN(encoded->len, fs_info->sectorsize); + ram_bytes = ALIGN(encoded->unencoded_len, fs_info->sectorsize); + end = start + num_bytes - 1; + + /* + * If the extent cannot be inline, the compressed data on disk must be + * sector-aligned. For convenience, we extend it with zeroes if it + * isn't. + */ + disk_num_bytes = ALIGN(orig_count, fs_info->sectorsize); + nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE); + pages = kvcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL_ACCOUNT); + if (!pages) + return -ENOMEM; + for (i = 0; i < nr_pages; i++) { + size_t bytes = min_t(size_t, PAGE_SIZE, iov_iter_count(from)); + char *kaddr; + + pages[i] = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_HIGHMEM); + if (!pages[i]) { + ret = -ENOMEM; + goto out_pages; + } + kaddr = kmap(pages[i]); + if (copy_from_iter(kaddr, bytes, from) != bytes) { + kunmap(pages[i]); + ret = -EFAULT; + goto out_pages; + } + if (bytes < PAGE_SIZE) + memset(kaddr + bytes, 0, PAGE_SIZE - bytes); + kunmap(pages[i]); + } + + for (;;) { + struct btrfs_ordered_extent *ordered; + + ret = btrfs_wait_ordered_range(inode, start, num_bytes); + if (ret) + goto out_pages; + ret = invalidate_inode_pages2_range(inode->i_mapping, + start >> PAGE_SHIFT, + end >> PAGE_SHIFT); + if (ret) + goto out_pages; + lock_extent_bits(io_tree, start, end, &cached_state); + ordered = btrfs_lookup_ordered_range(BTRFS_I(inode), start, + num_bytes); + if (!ordered && + !filemap_range_has_page(inode->i_mapping, start, end)) + break; + if (ordered) + btrfs_put_ordered_extent(ordered); + unlock_extent_cached(io_tree, start, end, &cached_state); + cond_resched(); + } + + /* + * We don't use the higher-level delalloc space functions because our + * num_bytes and disk_num_bytes are different. + */ + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), disk_num_bytes); + if (ret) + goto out_unlock; + ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, start, + num_bytes); + if (ret) + goto out_free_data_space; + ret = btrfs_delalloc_reserve_metadata(BTRFS_I(inode), num_bytes, + disk_num_bytes); + if (ret) + goto out_qgroup_free_data; + + /* Try an inline extent first. */ + if (start == 0 && encoded->unencoded_len == encoded->len && + encoded->unencoded_offset == 0) { + ret = cow_file_range_inline(BTRFS_I(inode), encoded->len, + orig_count, compression, pages, + true); + if (ret <= 0) { + if (ret == 0) + ret = orig_count; + goto out_delalloc_release; + } + } + + ret = btrfs_reserve_extent(root, disk_num_bytes, disk_num_bytes, + disk_num_bytes, 0, 0, &ins, 1, 1); + if (ret) + goto out_delalloc_release; + extent_reserved = true; + + em = create_io_em(BTRFS_I(inode), start, num_bytes, + start - encoded->unencoded_offset, ins.objectid, + ins.offset, ins.offset, ram_bytes, compression, + BTRFS_ORDERED_COMPRESSED); + if (IS_ERR(em)) { + ret = PTR_ERR(em); + goto out_free_reserved; + } + free_extent_map(em); + + ret = btrfs_add_ordered_extent(BTRFS_I(inode), start, num_bytes, + ram_bytes, ins.objectid, ins.offset, + encoded->unencoded_offset, + (1 << BTRFS_ORDERED_ENCODED) | + (1 << BTRFS_ORDERED_COMPRESSED), + compression); + if (ret) { + btrfs_drop_extent_cache(BTRFS_I(inode), start, end, 0); + goto out_free_reserved; + } + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + + if (start + encoded->len > inode->i_size) + i_size_write(inode, start + encoded->len); + + unlock_extent_cached(io_tree, start, end, &cached_state); + + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes); + + if (btrfs_submit_compressed_write(BTRFS_I(inode), start, num_bytes, + ins.objectid, ins.offset, pages, + nr_pages, 0, NULL, false)) { + struct page *page = pages[0]; + + page->mapping = inode->i_mapping; + btrfs_writepage_endio_finish_ordered(page, start, end, 0); + page->mapping = NULL; + ret = -EIO; + goto out_pages; + } + ret = orig_count; + goto out; + +out_free_reserved: + btrfs_dec_block_group_reservations(fs_info, ins.objectid); + btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); +out_delalloc_release: + btrfs_delalloc_release_extents(BTRFS_I(inode), num_bytes); + btrfs_delalloc_release_metadata(BTRFS_I(inode), disk_num_bytes, + ret < 0); +out_qgroup_free_data: + if (ret < 0) { + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, start, + num_bytes); + } +out_free_data_space: + /* + * If btrfs_reserve_extent() succeeded, then we already decremented + * bytes_may_use. + */ + if (!extent_reserved) + btrfs_free_reserved_data_space_noquota(fs_info, disk_num_bytes); +out_unlock: + unlock_extent_cached(io_tree, start, end, &cached_state); +out_pages: + for (i = 0; i < nr_pages; i++) { + if (pages[i]) + __free_page(pages[i]); + } + kvfree(pages); +out: + if (ret >= 0) + iocb->ki_pos += encoded->len; + return ret; +} + #ifdef CONFIG_SWAP /* * Add an entry indicating a block group or device which is pinned by a diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index e11a517fab10..00329d82337e 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -472,9 +472,15 @@ void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode, spin_lock(&btrfs_inode->lock); btrfs_mod_outstanding_extents(btrfs_inode, -1); spin_unlock(&btrfs_inode->lock); - if (root != fs_info->tree_root) - btrfs_delalloc_release_metadata(btrfs_inode, entry->num_bytes, - false); + if (root != fs_info->tree_root) { + u64 release; + + if (test_bit(BTRFS_ORDERED_ENCODED, &entry->flags)) + release = entry->disk_num_bytes; + else + release = entry->num_bytes; + btrfs_delalloc_release_metadata(btrfs_inode, release, false); + } if (test_bit(BTRFS_ORDERED_DIRECT, &entry->flags)) percpu_counter_add_batch(&fs_info->dio_bytes, -entry->num_bytes, diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 8c8d6ed7a350..67c72485bd90 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -62,6 +62,8 @@ enum { BTRFS_ORDERED_LOGGED_CSUM, /* We wait for this extent to complete in the current transaction */ BTRFS_ORDERED_PENDING, + /* RWF_ENCODED I/O */ + BTRFS_ORDERED_ENCODED, }; struct btrfs_ordered_extent {