From patchwork Mon Oct 7 14:23:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13824777 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 763B41D1F76 for ; Mon, 7 Oct 2024 14:24:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728311076; cv=none; b=IWJh4Xp4I483tvyrxjcc0oY52GyIHUsa//GOnSmZ5IdrYiuN/dXAFudl9eJdSc21ZlHsnOJuvqZmqc7koPZclkFus+wB0WCIt4/X3npeO1U7lUzHlQnyR+PjdMBy/P80GasNduw7aVJXmrfGxAOFCs4aosuJFPG+hf8Hf1vl3DE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728311076; c=relaxed/simple; bh=Wv255iQMQcFXzIhKM5T7QpG8TwX1ScnW6TVpLYYyUqQ=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=PE4PeXk5sjFgPtuG2E9/mzMIugLik3upJ7oAwtDjWHLjQRF6H9Dns4sP6tiVkHvE7ZoX6eiY7NtyXqNgh2RJ4xcKY+g6CPWMqeY28GyO+r8a6A6knlrfRPgsAChLN4xA99CFyZLUqN0PNPpfLV3CfozUV4gK6YW9AKK43yXHZko= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fXAQ4K3+; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fXAQ4K3+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F1BCCC4CECF; Mon, 7 Oct 2024 14:24:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728311076; bh=Wv255iQMQcFXzIhKM5T7QpG8TwX1ScnW6TVpLYYyUqQ=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=fXAQ4K3+Nj9bEa4wEagoykGYNQI1Dv4nnuZO7sWwtSgm1jOn2KqqFUpBuF0JudN5w KYKFVvl4c1VAUo1Mac0VlyeKzYYcKPZ6iOOwTDPj/yvFukOtyFza6AarGWPqCy1EE2 hynjQMB6u8UEiF3QymkELBDmVaot1lhYhSzpEE8khxFUl83tJb09Ot41gFBtMbJWFo PSj9OlTTDk32pzMu9fs2bKqkd9l1/DAMPsQLW2lNvOxZVweQUM2nMH5jXLUM7/R089 xtgb+tY98eap5x5Kn30sqq4/juJg8doKHrw3C14xPj+Wn6fbYdivu/tUSjygOPAZIm ngYsXeeP429oA== From: Christian Brauner Date: Mon, 07 Oct 2024 16:23:57 +0200 Subject: [PATCH v2 1/3] fs: protect backing files with rcu Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241007-brauner-file-rcuref-v2-1-387e24dc9163@kernel.org> References: <20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org> In-Reply-To: <20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jens Axboe , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=2546; i=brauner@kernel.org; h=from:subject:message-id; bh=Wv255iQMQcFXzIhKM5T7QpG8TwX1ScnW6TVpLYYyUqQ=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQzv1eYxhu0YN7a7v3qMyPj7dU544Nn7+JZd/2r5clDS 4LmbPn9tqOUhUGMi0FWTJHFod0kXG45T8Vmo0wNmDmsTCBDGLg4BWAisq8ZGZ7PcheoffBTe5bK 0Vnz18sssftufzKYST3oRDnj26QVsu8Z/meIpkjuPLc15+fLqTypq+0EnFUe/Hyw6OjP1/uV3fK Z9rIAAA== X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Currently backing files are not under any form of rcu protection. Switching to file_ref requires rcu protection and so does the speculative vma lookup. Switch backing files to the same rcu slab type as regular files. There should be no additional magic required as the lifetime of a backing file is always tied to a regular file. Signed-off-by: Christian Brauner --- fs/file_table.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index eed5ffad9997c24e533f88285deb537ddf9429ed..4b23eb7b79dd9d4ec779f4c01ba2e902988895dc 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -40,13 +40,17 @@ static struct files_stat_struct files_stat = { /* SLAB cache for file structures */ static struct kmem_cache *filp_cachep __ro_after_init; +static struct kmem_cache *bfilp_cachep __ro_after_init; static struct percpu_counter nr_files __cacheline_aligned_in_smp; /* Container for backing file with optional user path */ struct backing_file { struct file file; - struct path user_path; + union { + struct path user_path; + freeptr_t bf_freeptr; + }; }; static inline struct backing_file *backing_file(struct file *f) @@ -68,7 +72,7 @@ static inline void file_free(struct file *f) put_cred(f->f_cred); if (unlikely(f->f_mode & FMODE_BACKING)) { path_put(backing_file_user_path(f)); - kfree(backing_file(f)); + kmem_cache_free(bfilp_cachep, backing_file(f)); } else { kmem_cache_free(filp_cachep, f); } @@ -267,13 +271,13 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred) struct backing_file *ff; int error; - ff = kzalloc(sizeof(struct backing_file), GFP_KERNEL); + ff = kmem_cache_zalloc(bfilp_cachep, GFP_KERNEL); if (unlikely(!ff)) return ERR_PTR(-ENOMEM); error = init_file(&ff->file, flags, cred); if (unlikely(error)) { - kfree(ff); + kmem_cache_free(bfilp_cachep, ff); return ERR_PTR(error); } @@ -529,6 +533,11 @@ void __init files_init(void) filp_cachep = kmem_cache_create("filp", sizeof(struct file), &args, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU); + + args.freeptr_offset = offsetof(struct backing_file, bf_freeptr); + bfilp_cachep = kmem_cache_create("bfilp", sizeof(struct backing_file), + &args, SLAB_HWCACHE_ALIGN | SLAB_PANIC | + SLAB_ACCOUNT | SLAB_TYPESAFE_BY_RCU); percpu_counter_init(&nr_files, 0, GFP_KERNEL); } From patchwork Mon Oct 7 14:23:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13824778 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 20D7E1D363A for ; Mon, 7 Oct 2024 14:24:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728311078; cv=none; b=E0HJYoqUS7ccs2e1G6q2f9DosqrfOlmecdESS8gewkpMYb3E1ljjpWU0XYOzd+Cdby+3folHJgpMBAollDHBAeEA3Z7/mOi1l/qOPwfnJOD6+2pm5ecbfQncrXEJWAvB5uAW8WHGpkxTmuflzJP2srkDFAGNG0K51ktTyCoKXko= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728311078; c=relaxed/simple; bh=8R3Ppx/qVhsE4sDKFFpGR2v6Jr0VnrQwdYP/9tGhGWI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=URFRuts3pknfiiXEgQOEBiUoX73tf1rRL4HHLpZ63ntu360Pt/Ruo78t/kB9bdqks+CUROX8ZKO1UcPJ/8D2ObU8OVAnXTsPVStaFcOMF2Gt17DI6Np513WPrfRJc5NrtYYzHx0zRm+/zZ4NzsYC1VVVMP/MVxU/SmyXeaiq3gI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aDwK+8wI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aDwK+8wI" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 72F6EC4CEC6; Mon, 7 Oct 2024 14:24:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728311077; bh=8R3Ppx/qVhsE4sDKFFpGR2v6Jr0VnrQwdYP/9tGhGWI=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=aDwK+8wIgbbXGOn5bk7q6TcENrB6WmCiBv/wTNbGycHQR9Fk93I1xVwIIisTsR0aj 7GAqPsUgoRH5qF/i8iGiz6Dux121Dz9Sv5bEoJ4hK7Yms2DqdfGgKQoa+Ks5ghQj+H LzLAzQVIRg4SBHBaQWG152SCQWQ/ITuNu7J86mgmj7VB2+X4SUtpzdjGMFzyJwP5uh rGQOmh2G8CKBlV9UDjQCA6Ylvsi7bMTIqf5RDDr+DGYwpNbuOWDUTpv/8MqoWhM+7U jCGcl0ujaqkrmVvKdKpHKFhXKTP/XlmckBtQfZDn2a14F9ZAw9JTo+r3/IDBAizERy YKlcm5iO/ko5A== From: Christian Brauner Date: Mon, 07 Oct 2024 16:23:58 +0200 Subject: [PATCH v2 2/3] fs: add file_ref Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241007-brauner-file-rcuref-v2-2-387e24dc9163@kernel.org> References: <20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org> In-Reply-To: <20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jens Axboe , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=14049; i=brauner@kernel.org; h=from:subject:message-id; bh=8R3Ppx/qVhsE4sDKFFpGR2v6Jr0VnrQwdYP/9tGhGWI=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQzv1f8wbYy51j//icKB9Yf2jSHoc/TxSY9s27Z5Y7qd e37XpWqdZSyMIhxMciKKbI4tJuEyy3nqdhslKkBM4eVCWQIAxenAEzEdxMjQ6PxVf/lc/a/3Zjy bNaDmj+/rjwtTPp/tkkz8ZF4Wczy3mMM/0Ny1/R0e+jPcv2pzlMp/HCKhADn0ZvFkts/qwQkvos v5AIA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has O(N^2) behaviour under contention with N concurrent operations and it is in a hot path in __fget_files_rcu(). The rcuref infrastructures remedies this problem by using an unconditional increment relying on safe- and dead zones to make this work and requiring rcu protection for the data structure in question. This not just scales better it also introduces overflow protection. However, in contrast to generic rcuref, files require a memory barrier and thus cannot rely on *_relaxed() atomic operations and also require to be built on atomic_long_t as having massive amounts of reference isn't unheard of even if it is just an attack. As suggested by Linus, add a file specific variant instead of making this a generic library. Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu protection. In short, freeing of files isn't delayed until a grace period has elapsed. Instead, they are freed immediately and thus can be reused (multiple times) within the same grace period. So when picking a file from the file descriptor table via its file descriptor number it is thus possible to see an elevated reference count on file->f_count even though the file has already been recycled possibly multiple times by another task. To guard against this the vfs will pick the file from the file descriptor table twice. Once before the refcount increment and once after to compare the pointers (grossly simplified). If they match then the file is still valid. If not the caller needs to fput() it. The unconditional increment makes the following race possible as illustrated by rcuref: > Deconstruction race > =================== > > The release operation must be protected by prohibiting a grace period in > order to prevent a possible use after free: > > T1 T2 > put() get() > // ref->refcnt = ONEREF > if (!atomic_add_negative(-1, &ref->refcnt)) > return false; <- Not taken > > // ref->refcnt == NOREF > --> preemption > // Elevates ref->refcnt to ONEREF > if (!atomic_add_negative(1, &ref->refcnt)) > return true; <- taken > > if (put(&p->ref)) { <-- Succeeds > remove_pointer(p); > kfree_rcu(p, rcu); > } > > RCU grace period ends, object is freed > > atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF > > [...] it prevents the grace period which keeps the object alive until > all put() operations complete. Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for this deconstruction race. Afaict, the only interesting case would be someone freeing the file and someone immediately recycling it within the same grace period and reinitializing file->f_count to ONEREF while a concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as in the race above. But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should be safe from rcuref's perspective. T1 T2 T3 fput() fget() // f_count->refcnt = ONEREF if (!atomic_add_negative(-1, &f_count->refcnt)) return false; <- Not taken // f_count->refcnt == NOREF --> preemption // Elevates f_count->refcnt to ONEREF if (!atomic_add_negative(1, &f_count->refcnt)) return true; <- taken if (put(&f_count)) { <-- Succeeds remove_pointer(p); /* * Cache is SLAB_TYPESAFE_BY_RCU * so this is freed without a grace period. */ kmem_cache_free(p); } kmem_cache_alloc() init_file() { // Sets f_count->refcnt to ONEREF rcuref_long_init(&f->f_count, 1); } Object has been reused within the same grace period via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU. /* * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and * it would work correctly because the atomic_cmpxchg() * will fail because the refcount has been reset to ONEREF by T3. */ atomic_cmpxchg(&ref->refcnt, NOREF, DEAD); <- UAF I've been testing this with will-it-scale using fstat() on a machine that Jens gave me access (thank you very much!): processor : 511 vendor_id : AuthenticAMD cpu family : 25 model : 160 model name : AMD EPYC 9754 128-Core Processor and I consistently get a 3-5% improvement on 256+ threads. Signed-off-by: Christian Brauner --- fs/file.c | 106 +++++++++++++++++++++++++++++++++++++++++++ include/linux/file_ref.h | 116 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 222 insertions(+) diff --git a/fs/file.c b/fs/file.c index 5125607d040a2ff073e170d043124db5f444a90a..1b5fc867d8ddff856501ba49d8c751f888810330 100644 --- a/fs/file.c +++ b/fs/file.c @@ -20,10 +20,116 @@ #include #include #include +#include #include #include "internal.h" +/** + * __file_ref_get - Slowpath of file_ref_get() + * @ref: Pointer to the reference count + * + * Invoked when the reference count is outside of the valid zone. + * + * Return: + * False if the reference count was already marked dead + * + * True if the reference count is saturated, which prevents the + * object from being deconstructed ever. + */ +bool __file_ref_get(file_ref_t *ref) +{ + unsigned long cnt; + + /* + * If the reference count was already marked dead, undo the + * increment so it stays in the middle of the dead zone and return + * fail. + */ + cnt = atomic_long_read(&ref->refcnt); + if (cnt >= FILE_REF_RELEASED) { + atomic_long_set(&ref->refcnt, FILE_REF_DEAD); + return false; + } + + /* + * If it was saturated, warn and mark it so. In case the increment + * was already on a saturated value restore the saturation + * marker. This keeps it in the middle of the saturation zone and + * prevents the reference count from overflowing. This leaks the + * object memory, but prevents the obvious reference count overflow + * damage. + */ + if (WARN_ONCE(cnt > FILE_REF_MAXREF, "leaking memory because file reference count is saturated")) + atomic_long_set(&ref->refcnt, FILE_REF_SATURATED); + return true; +} +EXPORT_SYMBOL_GPL(__file_ref_get); + +/** + * __file_ref_put - Slowpath of file_ref_put() + * @ref: Pointer to the reference count + * + * Invoked when the reference count is outside of the valid zone. + * + * Return: + * True if this was the last reference with no future references + * possible. This signals the caller that it can safely schedule the + * object, which is protected by the reference counter, for + * deconstruction. + * + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * deconstruct the protected object. + */ +bool __file_ref_put(file_ref_t *ref) +{ + unsigned long cnt; + + /* Did this drop the last reference? */ + cnt = atomic_long_read(&ref->refcnt); + if (likely(cnt == FILE_REF_NOREF)) { + /* + * Carefully try to set the reference count to FILE_REF_DEAD. + * + * This can fail if a concurrent get() operation has + * elevated it again or the corresponding put() even marked + * it dead already. Both are valid situations and do not + * require a retry. If this fails the caller is not + * allowed to deconstruct the object. + */ + if (!atomic_long_try_cmpxchg_release(&ref->refcnt, &cnt, FILE_REF_DEAD)) + return false; + + /* + * The caller can safely schedule the object for + * deconstruction. Provide acquire ordering. + */ + smp_acquire__after_ctrl_dep(); + return true; + } + + /* + * If the reference count was already in the dead zone, then this + * put() operation is imbalanced. Warn, put the reference count back to + * DEAD and tell the caller to not deconstruct the object. + */ + if (WARN_ONCE(cnt >= FILE_REF_RELEASED, "imbalanced put on file reference count")) { + atomic_long_set(&ref->refcnt, FILE_REF_DEAD); + return false; + } + + /* + * This is a put() operation on a saturated refcount. Restore the + * mean saturation value and tell the caller to not deconstruct the + * object. + */ + if (cnt > FILE_REF_MAXREF) + atomic_long_set(&ref->refcnt, FILE_REF_SATURATED); + return false; +} +EXPORT_SYMBOL_GPL(__file_ref_put); + unsigned int sysctl_nr_open __read_mostly = 1024*1024; unsigned int sysctl_nr_open_min = BITS_PER_LONG; /* our min() is unusable in constant expressions ;-/ */ diff --git a/include/linux/file_ref.h b/include/linux/file_ref.h new file mode 100644 index 0000000000000000000000000000000000000000..ff3f36eac0efffd21f6298c42102e41c1422212d --- /dev/null +++ b/include/linux/file_ref.h @@ -0,0 +1,116 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_FILE_REF_H +#define _LINUX_FILE_REF_H + +#ifdef CONFIG_64BIT +#define FILE_REF_ONEREF 0x0000000000000000U +#define FILE_REF_MAXREF 0x7FFFFFFFFFFFFFFFU +#define FILE_REF_SATURATED 0xA000000000000000U +#define FILE_REF_RELEASED 0xC000000000000000U +#define FILE_REF_DEAD 0xE000000000000000U +#define FILE_REF_NOREF 0xFFFFFFFFFFFFFFFFU +#else +#define FILE_REF_ONEREF 0x00000000U +#define FILE_REF_MAXREF 0x7FFFFFFFU +#define FILE_REF_SATURATED 0xA0000000U +#define FILE_REF_RELEASED 0xC0000000U +#define FILE_REF_DEAD 0xE0000000U +#define FILE_REF_NOREF 0xFFFFFFFFU +#endif + +typedef struct { +#ifdef CONFIG_64BIT + atomic64_t refcnt; +#else + atomic_t refcnt; +#endif +} file_ref_t; + +/** + * file_ref_init - Initialize a file reference count + * @ref: Pointer to the reference count + * @cnt: The initial reference count typically '1' + */ +static inline void file_ref_init(file_ref_t *ref, unsigned long cnt) +{ + atomic_long_set(&ref->refcnt, cnt - 1); +} + +bool __file_ref_get(file_ref_t *ref); +bool __file_ref_put(file_ref_t *ref); + +/** + * file_ref_get - Acquire one reference on a file + * @ref: Pointer to the reference count + * + * Similar to atomic_inc_not_zero() but saturates at FILE_REF_MAXREF. + * + * Provides full memory ordering. + * + * Return: False if the attempt to acquire a reference failed. This happens + * when the last reference has been put already. True if a reference + * was successfully acquired + */ +static __always_inline __must_check bool file_ref_get(file_ref_t *ref) +{ + /* + * Unconditionally increase the reference count with full + * ordering. The saturation and dead zones provide enough + * tolerance for this. + */ + if (likely(!atomic_long_add_negative(1, &ref->refcnt))) + return true; + + /* Handle the cases inside the saturation and dead zones */ + return __file_ref_get(ref); +} + +/** + * file_ref_put -- Release a file reference + * @ref: Pointer to the reference count + * + * Provides release memory ordering, such that prior loads and stores + * are done before, and provides an acquire ordering on success such + * that free() must come after. + * + * Return: True if this was the last reference with no future references + * possible. This signals the caller that it can safely release + * the object which is protected by the reference counter. + * False if there are still active references or the put() raced + * with a concurrent get()/put() pair. Caller is not allowed to + * release the protected object. + */ +static __always_inline __must_check bool file_ref_put(file_ref_t *ref) +{ + bool released; + + preempt_disable(); + /* + * Unconditionally decrease the reference count. The saturation + * and dead zones provide enough tolerance for this. If this + * fails then we need to handle the last reference drop and + * cases inside the saturation and dead zones. + */ + if (likely(!atomic_long_add_negative_release(-1, &ref->refcnt))) + released = false; + else + released = __file_ref_put(ref); + preempt_enable(); + return released; +} + +/** + * file_ref_read - Read the number of file references + * @ref: Pointer to the reference count + * + * Return: The number of held references (0 ... N) + */ +static inline unsigned long file_ref_read(file_ref_t *ref) +{ + unsigned long c = atomic_long_read(&ref->refcnt); + + /* Return 0 if within the DEAD zone. */ + return c >= FILE_REF_RELEASED ? 0 : c + 1; +} + +#endif From patchwork Mon Oct 7 14:23:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13824779 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A9AF1D363A for ; Mon, 7 Oct 2024 14:24:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728311079; cv=none; b=X0WYb6G23Z4YYTR39RAvYd3AzIDdihBXm8zNNerCG8ueP2N2iR3ikUsA6kYkcs1A+p2i01m8R4YvkUdr5xkQ5dr0lKLMFMqlqgFXyg9JAqkVl8cWPyxsKVyC0kD44kNXSTh/ucMjUXhb2o5Wgn9dKk8HS47z/MwbelVBvECNmBc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1728311079; c=relaxed/simple; bh=T8BdlNMwGOH0eH5b3vG8bzxvjfeXxvor6pS00DLD0BU=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=RDJsLpsf2VFaGxHs30iYR9AioaGdFV3cpUcTkO83fPf8glB6trty8+XIMzSmDhJr2yRkWfcnqPCAlcZYAXYhncje2Ux5563ZHc3ssG6lIQIqptUpRMkTG8AdaPoE+Z7ytvtAed8MvXcWuyLPtN7ogrzg3lJ9XpUahiykJKYhhwg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UQoBbQdj; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UQoBbQdj" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 20B7FC4CECC; Mon, 7 Oct 2024 14:24:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1728311079; bh=T8BdlNMwGOH0eH5b3vG8bzxvjfeXxvor6pS00DLD0BU=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=UQoBbQdjuOLZVVkaAcV8UYNQYi/LRFRo+ULub+hCQ3vVyJBViHCyOZR27brykaBA9 1FoRpBRqcYzOLnsib0UIOj4IfxY1l52LKDeVo+qH19/juRnHjlaJ3X3fKYJGvpC54h 8yziSGcjzHde5363Q2++TRZLpEVlbPD16mKESssSLz7YC53YgpGtwnUkyaELTXb9FK wh+0beGsGu3QwXcK91CAciSOlussqSH+WSvMfh+gsYm+Y5tish792TN4EETa85k/bg zi27LEerNI/Jx66dmOVDDbgAdhbE1Qa687G5hBLpI+1pb4ZC9A3IlEAQrQhcE0dDMI 8F/BUJb+AZwlg== From: Christian Brauner Date: Mon, 07 Oct 2024 16:23:59 +0200 Subject: [PATCH v2 3/3] fs: port files to file_ref Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org> References: <20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org> In-Reply-To: <20241007-brauner-file-rcuref-v2-0-387e24dc9163@kernel.org> To: Linus Torvalds Cc: linux-fsdevel@vger.kernel.org, Thomas Gleixner , Jens Axboe , Christian Brauner X-Mailer: b4 0.15-dev-dedf8 X-Developer-Signature: v=1; a=openpgp-sha256; l=6109; i=brauner@kernel.org; h=from:subject:message-id; bh=T8BdlNMwGOH0eH5b3vG8bzxvjfeXxvor6pS00DLD0BU=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaQzv1f0MvN/8Mt6ezxvh+r/kJfSp52SV6fs9TCb67aEn yPmSKdWRykLgxgXg6yYIotDu0m43HKeis1GmRowc1iZQIYwcHEKwESWP2P4p+j/MpLhksWZ1J8b Q6/dkk3dcGFv8ZvwwD23TiifrBe49Z2R4cO7f6EfjhVPW/3j3M+tHWefMngo/cuQ4ym58CQlbrV 1KQsA X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Port files to rely on file_ref reference to improve scaling and gain overflow protection. Signed-off-by: Christian Brauner --- drivers/gpu/drm/i915/gt/shmem_utils.c | 2 +- drivers/gpu/drm/vmwgfx/ttm_object.c | 2 +- fs/eventpoll.c | 2 +- fs/file.c | 14 +++++++------- fs/file_table.c | 6 +++--- include/linux/fs.h | 9 +++++---- 6 files changed, 18 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c b/drivers/gpu/drm/i915/gt/shmem_utils.c index 1fb6ff77fd899111a0797dac0edd3f2cfa01f42d..bb696b29ee2c992c6b6d0ec5ae538f9ebbb9ed29 100644 --- a/drivers/gpu/drm/i915/gt/shmem_utils.c +++ b/drivers/gpu/drm/i915/gt/shmem_utils.c @@ -40,7 +40,7 @@ struct file *shmem_create_from_object(struct drm_i915_gem_object *obj) if (i915_gem_object_is_shmem(obj)) { file = obj->base.filp; - atomic_long_inc(&file->f_count); + get_file(file); return file; } diff --git a/drivers/gpu/drm/vmwgfx/ttm_object.c b/drivers/gpu/drm/vmwgfx/ttm_object.c index 3353e97687d1d5d0e05bdc8f26ae4b0aae53a997..a17e62867f3b33cd1aafade244d387b43bb66b51 100644 --- a/drivers/gpu/drm/vmwgfx/ttm_object.c +++ b/drivers/gpu/drm/vmwgfx/ttm_object.c @@ -471,7 +471,7 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev) */ static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { - return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; + return file_ref_get(&dmabuf->file->f_ref); } /** diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 1ae4542f0bd88b07e323d0dd75be6c0fe9fff54f..212383cefe6c9fe13a38061c2c81e5b6ff857925 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1002,7 +1002,7 @@ static struct file *epi_fget(const struct epitem *epi) struct file *file; file = epi->ffd.file; - if (!atomic_long_inc_not_zero(&file->f_count)) + if (!file_ref_get(&file->f_ref)) file = NULL; return file; } diff --git a/fs/file.c b/fs/file.c index 1b5fc867d8ddff856501ba49d8c751f888810330..f4df09e92b6158ee40a794a4e0462f57acf595cf 100644 --- a/fs/file.c +++ b/fs/file.c @@ -972,7 +972,7 @@ static struct file *__get_file_rcu(struct file __rcu **f) if (!file) return NULL; - if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) + if (unlikely(!file_ref_get(&file->f_ref))) return ERR_PTR(-EAGAIN); file_reloaded = rcu_dereference_raw(*f); @@ -986,8 +986,8 @@ static struct file *__get_file_rcu(struct file __rcu **f) OPTIMIZER_HIDE_VAR(file_reloaded_cmp); /* - * atomic_long_inc_not_zero() above provided a full memory - * barrier when we acquired a reference. + * file_ref_get() above provided a full memory barrier when we + * acquired a reference. * * This is paired with the write barrier from assigning to the * __rcu protected file pointer so that if that pointer still @@ -1085,11 +1085,11 @@ static inline struct file *__fget_files_rcu(struct files_struct *files, * We need to confirm it by incrementing the refcount * and then check the lookup again. * - * atomic_long_inc_not_zero() gives us a full memory - * barrier. We only really need an 'acquire' one to - * protect the loads below, but we don't have that. + * file_ref_get() gives us a full memory barrier. We + * only really need an 'acquire' one to protect the + * loads below, but we don't have that. */ - if (unlikely(!atomic_long_inc_not_zero(&file->f_count))) + if (unlikely(!file_ref_get(&file->f_ref))) continue; /* diff --git a/fs/file_table.c b/fs/file_table.c index 4b23eb7b79dd9d4ec779f4c01ba2e902988895dc..3f5dc4176b21ff82cc9440ed92a0ad962fdb2046 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -178,7 +178,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred) * fget-rcu pattern users need to be able to handle spurious * refcount bumps we should reinitialize the reused file first. */ - atomic_long_set(&f->f_count, 1); + file_ref_init(&f->f_ref, 1); return 0; } @@ -483,7 +483,7 @@ static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); void fput(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (file_ref_put(&file->f_ref)) { struct task_struct *task = current; if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) { @@ -516,7 +516,7 @@ void fput(struct file *file) */ void __fput_sync(struct file *file) { - if (atomic_long_dec_and_test(&file->f_count)) + if (file_ref_put(&file->f_ref)) __fput(file); } diff --git a/include/linux/fs.h b/include/linux/fs.h index e3c603d01337650d562405500013f5c4cfed8eb6..fece097d41a8fb47a1483e5418f1e7319405bba2 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -45,6 +45,7 @@ #include #include #include +#include #include #include @@ -1030,7 +1031,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.) */ struct file { - atomic_long_t f_count; + file_ref_t f_ref; spinlock_t f_lock; fmode_t f_mode; const struct file_operations *f_op; @@ -1078,15 +1079,15 @@ struct file_handle { static inline struct file *get_file(struct file *f) { - long prior = atomic_long_fetch_inc_relaxed(&f->f_count); - WARN_ONCE(!prior, "struct file::f_count incremented from zero; use-after-free condition present!\n"); + WARN_ONCE(!file_ref_get(&f->f_ref), + "struct file::f_ref incremented from zero; use-after-free condition present!\n"); return f; } struct file *get_file_rcu(struct file __rcu **f); struct file *get_file_active(struct file **f); -#define file_count(x) atomic_long_read(&(x)->f_count) +#define file_count(f) file_ref_read(&(f)->f_ref) #define MAX_NON_LFS ((1UL<<31) - 1)