[v2] fuse: Abort connection if FUSE server get stuck

Message ID	20241211194522.31977-2-etmartin4313@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BE8C52594BD for <linux-fsdevel@vger.kernel.org>; Wed, 11 Dec 2024 19:45:56 +0000 (UTC) From: etmartin4313@gmail.com To: linux-fsdevel@vger.kernel.org, miklos@szeredi.hu Cc: joannelkoong@gmail.com, bernd.schubert@fastmail.fm, jefflexu@linux.alibaba.com, josef@toxicpanda.com, laoar.shao@gmail.com, senozhatsky@chromium.org, etmartin@cisco.com, Etienne Martineau <etmartin4313@gmail.com> Subject: [PATCH v2] fuse: Abort connection if FUSE server get stuck Date: Wed, 11 Dec 2024 14:45:22 -0500 Message-Id: <20241211194522.31977-2-etmartin4313@gmail.com> In-Reply-To: <20241211194522.31977-1-etmartin4313@gmail.com> References: <20241211194522.31977-1-etmartin4313@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] fuse: Abort connection if FUSE server get stuck \| expand [v2] fuse: Abort connection if FUSE server get stuck

Etienne Martineau Dec. 11, 2024, 7:45 p.m. UTC

From: Etienne Martineau <etmartin4313@gmail.com>

This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
is getting stuck for too long. A slow FUSE server may tripped the
hang check timer for legitimate reasons hence consider disabling
HUNG_TASK_PANIC in that scenario.

Without this patch, an unresponsive / buggy / malicious FUSE server can
leave the clients in D state for a long period of time and on system where
HUNG_TASK_PANIC is set, trigger a catastrophic reload.

So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
to abort connections that exceed the timeout value which is define to be
half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
is per connection and runs only if there are active FUSE request pending.

A FUSE client can get into D state as such ( see below scenario #1 / #2 )
 1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
    OR
 2) request_wait_answer() -> wait_event_(interruptible / killable) is head
    of line blocking for subsequent clients accessing the same file

	scenario #1:
	2716 pts/2    D+     0:00 cat
	$ cat /proc/2716/stack
		[<0>] request_wait_answer+0x22e/0x340
		[<0>] __fuse_simple_request+0xd8/0x2c0
		[<0>] fuse_perform_write+0x3ec/0x760
		[<0>] fuse_file_write_iter+0x3d5/0x3f0
		[<0>] vfs_write+0x313/0x430
		[<0>] ksys_write+0x6a/0xf0
		[<0>] __x64_sys_write+0x19/0x30
		[<0>] x64_sys_call+0x18ab/0x26f0
		[<0>] do_syscall_64+0x7c/0x170
		[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

	scenario #2:
	2962 pts/2    S+     0:00 cat
	2963 pts/3    D+     0:00 cat
	$ cat /proc/2962/stack
		[<0>] request_wait_answer+0x140/0x340
		[<0>] __fuse_simple_request+0xd8/0x2c0
		[<0>] fuse_perform_write+0x3ec/0x760
		[<0>] fuse_file_write_iter+0x3d5/0x3f0
		[<0>] vfs_write+0x313/0x430
		[<0>] ksys_write+0x6a/0xf0
		[<0>] __x64_sys_write+0x19/0x30
		[<0>] x64_sys_call+0x18ab/0x26f0
		[<0>] do_syscall_64+0x7c/0x170
		[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
	$ cat /proc/2963/stack
		[<0>] fuse_file_write_iter+0x252/0x3f0
		[<0>] vfs_write+0x313/0x430
		[<0>] ksys_write+0x6a/0xf0
		[<0>] __x64_sys_write+0x19/0x30
		[<0>] x64_sys_call+0x18ab/0x26f0
		[<0>] do_syscall_64+0x7c/0x170
		[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Please note that this patch doesn't prevent the HUNG_TASK_WARNING.

Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
---
 fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
 fs/fuse/fuse_i.h             |   8 +++
 fs/fuse/inode.c              |   3 ++
 include/linux/sched/sysctl.h |   8 ++-
 kernel/hung_task.c           |   3 +-
 5 files changed, 119 insertions(+), 3 deletions(-)

Joanne Koong Dec. 11, 2024, 10:03 p.m. UTC | #1

On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
>
> From: Etienne Martineau <etmartin4313@gmail.com>
>
> This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> is getting stuck for too long. A slow FUSE server may tripped the
> hang check timer for legitimate reasons hence consider disabling
> HUNG_TASK_PANIC in that scenario.
>
> Without this patch, an unresponsive / buggy / malicious FUSE server can
> leave the clients in D state for a long period of time and on system where
> HUNG_TASK_PANIC is set, trigger a catastrophic reload.
>
> So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> to abort connections that exceed the timeout value which is define to be
> half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> is per connection and runs only if there are active FUSE request pending.

Hi Etienne,

For your use case, does the generic request timeouts logic and
max_request_timeout systemctl implemented in [1] and [2] not suffice?
IMO I don't think we should have logic specifically checking for hung
task timeouts in fuse, if the generic solution can be used.

Thanks,
Joanne

[1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@gmail.com/

>
> A FUSE client can get into D state as such ( see below scenario #1 / #2 )
>  1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
>     OR
>  2) request_wait_answer() -> wait_event_(interruptible / killable) is head
>     of line blocking for subsequent clients accessing the same file
>
>         scenario #1:
>         2716 pts/2    D+     0:00 cat
>         $ cat /proc/2716/stack
>                 [<0>] request_wait_answer+0x22e/0x340
>                 [<0>] __fuse_simple_request+0xd8/0x2c0
>                 [<0>] fuse_perform_write+0x3ec/0x760
>                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
>                 [<0>] vfs_write+0x313/0x430
>                 [<0>] ksys_write+0x6a/0xf0
>                 [<0>] __x64_sys_write+0x19/0x30
>                 [<0>] x64_sys_call+0x18ab/0x26f0
>                 [<0>] do_syscall_64+0x7c/0x170
>                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
>         scenario #2:
>         2962 pts/2    S+     0:00 cat
>         2963 pts/3    D+     0:00 cat
>         $ cat /proc/2962/stack
>                 [<0>] request_wait_answer+0x140/0x340
>                 [<0>] __fuse_simple_request+0xd8/0x2c0
>                 [<0>] fuse_perform_write+0x3ec/0x760
>                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
>                 [<0>] vfs_write+0x313/0x430
>                 [<0>] ksys_write+0x6a/0xf0
>                 [<0>] __x64_sys_write+0x19/0x30
>                 [<0>] x64_sys_call+0x18ab/0x26f0
>                 [<0>] do_syscall_64+0x7c/0x170
>                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>         $ cat /proc/2963/stack
>                 [<0>] fuse_file_write_iter+0x252/0x3f0
>                 [<0>] vfs_write+0x313/0x430
>                 [<0>] ksys_write+0x6a/0xf0
>                 [<0>] __x64_sys_write+0x19/0x30
>                 [<0>] x64_sys_call+0x18ab/0x26f0
>                 [<0>] do_syscall_64+0x7c/0x170
>                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Please note that this patch doesn't prevent the HUNG_TASK_WARNING.
>
> Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
> ---
>  fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
>  fs/fuse/fuse_i.h             |   8 +++
>  fs/fuse/inode.c              |   3 ++
>  include/linux/sched/sysctl.h |   8 ++-
>  kernel/hung_task.c           |   3 +-
>  5 files changed, 119 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 27ccae63495d..73d19de14e51 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -21,6 +21,8 @@
>  #include <linux/swap.h>
>  #include <linux/splice.h>
>  #include <linux/sched.h>
> +#include <linux/completion.h>
> +#include <linux/sched/sysctl.h>
>
>  #define CREATE_TRACE_POINTS
>  #include "fuse_trace.h"
> @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file)
>         return READ_ONCE(file->private_data);
>  }
>
> +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req,
> +               int timeout)
> +{
> +       return time_after(jiffies, req->create_time + timeout);
> +}
> +
> +/*
> + * Prevent hung task timer from firing at us
> + * Check if any requests aren't being completed by the specified request
> + * timeout. To do so, we:
> + * - check the fiq pending list
> + * - check the bg queue
> + * - check the fpq io and processing lists
> + *
> + * To make this fast, we only check against the head request on each list since
> + * these are generally queued in order of creation time (eg newer requests get
> + * queued to the tail). We might miss a few edge cases (eg requests transitioning
> + * between lists, re-sent requests at the head of the pending list having a
> + * later creation time than other requests on that list, etc.) but that is fine
> + * since if the request never gets fulfilled, it will eventually be caught.
> + */
> +void fuse_check_timeout(struct work_struct *wk)
> +{
> +       unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2);
> +       struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work);
> +       struct fuse_iqueue *fiq = &fc->iq;
> +       struct fuse_req *req;
> +       struct fuse_dev *fud;
> +       struct fuse_pqueue *fpq;
> +       bool expired = false;
> +       int i;
> +
> +       spin_lock(&fiq->lock);
> +       req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list);
> +       if (req)
> +               expired = request_expired(fc, req, hang_check_timer);
> +       spin_unlock(&fiq->lock);
> +       if (expired)
> +               goto abort_conn;
> +
> +       spin_lock(&fc->bg_lock);
> +       req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list);
> +       if (req)
> +               expired = request_expired(fc, req, hang_check_timer);
> +       spin_unlock(&fc->bg_lock);
> +       if (expired)
> +               goto abort_conn;
> +
> +       spin_lock(&fc->lock);
> +       if (!fc->connected) {
> +               spin_unlock(&fc->lock);
> +               return;
> +       }
> +       list_for_each_entry(fud, &fc->devices, entry) {
> +               fpq = &fud->pq;
> +               spin_lock(&fpq->lock);
> +               req = list_first_entry_or_null(&fpq->io, struct fuse_req, list);
> +               if (req && request_expired(fc, req, hang_check_timer))
> +                       goto fpq_abort;
> +
> +               for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> +                       req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list);
> +                       if (req && request_expired(fc, req, hang_check_timer))
> +                               goto fpq_abort;
> +               }
> +               spin_unlock(&fpq->lock);
> +       }
> +       /* Keep the ball rolling */
> +       if (atomic_read(&fc->num_waiting) != 0)
> +               queue_delayed_work(system_wq, &fc->work, hang_check_timer);
> +       spin_unlock(&fc->lock);
> +       return;
> +
> +fpq_abort:
> +       spin_unlock(&fpq->lock);
> +       spin_unlock(&fc->lock);
> +abort_conn:
> +       fuse_abort_conn(fc);
> +}
> +
>  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
>  {
> +       struct fuse_conn *fc = fm->fc;
>         INIT_LIST_HEAD(&req->list);
>         INIT_LIST_HEAD(&req->intr_entry);
>         init_waitqueue_head(&req->waitq);
>         refcount_set(&req->count, 1);
>         __set_bit(FR_PENDING, &req->flags);
>         req->fm = fm;
> +       req->create_time = jiffies;
> +
> +       if (sysctl_hung_task_panic) {
> +               spin_lock(&fc->lock);
> +               /* Get the ball rolling */
> +               queue_delayed_work(system_wq, &fc->work,
> +                               sysctl_hung_task_timeout_secs * (HZ / 2));
> +               spin_unlock(&fc->lock);
> +       }
>  }
>
>  static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
> @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req)
>                         fuse_drop_waiting(fc);
>                 }
>
> +               if (sysctl_hung_task_panic) {
> +                       spin_lock(&fc->lock);
> +                       /* Stop the timeout check if we are the last request */
> +                       if (atomic_read(&fc->num_waiting) == 0)
> +                               cancel_delayed_work_sync(&fc->work);
> +                       spin_unlock(&fc->lock);
> +               }
> +
>                 fuse_request_free(req);
>         }
>  }
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 74744c6f2860..aba3ffd0fb67 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -438,6 +438,9 @@ struct fuse_req {
>
>         /** fuse_mount this request belongs to */
>         struct fuse_mount *fm;
> +
> +       /** When (in jiffies) the request was created */
> +       unsigned long create_time;
>  };
>
>  struct fuse_iqueue;
> @@ -923,6 +926,9 @@ struct fuse_conn {
>         /** IDR for backing files ids */
>         struct idr backing_files_map;
>  #endif
> +
> +       /** Request wait timeout check */
> +       struct delayed_work work;
>  };
>
>  /*
> @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req);
>  /* Abort all requests */
>  void fuse_abort_conn(struct fuse_conn *fc);
>  void fuse_wait_aborted(struct fuse_conn *fc);
> +/* Connection timeout */
> +void fuse_check_timeout(struct work_struct *wk);
>
>  /**
>   * Invalidate inode attributes
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 3ce4f4e81d09..ed96154df0fd 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -23,6 +23,7 @@
>  #include <linux/exportfs.h>
>  #include <linux/posix_acl.h>
>  #include <linux/pid_namespace.h>
> +#include <linux/completion.h>
>  #include <uapi/linux/magic.h>
>
>  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
>         INIT_LIST_HEAD(&fc->entry);
>         INIT_LIST_HEAD(&fc->devices);
>         atomic_set(&fc->num_waiting, 0);
> +       INIT_DELAYED_WORK(&fc->work, fuse_check_timeout);
>         fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
>         fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
>         atomic64_set(&fc->khctr, 0);
> @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc)
>                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
>                         fuse_backing_files_free(fc);
>                 call_rcu(&fc->rcu, delayed_release);
> +               cancel_delayed_work_sync(&fc->work);
>         }
>  }
>  EXPORT_SYMBOL_GPL(fuse_conn_put);
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 5a64582b086b..1ed3a511060d 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -5,11 +5,15 @@
>  #include <linux/types.h>
>
>  #ifdef CONFIG_DETECT_HUNG_TASK
> -/* used for hung_task and block/ */
> +/* used for hung_task, block/ and fuse */
>  extern unsigned long sysctl_hung_task_timeout_secs;
> +extern unsigned int sysctl_hung_task_panic;
>  #else
>  /* Avoid need for ifdefs elsewhere in the code */
> -enum { sysctl_hung_task_timeout_secs = 0 };
> +enum {
> +       sysctl_hung_task_timeout_secs = 0,
> +       sysctl_hung_task_panic = 0,
> +};
>  #endif
>
>  enum sched_tunable_scaling {
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index c18717189f32..16602d3754b1 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
>   * Should we panic (and reboot, if panic_timeout= is set) when a
>   * hung task is detected:
>   */
> -static unsigned int __read_mostly sysctl_hung_task_panic =
> +unsigned int __read_mostly sysctl_hung_task_panic =
>         IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic);
>
>  static int
>  hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> --
> 2.34.1
>

Etienne Martineau Dec. 11, 2024, 11:04 p.m. UTC | #2

On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> >
> > From: Etienne Martineau <etmartin4313@gmail.com>
> >
> > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > is getting stuck for too long. A slow FUSE server may tripped the
> > hang check timer for legitimate reasons hence consider disabling
> > HUNG_TASK_PANIC in that scenario.
> >
> > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > leave the clients in D state for a long period of time and on system where
> > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
> >
> > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > to abort connections that exceed the timeout value which is define to be
> > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > is per connection and runs only if there are active FUSE request pending.
>
> Hi Etienne,
>
> For your use case, does the generic request timeouts logic and
> max_request_timeout systemctl implemented in [1] and [2] not suffice?
> IMO I don't think we should have logic specifically checking for hung
> task timeouts in fuse, if the generic solution can be used.
>
> Thanks,
> Joanne

We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
is set while a buggy / malicious FUSE server stops responding.
I would argue that this is much needed in stable branches as well...

For that reason, I believe we need to keep things simple for step #1
e.g. there is no
need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
holds the source of truth.

IMO introducing those new knobs will put an unnecessary burden on sysadmins into
something that is error prone because unlike
  CONFIG_DETECT_HUNG_TASK=y
  CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
which is built-in, the "default_request_timeout" /
"max_request_timeout" needs to be
set appropriately after every reboot and failure to do so may have
nasty consequences.

For the more generic cases then yes those timeouts make sense to me.

Thanks,
Etienne

>
> [1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@gmail.com/
> [2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@gmail.com/
>
> >
> > A FUSE client can get into D state as such ( see below scenario #1 / #2 )
> >  1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
> >     OR
> >  2) request_wait_answer() -> wait_event_(interruptible / killable) is head
> >     of line blocking for subsequent clients accessing the same file
> >
> >         scenario #1:
> >         2716 pts/2    D+     0:00 cat
> >         $ cat /proc/2716/stack
> >                 [<0>] request_wait_answer+0x22e/0x340
> >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> >                 [<0>] fuse_perform_write+0x3ec/0x760
> >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> >                 [<0>] vfs_write+0x313/0x430
> >                 [<0>] ksys_write+0x6a/0xf0
> >                 [<0>] __x64_sys_write+0x19/0x30
> >                 [<0>] x64_sys_call+0x18ab/0x26f0
> >                 [<0>] do_syscall_64+0x7c/0x170
> >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> >         scenario #2:
> >         2962 pts/2    S+     0:00 cat
> >         2963 pts/3    D+     0:00 cat
> >         $ cat /proc/2962/stack
> >                 [<0>] request_wait_answer+0x140/0x340
> >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> >                 [<0>] fuse_perform_write+0x3ec/0x760
> >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> >                 [<0>] vfs_write+0x313/0x430
> >                 [<0>] ksys_write+0x6a/0xf0
> >                 [<0>] __x64_sys_write+0x19/0x30
> >                 [<0>] x64_sys_call+0x18ab/0x26f0
> >                 [<0>] do_syscall_64+0x7c/0x170
> >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >         $ cat /proc/2963/stack
> >                 [<0>] fuse_file_write_iter+0x252/0x3f0
> >                 [<0>] vfs_write+0x313/0x430
> >                 [<0>] ksys_write+0x6a/0xf0
> >                 [<0>] __x64_sys_write+0x19/0x30
> >                 [<0>] x64_sys_call+0x18ab/0x26f0
> >                 [<0>] do_syscall_64+0x7c/0x170
> >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > Please note that this patch doesn't prevent the HUNG_TASK_WARNING.
> >
> > Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
> > ---
> >  fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
> >  fs/fuse/fuse_i.h             |   8 +++
> >  fs/fuse/inode.c              |   3 ++
> >  include/linux/sched/sysctl.h |   8 ++-
> >  kernel/hung_task.c           |   3 +-
> >  5 files changed, 119 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 27ccae63495d..73d19de14e51 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -21,6 +21,8 @@
> >  #include <linux/swap.h>
> >  #include <linux/splice.h>
> >  #include <linux/sched.h>
> > +#include <linux/completion.h>
> > +#include <linux/sched/sysctl.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include "fuse_trace.h"
> > @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file)
> >         return READ_ONCE(file->private_data);
> >  }
> >
> > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req,
> > +               int timeout)
> > +{
> > +       return time_after(jiffies, req->create_time + timeout);
> > +}
> > +
> > +/*
> > + * Prevent hung task timer from firing at us
> > + * Check if any requests aren't being completed by the specified request
> > + * timeout. To do so, we:
> > + * - check the fiq pending list
> > + * - check the bg queue
> > + * - check the fpq io and processing lists
> > + *
> > + * To make this fast, we only check against the head request on each list since
> > + * these are generally queued in order of creation time (eg newer requests get
> > + * queued to the tail). We might miss a few edge cases (eg requests transitioning
> > + * between lists, re-sent requests at the head of the pending list having a
> > + * later creation time than other requests on that list, etc.) but that is fine
> > + * since if the request never gets fulfilled, it will eventually be caught.
> > + */
> > +void fuse_check_timeout(struct work_struct *wk)
> > +{
> > +       unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2);
> > +       struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work);
> > +       struct fuse_iqueue *fiq = &fc->iq;
> > +       struct fuse_req *req;
> > +       struct fuse_dev *fud;
> > +       struct fuse_pqueue *fpq;
> > +       bool expired = false;
> > +       int i;
> > +
> > +       spin_lock(&fiq->lock);
> > +       req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list);
> > +       if (req)
> > +               expired = request_expired(fc, req, hang_check_timer);
> > +       spin_unlock(&fiq->lock);
> > +       if (expired)
> > +               goto abort_conn;
> > +
> > +       spin_lock(&fc->bg_lock);
> > +       req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list);
> > +       if (req)
> > +               expired = request_expired(fc, req, hang_check_timer);
> > +       spin_unlock(&fc->bg_lock);
> > +       if (expired)
> > +               goto abort_conn;
> > +
> > +       spin_lock(&fc->lock);
> > +       if (!fc->connected) {
> > +               spin_unlock(&fc->lock);
> > +               return;
> > +       }
> > +       list_for_each_entry(fud, &fc->devices, entry) {
> > +               fpq = &fud->pq;
> > +               spin_lock(&fpq->lock);
> > +               req = list_first_entry_or_null(&fpq->io, struct fuse_req, list);
> > +               if (req && request_expired(fc, req, hang_check_timer))
> > +                       goto fpq_abort;
> > +
> > +               for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > +                       req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list);
> > +                       if (req && request_expired(fc, req, hang_check_timer))
> > +                               goto fpq_abort;
> > +               }
> > +               spin_unlock(&fpq->lock);
> > +       }
> > +       /* Keep the ball rolling */
> > +       if (atomic_read(&fc->num_waiting) != 0)
> > +               queue_delayed_work(system_wq, &fc->work, hang_check_timer);
> > +       spin_unlock(&fc->lock);
> > +       return;
> > +
> > +fpq_abort:
> > +       spin_unlock(&fpq->lock);
> > +       spin_unlock(&fc->lock);
> > +abort_conn:
> > +       fuse_abort_conn(fc);
> > +}
> > +
> >  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> >  {
> > +       struct fuse_conn *fc = fm->fc;
> >         INIT_LIST_HEAD(&req->list);
> >         INIT_LIST_HEAD(&req->intr_entry);
> >         init_waitqueue_head(&req->waitq);
> >         refcount_set(&req->count, 1);
> >         __set_bit(FR_PENDING, &req->flags);
> >         req->fm = fm;
> > +       req->create_time = jiffies;
> > +
> > +       if (sysctl_hung_task_panic) {
> > +               spin_lock(&fc->lock);
> > +               /* Get the ball rolling */
> > +               queue_delayed_work(system_wq, &fc->work,
> > +                               sysctl_hung_task_timeout_secs * (HZ / 2));
> > +               spin_unlock(&fc->lock);
> > +       }
> >  }
> >
> >  static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
> > @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req)
> >                         fuse_drop_waiting(fc);
> >                 }
> >
> > +               if (sysctl_hung_task_panic) {
> > +                       spin_lock(&fc->lock);
> > +                       /* Stop the timeout check if we are the last request */
> > +                       if (atomic_read(&fc->num_waiting) == 0)
> > +                               cancel_delayed_work_sync(&fc->work);
> > +                       spin_unlock(&fc->lock);
> > +               }
> > +
> >                 fuse_request_free(req);
> >         }
> >  }
> > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > index 74744c6f2860..aba3ffd0fb67 100644
> > --- a/fs/fuse/fuse_i.h
> > +++ b/fs/fuse/fuse_i.h
> > @@ -438,6 +438,9 @@ struct fuse_req {
> >
> >         /** fuse_mount this request belongs to */
> >         struct fuse_mount *fm;
> > +
> > +       /** When (in jiffies) the request was created */
> > +       unsigned long create_time;
> >  };
> >
> >  struct fuse_iqueue;
> > @@ -923,6 +926,9 @@ struct fuse_conn {
> >         /** IDR for backing files ids */
> >         struct idr backing_files_map;
> >  #endif
> > +
> > +       /** Request wait timeout check */
> > +       struct delayed_work work;
> >  };
> >
> >  /*
> > @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req);
> >  /* Abort all requests */
> >  void fuse_abort_conn(struct fuse_conn *fc);
> >  void fuse_wait_aborted(struct fuse_conn *fc);
> > +/* Connection timeout */
> > +void fuse_check_timeout(struct work_struct *wk);
> >
> >  /**
> >   * Invalidate inode attributes
> > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > index 3ce4f4e81d09..ed96154df0fd 100644
> > --- a/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/exportfs.h>
> >  #include <linux/posix_acl.h>
> >  #include <linux/pid_namespace.h>
> > +#include <linux/completion.h>
> >  #include <uapi/linux/magic.h>
> >
> >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> >         INIT_LIST_HEAD(&fc->entry);
> >         INIT_LIST_HEAD(&fc->devices);
> >         atomic_set(&fc->num_waiting, 0);
> > +       INIT_DELAYED_WORK(&fc->work, fuse_check_timeout);
> >         fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> >         fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
> >         atomic64_set(&fc->khctr, 0);
> > @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> >                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> >                         fuse_backing_files_free(fc);
> >                 call_rcu(&fc->rcu, delayed_release);
> > +               cancel_delayed_work_sync(&fc->work);
> >         }
> >  }
> >  EXPORT_SYMBOL_GPL(fuse_conn_put);
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index 5a64582b086b..1ed3a511060d 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -5,11 +5,15 @@
> >  #include <linux/types.h>
> >
> >  #ifdef CONFIG_DETECT_HUNG_TASK
> > -/* used for hung_task and block/ */
> > +/* used for hung_task, block/ and fuse */
> >  extern unsigned long sysctl_hung_task_timeout_secs;
> > +extern unsigned int sysctl_hung_task_panic;
> >  #else
> >  /* Avoid need for ifdefs elsewhere in the code */
> > -enum { sysctl_hung_task_timeout_secs = 0 };
> > +enum {
> > +       sysctl_hung_task_timeout_secs = 0,
> > +       sysctl_hung_task_panic = 0,
> > +};
> >  #endif
> >
> >  enum sched_tunable_scaling {
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index c18717189f32..16602d3754b1 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> >   * Should we panic (and reboot, if panic_timeout= is set) when a
> >   * hung task is detected:
> >   */
> > -static unsigned int __read_mostly sysctl_hung_task_panic =
> > +unsigned int __read_mostly sysctl_hung_task_panic =
> >         IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic);
> >
> >  static int
> >  hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > --
> > 2.34.1
> >

Lance Yang Dec. 12, 2024, 5:26 a.m. UTC | #3

On Thu, Dec 12, 2024 at 7:04 AM Etienne Martineau
<etmartin4313@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> > >
> > > From: Etienne Martineau <etmartin4313@gmail.com>
> > >
> > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > is getting stuck for too long. A slow FUSE server may tripped the

A FUSE server is getting stuck for longer than hung_task_timeout_secs
(the default is two minutes). Is it not buggy in the real-world?

> > > hang check timer for legitimate reasons hence consider disabling
> > > HUNG_TASK_PANIC in that scenario.

Why not just consider increasing hung_task_timeout_secs if necessary?

> > >
> > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > leave the clients in D state for a long period of time and on system where
> > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.

Sorry, I don't see any sense in knowing whether HUNG_TASK_PANIC is enabled for
FUSE servers. Or am I possibly missing something important?

If HUNG_TASK_PANIC is set, we should do a reload when a hung task is detected;
this is working as expected IHMO.

Thanks,
Lance

> > >
> > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > to abort connections that exceed the timeout value which is define to be
> > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > is per connection and runs only if there are active FUSE request pending.
> >
> > Hi Etienne,
> >
> > For your use case, does the generic request timeouts logic and
> > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > IMO I don't think we should have logic specifically checking for hung
> > task timeouts in fuse, if the generic solution can be used.
> >
> > Thanks,
> > Joanne
>
> We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> is set while a buggy / malicious FUSE server stops responding.
> I would argue that this is much needed in stable branches as well...
>
> For that reason, I believe we need to keep things simple for step #1
> e.g. there is no
> need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> holds the source of truth.
>
> IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> something that is error prone because unlike
>   CONFIG_DETECT_HUNG_TASK=y
>   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> which is built-in, the "default_request_timeout" /
> "max_request_timeout" needs to be
> set appropriately after every reboot and failure to do so may have
> nasty consequences.
>
> For the more generic cases then yes those timeouts make sense to me.
>
> Thanks,
> Etienne
>
> >
> > [1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@gmail.com/
> > [2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@gmail.com/
> >
> > >
> > > A FUSE client can get into D state as such ( see below scenario #1 / #2 )
> > >  1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
> > >     OR
> > >  2) request_wait_answer() -> wait_event_(interruptible / killable) is head
> > >     of line blocking for subsequent clients accessing the same file
> > >
> > >         scenario #1:
> > >         2716 pts/2    D+     0:00 cat
> > >         $ cat /proc/2716/stack
> > >                 [<0>] request_wait_answer+0x22e/0x340
> > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > >                 [<0>] vfs_write+0x313/0x430
> > >                 [<0>] ksys_write+0x6a/0xf0
> > >                 [<0>] __x64_sys_write+0x19/0x30
> > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > >                 [<0>] do_syscall_64+0x7c/0x170
> > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > >         scenario #2:
> > >         2962 pts/2    S+     0:00 cat
> > >         2963 pts/3    D+     0:00 cat
> > >         $ cat /proc/2962/stack
> > >                 [<0>] request_wait_answer+0x140/0x340
> > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > >                 [<0>] vfs_write+0x313/0x430
> > >                 [<0>] ksys_write+0x6a/0xf0
> > >                 [<0>] __x64_sys_write+0x19/0x30
> > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > >                 [<0>] do_syscall_64+0x7c/0x170
> > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >         $ cat /proc/2963/stack
> > >                 [<0>] fuse_file_write_iter+0x252/0x3f0
> > >                 [<0>] vfs_write+0x313/0x430
> > >                 [<0>] ksys_write+0x6a/0xf0
> > >                 [<0>] __x64_sys_write+0x19/0x30
> > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > >                 [<0>] do_syscall_64+0x7c/0x170
> > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > Please note that this patch doesn't prevent the HUNG_TASK_WARNING.
> > >
> > > Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
> > > ---
> > >  fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
> > >  fs/fuse/fuse_i.h             |   8 +++
> > >  fs/fuse/inode.c              |   3 ++
> > >  include/linux/sched/sysctl.h |   8 ++-
> > >  kernel/hung_task.c           |   3 +-
> > >  5 files changed, 119 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 27ccae63495d..73d19de14e51 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -21,6 +21,8 @@
> > >  #include <linux/swap.h>
> > >  #include <linux/splice.h>
> > >  #include <linux/sched.h>
> > > +#include <linux/completion.h>
> > > +#include <linux/sched/sysctl.h>
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include "fuse_trace.h"
> > > @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file)
> > >         return READ_ONCE(file->private_data);
> > >  }
> > >
> > > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req,
> > > +               int timeout)
> > > +{
> > > +       return time_after(jiffies, req->create_time + timeout);
> > > +}
> > > +
> > > +/*
> > > + * Prevent hung task timer from firing at us
> > > + * Check if any requests aren't being completed by the specified request
> > > + * timeout. To do so, we:
> > > + * - check the fiq pending list
> > > + * - check the bg queue
> > > + * - check the fpq io and processing lists
> > > + *
> > > + * To make this fast, we only check against the head request on each list since
> > > + * these are generally queued in order of creation time (eg newer requests get
> > > + * queued to the tail). We might miss a few edge cases (eg requests transitioning
> > > + * between lists, re-sent requests at the head of the pending list having a
> > > + * later creation time than other requests on that list, etc.) but that is fine
> > > + * since if the request never gets fulfilled, it will eventually be caught.
> > > + */
> > > +void fuse_check_timeout(struct work_struct *wk)
> > > +{
> > > +       unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2);
> > > +       struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work);
> > > +       struct fuse_iqueue *fiq = &fc->iq;
> > > +       struct fuse_req *req;
> > > +       struct fuse_dev *fud;
> > > +       struct fuse_pqueue *fpq;
> > > +       bool expired = false;
> > > +       int i;
> > > +
> > > +       spin_lock(&fiq->lock);
> > > +       req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list);
> > > +       if (req)
> > > +               expired = request_expired(fc, req, hang_check_timer);
> > > +       spin_unlock(&fiq->lock);
> > > +       if (expired)
> > > +               goto abort_conn;
> > > +
> > > +       spin_lock(&fc->bg_lock);
> > > +       req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list);
> > > +       if (req)
> > > +               expired = request_expired(fc, req, hang_check_timer);
> > > +       spin_unlock(&fc->bg_lock);
> > > +       if (expired)
> > > +               goto abort_conn;
> > > +
> > > +       spin_lock(&fc->lock);
> > > +       if (!fc->connected) {
> > > +               spin_unlock(&fc->lock);
> > > +               return;
> > > +       }
> > > +       list_for_each_entry(fud, &fc->devices, entry) {
> > > +               fpq = &fud->pq;
> > > +               spin_lock(&fpq->lock);
> > > +               req = list_first_entry_or_null(&fpq->io, struct fuse_req, list);
> > > +               if (req && request_expired(fc, req, hang_check_timer))
> > > +                       goto fpq_abort;
> > > +
> > > +               for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > > +                       req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list);
> > > +                       if (req && request_expired(fc, req, hang_check_timer))
> > > +                               goto fpq_abort;
> > > +               }
> > > +               spin_unlock(&fpq->lock);
> > > +       }
> > > +       /* Keep the ball rolling */
> > > +       if (atomic_read(&fc->num_waiting) != 0)
> > > +               queue_delayed_work(system_wq, &fc->work, hang_check_timer);
> > > +       spin_unlock(&fc->lock);
> > > +       return;
> > > +
> > > +fpq_abort:
> > > +       spin_unlock(&fpq->lock);
> > > +       spin_unlock(&fc->lock);
> > > +abort_conn:
> > > +       fuse_abort_conn(fc);
> > > +}
> > > +
> > >  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> > >  {
> > > +       struct fuse_conn *fc = fm->fc;
> > >         INIT_LIST_HEAD(&req->list);
> > >         INIT_LIST_HEAD(&req->intr_entry);
> > >         init_waitqueue_head(&req->waitq);
> > >         refcount_set(&req->count, 1);
> > >         __set_bit(FR_PENDING, &req->flags);
> > >         req->fm = fm;
> > > +       req->create_time = jiffies;
> > > +
> > > +       if (sysctl_hung_task_panic) {
> > > +               spin_lock(&fc->lock);
> > > +               /* Get the ball rolling */
> > > +               queue_delayed_work(system_wq, &fc->work,
> > > +                               sysctl_hung_task_timeout_secs * (HZ / 2));
> > > +               spin_unlock(&fc->lock);
> > > +       }
> > >  }
> > >
> > >  static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
> > > @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req)
> > >                         fuse_drop_waiting(fc);
> > >                 }
> > >
> > > +               if (sysctl_hung_task_panic) {
> > > +                       spin_lock(&fc->lock);
> > > +                       /* Stop the timeout check if we are the last request */
> > > +                       if (atomic_read(&fc->num_waiting) == 0)
> > > +                               cancel_delayed_work_sync(&fc->work);
> > > +                       spin_unlock(&fc->lock);
> > > +               }
> > > +
> > >                 fuse_request_free(req);
> > >         }
> > >  }
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 74744c6f2860..aba3ffd0fb67 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -438,6 +438,9 @@ struct fuse_req {
> > >
> > >         /** fuse_mount this request belongs to */
> > >         struct fuse_mount *fm;
> > > +
> > > +       /** When (in jiffies) the request was created */
> > > +       unsigned long create_time;
> > >  };
> > >
> > >  struct fuse_iqueue;
> > > @@ -923,6 +926,9 @@ struct fuse_conn {
> > >         /** IDR for backing files ids */
> > >         struct idr backing_files_map;
> > >  #endif
> > > +
> > > +       /** Request wait timeout check */
> > > +       struct delayed_work work;
> > >  };
> > >
> > >  /*
> > > @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req);
> > >  /* Abort all requests */
> > >  void fuse_abort_conn(struct fuse_conn *fc);
> > >  void fuse_wait_aborted(struct fuse_conn *fc);
> > > +/* Connection timeout */
> > > +void fuse_check_timeout(struct work_struct *wk);
> > >
> > >  /**
> > >   * Invalidate inode attributes
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 3ce4f4e81d09..ed96154df0fd 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -23,6 +23,7 @@
> > >  #include <linux/exportfs.h>
> > >  #include <linux/posix_acl.h>
> > >  #include <linux/pid_namespace.h>
> > > +#include <linux/completion.h>
> > >  #include <uapi/linux/magic.h>
> > >
> > >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > > @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> > >         INIT_LIST_HEAD(&fc->entry);
> > >         INIT_LIST_HEAD(&fc->devices);
> > >         atomic_set(&fc->num_waiting, 0);
> > > +       INIT_DELAYED_WORK(&fc->work, fuse_check_timeout);
> > >         fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> > >         fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
> > >         atomic64_set(&fc->khctr, 0);
> > > @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > >                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > >                         fuse_backing_files_free(fc);
> > >                 call_rcu(&fc->rcu, delayed_release);
> > > +               cancel_delayed_work_sync(&fc->work);
> > >         }
> > >  }
> > >  EXPORT_SYMBOL_GPL(fuse_conn_put);
> > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > index 5a64582b086b..1ed3a511060d 100644
> > > --- a/include/linux/sched/sysctl.h
> > > +++ b/include/linux/sched/sysctl.h
> > > @@ -5,11 +5,15 @@
> > >  #include <linux/types.h>
> > >
> > >  #ifdef CONFIG_DETECT_HUNG_TASK
> > > -/* used for hung_task and block/ */
> > > +/* used for hung_task, block/ and fuse */
> > >  extern unsigned long sysctl_hung_task_timeout_secs;
> > > +extern unsigned int sysctl_hung_task_panic;
> > >  #else
> > >  /* Avoid need for ifdefs elsewhere in the code */
> > > -enum { sysctl_hung_task_timeout_secs = 0 };
> > > +enum {
> > > +       sysctl_hung_task_timeout_secs = 0,
> > > +       sysctl_hung_task_panic = 0,
> > > +};
> > >  #endif
> > >
> > >  enum sched_tunable_scaling {
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > index c18717189f32..16602d3754b1 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> > >   * Should we panic (and reboot, if panic_timeout= is set) when a
> > >   * hung task is detected:
> > >   */
> > > -static unsigned int __read_mostly sysctl_hung_task_panic =
> > > +unsigned int __read_mostly sysctl_hung_task_panic =
> > >         IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > > +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic);
> > >
> > >  static int
> > >  hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > > --
> > > 2.34.1
> > >

Etienne Martineau Dec. 12, 2024, 1:30 p.m. UTC | #4

On Thu, Dec 12, 2024 at 12:27 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 7:04 AM Etienne Martineau
> <etmartin4313@gmail.com> wrote:
> >
> > On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> > > >
> > > > From: Etienne Martineau <etmartin4313@gmail.com>
> > > >
> > > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > > is getting stuck for too long. A slow FUSE server may tripped the
>
> A FUSE server is getting stuck for longer than hung_task_timeout_secs
> (the default is two minutes). Is it not buggy in the real-world?
Can be buggy OR malicious yes. FUSE server is a user-space process so all
bets are off.

> > > > hang check timer for legitimate reasons hence consider disabling
> > > > HUNG_TASK_PANIC in that scenario.
>
> Why not just consider increasing hung_task_timeout_secs if necessary?
What value is the right value then?

The timeout is global, so by increasing the timeout it will take
longer for the kernel
to report and act on different hung task scenarios.

HUNG_TASK_PANIC is a failsafe mechanism that helps prevent your system from
running headless for too long. Say you are in crypto-mining and your
box is getting
stuck on some FPGA drivers -- the process calling into that driver is
stuck in an
UNINTERRUPTIBLE wait somehow. Without HUNG_TASK_PANIC you'll lose
money because your box may end up sitting idling for hours doing nothing.

>
> > > >
> > > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > > leave the clients in D state for a long period of time and on system where
> > > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
>
> Sorry, I don't see any sense in knowing whether HUNG_TASK_PANIC is enabled for
> FUSE servers. Or am I possibly missing something important?
Regular file-system drivers handles everything internally but FUSE on
the other hands,
delegate the file system operation to a user process ( FUSE server )
If the FUSE server is turning bad, you don't want to reload right?

A non-privileged user can  potentially exploit this flaw and trigger a
reload. I'm
surprised that this didn't get flagged before ( maybe I'm missing something ? )
IMO this is why I think something needs to be done for the stable
branch as well.

>
> If HUNG_TASK_PANIC is set, we should do a reload when a hung task is detected;
> this is working as expected IHMO.
Say when your browser hangs on your system, do you reload? FUSE server
is just another
process.

thanks
Etienne

> Thanks,
> Lance
>
> > > >
> > > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > > to abort connections that exceed the timeout value which is define to be
> > > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > > is per connection and runs only if there are active FUSE request pending.
> > >
> > > Hi Etienne,
> > >
> > > For your use case, does the generic request timeouts logic and
> > > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > > IMO I don't think we should have logic specifically checking for hung
> > > task timeouts in fuse, if the generic solution can be used.
> > >
> > > Thanks,
> > > Joanne
> >
> > We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> > is set while a buggy / malicious FUSE server stops responding.
> > I would argue that this is much needed in stable branches as well...
> >
> > For that reason, I believe we need to keep things simple for step #1
> > e.g. there is no
> > need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> > holds the source of truth.
> >
> > IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> > something that is error prone because unlike
> >   CONFIG_DETECT_HUNG_TASK=y
> >   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> > which is built-in, the "default_request_timeout" /
> > "max_request_timeout" needs to be
> > set appropriately after every reboot and failure to do so may have
> > nasty consequences.
> >
> > For the more generic cases then yes those timeouts make sense to me.
> >
> > Thanks,
> > Etienne
> >
> > >
> > > [1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@gmail.com/
> > > [2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@gmail.com/
> > >
> > > >
> > > > A FUSE client can get into D state as such ( see below scenario #1 / #2 )
> > > >  1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
> > > >     OR
> > > >  2) request_wait_answer() -> wait_event_(interruptible / killable) is head
> > > >     of line blocking for subsequent clients accessing the same file
> > > >
> > > >         scenario #1:
> > > >         2716 pts/2    D+     0:00 cat
> > > >         $ cat /proc/2716/stack
> > > >                 [<0>] request_wait_answer+0x22e/0x340
> > > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > > >                 [<0>] vfs_write+0x313/0x430
> > > >                 [<0>] ksys_write+0x6a/0xf0
> > > >                 [<0>] __x64_sys_write+0x19/0x30
> > > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > > >                 [<0>] do_syscall_64+0x7c/0x170
> > > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > >
> > > >         scenario #2:
> > > >         2962 pts/2    S+     0:00 cat
> > > >         2963 pts/3    D+     0:00 cat
> > > >         $ cat /proc/2962/stack
> > > >                 [<0>] request_wait_answer+0x140/0x340
> > > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > > >                 [<0>] vfs_write+0x313/0x430
> > > >                 [<0>] ksys_write+0x6a/0xf0
> > > >                 [<0>] __x64_sys_write+0x19/0x30
> > > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > > >                 [<0>] do_syscall_64+0x7c/0x170
> > > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > >         $ cat /proc/2963/stack
> > > >                 [<0>] fuse_file_write_iter+0x252/0x3f0
> > > >                 [<0>] vfs_write+0x313/0x430
> > > >                 [<0>] ksys_write+0x6a/0xf0
> > > >                 [<0>] __x64_sys_write+0x19/0x30
> > > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > > >                 [<0>] do_syscall_64+0x7c/0x170
> > > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > >
> > > > Please note that this patch doesn't prevent the HUNG_TASK_WARNING.
> > > >
> > > > Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
> > > > ---
> > > >  fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/fuse_i.h             |   8 +++
> > > >  fs/fuse/inode.c              |   3 ++
> > > >  include/linux/sched/sysctl.h |   8 ++-
> > > >  kernel/hung_task.c           |   3 +-
> > > >  5 files changed, 119 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > > index 27ccae63495d..73d19de14e51 100644
> > > > --- a/fs/fuse/dev.c
> > > > +++ b/fs/fuse/dev.c
> > > > @@ -21,6 +21,8 @@
> > > >  #include <linux/swap.h>
> > > >  #include <linux/splice.h>
> > > >  #include <linux/sched.h>
> > > > +#include <linux/completion.h>
> > > > +#include <linux/sched/sysctl.h>
> > > >
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include "fuse_trace.h"
> > > > @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file)
> > > >         return READ_ONCE(file->private_data);
> > > >  }
> > > >
> > > > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req,
> > > > +               int timeout)
> > > > +{
> > > > +       return time_after(jiffies, req->create_time + timeout);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Prevent hung task timer from firing at us
> > > > + * Check if any requests aren't being completed by the specified request
> > > > + * timeout. To do so, we:
> > > > + * - check the fiq pending list
> > > > + * - check the bg queue
> > > > + * - check the fpq io and processing lists
> > > > + *
> > > > + * To make this fast, we only check against the head request on each list since
> > > > + * these are generally queued in order of creation time (eg newer requests get
> > > > + * queued to the tail). We might miss a few edge cases (eg requests transitioning
> > > > + * between lists, re-sent requests at the head of the pending list having a
> > > > + * later creation time than other requests on that list, etc.) but that is fine
> > > > + * since if the request never gets fulfilled, it will eventually be caught.
> > > > + */
> > > > +void fuse_check_timeout(struct work_struct *wk)
> > > > +{
> > > > +       unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2);
> > > > +       struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work);
> > > > +       struct fuse_iqueue *fiq = &fc->iq;
> > > > +       struct fuse_req *req;
> > > > +       struct fuse_dev *fud;
> > > > +       struct fuse_pqueue *fpq;
> > > > +       bool expired = false;
> > > > +       int i;
> > > > +
> > > > +       spin_lock(&fiq->lock);
> > > > +       req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list);
> > > > +       if (req)
> > > > +               expired = request_expired(fc, req, hang_check_timer);
> > > > +       spin_unlock(&fiq->lock);
> > > > +       if (expired)
> > > > +               goto abort_conn;
> > > > +
> > > > +       spin_lock(&fc->bg_lock);
> > > > +       req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list);
> > > > +       if (req)
> > > > +               expired = request_expired(fc, req, hang_check_timer);
> > > > +       spin_unlock(&fc->bg_lock);
> > > > +       if (expired)
> > > > +               goto abort_conn;
> > > > +
> > > > +       spin_lock(&fc->lock);
> > > > +       if (!fc->connected) {
> > > > +               spin_unlock(&fc->lock);
> > > > +               return;
> > > > +       }
> > > > +       list_for_each_entry(fud, &fc->devices, entry) {
> > > > +               fpq = &fud->pq;
> > > > +               spin_lock(&fpq->lock);
> > > > +               req = list_first_entry_or_null(&fpq->io, struct fuse_req, list);
> > > > +               if (req && request_expired(fc, req, hang_check_timer))
> > > > +                       goto fpq_abort;
> > > > +
> > > > +               for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > > > +                       req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list);
> > > > +                       if (req && request_expired(fc, req, hang_check_timer))
> > > > +                               goto fpq_abort;
> > > > +               }
> > > > +               spin_unlock(&fpq->lock);
> > > > +       }
> > > > +       /* Keep the ball rolling */
> > > > +       if (atomic_read(&fc->num_waiting) != 0)
> > > > +               queue_delayed_work(system_wq, &fc->work, hang_check_timer);
> > > > +       spin_unlock(&fc->lock);
> > > > +       return;
> > > > +
> > > > +fpq_abort:
> > > > +       spin_unlock(&fpq->lock);
> > > > +       spin_unlock(&fc->lock);
> > > > +abort_conn:
> > > > +       fuse_abort_conn(fc);
> > > > +}
> > > > +
> > > >  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> > > >  {
> > > > +       struct fuse_conn *fc = fm->fc;
> > > >         INIT_LIST_HEAD(&req->list);
> > > >         INIT_LIST_HEAD(&req->intr_entry);
> > > >         init_waitqueue_head(&req->waitq);
> > > >         refcount_set(&req->count, 1);
> > > >         __set_bit(FR_PENDING, &req->flags);
> > > >         req->fm = fm;
> > > > +       req->create_time = jiffies;
> > > > +
> > > > +       if (sysctl_hung_task_panic) {
> > > > +               spin_lock(&fc->lock);
> > > > +               /* Get the ball rolling */
> > > > +               queue_delayed_work(system_wq, &fc->work,
> > > > +                               sysctl_hung_task_timeout_secs * (HZ / 2));
> > > > +               spin_unlock(&fc->lock);
> > > > +       }
> > > >  }
> > > >
> > > >  static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
> > > > @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req)
> > > >                         fuse_drop_waiting(fc);
> > > >                 }
> > > >
> > > > +               if (sysctl_hung_task_panic) {
> > > > +                       spin_lock(&fc->lock);
> > > > +                       /* Stop the timeout check if we are the last request */
> > > > +                       if (atomic_read(&fc->num_waiting) == 0)
> > > > +                               cancel_delayed_work_sync(&fc->work);
> > > > +                       spin_unlock(&fc->lock);
> > > > +               }
> > > > +
> > > >                 fuse_request_free(req);
> > > >         }
> > > >  }
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 74744c6f2860..aba3ffd0fb67 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > @@ -438,6 +438,9 @@ struct fuse_req {
> > > >
> > > >         /** fuse_mount this request belongs to */
> > > >         struct fuse_mount *fm;
> > > > +
> > > > +       /** When (in jiffies) the request was created */
> > > > +       unsigned long create_time;
> > > >  };
> > > >
> > > >  struct fuse_iqueue;
> > > > @@ -923,6 +926,9 @@ struct fuse_conn {
> > > >         /** IDR for backing files ids */
> > > >         struct idr backing_files_map;
> > > >  #endif
> > > > +
> > > > +       /** Request wait timeout check */
> > > > +       struct delayed_work work;
> > > >  };
> > > >
> > > >  /*
> > > > @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req);
> > > >  /* Abort all requests */
> > > >  void fuse_abort_conn(struct fuse_conn *fc);
> > > >  void fuse_wait_aborted(struct fuse_conn *fc);
> > > > +/* Connection timeout */
> > > > +void fuse_check_timeout(struct work_struct *wk);
> > > >
> > > >  /**
> > > >   * Invalidate inode attributes
> > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > index 3ce4f4e81d09..ed96154df0fd 100644
> > > > --- a/fs/fuse/inode.c
> > > > +++ b/fs/fuse/inode.c
> > > > @@ -23,6 +23,7 @@
> > > >  #include <linux/exportfs.h>
> > > >  #include <linux/posix_acl.h>
> > > >  #include <linux/pid_namespace.h>
> > > > +#include <linux/completion.h>
> > > >  #include <uapi/linux/magic.h>
> > > >
> > > >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > > > @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> > > >         INIT_LIST_HEAD(&fc->entry);
> > > >         INIT_LIST_HEAD(&fc->devices);
> > > >         atomic_set(&fc->num_waiting, 0);
> > > > +       INIT_DELAYED_WORK(&fc->work, fuse_check_timeout);
> > > >         fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> > > >         fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
> > > >         atomic64_set(&fc->khctr, 0);
> > > > @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > > >                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > > >                         fuse_backing_files_free(fc);
> > > >                 call_rcu(&fc->rcu, delayed_release);
> > > > +               cancel_delayed_work_sync(&fc->work);
> > > >         }
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(fuse_conn_put);
> > > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > > index 5a64582b086b..1ed3a511060d 100644
> > > > --- a/include/linux/sched/sysctl.h
> > > > +++ b/include/linux/sched/sysctl.h
> > > > @@ -5,11 +5,15 @@
> > > >  #include <linux/types.h>
> > > >
> > > >  #ifdef CONFIG_DETECT_HUNG_TASK
> > > > -/* used for hung_task and block/ */
> > > > +/* used for hung_task, block/ and fuse */
> > > >  extern unsigned long sysctl_hung_task_timeout_secs;
> > > > +extern unsigned int sysctl_hung_task_panic;
> > > >  #else
> > > >  /* Avoid need for ifdefs elsewhere in the code */
> > > > -enum { sysctl_hung_task_timeout_secs = 0 };
> > > > +enum {
> > > > +       sysctl_hung_task_timeout_secs = 0,
> > > > +       sysctl_hung_task_panic = 0,
> > > > +};
> > > >  #endif
> > > >
> > > >  enum sched_tunable_scaling {
> > > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > > index c18717189f32..16602d3754b1 100644
> > > > --- a/kernel/hung_task.c
> > > > +++ b/kernel/hung_task.c
> > > > @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> > > >   * Should we panic (and reboot, if panic_timeout= is set) when a
> > > >   * hung task is detected:
> > > >   */
> > > > -static unsigned int __read_mostly sysctl_hung_task_panic =
> > > > +unsigned int __read_mostly sysctl_hung_task_panic =
> > > >         IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > > > +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic);
> > > >
> > > >  static int
> > > >  hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > > > --
> > > > 2.34.1
> > > >

Lance Yang Dec. 12, 2024, 3:08 p.m. UTC | #5

CC+ Andrew
CC+ David
CC+ Matthew
CC+ Barry
CC+ Ryan

On Thu, Dec 12, 2024 at 9:30 PM Etienne Martineau
<etmartin4313@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 12:27 AM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > On Thu, Dec 12, 2024 at 7:04 AM Etienne Martineau
> > <etmartin4313@gmail.com> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> > > > >
> > > > > From: Etienne Martineau <etmartin4313@gmail.com>
> > > > >
> > > > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > > > is getting stuck for too long. A slow FUSE server may tripped the
> >
> > A FUSE server is getting stuck for longer than hung_task_timeout_secs
> > (the default is two minutes). Is it not buggy in the real-world?
> Can be buggy OR malicious yes. FUSE server is a user-space process so all
> bets are off.

Yep, all bets are off ;)

>
> > > > > hang check timer for legitimate reasons hence consider disabling
> > > > > HUNG_TASK_PANIC in that scenario.
> >
> > Why not just consider increasing hung_task_timeout_secs if necessary?
> What value is the right value then?
>
> The timeout is global, so by increasing the timeout it will take
> longer for the kernel
> to report and act on different hung task scenarios.
>
> HUNG_TASK_PANIC is a failsafe mechanism that helps prevent your system from
> running headless for too long. Say you are in crypto-mining and your
> box is getting
> stuck on some FPGA drivers -- the process calling into that driver is
> stuck in an
> UNINTERRUPTIBLE wait somehow. Without HUNG_TASK_PANIC you'll lose
> money because your box may end up sitting idling for hours doing nothing.

Agreed, which is why HUNG_TASK_PANIC is enabled by default, IIRC.

>
> >
> > > > >
> > > > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > > > leave the clients in D state for a long period of time and on system where
> > > > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
> >
> > Sorry, I don't see any sense in knowing whether HUNG_TASK_PANIC is enabled for
> > FUSE servers. Or am I possibly missing something important?
> Regular file-system drivers handles everything internally but FUSE on
> the other hands,
> delegate the file system operation to a user process ( FUSE server )
> If the FUSE server is turning bad, you don't want to reload right?

To me, it makes sense to reload the system if HUNG_TASK_PANIC is
enabled. Doing so allows me to notice the issue in time and resolve it
through the kernel dump, IHMO.

>
> A non-privileged user can  potentially exploit this flaw and trigger a
> reload. I'm
> surprised that this didn't get flagged before ( maybe I'm missing something ? )
> IMO this is why I think something needs to be done for the stable
> branch as well.

AFAIK, besides this, a non-privileged user has other ways to cause some
processes to stay in the D state for a long period of time.

>
> >
> > If HUNG_TASK_PANIC is set, we should do a reload when a hung task is detected;
> > this is working as expected IHMO.
> Say when your browser hangs on your system, do you reload? FUSE server
> is just another
> process.

Hmm... the choice to enable HUNG_TASK_PANIC should be up to the user, while
the decision to reload the system should be up to the hung task detector ;)

Thanks a lot for including me. It seems like we're not on the same page and I'm
also not a FUSE expert. So, let's hear the views of others.

Thanks,
Lance


>
> thanks
> Etienne
>
> > Thanks,
> > Lance
> >
> > > > >
> > > > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > > > to abort connections that exceed the timeout value which is define to be
> > > > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > > > is per connection and runs only if there are active FUSE request pending.
> > > >
> > > > Hi Etienne,
> > > >
> > > > For your use case, does the generic request timeouts logic and
> > > > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > > > IMO I don't think we should have logic specifically checking for hung
> > > > task timeouts in fuse, if the generic solution can be used.
> > > >
> > > > Thanks,
> > > > Joanne
> > >
> > > We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> > > is set while a buggy / malicious FUSE server stops responding.
> > > I would argue that this is much needed in stable branches as well...
> > >
> > > For that reason, I believe we need to keep things simple for step #1
> > > e.g. there is no
> > > need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> > > holds the source of truth.
> > >
> > > IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> > > something that is error prone because unlike
> > >   CONFIG_DETECT_HUNG_TASK=y
> > >   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> > > which is built-in, the "default_request_timeout" /
> > > "max_request_timeout" needs to be
> > > set appropriately after every reboot and failure to do so may have
> > > nasty consequences.
> > >
> > > For the more generic cases then yes those timeouts make sense to me.
> > >
> > > Thanks,
> > > Etienne
> > >
> > > >
> > > > [1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@gmail.com/
> > > > [2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@gmail.com/
> > > >
> > > > >
> > > > > A FUSE client can get into D state as such ( see below scenario #1 / #2 )
> > > > >  1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
> > > > >     OR
> > > > >  2) request_wait_answer() -> wait_event_(interruptible / killable) is head
> > > > >     of line blocking for subsequent clients accessing the same file
> > > > >
> > > > >         scenario #1:
> > > > >         2716 pts/2    D+     0:00 cat
> > > > >         $ cat /proc/2716/stack
> > > > >                 [<0>] request_wait_answer+0x22e/0x340
> > > > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > > > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > > > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > > > >                 [<0>] vfs_write+0x313/0x430
> > > > >                 [<0>] ksys_write+0x6a/0xf0
> > > > >                 [<0>] __x64_sys_write+0x19/0x30
> > > > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > > > >                 [<0>] do_syscall_64+0x7c/0x170
> > > > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > >
> > > > >         scenario #2:
> > > > >         2962 pts/2    S+     0:00 cat
> > > > >         2963 pts/3    D+     0:00 cat
> > > > >         $ cat /proc/2962/stack
> > > > >                 [<0>] request_wait_answer+0x140/0x340
> > > > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > > > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > > > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > > > >                 [<0>] vfs_write+0x313/0x430
> > > > >                 [<0>] ksys_write+0x6a/0xf0
> > > > >                 [<0>] __x64_sys_write+0x19/0x30
> > > > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > > > >                 [<0>] do_syscall_64+0x7c/0x170
> > > > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > >         $ cat /proc/2963/stack
> > > > >                 [<0>] fuse_file_write_iter+0x252/0x3f0
> > > > >                 [<0>] vfs_write+0x313/0x430
> > > > >                 [<0>] ksys_write+0x6a/0xf0
> > > > >                 [<0>] __x64_sys_write+0x19/0x30
> > > > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > > > >                 [<0>] do_syscall_64+0x7c/0x170
> > > > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > >
> > > > > Please note that this patch doesn't prevent the HUNG_TASK_WARNING.
> > > > >
> > > > > Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
> > > > > ---
> > > > >  fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
> > > > >  fs/fuse/fuse_i.h             |   8 +++
> > > > >  fs/fuse/inode.c              |   3 ++
> > > > >  include/linux/sched/sysctl.h |   8 ++-
> > > > >  kernel/hung_task.c           |   3 +-
> > > > >  5 files changed, 119 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > > > index 27ccae63495d..73d19de14e51 100644
> > > > > --- a/fs/fuse/dev.c
> > > > > +++ b/fs/fuse/dev.c
> > > > > @@ -21,6 +21,8 @@
> > > > >  #include <linux/swap.h>
> > > > >  #include <linux/splice.h>
> > > > >  #include <linux/sched.h>
> > > > > +#include <linux/completion.h>
> > > > > +#include <linux/sched/sysctl.h>
> > > > >
> > > > >  #define CREATE_TRACE_POINTS
> > > > >  #include "fuse_trace.h"
> > > > > @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file)
> > > > >         return READ_ONCE(file->private_data);
> > > > >  }
> > > > >
> > > > > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req,
> > > > > +               int timeout)
> > > > > +{
> > > > > +       return time_after(jiffies, req->create_time + timeout);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Prevent hung task timer from firing at us
> > > > > + * Check if any requests aren't being completed by the specified request
> > > > > + * timeout. To do so, we:
> > > > > + * - check the fiq pending list
> > > > > + * - check the bg queue
> > > > > + * - check the fpq io and processing lists
> > > > > + *
> > > > > + * To make this fast, we only check against the head request on each list since
> > > > > + * these are generally queued in order of creation time (eg newer requests get
> > > > > + * queued to the tail). We might miss a few edge cases (eg requests transitioning
> > > > > + * between lists, re-sent requests at the head of the pending list having a
> > > > > + * later creation time than other requests on that list, etc.) but that is fine
> > > > > + * since if the request never gets fulfilled, it will eventually be caught.
> > > > > + */
> > > > > +void fuse_check_timeout(struct work_struct *wk)
> > > > > +{
> > > > > +       unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2);
> > > > > +       struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work);
> > > > > +       struct fuse_iqueue *fiq = &fc->iq;
> > > > > +       struct fuse_req *req;
> > > > > +       struct fuse_dev *fud;
> > > > > +       struct fuse_pqueue *fpq;
> > > > > +       bool expired = false;
> > > > > +       int i;
> > > > > +
> > > > > +       spin_lock(&fiq->lock);
> > > > > +       req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list);
> > > > > +       if (req)
> > > > > +               expired = request_expired(fc, req, hang_check_timer);
> > > > > +       spin_unlock(&fiq->lock);
> > > > > +       if (expired)
> > > > > +               goto abort_conn;
> > > > > +
> > > > > +       spin_lock(&fc->bg_lock);
> > > > > +       req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list);
> > > > > +       if (req)
> > > > > +               expired = request_expired(fc, req, hang_check_timer);
> > > > > +       spin_unlock(&fc->bg_lock);
> > > > > +       if (expired)
> > > > > +               goto abort_conn;
> > > > > +
> > > > > +       spin_lock(&fc->lock);
> > > > > +       if (!fc->connected) {
> > > > > +               spin_unlock(&fc->lock);
> > > > > +               return;
> > > > > +       }
> > > > > +       list_for_each_entry(fud, &fc->devices, entry) {
> > > > > +               fpq = &fud->pq;
> > > > > +               spin_lock(&fpq->lock);
> > > > > +               req = list_first_entry_or_null(&fpq->io, struct fuse_req, list);
> > > > > +               if (req && request_expired(fc, req, hang_check_timer))
> > > > > +                       goto fpq_abort;
> > > > > +
> > > > > +               for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > > > > +                       req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list);
> > > > > +                       if (req && request_expired(fc, req, hang_check_timer))
> > > > > +                               goto fpq_abort;
> > > > > +               }
> > > > > +               spin_unlock(&fpq->lock);
> > > > > +       }
> > > > > +       /* Keep the ball rolling */
> > > > > +       if (atomic_read(&fc->num_waiting) != 0)
> > > > > +               queue_delayed_work(system_wq, &fc->work, hang_check_timer);
> > > > > +       spin_unlock(&fc->lock);
> > > > > +       return;
> > > > > +
> > > > > +fpq_abort:
> > > > > +       spin_unlock(&fpq->lock);
> > > > > +       spin_unlock(&fc->lock);
> > > > > +abort_conn:
> > > > > +       fuse_abort_conn(fc);
> > > > > +}
> > > > > +
> > > > >  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> > > > >  {
> > > > > +       struct fuse_conn *fc = fm->fc;
> > > > >         INIT_LIST_HEAD(&req->list);
> > > > >         INIT_LIST_HEAD(&req->intr_entry);
> > > > >         init_waitqueue_head(&req->waitq);
> > > > >         refcount_set(&req->count, 1);
> > > > >         __set_bit(FR_PENDING, &req->flags);
> > > > >         req->fm = fm;
> > > > > +       req->create_time = jiffies;
> > > > > +
> > > > > +       if (sysctl_hung_task_panic) {
> > > > > +               spin_lock(&fc->lock);
> > > > > +               /* Get the ball rolling */
> > > > > +               queue_delayed_work(system_wq, &fc->work,
> > > > > +                               sysctl_hung_task_timeout_secs * (HZ / 2));
> > > > > +               spin_unlock(&fc->lock);
> > > > > +       }
> > > > >  }
> > > > >
> > > > >  static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
> > > > > @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req)
> > > > >                         fuse_drop_waiting(fc);
> > > > >                 }
> > > > >
> > > > > +               if (sysctl_hung_task_panic) {
> > > > > +                       spin_lock(&fc->lock);
> > > > > +                       /* Stop the timeout check if we are the last request */
> > > > > +                       if (atomic_read(&fc->num_waiting) == 0)
> > > > > +                               cancel_delayed_work_sync(&fc->work);
> > > > > +                       spin_unlock(&fc->lock);
> > > > > +               }
> > > > > +
> > > > >                 fuse_request_free(req);
> > > > >         }
> > > > >  }
> > > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > > index 74744c6f2860..aba3ffd0fb67 100644
> > > > > --- a/fs/fuse/fuse_i.h
> > > > > +++ b/fs/fuse/fuse_i.h
> > > > > @@ -438,6 +438,9 @@ struct fuse_req {
> > > > >
> > > > >         /** fuse_mount this request belongs to */
> > > > >         struct fuse_mount *fm;
> > > > > +
> > > > > +       /** When (in jiffies) the request was created */
> > > > > +       unsigned long create_time;
> > > > >  };
> > > > >
> > > > >  struct fuse_iqueue;
> > > > > @@ -923,6 +926,9 @@ struct fuse_conn {
> > > > >         /** IDR for backing files ids */
> > > > >         struct idr backing_files_map;
> > > > >  #endif
> > > > > +
> > > > > +       /** Request wait timeout check */
> > > > > +       struct delayed_work work;
> > > > >  };
> > > > >
> > > > >  /*
> > > > > @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req);
> > > > >  /* Abort all requests */
> > > > >  void fuse_abort_conn(struct fuse_conn *fc);
> > > > >  void fuse_wait_aborted(struct fuse_conn *fc);
> > > > > +/* Connection timeout */
> > > > > +void fuse_check_timeout(struct work_struct *wk);
> > > > >
> > > > >  /**
> > > > >   * Invalidate inode attributes
> > > > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > > > index 3ce4f4e81d09..ed96154df0fd 100644
> > > > > --- a/fs/fuse/inode.c
> > > > > +++ b/fs/fuse/inode.c
> > > > > @@ -23,6 +23,7 @@
> > > > >  #include <linux/exportfs.h>
> > > > >  #include <linux/posix_acl.h>
> > > > >  #include <linux/pid_namespace.h>
> > > > > +#include <linux/completion.h>
> > > > >  #include <uapi/linux/magic.h>
> > > > >
> > > > >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > > > > @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> > > > >         INIT_LIST_HEAD(&fc->entry);
> > > > >         INIT_LIST_HEAD(&fc->devices);
> > > > >         atomic_set(&fc->num_waiting, 0);
> > > > > +       INIT_DELAYED_WORK(&fc->work, fuse_check_timeout);
> > > > >         fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> > > > >         fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
> > > > >         atomic64_set(&fc->khctr, 0);
> > > > > @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > > > >                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > > > >                         fuse_backing_files_free(fc);
> > > > >                 call_rcu(&fc->rcu, delayed_release);
> > > > > +               cancel_delayed_work_sync(&fc->work);
> > > > >         }
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(fuse_conn_put);
> > > > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > > > index 5a64582b086b..1ed3a511060d 100644
> > > > > --- a/include/linux/sched/sysctl.h
> > > > > +++ b/include/linux/sched/sysctl.h
> > > > > @@ -5,11 +5,15 @@
> > > > >  #include <linux/types.h>
> > > > >
> > > > >  #ifdef CONFIG_DETECT_HUNG_TASK
> > > > > -/* used for hung_task and block/ */
> > > > > +/* used for hung_task, block/ and fuse */
> > > > >  extern unsigned long sysctl_hung_task_timeout_secs;
> > > > > +extern unsigned int sysctl_hung_task_panic;
> > > > >  #else
> > > > >  /* Avoid need for ifdefs elsewhere in the code */
> > > > > -enum { sysctl_hung_task_timeout_secs = 0 };
> > > > > +enum {
> > > > > +       sysctl_hung_task_timeout_secs = 0,
> > > > > +       sysctl_hung_task_panic = 0,
> > > > > +};
> > > > >  #endif
> > > > >
> > > > >  enum sched_tunable_scaling {
> > > > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > > > index c18717189f32..16602d3754b1 100644
> > > > > --- a/kernel/hung_task.c
> > > > > +++ b/kernel/hung_task.c
> > > > > @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> > > > >   * Should we panic (and reboot, if panic_timeout= is set) when a
> > > > >   * hung task is detected:
> > > > >   */
> > > > > -static unsigned int __read_mostly sysctl_hung_task_panic =
> > > > > +unsigned int __read_mostly sysctl_hung_task_panic =
> > > > >         IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > > > > +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic);
> > > > >
> > > > >  static int
> > > > >  hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > > > > --
> > > > > 2.34.1
> > > > >

Etienne Martineau Dec. 12, 2024, 5:52 p.m. UTC | #6

On Thu, Dec 12, 2024 at 10:09 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> CC+ Andrew
> CC+ David
> CC+ Matthew
> CC+ Barry
> CC+ Ryan
>

> > Regular file-system drivers handles everything internally but FUSE on
> > the other hands,
> > delegate the file system operation to a user process ( FUSE server )
> > If the FUSE server is turning bad, you don't want to reload right?
>
> To me, it makes sense to reload the system if HUNG_TASK_PANIC is
> enabled. Doing so allows me to notice the issue in time and resolve it
> through the kernel dump, IHMO.
>
Going thru the kdump and extract the gcore of the FUSE server is a bit
convoluted.
Maybe we should SIGABRT the server directly then?

> >
> > A non-privileged user can  potentially exploit this flaw and trigger a
> > reload. I'm
> > surprised that this didn't get flagged before ( maybe I'm missing something ? )
> > IMO this is why I think something needs to be done for the stable
> > branch as well.
>
> AFAIK, besides this, a non-privileged user has other ways to cause some
> processes to stay in the D state for a long period of time.
>
On older releases it used to be possible to trip the timer by banging
on some USB devices
but I believe this is fixed. Do you have an example?

> >
> > >
> > > If HUNG_TASK_PANIC is set, we should do a reload when a hung task is detected;
> > > this is working as expected IHMO.
> > Say when your browser hangs on your system, do you reload? FUSE server
> > is just another
> > process.
>
> Hmm... the choice to enable HUNG_TASK_PANIC should be up to the user, while
> the decision to reload the system should be up to the hung task detector ;)
>
> Thanks a lot for including me. It seems like we're not on the same page and I'm
> also not a FUSE expert. So, let's hear the views of others.
>
> Thanks,
> Lance
>
>
thanks,
Etienne

Joanne Koong Dec. 12, 2024, 9:48 p.m. UTC | #7

On Wed, Dec 11, 2024 at 3:04 PM Etienne Martineau
<etmartin4313@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> > >
> > > From: Etienne Martineau <etmartin4313@gmail.com>
> > >
> > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > is getting stuck for too long. A slow FUSE server may tripped the
> > > hang check timer for legitimate reasons hence consider disabling
> > > HUNG_TASK_PANIC in that scenario.
> > >
> > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > leave the clients in D state for a long period of time and on system where
> > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
> > >
> > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > to abort connections that exceed the timeout value which is define to be
> > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > is per connection and runs only if there are active FUSE request pending.
> >
> > Hi Etienne,
> >
> > For your use case, does the generic request timeouts logic and
> > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > IMO I don't think we should have logic specifically checking for hung
> > task timeouts in fuse, if the generic solution can be used.
> >
> > Thanks,
> > Joanne
>
> We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> is set while a buggy / malicious FUSE server stops responding.
> I would argue that this is much needed in stable branches as well...
>
> For that reason, I believe we need to keep things simple for step #1
> e.g. there is no
> need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> holds the source of truth.
>
> IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> something that is error prone because unlike
>   CONFIG_DETECT_HUNG_TASK=y
>   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> which is built-in, the "default_request_timeout" /
> "max_request_timeout" needs to be
> set appropriately after every reboot and failure to do so may have
> nasty consequences.

imo, it is not important to directly defend against the hung task case
inside the fuse code itself. imo, the generic timeout should be used
instead. As I understand it, hung task panic is mostly enabled for
debug purposes and is enabled through a sysctl. imo, if the system
admin enables the hung task panic sysctl value, then it is not too
much of a burden for them to also set the fuse max request timeout
sysctl.


Thanks,
Joanne

>
> For the more generic cases then yes those timeouts make sense to me.
>
> Thanks,
> Etienne
>
> >
> > [1] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-1-joannelkoong@gmail.com/
> > [2] https://lore.kernel.org/linux-fsdevel/20241114191332.669127-4-joannelkoong@gmail.com/
> >
> > >
> > > A FUSE client can get into D state as such ( see below scenario #1 / #2 )
> > >  1) request_wait_answer() -> wait_event() is UNINTERRUPTIBLE
> > >     OR
> > >  2) request_wait_answer() -> wait_event_(interruptible / killable) is head
> > >     of line blocking for subsequent clients accessing the same file
> > >
> > >         scenario #1:
> > >         2716 pts/2    D+     0:00 cat
> > >         $ cat /proc/2716/stack
> > >                 [<0>] request_wait_answer+0x22e/0x340
> > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > >                 [<0>] vfs_write+0x313/0x430
> > >                 [<0>] ksys_write+0x6a/0xf0
> > >                 [<0>] __x64_sys_write+0x19/0x30
> > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > >                 [<0>] do_syscall_64+0x7c/0x170
> > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > >         scenario #2:
> > >         2962 pts/2    S+     0:00 cat
> > >         2963 pts/3    D+     0:00 cat
> > >         $ cat /proc/2962/stack
> > >                 [<0>] request_wait_answer+0x140/0x340
> > >                 [<0>] __fuse_simple_request+0xd8/0x2c0
> > >                 [<0>] fuse_perform_write+0x3ec/0x760
> > >                 [<0>] fuse_file_write_iter+0x3d5/0x3f0
> > >                 [<0>] vfs_write+0x313/0x430
> > >                 [<0>] ksys_write+0x6a/0xf0
> > >                 [<0>] __x64_sys_write+0x19/0x30
> > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > >                 [<0>] do_syscall_64+0x7c/0x170
> > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >         $ cat /proc/2963/stack
> > >                 [<0>] fuse_file_write_iter+0x252/0x3f0
> > >                 [<0>] vfs_write+0x313/0x430
> > >                 [<0>] ksys_write+0x6a/0xf0
> > >                 [<0>] __x64_sys_write+0x19/0x30
> > >                 [<0>] x64_sys_call+0x18ab/0x26f0
> > >                 [<0>] do_syscall_64+0x7c/0x170
> > >                 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > > Please note that this patch doesn't prevent the HUNG_TASK_WARNING.
> > >
> > > Signed-off-by: Etienne Martineau <etmartin4313@gmail.com>
> > > ---
> > >  fs/fuse/dev.c                | 100 +++++++++++++++++++++++++++++++++++
> > >  fs/fuse/fuse_i.h             |   8 +++
> > >  fs/fuse/inode.c              |   3 ++
> > >  include/linux/sched/sysctl.h |   8 ++-
> > >  kernel/hung_task.c           |   3 +-
> > >  5 files changed, 119 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 27ccae63495d..73d19de14e51 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -21,6 +21,8 @@
> > >  #include <linux/swap.h>
> > >  #include <linux/splice.h>
> > >  #include <linux/sched.h>
> > > +#include <linux/completion.h>
> > > +#include <linux/sched/sysctl.h>
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include "fuse_trace.h"
> > > @@ -45,14 +47,104 @@ static struct fuse_dev *fuse_get_dev(struct file *file)
> > >         return READ_ONCE(file->private_data);
> > >  }
> > >
> > > +static bool request_expired(struct fuse_conn *fc, struct fuse_req *req,
> > > +               int timeout)
> > > +{
> > > +       return time_after(jiffies, req->create_time + timeout);
> > > +}
> > > +
> > > +/*
> > > + * Prevent hung task timer from firing at us
> > > + * Check if any requests aren't being completed by the specified request
> > > + * timeout. To do so, we:
> > > + * - check the fiq pending list
> > > + * - check the bg queue
> > > + * - check the fpq io and processing lists
> > > + *
> > > + * To make this fast, we only check against the head request on each list since
> > > + * these are generally queued in order of creation time (eg newer requests get
> > > + * queued to the tail). We might miss a few edge cases (eg requests transitioning
> > > + * between lists, re-sent requests at the head of the pending list having a
> > > + * later creation time than other requests on that list, etc.) but that is fine
> > > + * since if the request never gets fulfilled, it will eventually be caught.
> > > + */
> > > +void fuse_check_timeout(struct work_struct *wk)
> > > +{
> > > +       unsigned long hang_check_timer = sysctl_hung_task_timeout_secs * (HZ / 2);
> > > +       struct fuse_conn *fc = container_of(wk, struct fuse_conn, work.work);
> > > +       struct fuse_iqueue *fiq = &fc->iq;
> > > +       struct fuse_req *req;
> > > +       struct fuse_dev *fud;
> > > +       struct fuse_pqueue *fpq;
> > > +       bool expired = false;
> > > +       int i;
> > > +
> > > +       spin_lock(&fiq->lock);
> > > +       req = list_first_entry_or_null(&fiq->pending, struct fuse_req, list);
> > > +       if (req)
> > > +               expired = request_expired(fc, req, hang_check_timer);
> > > +       spin_unlock(&fiq->lock);
> > > +       if (expired)
> > > +               goto abort_conn;
> > > +
> > > +       spin_lock(&fc->bg_lock);
> > > +       req = list_first_entry_or_null(&fc->bg_queue, struct fuse_req, list);
> > > +       if (req)
> > > +               expired = request_expired(fc, req, hang_check_timer);
> > > +       spin_unlock(&fc->bg_lock);
> > > +       if (expired)
> > > +               goto abort_conn;
> > > +
> > > +       spin_lock(&fc->lock);
> > > +       if (!fc->connected) {
> > > +               spin_unlock(&fc->lock);
> > > +               return;
> > > +       }
> > > +       list_for_each_entry(fud, &fc->devices, entry) {
> > > +               fpq = &fud->pq;
> > > +               spin_lock(&fpq->lock);
> > > +               req = list_first_entry_or_null(&fpq->io, struct fuse_req, list);
> > > +               if (req && request_expired(fc, req, hang_check_timer))
> > > +                       goto fpq_abort;
> > > +
> > > +               for (i = 0; i < FUSE_PQ_HASH_SIZE; i++) {
> > > +                       req = list_first_entry_or_null(&fpq->processing[i], struct fuse_req, list);
> > > +                       if (req && request_expired(fc, req, hang_check_timer))
> > > +                               goto fpq_abort;
> > > +               }
> > > +               spin_unlock(&fpq->lock);
> > > +       }
> > > +       /* Keep the ball rolling */
> > > +       if (atomic_read(&fc->num_waiting) != 0)
> > > +               queue_delayed_work(system_wq, &fc->work, hang_check_timer);
> > > +       spin_unlock(&fc->lock);
> > > +       return;
> > > +
> > > +fpq_abort:
> > > +       spin_unlock(&fpq->lock);
> > > +       spin_unlock(&fc->lock);
> > > +abort_conn:
> > > +       fuse_abort_conn(fc);
> > > +}
> > > +
> > >  static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
> > >  {
> > > +       struct fuse_conn *fc = fm->fc;
> > >         INIT_LIST_HEAD(&req->list);
> > >         INIT_LIST_HEAD(&req->intr_entry);
> > >         init_waitqueue_head(&req->waitq);
> > >         refcount_set(&req->count, 1);
> > >         __set_bit(FR_PENDING, &req->flags);
> > >         req->fm = fm;
> > > +       req->create_time = jiffies;
> > > +
> > > +       if (sysctl_hung_task_panic) {
> > > +               spin_lock(&fc->lock);
> > > +               /* Get the ball rolling */
> > > +               queue_delayed_work(system_wq, &fc->work,
> > > +                               sysctl_hung_task_timeout_secs * (HZ / 2));
> > > +               spin_unlock(&fc->lock);
> > > +       }
> > >  }
> > >
> > >  static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
> > > @@ -200,6 +292,14 @@ static void fuse_put_request(struct fuse_req *req)
> > >                         fuse_drop_waiting(fc);
> > >                 }
> > >
> > > +               if (sysctl_hung_task_panic) {
> > > +                       spin_lock(&fc->lock);
> > > +                       /* Stop the timeout check if we are the last request */
> > > +                       if (atomic_read(&fc->num_waiting) == 0)
> > > +                               cancel_delayed_work_sync(&fc->work);
> > > +                       spin_unlock(&fc->lock);
> > > +               }
> > > +
> > >                 fuse_request_free(req);
> > >         }
> > >  }
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 74744c6f2860..aba3ffd0fb67 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -438,6 +438,9 @@ struct fuse_req {
> > >
> > >         /** fuse_mount this request belongs to */
> > >         struct fuse_mount *fm;
> > > +
> > > +       /** When (in jiffies) the request was created */
> > > +       unsigned long create_time;
> > >  };
> > >
> > >  struct fuse_iqueue;
> > > @@ -923,6 +926,9 @@ struct fuse_conn {
> > >         /** IDR for backing files ids */
> > >         struct idr backing_files_map;
> > >  #endif
> > > +
> > > +       /** Request wait timeout check */
> > > +       struct delayed_work work;
> > >  };
> > >
> > >  /*
> > > @@ -1190,6 +1196,8 @@ void fuse_request_end(struct fuse_req *req);
> > >  /* Abort all requests */
> > >  void fuse_abort_conn(struct fuse_conn *fc);
> > >  void fuse_wait_aborted(struct fuse_conn *fc);
> > > +/* Connection timeout */
> > > +void fuse_check_timeout(struct work_struct *wk);
> > >
> > >  /**
> > >   * Invalidate inode attributes
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index 3ce4f4e81d09..ed96154df0fd 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -23,6 +23,7 @@
> > >  #include <linux/exportfs.h>
> > >  #include <linux/posix_acl.h>
> > >  #include <linux/pid_namespace.h>
> > > +#include <linux/completion.h>
> > >  #include <uapi/linux/magic.h>
> > >
> > >  MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>");
> > > @@ -964,6 +965,7 @@ void fuse_conn_init(struct fuse_conn *fc, struct fuse_mount *fm,
> > >         INIT_LIST_HEAD(&fc->entry);
> > >         INIT_LIST_HEAD(&fc->devices);
> > >         atomic_set(&fc->num_waiting, 0);
> > > +       INIT_DELAYED_WORK(&fc->work, fuse_check_timeout);
> > >         fc->max_background = FUSE_DEFAULT_MAX_BACKGROUND;
> > >         fc->congestion_threshold = FUSE_DEFAULT_CONGESTION_THRESHOLD;
> > >         atomic64_set(&fc->khctr, 0);
> > > @@ -1015,6 +1017,7 @@ void fuse_conn_put(struct fuse_conn *fc)
> > >                 if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
> > >                         fuse_backing_files_free(fc);
> > >                 call_rcu(&fc->rcu, delayed_release);
> > > +               cancel_delayed_work_sync(&fc->work);
> > >         }
> > >  }
> > >  EXPORT_SYMBOL_GPL(fuse_conn_put);
> > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > index 5a64582b086b..1ed3a511060d 100644
> > > --- a/include/linux/sched/sysctl.h
> > > +++ b/include/linux/sched/sysctl.h
> > > @@ -5,11 +5,15 @@
> > >  #include <linux/types.h>
> > >
> > >  #ifdef CONFIG_DETECT_HUNG_TASK
> > > -/* used for hung_task and block/ */
> > > +/* used for hung_task, block/ and fuse */
> > >  extern unsigned long sysctl_hung_task_timeout_secs;
> > > +extern unsigned int sysctl_hung_task_panic;
> > >  #else
> > >  /* Avoid need for ifdefs elsewhere in the code */
> > > -enum { sysctl_hung_task_timeout_secs = 0 };
> > > +enum {
> > > +       sysctl_hung_task_timeout_secs = 0,
> > > +       sysctl_hung_task_panic = 0,
> > > +};
> > >  #endif
> > >
> > >  enum sched_tunable_scaling {
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > index c18717189f32..16602d3754b1 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -78,8 +78,9 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace;
> > >   * Should we panic (and reboot, if panic_timeout= is set) when a
> > >   * hung task is detected:
> > >   */
> > > -static unsigned int __read_mostly sysctl_hung_task_panic =
> > > +unsigned int __read_mostly sysctl_hung_task_panic =
> > >         IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC);
> > > +EXPORT_SYMBOL_GPL(sysctl_hung_task_panic);
> > >
> > >  static int
> > >  hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr)
> > > --
> > > 2.34.1
> > >

Etienne Martineau Dec. 12, 2024, 10:45 p.m. UTC | #8

On Thu, Dec 12, 2024 at 4:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Dec 11, 2024 at 3:04 PM Etienne Martineau
> <etmartin4313@gmail.com> wrote:
> >
> > On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> > > >
> > > > From: Etienne Martineau <etmartin4313@gmail.com>
> > > >
> > > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > > is getting stuck for too long. A slow FUSE server may tripped the
> > > > hang check timer for legitimate reasons hence consider disabling
> > > > HUNG_TASK_PANIC in that scenario.
> > > >
> > > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > > leave the clients in D state for a long period of time and on system where
> > > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
> > > >
> > > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > > to abort connections that exceed the timeout value which is define to be
> > > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > > is per connection and runs only if there are active FUSE request pending.
> > >
> > > Hi Etienne,
> > >
> > > For your use case, does the generic request timeouts logic and
> > > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > > IMO I don't think we should have logic specifically checking for hung
> > > task timeouts in fuse, if the generic solution can be used.
> > >
> > > Thanks,
> > > Joanne
> >
> > We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> > is set while a buggy / malicious FUSE server stops responding.
> > I would argue that this is much needed in stable branches as well...
> >
> > For that reason, I believe we need to keep things simple for step #1
> > e.g. there is no
> > need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> > holds the source of truth.
> >
> > IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> > something that is error prone because unlike
> >   CONFIG_DETECT_HUNG_TASK=y
> >   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> > which is built-in, the "default_request_timeout" /
> > "max_request_timeout" needs to be
> > set appropriately after every reboot and failure to do so may have
> > nasty consequences.
>
> imo, it is not important to directly defend against the hung task case
> inside the fuse code itself. imo, the generic timeout should be used
> instead. As I understand it, hung task panic is mostly enabled for
> debug purposes and is enabled through a sysctl. imo, if the system
> admin enables the hung task panic sysctl value, then it is not too
> much of a burden for them to also set the fuse max request timeout
> sysctl.
>
>
> Thanks,
> Joanne
>
Yes, based on the comments received so far I agree that generic timeout is the
prefered approach. Looks like we are amongst the few that run production
systems with hung task panic set. So yeah, I will match fuse max request
timeout with hung task timeout to get the equivalent behavior.

On a slightly different matter, I realized that in some scenarios
there is no benefit
in stopping the timer when reaching the last request because another
request can come
right after and then we have to start the timer once again which keeps bouncing
between cancel_delayed_work_sync() and queue_delayed_work().

So I think it's best to stick with your approach of starting the timer
when the connection
is initially established. I can re-work this patch if needed?

I've been doing some testing and so far I hit timeout in bg_queue and
fpq->processing
but I cannot trigger timeouts in fiq->pending somehow?

thanks
Etienne

Joanne Koong Dec. 12, 2024, 11:30 p.m. UTC | #9

On Thu, Dec 12, 2024 at 2:46 PM Etienne Martineau
<etmartin4313@gmail.com> wrote:
>
> On Thu, Dec 12, 2024 at 4:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Dec 11, 2024 at 3:04 PM Etienne Martineau
> > <etmartin4313@gmail.com> wrote:
> > >
> > > On Wed, Dec 11, 2024 at 5:04 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Wed, Dec 11, 2024 at 11:45 AM <etmartin4313@gmail.com> wrote:
> > > > >
> > > > > From: Etienne Martineau <etmartin4313@gmail.com>
> > > > >
> > > > > This patch abort connection if HUNG_TASK_PANIC is set and a FUSE server
> > > > > is getting stuck for too long. A slow FUSE server may tripped the
> > > > > hang check timer for legitimate reasons hence consider disabling
> > > > > HUNG_TASK_PANIC in that scenario.
> > > > >
> > > > > Without this patch, an unresponsive / buggy / malicious FUSE server can
> > > > > leave the clients in D state for a long period of time and on system where
> > > > > HUNG_TASK_PANIC is set, trigger a catastrophic reload.
> > > > >
> > > > > So, if HUNG_TASK_PANIC checking is enabled, we should wake up periodically
> > > > > to abort connections that exceed the timeout value which is define to be
> > > > > half the HUNG_TASK_TIMEOUT period, which keeps overhead low. The timer
> > > > > is per connection and runs only if there are active FUSE request pending.
> > > >
> > > > Hi Etienne,
> > > >
> > > > For your use case, does the generic request timeouts logic and
> > > > max_request_timeout systemctl implemented in [1] and [2] not suffice?
> > > > IMO I don't think we should have logic specifically checking for hung
> > > > task timeouts in fuse, if the generic solution can be used.
> > > >
> > > > Thanks,
> > > > Joanne
> > >
> > > We need a way to avoid catastrophic reloads on systems where HUNG_TASK_PANIC
> > > is set while a buggy / malicious FUSE server stops responding.
> > > I would argue that this is much needed in stable branches as well...
> > >
> > > For that reason, I believe we need to keep things simple for step #1
> > > e.g. there is no
> > > need to introduce another knob as we already have HUNG_TASK_TIMEOUT which
> > > holds the source of truth.
> > >
> > > IMO introducing those new knobs will put an unnecessary burden on sysadmins into
> > > something that is error prone because unlike
> > >   CONFIG_DETECT_HUNG_TASK=y
> > >   CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
> > > which is built-in, the "default_request_timeout" /
> > > "max_request_timeout" needs to be
> > > set appropriately after every reboot and failure to do so may have
> > > nasty consequences.
> >
> > imo, it is not important to directly defend against the hung task case
> > inside the fuse code itself. imo, the generic timeout should be used
> > instead. As I understand it, hung task panic is mostly enabled for
> > debug purposes and is enabled through a sysctl. imo, if the system
> > admin enables the hung task panic sysctl value, then it is not too
> > much of a burden for them to also set the fuse max request timeout
> > sysctl.
> >
> >
> > Thanks,
> > Joanne
> >
> Yes, based on the comments received so far I agree that generic timeout is the
> prefered approach. Looks like we are amongst the few that run production
> systems with hung task panic set. So yeah, I will match fuse max request
> timeout with hung task timeout to get the equivalent behavior.

Sounds great. Just FYI, the timeouts in fuse won't be 100% precise -
they'll have an upper margin of error associated with it (this is
included in the documentation for the sysctl, since it's very
non-obvious). For example, if the max request timeout is set to 600
seconds, it may fire off a little after 600 seconds. So it'd be best
if you set the fuse max request timeout to be below the hung task
timeout to be sure. IIRC, Sergey is doing the same thing [1].


[1] https://lore.kernel.org/linux-fsdevel/20241128115455.GG10431@google.com/

>
> On a slightly different matter, I realized that in some scenarios
> there is no benefit
> in stopping the timer when reaching the last request because another
> request can come
> right after and then we have to start the timer once again which keeps bouncing
> between cancel_delayed_work_sync() and queue_delayed_work().
>
> So I think it's best to stick with your approach of starting the timer
> when the connection
> is initially established. I can re-work this patch if needed?

Thanks for testing out the timeout functionality. I'm planning to
submit v10 of the generic timeout patch to use workqueues early next
week. The time granularity will also be changed to work in seconds
instead of minutes, as noted for Sergey's and your use case. I'll make
sure you get cc'ed on that patchset.

>
> I've been doing some testing and so far I hit timeout in bg_queue and
> fpq->processing
> but I cannot trigger timeouts in fiq->pending somehow?

You can trigger the fiq->pending timeout by having your fuse server
never read from /dev/fuse, which will keep the request on the
fiq->pending list when the timeout hits. The request is only taken off
the fiq->pending list when fuse reads a request into the server's
buffer (see fuse_dev_do_read()).


Thanks,
Joanne
>
> thanks
> Etienne

Etienne Martineau Dec. 16, 2024, 2:07 a.m. UTC | #10

On Thu, Dec 12, 2024 at 6:30 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> > Yes, based on the comments received so far I agree that generic timeout is the
> > prefered approach. Looks like we are amongst the few that run production
> > systems with hung task panic set. So yeah, I will match fuse max request
> > timeout with hung task timeout to get the equivalent behavior.
>
> Sounds great. Just FYI, the timeouts in fuse won't be 100% precise -
> they'll have an upper margin of error associated with it (this is
> included in the documentation for the sysctl, since it's very
> non-obvious). For example, if the max request timeout is set to 600
> seconds, it may fire off a little after 600 seconds. So it'd be best
> if you set the fuse max request timeout to be below the hung task
> timeout to be sure. IIRC, Sergey is doing the same thing [1].
>
Understood yes.
>
> [1] https://lore.kernel.org/linux-fsdevel/20241128115455.GG10431@google.com/
>
> >
> > On a slightly different matter, I realized that in some scenarios
> > there is no benefit
> > in stopping the timer when reaching the last request because another
> > request can come
> > right after and then we have to start the timer once again which keeps bouncing
> > between cancel_delayed_work_sync() and queue_delayed_work().
> >
> > So I think it's best to stick with your approach of starting the timer
> > when the connection
> > is initially established. I can re-work this patch if needed?
>
> Thanks for testing out the timeout functionality. I'm planning to
> submit v10 of the generic timeout patch to use workqueues early next
> week. The time granularity will also be changed to work in seconds
> instead of minutes, as noted for Sergey's and your use case. I'll make
> sure you get cc'ed on that patchset.
Thank you very much. I'll take a look
thanks
Etienne

[v2] fuse: Abort connection if FUSE server get stuck

Commit Message

Comments

Patch