diff mbox series

[v7,2/3] fuse: add optional kernel-enforced timeout for requests

Message ID 20241007184258.2837492-3-joannelkoong@gmail.com (mailing list archive)
State New
Headers show
Series fuse: add kernel-enforced request timeout option | expand

Commit Message

Joanne Koong Oct. 7, 2024, 6:42 p.m. UTC
There are situations where fuse servers can become unresponsive or
stuck, for example if the server is deadlocked. Currently, there's no
good way to detect if a server is stuck and needs to be killed manually.

This commit adds an option for enforcing a timeout (in minutes) for
requests where if the timeout elapses without the server responding to
the request, the connection will be automatically aborted.

Please note that these timeouts are not 100% precise. The request may
take an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the set max timeout
due to how it's internally implemented.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 fs/fuse/dev.c    | 63 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h | 45 ++++++++++++++++++++++++++++++++++
 fs/fuse/inode.c  | 22 +++++++++++++++++
 3 files changed, 129 insertions(+), 1 deletion(-)

Comments

Miklos Szeredi Oct. 9, 2024, 8:14 a.m. UTC | #1
On Mon, 7 Oct 2024 at 20:43, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> There are situations where fuse servers can become unresponsive or
> stuck, for example if the server is deadlocked. Currently, there's no
> good way to detect if a server is stuck and needs to be killed manually.
>
> This commit adds an option for enforcing a timeout (in minutes) for
> requests where if the timeout elapses without the server responding to
> the request, the connection will be automatically aborted.
>
> Please note that these timeouts are not 100% precise. The request may
> take an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the set max timeout
> due to how it's internally implemented.

One thing I worry about is adding more roadblocks on the way to making
request queuing more scalable.

Currently there's fc->num_waiting that's touched on all requests and
bg_queue/bg_lock that are touched on background requests.  We should
be trying to fix these bottlenecks instead of adding more.

Can't we use the existing lists to scan requests?

It's more complex, obviously, but at least it doesn't introduce yet
another per-fc list to worry about.

Thanks,
Miklos
Joanne Koong Oct. 10, 2024, 12:44 a.m. UTC | #2
On Wed, Oct 9, 2024 at 1:14 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Mon, 7 Oct 2024 at 20:43, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > There are situations where fuse servers can become unresponsive or
> > stuck, for example if the server is deadlocked. Currently, there's no
> > good way to detect if a server is stuck and needs to be killed manually.
> >
> > This commit adds an option for enforcing a timeout (in minutes) for
> > requests where if the timeout elapses without the server responding to
> > the request, the connection will be automatically aborted.
> >
> > Please note that these timeouts are not 100% precise. The request may
> > take an extra FUSE_TIMEOUT_TIMER_FREQ seconds beyond the set max timeout
> > due to how it's internally implemented.
>
> One thing I worry about is adding more roadblocks on the way to making
> request queuing more scalable.
>
> Currently there's fc->num_waiting that's touched on all requests and
> bg_queue/bg_lock that are touched on background requests.  We should
> be trying to fix these bottlenecks instead of adding more.
>
> Can't we use the existing lists to scan requests?

Hi Miklos,

The existing lists we have are:
* fiq pending list (per connection)
* fpq io list and fpq processing (for its associated hash) list (per fuse dev)
* bg queue (per connection)

If we wanted to reuse existing lists, we could do this in the timeout handler:
* grab the fiq lock, check the head entry of the pending list, release the lock
* grab the bg lock, check the head entry of the bg_queue, release the lock
* for each connection's fuse dev, grab the fpq lock, check the head
entry of the fpq->io list, iterate through the fpq->processing's lists
for 256 hashes and check against the head entry, release the lock

but some requests could slip through for the following cases:
-- resend:
* Request is on the resend's to_queue list when the timeout handler
check runs, in which case if that request is expired we won't get to
that until the next time the timeout handler kicks in
* A re-sent request may be moved to the head of the fiq->pending list,
but have a creation time newer than other entries on the fiq->pending
list , in which case we would not time out and abort the connection
when we should be doing so
-- transitioning between lists
* A request that is between lists (eg fpq->io and fpq->processing)
could be missed when the timeout handler check runs (but will probably
be caught the next time the timeout handler kicks in. We could also
modify the logic in dev_do_read to use list_move to avoid this case).

I think it's fine for these edge cases to slip through since most of
them will be caught eventually by the subsequent timeout handler runs,
but I was more worried about the increased lock contention while
iterating through all hashes of the fpq->processing list. But I think
for that we could just increase the timeout frequency to run less
frequently (eg once every 5 minutes instead of once every minute)

Do you think something like this sounds more reasonable?

Alternatively, I also still like the idea of something looser with
just periodically (according to whatever specified timeout) checking
if any requests are being serviced at all when fc->num-waiting is
non-zero. However, this would only protect against fully deadlocked
servers and miss malicious ones or half-deadlocked ones (eg
multithreaded fuse servers where only some threads are deadlocked).


Thanks,
Joanne
>
> It's more complex, obviously, but at least it doesn't introduce yet
> another per-fc list to worry about.
>
> Thanks,
> Miklos
Miklos Szeredi Oct. 10, 2024, 8:21 a.m. UTC | #3
On Thu, 10 Oct 2024 at 02:45, Joanne Koong <joannelkoong@gmail.com> wrote:
> I think it's fine for these edge cases to slip through since most of
> them will be caught eventually by the subsequent timeout handler runs,
> but I was more worried about the increased lock contention while
> iterating through all hashes of the fpq->processing list. But I think
> for that we could just increase the timeout frequency to run less
> frequently (eg once every 5 minutes instead of once every minute)

Yeah, edge cases shouldn't matter.  I think even 1/s frequency
wouldn't be too bad, this is just a quick scan of a (hopefully) not
too long list.

BTW, the cumulative lock contention would be exactly the same with the
separate timeout list, I wouldn't worry too much about it.

> Alternatively, I also still like the idea of something looser with
> just periodically (according to whatever specified timeout) checking
> if any requests are being serviced at all when fc->num-waiting is
> non-zero. However, this would only protect against fully deadlocked
> servers and miss malicious ones or half-deadlocked ones (eg
> multithreaded fuse servers where only some threads are deadlocked).

I don't have a preference.  Whichever is simpler to implement.

Thanks,
Miklos
Joanne Koong Oct. 10, 2024, 11:08 p.m. UTC | #4
On Thu, Oct 10, 2024 at 1:21 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Thu, 10 Oct 2024 at 02:45, Joanne Koong <joannelkoong@gmail.com> wrote:
> > I think it's fine for these edge cases to slip through since most of
> > them will be caught eventually by the subsequent timeout handler runs,
> > but I was more worried about the increased lock contention while
> > iterating through all hashes of the fpq->processing list. But I think
> > for that we could just increase the timeout frequency to run less
> > frequently (eg once every 5 minutes instead of once every minute)
>
> Yeah, edge cases shouldn't matter.  I think even 1/s frequency
> wouldn't be too bad, this is just a quick scan of a (hopefully) not
> too long list.
>
> BTW, the cumulative lock contention would be exactly the same with the
> separate timeout list, I wouldn't worry too much about it.
>
> > Alternatively, I also still like the idea of something looser with
> > just periodically (according to whatever specified timeout) checking
> > if any requests are being serviced at all when fc->num-waiting is
> > non-zero. However, this would only protect against fully deadlocked
> > servers and miss malicious ones or half-deadlocked ones (eg
> > multithreaded fuse servers where only some threads are deadlocked).
>
> I don't have a preference.  Whichever is simpler to implement.
>

Sounds great, thanks for the feedback. I'll submit v8 with these new
changes of reusing the existing lists.


Thanks,
Joanne

> Thanks,
> Miklos
diff mbox series

Patch

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 1f64ae6d7a69..429c91c59e3a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -45,14 +45,62 @@  static struct fuse_dev *fuse_get_dev(struct file *file)
 	return READ_ONCE(file->private_data);
 }
 
+void fuse_check_timeout(struct timer_list *timer)
+{
+	struct fuse_conn *fc = container_of(timer, struct fuse_conn, timeout.timer);
+	struct fuse_req *req;
+	bool expired = false;
+
+	spin_lock(&fc->timeout.lock);
+	req = list_first_entry_or_null(&fc->timeout.list, struct fuse_req,
+				       timer_entry);
+	if (req)
+		expired = jiffies > req->create_time + fc->timeout.req_timeout;
+	spin_unlock(&fc->timeout.lock);
+
+	/*
+	 * Don't rearm the timer if the list was empty in case the filesystem
+	 * is inactive. When the next request is sent, it'll rearm the timer.
+	 */
+	if (!req)
+		return;
+
+	if (expired)
+		fuse_abort_conn(fc);
+	else
+		mod_timer(&fc->timeout.timer, jiffies + FUSE_TIMEOUT_TIMER_FREQ);
+}
+
+static void add_req_timeout_entry(struct fuse_conn *fc, struct fuse_req *req)
+{
+	spin_lock(&fc->timeout.lock);
+	if (!timer_pending(&fc->timeout.timer))
+		mod_timer(&fc->timeout.timer, jiffies + FUSE_TIMEOUT_TIMER_FREQ);
+	list_add_tail(&req->timer_entry, &fc->timeout.list);
+	spin_unlock(&fc->timeout.lock);
+}
+
+static void remove_req_timeout_entry(struct fuse_conn *fc, struct fuse_req *req)
+{
+	spin_lock(&fc->timeout.lock);
+	list_del(&req->timer_entry);
+	spin_unlock(&fc->timeout.lock);
+}
+
 static void fuse_request_init(struct fuse_mount *fm, struct fuse_req *req)
 {
+	struct fuse_conn *fc = fm->fc;
+
 	INIT_LIST_HEAD(&req->list);
 	INIT_LIST_HEAD(&req->intr_entry);
 	init_waitqueue_head(&req->waitq);
 	refcount_set(&req->count, 1);
 	__set_bit(FR_PENDING, &req->flags);
 	req->fm = fm;
+	if (fc->timeout.req_timeout) {
+		INIT_LIST_HEAD(&req->timer_entry);
+		req->create_time = jiffies;
+	}
 }
 
 static struct fuse_req *fuse_request_alloc(struct fuse_mount *fm, gfp_t flags)
@@ -359,6 +407,9 @@  void fuse_request_end(struct fuse_req *req)
 	if (test_and_set_bit(FR_FINISHED, &req->flags))
 		goto put_request;
 
+	if (fc->timeout.req_timeout)
+		remove_req_timeout_entry(fc, req);
+
 	trace_fuse_request_end(req);
 	/*
 	 * test_and_set_bit() implies smp_mb() between bit
@@ -450,6 +501,8 @@  static void request_wait_answer(struct fuse_req *req)
 		if (test_bit(FR_PENDING, &req->flags)) {
 			list_del(&req->list);
 			spin_unlock(&fiq->lock);
+			if (fc->timeout.req_timeout)
+				remove_req_timeout_entry(fc, req);
 			__fuse_put_request(req);
 			req->out.h.error = -EINTR;
 			return;
@@ -466,13 +519,16 @@  static void request_wait_answer(struct fuse_req *req)
 
 static void __fuse_request_send(struct fuse_req *req)
 {
-	struct fuse_iqueue *fiq = &req->fm->fc->iq;
+	struct fuse_conn *fc = req->fm->fc;
+	struct fuse_iqueue *fiq = &fc->iq;
 
 	BUG_ON(test_bit(FR_BACKGROUND, &req->flags));
 
 	/* acquire extra reference, since request is still needed after
 	   fuse_request_end() */
 	__fuse_get_request(req);
+	if (fc->timeout.req_timeout)
+		add_req_timeout_entry(fc, req);
 	fuse_send_one(fiq, req);
 
 	request_wait_answer(req);
@@ -598,6 +654,8 @@  static bool fuse_request_queue_background(struct fuse_req *req)
 		if (fc->num_background == fc->max_background)
 			fc->blocked = 1;
 		list_add_tail(&req->list, &fc->bg_queue);
+		if (fc->timeout.req_timeout)
+			add_req_timeout_entry(fc, req);
 		flush_bg_queue(fc);
 		queued = true;
 	}
@@ -2296,6 +2354,9 @@  void fuse_abort_conn(struct fuse_conn *fc)
 		spin_unlock(&fc->lock);
 
 		end_requests(&to_end);
+
+		if (fc->timeout.req_timeout)
+			timer_delete(&fc->timeout.timer);
 	} else {
 		spin_unlock(&fc->lock);
 	}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 7ff00bae4a84..4c3998c28519 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -435,6 +435,16 @@  struct fuse_req {
 
 	/** fuse_mount this request belongs to */
 	struct fuse_mount *fm;
+
+	/*
+	 * These fields are only used if the connection enforces request
+	 * timeouts.
+	 *
+	 * timer_entry is the entry on the fuse connection's timeout list.
+	 * create_time (in jiffies) tracks when the request was created.
+	 */
+	struct list_head timer_entry;
+	unsigned long create_time;
 };
 
 struct fuse_iqueue;
@@ -525,6 +535,33 @@  struct fuse_pqueue {
 	struct list_head io;
 };
 
+/* Frequency (in seconds) of request timeout checks, if opted into */
+#define FUSE_TIMEOUT_TIMER_FREQ 60 * HZ
+
+/*
+ * If the connection enforces request timeouts, then all requests get
+ * added to the list when created and removed from the list when fulfilled by
+ * the server.
+ *
+ * The timer is triggered every FUSE_TIMEOUT_TIMER_FREQ seconds. It will
+ * check the head of the list for whether that request's start_time
+ * exceeds the timeout. If so, then the connection will be aborted.
+ *
+ * In the worst case, this guarantees that all requests will take
+ * the specified timeout + FUSE_TIMEOUT_TIMER_FREQ time to fulfill.
+ */
+struct fuse_timeout {
+	struct timer_list timer;
+
+	/* Request timeout (in jiffies). 0 = no timeout */
+	unsigned long req_timeout;
+
+	spinlock_t lock;
+
+	/* List of all requests that haven't been completed yet */
+	struct list_head list;
+};
+
 /**
  * Fuse device instance
  */
@@ -571,6 +608,8 @@  struct fuse_fs_context {
 	enum fuse_dax_mode dax_mode;
 	unsigned int max_read;
 	unsigned int blksize;
+	/*  Request timeout (in minutes). 0 = no timeout (infinite wait) */
+	unsigned int req_timeout;
 	const char *subtype;
 
 	/* DAX device, may be NULL */
@@ -914,6 +953,9 @@  struct fuse_conn {
 	/** IDR for backing files ids */
 	struct idr backing_files_map;
 #endif
+
+	/** Only used if the connection enforces request timeouts */
+	struct fuse_timeout timeout;
 };
 
 /*
@@ -1175,6 +1217,9 @@  void fuse_request_end(struct fuse_req *req);
 void fuse_abort_conn(struct fuse_conn *fc);
 void fuse_wait_aborted(struct fuse_conn *fc);
 
+/* Check if any requests timed out */
+void fuse_check_timeout(struct timer_list *timer);
+
 /**
  * Invalidate inode attributes
  */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f1779ff3f8d1..e5c7a214a222 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -735,6 +735,7 @@  enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_REQUEST_TIMEOUT,
 	OPT_ERR
 };
 
@@ -749,6 +750,7 @@  static const struct fs_parameter_spec fuse_fs_parameters[] = {
 	fsparam_u32	("max_read",		OPT_MAX_READ),
 	fsparam_u32	("blksize",		OPT_BLKSIZE),
 	fsparam_string	("subtype",		OPT_SUBTYPE),
+	fsparam_u16	("request_timeout",	OPT_REQUEST_TIMEOUT),
 	{}
 };
 
@@ -844,6 +846,10 @@  static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
 		ctx->blksize = result.uint_32;
 		break;
 
+	case OPT_REQUEST_TIMEOUT:
+		ctx->req_timeout = result.uint_16;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -973,6 +979,8 @@  void fuse_conn_put(struct fuse_conn *fc)
 
 		if (IS_ENABLED(CONFIG_FUSE_DAX))
 			fuse_dax_conn_free(fc);
+		if (fc->timeout.req_timeout)
+			timer_delete_sync(&fc->timeout.timer);
 		if (fiq->ops->release)
 			fiq->ops->release(fiq);
 		put_pid_ns(fc->pid_ns);
@@ -1691,6 +1699,19 @@  int fuse_init_fs_context_submount(struct fs_context *fsc)
 }
 EXPORT_SYMBOL_GPL(fuse_init_fs_context_submount);
 
+static void fuse_init_fc_timeout(struct fuse_conn *fc, struct fuse_fs_context *ctx)
+{
+	if (ctx->req_timeout) {
+		if (check_mul_overflow(ctx->req_timeout * 60, HZ, &fc->timeout.req_timeout))
+			fc->timeout.req_timeout = U32_MAX;
+		INIT_LIST_HEAD(&fc->timeout.list);
+		spin_lock_init(&fc->timeout.lock);
+		timer_setup(&fc->timeout.timer, fuse_check_timeout, 0);
+	} else {
+		fc->timeout.req_timeout = 0;
+	}
+}
+
 int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 {
 	struct fuse_dev *fud = NULL;
@@ -1753,6 +1774,7 @@  int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	fc->destroy = ctx->destroy;
 	fc->no_control = ctx->no_control;
 	fc->no_force_umount = ctx->no_force_umount;
+	fuse_init_fc_timeout(fc, ctx);
 
 	err = -ENOMEM;
 	root = fuse_get_root_inode(sb, ctx->rootmode);