diff mbox series

[f2fs-dev,v2,1/2] workqueue: new struct io_work

Message ID 20240701075138.1144575-2-yi.sun@unisoc.com (mailing list archive)
State New
Headers show
Series new struct io_work and use it in f2fs fsverity work | expand

Commit Message

Yi Sun July 1, 2024, 7:51 a.m. UTC
Many works will go to submit_bio(), and in many cases the io priority
of kworker cannot meet the real-time requirements of this work.

So create a new struct io_work, which contains the io priority that
the kworker thread can adjust its own io priority according to.

And,
new function set_io_work_ioprio() to set the io priority of io work,
new function may_adjust_io_work_task_ioprio() to adjust kworker's io
priority,
new function restore_io_work_task_ioprio() to restore kworker's io
priority.

Signed-off-by: Yi Sun <yi.sun@unisoc.com>
---
 include/linux/workqueue.h | 47 ++++++++++++++++++++++++++++++++++++
 kernel/workqueue.c        | 50 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 97 insertions(+)

Comments

Tejun Heo July 1, 2024, 5:32 p.m. UTC | #1
Hello,

On Mon, Jul 01, 2024 at 03:51:37PM +0800, Yi Sun wrote:
> +/*
> + * If a work may do disk IO, it is recommended to use struct io_work
> + * instead of struct work_struct.
> + */
> +struct io_work {
> +	struct work_struct work;
> +
> +	/* If the work does submit_bio, io priority may be needed. */
> +	unsigned short ioprio;
> +	/* Record kworker's original io priority. */
> +	unsigned short ori_ioprio;
> +	/* Whether the work has set io priority? */
> +	long ioprio_flag;
> +};

There are fundamental limitations to this approach in terms of
prioritization. If you tag each work items with priority and then send them
to the same workqueue, there's nothing preventing a low priority issuer from
flooding the workqueue and causing a priority inversion. ie. To solve this
properly, you need per-issuer-class workqueue so that the concurrency limit
is not shared across different priorities.

Now, this limited implementation, while incomplete and easy to defeat, may
still be useful. After all, ioprio itself, I think, is flawed in the same
way. If f2fs wants to implement this internally, that's okay, I suppose, but
as a generic mechanism, I don't think this makes a lot of sense.

Thanks.
Kent Overstreet July 2, 2024, 3:53 a.m. UTC | #2
On Mon, Jul 01, 2024 at 07:32:23AM GMT, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 01, 2024 at 03:51:37PM +0800, Yi Sun wrote:
> > +/*
> > + * If a work may do disk IO, it is recommended to use struct io_work
> > + * instead of struct work_struct.
> > + */
> > +struct io_work {
> > +	struct work_struct work;
> > +
> > +	/* If the work does submit_bio, io priority may be needed. */
> > +	unsigned short ioprio;
> > +	/* Record kworker's original io priority. */
> > +	unsigned short ori_ioprio;
> > +	/* Whether the work has set io priority? */
> > +	long ioprio_flag;
> > +};
> 
> There are fundamental limitations to this approach in terms of
> prioritization. If you tag each work items with priority and then send them
> to the same workqueue, there's nothing preventing a low priority issuer from
> flooding the workqueue and causing a priority inversion. ie. To solve this
> properly, you need per-issuer-class workqueue so that the concurrency limit
> is not shared across different priorities.
> 
> Now, this limited implementation, while incomplete and easy to defeat, may
> still be useful. After all, ioprio itself, I think, is flawed in the same
> way. If f2fs wants to implement this internally, that's okay, I suppose, but
> as a generic mechanism, I don't think this makes a lot of sense.

And I wonder if the reason for submitting from a workqueue isn't also
priority inversion?

I haven't looked at the f2fs code, but that comes up in bcachefs; we
have IOs that we submit from worqueue context because they're submitted
in contexts where we _really_ cannot block - they're metadata IOs, and
thus also high priority IOs. But if the queue is already full with lower
priority IOs...

perhaps what we need is a bio flag to say "do not block in the
submission path, queue is allowed to exceed normal limits for this (high
priority) IO"
yi sun July 2, 2024, 9:27 a.m. UTC | #3
Yes, adding the io priority attribute to work will bring huge benefits in the
following scenarios:
The system has huge IO pressure (these IOs may all be low-priority IOs),
and a work (we hope to complete quickly) is also doing IO. If this work
does not set IO priority, it will be blocked because it is difficult to get IO
resources. And if this work sets a high IO priority and passes the IO priority
to kworker, then this work will be completed quickly (as we expect).

On Tue, Jul 2, 2024 at 11:53 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Mon, Jul 01, 2024 at 07:32:23AM GMT, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Jul 01, 2024 at 03:51:37PM +0800, Yi Sun wrote:
> > > +/*
> > > + * If a work may do disk IO, it is recommended to use struct io_work
> > > + * instead of struct work_struct.
> > > + */
> > > +struct io_work {
> > > +   struct work_struct work;
> > > +
> > > +   /* If the work does submit_bio, io priority may be needed. */
> > > +   unsigned short ioprio;
> > > +   /* Record kworker's original io priority. */
> > > +   unsigned short ori_ioprio;
> > > +   /* Whether the work has set io priority? */
> > > +   long ioprio_flag;
> > > +};
> >
> > There are fundamental limitations to this approach in terms of
> > prioritization. If you tag each work items with priority and then send them
> > to the same workqueue, there's nothing preventing a low priority issuer from
> > flooding the workqueue and causing a priority inversion. ie. To solve this
> > properly, you need per-issuer-class workqueue so that the concurrency limit
> > is not shared across different priorities.
> >
> > Now, this limited implementation, while incomplete and easy to defeat, may
> > still be useful. After all, ioprio itself, I think, is flawed in the same
> > way. If f2fs wants to implement this internally, that's okay, I suppose, but
> > as a generic mechanism, I don't think this makes a lot of sense.
>
> And I wonder if the reason for submitting from a workqueue isn't also
> priority inversion?
>
> I haven't looked at the f2fs code, but that comes up in bcachefs; we
> have IOs that we submit from worqueue context because they're submitted
> in contexts where we _really_ cannot block - they're metadata IOs, and
> thus also high priority IOs. But if the queue is already full with lower
> priority IOs...
>
> perhaps what we need is a bio flag to say "do not block in the
> submission path, queue is allowed to exceed normal limits for this (high
> priority) IO"
Tejun Heo July 2, 2024, 4:40 p.m. UTC | #4
Hello,

On Tue, Jul 02, 2024 at 05:27:19PM +0800, yi sun wrote:
> Yes, adding the io priority attribute to work will bring huge benefits in the
> following scenarios:
> The system has huge IO pressure (these IOs may all be low-priority IOs),
> and a work (we hope to complete quickly) is also doing IO. If this work
> does not set IO priority, it will be blocked because it is difficult to get IO
> resources. And if this work sets a high IO priority and passes the IO priority
> to kworker, then this work will be completed quickly (as we expect).

As I wrote previously, you can still get trapped in a pretty bad priority
inversion whether from workqueue concurrency or queue depth exhaustion. I'm
sure that there's some spectrum of contention conditions that can be helped
with just setting ioprio, but it's a pretty partial solution.

Thanks.
Kent Overstreet July 2, 2024, 4:45 p.m. UTC | #5
On Tue, Jul 02, 2024 at 05:27:19PM GMT, yi sun wrote:
> Yes, adding the io priority attribute to work will bring huge benefits in the
> following scenarios:
> The system has huge IO pressure (these IOs may all be low-priority IOs),
> and a work (we hope to complete quickly) is also doing IO. If this work
> does not set IO priority, it will be blocked because it is difficult to get IO
> resources. And if this work sets a high IO priority and passes the IO priority
> to kworker, then this work will be completed quickly (as we expect).

Why are you submitting IO from a workqueue in the first place?
kernel test robot July 2, 2024, 9:51 p.m. UTC | #6
Hi Yi,

kernel test robot noticed the following build errors:

[auto build test ERROR on jaegeuk-f2fs/dev-test]
[also build test ERROR on jaegeuk-f2fs/dev tj-wq/for-next linus/master v6.10-rc6 next-20240702]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yi-Sun/workqueue-new-struct-io_work/20240701-155343
base:   https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev-test
patch link:    https://lore.kernel.org/r/20240701075138.1144575-2-yi.sun%40unisoc.com
patch subject: [PATCH v2 1/2] workqueue: new struct io_work
config: arc-tb10x_defconfig (https://download.01.org/0day-ci/archive/20240703/202407030547.MbODytSE-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240703/202407030547.MbODytSE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407030547.MbODytSE-lkp@intel.com/

All errors (new ones prefixed by >>):

   arc-elf-ld: kernel/workqueue.o: in function `may_adjust_io_work_task_ioprio':
   kernel/workqueue.c:2682:(.text+0xbb0): undefined reference to `set_task_ioprio'
>> arc-elf-ld: kernel/workqueue.c:2682:(.text+0xbb0): undefined reference to `set_task_ioprio'
   arc-elf-ld: kernel/workqueue.o: in function `restore_io_work_task_ioprio':
   kernel/workqueue.c:2698:(.text+0xbc8): undefined reference to `set_task_ioprio'
   arc-elf-ld: kernel/workqueue.c:2698:(.text+0xbc8): undefined reference to `set_task_ioprio'
diff mbox series

Patch

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index d9968bfc8eac..4b2cb54a68b2 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -127,6 +127,21 @@  struct rcu_work {
 	struct workqueue_struct *wq;
 };
 
+/*
+ * If a work may do disk IO, it is recommended to use struct io_work
+ * instead of struct work_struct.
+ */
+struct io_work {
+	struct work_struct work;
+
+	/* If the work does submit_bio, io priority may be needed. */
+	unsigned short ioprio;
+	/* Record kworker's original io priority. */
+	unsigned short ori_ioprio;
+	/* Whether the work has set io priority? */
+	long ioprio_flag;
+};
+
 enum wq_affn_scope {
 	WQ_AFFN_DFL,			/* use system default */
 	WQ_AFFN_CPU,			/* one pod per CPU */
@@ -218,6 +233,11 @@  static inline struct rcu_work *to_rcu_work(struct work_struct *work)
 	return container_of(work, struct rcu_work, work);
 }
 
+static inline struct io_work *to_io_work(struct work_struct *work)
+{
+	return container_of(work, struct io_work, work);
+}
+
 struct execute_work {
 	struct work_struct work;
 };
@@ -347,6 +367,18 @@  static inline unsigned int work_static(struct work_struct *work) { return 0; }
 #define INIT_RCU_WORK_ONSTACK(_work, _func)				\
 	INIT_WORK_ONSTACK(&(_work)->work, (_func))
 
+#define INIT_IO_WORK(_work, _func)					\
+	do {								\
+		INIT_WORK(&(_work)->work, (_func));			\
+		(_work)->ioprio_flag = 0;				\
+	} while (0)
+
+#define INIT_IO_WORK_ONSTACK(_work, _func)				\
+	do {								\
+		INIT_WORK_ONSTACK(&(_work)->work, (_func));		\
+		(_work)->ioprio_flag = 0;				\
+	} while (0)
+
 /**
  * work_pending - Find out whether a work item is currently pending
  * @work: The work item in question
@@ -552,6 +584,10 @@  extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *dwork, unsigned long delay);
 extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
 
+extern void set_io_work_ioprio(struct io_work *work, unsigned short ioprio);
+extern void may_adjust_io_work_task_ioprio(struct io_work *work);
+extern void restore_io_work_task_ioprio(struct io_work *work);
+
 extern void __flush_workqueue(struct workqueue_struct *wq);
 extern void drain_workqueue(struct workqueue_struct *wq);
 
@@ -636,6 +672,17 @@  static inline bool queue_delayed_work(struct workqueue_struct *wq,
 	return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
 }
 
+/**
+ * queue_io_work - queue io work on a workqueue
+ * @wq: workqueue to use
+ * @iowork: io work to queue
+ */
+static inline bool queue_io_work(struct workqueue_struct *wq,
+				      struct io_work *iowork)
+{
+	return queue_work(wq, &(iowork->work));
+}
+
 /**
  * mod_delayed_work - modify delay of or queue a delayed work
  * @wq: workqueue to use
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 3fbaecfc88c2..a55b74d5f560 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2652,6 +2652,56 @@  bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 }
 EXPORT_SYMBOL(queue_rcu_work);
 
+/**
+ * set_io_work_ioprio - set io priority for the current io work
+ * @iowork: the io work to be set
+ * @ioprio: desired io priority
+ *
+ * This function can be called after INIT_IO_WORK if the io priority
+ * of the io work needs to adjust. And it is recommended to use this
+ * function together with may_adjust_io_work_task_ioprio() and
+ * restore_io_work_task_ioprio().
+ */
+void set_io_work_ioprio(struct io_work *iowork, unsigned short ioprio)
+{
+	iowork->ioprio = ioprio;
+	iowork->ioprio_flag = 1;
+}
+EXPORT_SYMBOL(set_io_work_ioprio);
+
+/**
+ * may_adjust_io_work_task_ioprio - maybe adjust the io priority of kworker
+ * @iowork: the io work that kworker will do
+ *
+ * It is recommended to use this function together with set_io_work_ioprio()
+ * and restore_io_work_task_ioprio().
+ */
+void may_adjust_io_work_task_ioprio(struct io_work *iowork)
+{
+	if (iowork->ioprio_flag) {
+		iowork->ori_ioprio = get_current_ioprio();
+		set_task_ioprio(current, iowork->ioprio);
+	}
+}
+EXPORT_SYMBOL(may_adjust_io_work_task_ioprio);
+
+/**
+ * restore_io_work_task_ioprio - restore the io priority of kworker
+ * @iowork: the io work that kworker just did
+ *
+ * When kworker finishes the io work, the original io priority of
+ * kworker should be restored. It is recommended to use this function
+ * together with set_io_work_ioprio() and may_adjust_io_work_task_ioprio().
+ */
+void restore_io_work_task_ioprio(struct io_work *iowork)
+{
+	if (iowork->ioprio_flag) {
+		set_task_ioprio(current, iowork->ori_ioprio);
+		iowork->ioprio_flag = 0;
+	}
+}
+EXPORT_SYMBOL(restore_io_work_task_ioprio);
+
 static struct worker *alloc_worker(int node)
 {
 	struct worker *worker;