diff mbox series

[RFC,v4,01/13] ktask: add documentation

Message ID 20181105165558.11698-2-daniel.m.jordan@oracle.com (mailing list archive)
State New, archived
Headers show
Series ktask: multithread CPU-intensive kernel work | expand

Commit Message

Daniel Jordan Nov. 5, 2018, 4:55 p.m. UTC
Motivates and explains the ktask API for kernel clients.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/core-api/index.rst |   1 +
 Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++
 2 files changed, 214 insertions(+)
 create mode 100644 Documentation/core-api/ktask.rst

Comments

Randy Dunlap Nov. 5, 2018, 9:19 p.m. UTC | #1
On 11/5/18 8:55 AM, Daniel Jordan wrote:
> Motivates and explains the ktask API for kernel clients.
> 
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  Documentation/core-api/index.rst |   1 +
>  Documentation/core-api/ktask.rst | 213 +++++++++++++++++++++++++++++++
>  2 files changed, 214 insertions(+)
>  create mode 100644 Documentation/core-api/ktask.rst

Hi,

> diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
> new file mode 100644
> index 000000000000..c3c00e1f802f
> --- /dev/null
> +++ b/Documentation/core-api/ktask.rst
> @@ -0,0 +1,213 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +============================================
> +ktask: parallelize CPU-intensive kernel work
> +============================================
> +
> +:Date: November, 2018
> +:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
> +
> +
> +Introduction
> +============

[snip]


> +Resource Limits
> +===============
> +
> +ktask has resource limits on the number of work items it sends to workqueue.

                                                                  to a workqueue.
or:                                                               to workqueues.

> +In ktask, a workqueue item is a thread that runs chunks of the task until the
> +task is finished.
> +
> +These limits support the different ways ktask uses workqueues:
> + - ktask_run to run threads on the calling thread's node.
> + - ktask_run_numa to run threads on the node(s) specified.
> + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
> +   system.
> +
> +To support these different ways of queueing work while maintaining an efficient
> +concurrency level, we need both system-wide and per-node limits on the number

I would prefer to refer to ktask as ktask instead of "we", so
s/we need/ktask needs/


> +of threads.  Without per-node limits, a node might become oversubscribed
> +despite ktask staying within the system-wide limit, and without a system-wide
> +limit, we can't properly account for work that can run on any node.

s/we/ktask/

> +
> +The system-wide limit is based on the total number of CPUs, and the per-node
> +limit on the CPU count for each node.  A per-node work item counts against the
> +system-wide limit.  Workqueue's max_active can't accommodate both types of
> +limit, no matter how many workqueues are used, so ktask implements its own.
> +
> +If a per-node limit is reached, the work item is allowed to run anywhere on the
> +machine to avoid overwhelming the node.  If the global limit is also reached,
> +ktask won't queue additional work items until we fall below the limit again.

s/we fall/ktask falls/
or s/we fall/it falls/

> +
> +These limits apply only to workqueue items--that is, helper threads beyond the
> +one starting the task.  That way, one thread per task is always allowed to run.


thanks.
Daniel Jordan Nov. 6, 2018, 2:27 a.m. UTC | #2
On Mon, Nov 05, 2018 at 01:19:50PM -0800, Randy Dunlap wrote:
> On 11/5/18 8:55 AM, Daniel Jordan wrote:
> 
> Hi,
> 
> > +Resource Limits
> > +===============
> > +
> > +ktask has resource limits on the number of work items it sends to workqueue.
> 
>                                                                   to a workqueue.
> or:                                                               to workqueues.

Ok, I'll do "to workqueues" since ktask uses two internally (NUMA-aware and
non-NUMA-aware).

> 
> > +In ktask, a workqueue item is a thread that runs chunks of the task until the
> > +task is finished.
> > +
> > +These limits support the different ways ktask uses workqueues:
> > + - ktask_run to run threads on the calling thread's node.
> > + - ktask_run_numa to run threads on the node(s) specified.
> > + - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
> > +   system.
> > +
> > +To support these different ways of queueing work while maintaining an efficient
> > +concurrency level, we need both system-wide and per-node limits on the number
> 
> I would prefer to refer to ktask as ktask instead of "we", so
> s/we need/ktask needs/

Good idea, I'll change it.

> > +of threads.  Without per-node limits, a node might become oversubscribed
> > +despite ktask staying within the system-wide limit, and without a system-wide
> > +limit, we can't properly account for work that can run on any node.
> 
> s/we/ktask/

Ok.

> > +
> > +The system-wide limit is based on the total number of CPUs, and the per-node
> > +limit on the CPU count for each node.  A per-node work item counts against the
> > +system-wide limit.  Workqueue's max_active can't accommodate both types of
> > +limit, no matter how many workqueues are used, so ktask implements its own.
> > +
> > +If a per-node limit is reached, the work item is allowed to run anywhere on the
> > +machine to avoid overwhelming the node.  If the global limit is also reached,
> > +ktask won't queue additional work items until we fall below the limit again.
> 
> s/we fall/ktask falls/
> or s/we fall/it falls/

'ktask.'  Will change.

> > +
> > +These limits apply only to workqueue items--that is, helper threads beyond the
> > +one starting the task.  That way, one thread per task is always allowed to run.
> 
> 
> thanks.

Appreciate the feedback!
Peter Zijlstra Nov. 6, 2018, 8:49 a.m. UTC | #3
On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> +Concept
> +=======
> +
> +ktask is built on unbound workqueues to take advantage of the thread management
> +facilities it provides: creation, destruction, flushing, priority setting, and
> +NUMA affinity.
> +
> +A little terminology up front:  A 'task' is the total work there is to do and a
> +'chunk' is a unit of work given to a thread.

So I hate on the task naming. We already have a task, lets not overload
that name.

> +To complete a task using the ktask framework, a client provides a thread
> +function that is responsible for completing one chunk.  The thread function is
> +defined in a standard way, with start and end arguments that delimit the chunk
> +as well as an argument that the client uses to pass data specific to the task.
> +
> +In addition, the client supplies an object representing the start of the task
> +and an iterator function that knows how to advance some number of units in the
> +task to yield another object representing the new task position.  The framework
> +uses the start object and iterator internally to divide the task into chunks.
> +
> +Finally, the client passes the total task size and a minimum chunk size to
> +indicate the minimum amount of work that's appropriate to do in one chunk.  The
> +sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
> +framework uses these sizes, along with the number of online CPUs and an
> +internal maximum number of threads, to decide how many threads to start and how
> +many chunks to divide the task into.
> +
> +For example, consider the task of clearing a gigantic page.  This used to be
> +done in a single thread with a for loop that calls a page clearing function for
> +each constituent base page.  To parallelize with ktask, the client first moves
> +the for loop to the thread function, adapting it to operate on the range passed
> +to the function.  In this simple case, the thread function's start and end
> +arguments are just addresses delimiting the portion of the gigantic page to
> +clear.  Then, where the for loop used to be, the client calls into ktask with
> +the start address of the gigantic page, the total size of the gigantic page,
> +and the thread function.  Internally, ktask will divide the address range into
> +an appropriate number of chunks and start an appropriate number of threads to
> +complete these chunks.

I see no mention of padata anywhere; I also don't see mention of the
async init stuff. Both appear to me to share, at least in part, the same
reason for existence.

> +Scheduler Interaction
> +=====================
> +
> +Even within the resource limits, ktask must take care to run a number of
> +threads appropriate for the system's current CPU load.  Under high CPU usage,
> +starting excessive helper threads may disturb other tasks, unfairly taking CPU
> +time away from them for the sake of an optimized kernel code path.
> +
> +ktask plays nicely in this case by setting helper threads to the lowest
> +scheduling priority on the system (MAX_NICE).  This way, helpers' CPU time is
> +appropriately throttled on a busy system and other tasks are not disturbed.
> +
> +The main thread initiating the task remains at its original priority so that it
> +still makes progress on a busy system.
> +
> +It is possible for a helper thread to start running and then be forced off-CPU
> +by a higher priority thread.  With the helper's CPU time curtailed by MAX_NICE,
> +the main thread may wait longer for the task to finish than it would have had
> +it not started any helpers, so to ensure forward progress at a single-threaded
> +pace, once the main thread is finished with all outstanding work in the task,
> +the main thread wills its priority to one helper thread at a time.  At least
> +one thread will then always be running at the priority of the calling thread.

What isn't clear is if this calling thread is waiting or not. Only do
this inheritance trick if it is actually waiting on the work. If it is
not, nobody cares.

> +Cgroup Awareness
> +================
> +
> +Given the potentially large amount of CPU time ktask threads may consume, they
> +should be aware of the cgroup of the task that called into ktask and
> +appropriately throttled.
> +
> +TODO: Implement cgroup-awareness in unbound workqueues.

Yes.. that needs done.

> +Power Management
> +================
> +
> +Starting additional helper threads may cause the system to consume more energy,
> +which is undesirable on energy-conscious devices.  Therefore ktask needs to be
> +aware of cpufreq policies and scaling governors.
> +
> +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> +part of the system, that is a signal that the user has strong power management
> +preferences, in which case ktask is disabled.
> +
> +TODO: Implement this.

No, don't do that, its broken. Also, we're trying to move to a single
cpufreq governor for all.

Sure we'll retain 'performance', but powersave and conservative and all
that nonsense should go away eventually.

That's not saying you don't need a knob for this; but don't look at
cpufreq for this.
Daniel Jordan Nov. 6, 2018, 8:34 p.m. UTC | #4
On Tue, Nov 06, 2018 at 09:49:11AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> > +Concept
> > +=======
> > +
> > +ktask is built on unbound workqueues to take advantage of the thread management
> > +facilities it provides: creation, destruction, flushing, priority setting, and
> > +NUMA affinity.
> > +
> > +A little terminology up front:  A 'task' is the total work there is to do and a
> > +'chunk' is a unit of work given to a thread.
> 
> So I hate on the task naming. We already have a task, lets not overload
> that name.

Ok, agreed, it's a crowded field with 'task', 'work', 'thread'...

Maybe 'job', since nothing seems to have taken that in kernel/.

> I see no mention of padata anywhere; I also don't see mention of the
> async init stuff. Both appear to me to share, at least in part, the same
> reason for existence.

padata is news to me.  From reading its doc, it comes with some special
requirements of its own, like softirqs disabled during the parallel callback,
and some ktask users need to sleep.  I'll check whether it could be reworked to
handle this.

And yes, async shares the same basic infrastructure, but ktask callers need to
wait, so the two seem fundamentally at odds.  I'll add this explanation in.

> 
> > +Scheduler Interaction
> > +=====================
...
> > +It is possible for a helper thread to start running and then be forced off-CPU
> > +by a higher priority thread.  With the helper's CPU time curtailed by MAX_NICE,
> > +the main thread may wait longer for the task to finish than it would have had
> > +it not started any helpers, so to ensure forward progress at a single-threaded
> > +pace, once the main thread is finished with all outstanding work in the task,
> > +the main thread wills its priority to one helper thread at a time.  At least
> > +one thread will then always be running at the priority of the calling thread.
> 
> What isn't clear is if this calling thread is waiting or not. Only do
> this inheritance trick if it is actually waiting on the work. If it is
> not, nobody cares.

The calling thread waits.  Even if it didn't though, the inheritance trick
would still be desirable for timely completion of the job.

> 
> > +Cgroup Awareness
> > +================
> > +
> > +Given the potentially large amount of CPU time ktask threads may consume, they
> > +should be aware of the cgroup of the task that called into ktask and
> > +appropriately throttled.
> > +
> > +TODO: Implement cgroup-awareness in unbound workqueues.
> 
> Yes.. that needs done.

Great.

> 
> > +Power Management
> > +================
> > +
> > +Starting additional helper threads may cause the system to consume more energy,
> > +which is undesirable on energy-conscious devices.  Therefore ktask needs to be
> > +aware of cpufreq policies and scaling governors.
> > +
> > +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> > +part of the system, that is a signal that the user has strong power management
> > +preferences, in which case ktask is disabled.
> > +
> > +TODO: Implement this.
> 
> No, don't do that, its broken. Also, we're trying to move to a single
> cpufreq governor for all.
>
> Sure we'll retain 'performance', but powersave and conservative and all
> that nonsense should go away eventually.

Ok, good to know.

> That's not saying you don't need a knob for this; but don't look at
> cpufreq for this.

Ok, I'll dig through power management to see what else is there.  Maybe there's
some way to ask "is this machine energy conscious?"

Thanks for looking through this!
Jason Gunthorpe Nov. 6, 2018, 8:51 p.m. UTC | #5
On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:

> > What isn't clear is if this calling thread is waiting or not. Only do
> > this inheritance trick if it is actually waiting on the work. If it is
> > not, nobody cares.
> 
> The calling thread waits.  Even if it didn't though, the inheritance trick
> would still be desirable for timely completion of the job.

Can you make lockdep aware that this is synchronous?

ie if I do

  mutex_lock()
  ktask_run()
  mutex_lock()

Can lockdep know that all the workers are running under that lock?

I'm thinking particularly about rtnl_lock as a possible case, but
there could also make some sense to hold the read side of the mm_sem
or similar like the above.

Jason
Peter Zijlstra Nov. 7, 2018, 10:27 a.m. UTC | #6
On Tue, Nov 06, 2018 at 08:51:54PM +0000, Jason Gunthorpe wrote:
> On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> 
> > > What isn't clear is if this calling thread is waiting or not. Only do
> > > this inheritance trick if it is actually waiting on the work. If it is
> > > not, nobody cares.
> > 
> > The calling thread waits.  Even if it didn't though, the inheritance trick
> > would still be desirable for timely completion of the job.
> 
> Can you make lockdep aware that this is synchronous?
> 
> ie if I do
> 
>   mutex_lock()
>   ktask_run()
>   mutex_lock()
> 
> Can lockdep know that all the workers are running under that lock?
> 
> I'm thinking particularly about rtnl_lock as a possible case, but
> there could also make some sense to hold the read side of the mm_sem
> or similar like the above.

Yes, the normal trick is adding a fake lock to ktask_run and holding
that over the actual job. See lock_map* in flush_workqueue() vs
process_one_work().
Peter Zijlstra Nov. 7, 2018, 10:35 a.m. UTC | #7
On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> On Tue, Nov 06, 2018 at 09:49:11AM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> > > +Concept
> > > +=======
> > > +
> > > +ktask is built on unbound workqueues to take advantage of the thread management
> > > +facilities it provides: creation, destruction, flushing, priority setting, and
> > > +NUMA affinity.
> > > +
> > > +A little terminology up front:  A 'task' is the total work there is to do and a
> > > +'chunk' is a unit of work given to a thread.
> > 
> > So I hate on the task naming. We already have a task, lets not overload
> > that name.
> 
> Ok, agreed, it's a crowded field with 'task', 'work', 'thread'...
> 
> Maybe 'job', since nothing seems to have taken that in kernel/.

Do we want to somehow convey the fundamentally parallel nature of the
thing?

> > I see no mention of padata anywhere; I also don't see mention of the
> > async init stuff. Both appear to me to share, at least in part, the same
> > reason for existence.
> 
> padata is news to me.  From reading its doc, it comes with some special
> requirements of its own, like softirqs disabled during the parallel callback,
> and some ktask users need to sleep.  I'll check whether it could be reworked to
> handle this.

Right, padata is something that came from the network stack I think.
It's a bit of an odd thing, but it would be nice if we can fold it into
something larger.

> And yes, async shares the same basic infrastructure, but ktask callers need to
> wait, so the two seem fundamentally at odds.  I'll add this explanation in.

Why does ktask have to be fundamentally async?

> > > +Scheduler Interaction
> > > +=====================
> ...
> > > +It is possible for a helper thread to start running and then be forced off-CPU
> > > +by a higher priority thread.  With the helper's CPU time curtailed by MAX_NICE,
> > > +the main thread may wait longer for the task to finish than it would have had
> > > +it not started any helpers, so to ensure forward progress at a single-threaded
> > > +pace, once the main thread is finished with all outstanding work in the task,
> > > +the main thread wills its priority to one helper thread at a time.  At least
> > > +one thread will then always be running at the priority of the calling thread.
> > 
> > What isn't clear is if this calling thread is waiting or not. Only do
> > this inheritance trick if it is actually waiting on the work. If it is
> > not, nobody cares.
> 
> The calling thread waits.  Even if it didn't though, the inheritance trick
> would still be desirable for timely completion of the job.

No, if nobody is waiting on it, it really doesn't matter.

> > > +Power Management
> > > +================
> > > +
> > > +Starting additional helper threads may cause the system to consume more energy,
> > > +which is undesirable on energy-conscious devices.  Therefore ktask needs to be
> > > +aware of cpufreq policies and scaling governors.
> > > +
> > > +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> > > +part of the system, that is a signal that the user has strong power management
> > > +preferences, in which case ktask is disabled.
> > > +
> > > +TODO: Implement this.
> > 
> > No, don't do that, its broken. Also, we're trying to move to a single
> > cpufreq governor for all.
> >
> > Sure we'll retain 'performance', but powersave and conservative and all
> > that nonsense should go away eventually.
> 
> Ok, good to know.
> 
> > That's not saying you don't need a knob for this; but don't look at
> > cpufreq for this.
> 
> Ok, I'll dig through power management to see what else is there.  Maybe there's
> some way to ask "is this machine energy conscious?"

IIRC you're presenting at LPC, drop by the Power Management and
Energy-awareness MC.
Daniel Jordan Nov. 7, 2018, 8:21 p.m. UTC | #8
On Wed, Nov 07, 2018 at 11:27:52AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 08:51:54PM +0000, Jason Gunthorpe wrote:
> > On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> > 
> > > > What isn't clear is if this calling thread is waiting or not. Only do
> > > > this inheritance trick if it is actually waiting on the work. If it is
> > > > not, nobody cares.
> > > 
> > > The calling thread waits.  Even if it didn't though, the inheritance trick
> > > would still be desirable for timely completion of the job.
> > 
> > Can you make lockdep aware that this is synchronous?
> > 
> > ie if I do
> > 
> >   mutex_lock()
> >   ktask_run()
> >   mutex_lock()
> > 
> > Can lockdep know that all the workers are running under that lock?
> > 
> > I'm thinking particularly about rtnl_lock as a possible case, but
> > there could also make some sense to hold the read side of the mm_sem
> > or similar like the above.
> 
> Yes, the normal trick is adding a fake lock to ktask_run and holding
> that over the actual job. See lock_map* in flush_workqueue() vs
> process_one_work().

I'll add that for the next version.
Daniel Jordan Nov. 7, 2018, 9:20 p.m. UTC | #9
On Wed, Nov 07, 2018 at 11:35:54AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 06, 2018 at 12:34:11PM -0800, Daniel Jordan wrote:
> > On Tue, Nov 06, 2018 at 09:49:11AM +0100, Peter Zijlstra wrote:
> > > On Mon, Nov 05, 2018 at 11:55:46AM -0500, Daniel Jordan wrote:
> > > > +Concept
> > > > +=======
> > > > +
> > > > +ktask is built on unbound workqueues to take advantage of the thread management
> > > > +facilities it provides: creation, destruction, flushing, priority setting, and
> > > > +NUMA affinity.
> > > > +
> > > > +A little terminology up front:  A 'task' is the total work there is to do and a
> > > > +'chunk' is a unit of work given to a thread.
> > > 
> > > So I hate on the task naming. We already have a task, lets not overload
> > > that name.
> > 
> > Ok, agreed, it's a crowded field with 'task', 'work', 'thread'...
> > 
> > Maybe 'job', since nothing seems to have taken that in kernel/.
> 
> Do we want to somehow convey the fundamentally parallel nature of the
> thing?

Ok, I've consulted my thesaurus and everything.  Best I can come up with is
just 'parallel', so for example parallel_run would be the interface.  Or
para_run.

Going to think about this more, and I'm open to suggestions.

> 
> > > I see no mention of padata anywhere; I also don't see mention of the
> > > async init stuff. Both appear to me to share, at least in part, the same
> > > reason for existence.
> > 
> > padata is news to me.  From reading its doc, it comes with some special
> > requirements of its own, like softirqs disabled during the parallel callback,
> > and some ktask users need to sleep.  I'll check whether it could be reworked to
> > handle this.
> 
> Right, padata is something that came from the network stack I think.
> It's a bit of an odd thing, but it would be nice if we can fold it into
> something larger.

Sure, I'll see how it goes.

> > And yes, async shares the same basic infrastructure, but ktask callers need to
> > wait, so the two seem fundamentally at odds.  I'll add this explanation in.
> 
> Why does ktask have to be fundamentally async?

Assuming you mean sync.  It doesn't have to be synchronous always, but some
users need that.  A caller that clears a page shouldn't return to userland
before the job is done.

Anyway, sure, it may come out clean to encapsulate the async parts of async.c
into their own paths and then find a new name for that file.  I'll see how this
goes too.

>
> > > > +Scheduler Interaction
> > > > +=====================
> > ...
> > > > +It is possible for a helper thread to start running and then be forced off-CPU
> > > > +by a higher priority thread.  With the helper's CPU time curtailed by MAX_NICE,
> > > > +the main thread may wait longer for the task to finish than it would have had
> > > > +it not started any helpers, so to ensure forward progress at a single-threaded
> > > > +pace, once the main thread is finished with all outstanding work in the task,
> > > > +the main thread wills its priority to one helper thread at a time.  At least
> > > > +one thread will then always be running at the priority of the calling thread.
> > > 
> > > What isn't clear is if this calling thread is waiting or not. Only do
> > > this inheritance trick if it is actually waiting on the work. If it is
> > > not, nobody cares.
> > 
> > The calling thread waits.  Even if it didn't though, the inheritance trick
> > would still be desirable for timely completion of the job.
> 
> No, if nobody is waiting on it, it really doesn't matter.

Ok, I (think) I see what you meant.  If nobody is waiting on it, as in, it's
not desirable from a performance POV.  Agree.

> 
> > > > +Power Management
> > > > +================
> > > > +
> > > > +Starting additional helper threads may cause the system to consume more energy,
> > > > +which is undesirable on energy-conscious devices.  Therefore ktask needs to be
> > > > +aware of cpufreq policies and scaling governors.
> > > > +
> > > > +If an energy-conscious policy is in use (e.g. powersave, conservative) on any
> > > > +part of the system, that is a signal that the user has strong power management
> > > > +preferences, in which case ktask is disabled.
> > > > +
> > > > +TODO: Implement this.
> > > 
> > > No, don't do that, its broken. Also, we're trying to move to a single
> > > cpufreq governor for all.
> > >
> > > Sure we'll retain 'performance', but powersave and conservative and all
> > > that nonsense should go away eventually.
> > 
> > Ok, good to know.
> > 
> > > That's not saying you don't need a knob for this; but don't look at
> > > cpufreq for this.
> > 
> > Ok, I'll dig through power management to see what else is there.  Maybe there's
> > some way to ask "is this machine energy conscious?"
> 
> IIRC you're presenting at LPC, drop by the Power Management and
> Energy-awareness MC.

Will do.
Jonathan Corbet Nov. 8, 2018, 5:26 p.m. UTC | #10
On Mon,  5 Nov 2018 11:55:46 -0500
Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> Motivates and explains the ktask API for kernel clients.

A couple of quick thoughts:

- Agree with Peter on the use of "task"; something like "job" would be far
  less likely to create confusion.  Maybe you could even call it a "batch
  job" to give us old-timers warm fuzzies...:)

- You have kerneldoc comments for the API functions, but you don't pull
  those into the documentation itself.  Adding some kernel-doc directives
  could help to fill things out nicely with little effort.

Thanks,

jon
Daniel Jordan Nov. 8, 2018, 7:15 p.m. UTC | #11
On Thu, Nov 08, 2018 at 10:26:38AM -0700, Jonathan Corbet wrote:
> On Mon,  5 Nov 2018 11:55:46 -0500
> Daniel Jordan <daniel.m.jordan@oracle.com> wrote:
> 
> > Motivates and explains the ktask API for kernel clients.
> 
> A couple of quick thoughts:
> 
> - Agree with Peter on the use of "task"; something like "job" would be far
>   less likely to create confusion.  Maybe you could even call it a "batch
>   job" to give us old-timers warm fuzzies...:)

smp_job?  Or smp_batch, for that retro flavor?  :)

> 
> - You have kerneldoc comments for the API functions, but you don't pull
>   those into the documentation itself.  Adding some kernel-doc directives
>   could help to fill things out nicely with little effort.

I thought this part of ktask.rst handled that, or am I not doing it right?

    Interface
    =========
    
    .. kernel-doc:: include/linux/ktask.h

Thanks for the comments,
Daniel
Jonathan Corbet Nov. 8, 2018, 7:24 p.m. UTC | #12
On Thu, 8 Nov 2018 11:15:53 -0800
Daniel Jordan <daniel.m.jordan@oracle.com> wrote:

> > - You have kerneldoc comments for the API functions, but you don't pull
> >   those into the documentation itself.  Adding some kernel-doc directives
> >   could help to fill things out nicely with little effort.  
> 
> I thought this part of ktask.rst handled that, or am I not doing it right?
> 
>     Interface
>     =========
>     
>     .. kernel-doc:: include/linux/ktask.h

Sigh, no, you're doing it just fine, and I clearly wasn't sufficiently
caffeinated.  Apologies for the noise.

jon
Daniel Jordan Nov. 28, 2018, 4:56 p.m. UTC | #13
On Tue, Nov 27, 2018 at 08:50:08PM +0100, Pavel Machek wrote:
> Hi!

Hi, Pavel.

> > +============================================
> > +ktask: parallelize CPU-intensive kernel work
> > +============================================
> > +
> > +:Date: November, 2018
> > +:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
> 
> 
> > +For example, consider the task of clearing a gigantic page.  This used to be
> > +done in a single thread with a for loop that calls a page clearing function for
> > +each constituent base page.  To parallelize with ktask, the client first moves
> > +the for loop to the thread function, adapting it to operate on the range passed
> > +to the function.  In this simple case, the thread function's start and end
> > +arguments are just addresses delimiting the portion of the gigantic page to
> > +clear.  Then, where the for loop used to be, the client calls into ktask with
> > +the start address of the gigantic page, the total size of the gigantic page,
> > +and the thread function.  Internally, ktask will divide the address range into
> > +an appropriate number of chunks and start an appropriate number of threads to
> > +complete these chunks.
> 
> Great, so my little task is bound to CPUs 1-4 and uses gigantic
> pages. Kernel clears them for me.
> 
> a) Do all the CPUs work for me, or just CPUs I was assigned to?

In ktask's current form, all the CPUs.  This is an existing limitation of
workqueues, which ktask is built on: unbound workqueue workers don't honor the
cpumask of the queueing task (...absent a wq user applying a cpumask wq attr
beforehand, which nobody in-tree does...).

But good point, the helper threads should only run on the CPUs the task is
bound to.  I'm working on cgroup-aware workqueues but hadn't considered a
task's cpumask outside of cgroup/cpuset, so I'll try adding support for this
too.

> b) Will my time my_little_task show the system time including the
> worker threads?

No, system time of kworkers isn't accounted to the user tasks they're working
on behalf of.  This time is already visible to userland in kworkers, and it
would be confusing to account it to a userland task instead.

Thanks for the questions.

Daniel
diff mbox series

Patch

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 3adee82be311..c143a280a5b1 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -18,6 +18,7 @@  Core utilities
    refcount-vs-atomic
    cpu_hotplug
    idr
+   ktask
    local_ops
    workqueue
    genericirq
diff --git a/Documentation/core-api/ktask.rst b/Documentation/core-api/ktask.rst
new file mode 100644
index 000000000000..c3c00e1f802f
--- /dev/null
+++ b/Documentation/core-api/ktask.rst
@@ -0,0 +1,213 @@ 
+.. SPDX-License-Identifier: GPL-2.0+
+
+============================================
+ktask: parallelize CPU-intensive kernel work
+============================================
+
+:Date: November, 2018
+:Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+
+
+Introduction
+============
+
+ktask is a generic framework for parallelizing CPU-intensive work in the
+kernel.  The intended use is for big machines that can use their CPU power to
+speed up large tasks that can't otherwise be multithreaded in userland.  The
+API is generic enough to add concurrency to many different kinds of tasks--for
+example, page clearing over an address range or freeing a list of pages--and
+aims to save its clients the trouble of splitting up the work, choosing the
+number of helper threads to use, maintaining an efficient concurrency level,
+starting these threads, and load balancing the work between them.
+
+
+Motivation
+==========
+
+A single CPU can spend an excessive amount of time in the kernel operating on
+large amounts of data.  Often these situations arise during initialization- and
+destruction-related tasks, where the data involved scales with system size.
+These long-running jobs can slow startup and shutdown of applications and the
+system itself while extra CPUs sit idle.
+
+To ensure that applications and the kernel continue to perform well as core
+counts and memory sizes increase, the kernel harnesses these idle CPUs to
+complete such jobs more quickly.
+
+For example, when booting a large NUMA machine, ktask uses additional CPUs that
+would otherwise be idle until the machine is fully up to avoid a needless
+bottleneck during system boot and allow the kernel to take advantage of unused
+memory bandwidth.  Similarly, when starting a large VM using VFIO, ktask takes
+advantage of the VM's idle CPUs during VFIO page pinning rather than have the
+VM's boot blocked on one thread doing all the work.
+
+ktask is not a substitute for single-threaded optimization.  However, there is
+a point where a single CPU hits a wall despite performance tuning, so
+parallelize!
+
+
+Concept
+=======
+
+ktask is built on unbound workqueues to take advantage of the thread management
+facilities it provides: creation, destruction, flushing, priority setting, and
+NUMA affinity.
+
+A little terminology up front:  A 'task' is the total work there is to do and a
+'chunk' is a unit of work given to a thread.
+
+To complete a task using the ktask framework, a client provides a thread
+function that is responsible for completing one chunk.  The thread function is
+defined in a standard way, with start and end arguments that delimit the chunk
+as well as an argument that the client uses to pass data specific to the task.
+
+In addition, the client supplies an object representing the start of the task
+and an iterator function that knows how to advance some number of units in the
+task to yield another object representing the new task position.  The framework
+uses the start object and iterator internally to divide the task into chunks.
+
+Finally, the client passes the total task size and a minimum chunk size to
+indicate the minimum amount of work that's appropriate to do in one chunk.  The
+sizes are given in task-specific units (e.g. pages, inodes, bytes).  The
+framework uses these sizes, along with the number of online CPUs and an
+internal maximum number of threads, to decide how many threads to start and how
+many chunks to divide the task into.
+
+For example, consider the task of clearing a gigantic page.  This used to be
+done in a single thread with a for loop that calls a page clearing function for
+each constituent base page.  To parallelize with ktask, the client first moves
+the for loop to the thread function, adapting it to operate on the range passed
+to the function.  In this simple case, the thread function's start and end
+arguments are just addresses delimiting the portion of the gigantic page to
+clear.  Then, where the for loop used to be, the client calls into ktask with
+the start address of the gigantic page, the total size of the gigantic page,
+and the thread function.  Internally, ktask will divide the address range into
+an appropriate number of chunks and start an appropriate number of threads to
+complete these chunks.
+
+
+Configuration
+=============
+
+To use ktask, configure the kernel with CONFIG_KTASK=y.
+
+If CONFIG_KTASK=n, calls to the ktask API are simply #define'd to run the
+thread function that the client provides so that the task is completed without
+concurrency in the current thread.
+
+
+Interface
+=========
+
+.. kernel-doc:: include/linux/ktask.h
+
+
+Resource Limits
+===============
+
+ktask has resource limits on the number of work items it sends to workqueue.
+In ktask, a workqueue item is a thread that runs chunks of the task until the
+task is finished.
+
+These limits support the different ways ktask uses workqueues:
+ - ktask_run to run threads on the calling thread's node.
+ - ktask_run_numa to run threads on the node(s) specified.
+ - ktask_run_numa with nid=NUMA_NO_NODE to run threads on any node in the
+   system.
+
+To support these different ways of queueing work while maintaining an efficient
+concurrency level, we need both system-wide and per-node limits on the number
+of threads.  Without per-node limits, a node might become oversubscribed
+despite ktask staying within the system-wide limit, and without a system-wide
+limit, we can't properly account for work that can run on any node.
+
+The system-wide limit is based on the total number of CPUs, and the per-node
+limit on the CPU count for each node.  A per-node work item counts against the
+system-wide limit.  Workqueue's max_active can't accommodate both types of
+limit, no matter how many workqueues are used, so ktask implements its own.
+
+If a per-node limit is reached, the work item is allowed to run anywhere on the
+machine to avoid overwhelming the node.  If the global limit is also reached,
+ktask won't queue additional work items until we fall below the limit again.
+
+These limits apply only to workqueue items--that is, helper threads beyond the
+one starting the task.  That way, one thread per task is always allowed to run.
+
+
+Scheduler Interaction
+=====================
+
+Even within the resource limits, ktask must take care to run a number of
+threads appropriate for the system's current CPU load.  Under high CPU usage,
+starting excessive helper threads may disturb other tasks, unfairly taking CPU
+time away from them for the sake of an optimized kernel code path.
+
+ktask plays nicely in this case by setting helper threads to the lowest
+scheduling priority on the system (MAX_NICE).  This way, helpers' CPU time is
+appropriately throttled on a busy system and other tasks are not disturbed.
+
+The main thread initiating the task remains at its original priority so that it
+still makes progress on a busy system.
+
+It is possible for a helper thread to start running and then be forced off-CPU
+by a higher priority thread.  With the helper's CPU time curtailed by MAX_NICE,
+the main thread may wait longer for the task to finish than it would have had
+it not started any helpers, so to ensure forward progress at a single-threaded
+pace, once the main thread is finished with all outstanding work in the task,
+the main thread wills its priority to one helper thread at a time.  At least
+one thread will then always be running at the priority of the calling thread.
+
+
+Cgroup Awareness
+================
+
+Given the potentially large amount of CPU time ktask threads may consume, they
+should be aware of the cgroup of the task that called into ktask and
+appropriately throttled.
+
+TODO: Implement cgroup-awareness in unbound workqueues.
+
+
+Power Management
+================
+
+Starting additional helper threads may cause the system to consume more energy,
+which is undesirable on energy-conscious devices.  Therefore ktask needs to be
+aware of cpufreq policies and scaling governors.
+
+If an energy-conscious policy is in use (e.g. powersave, conservative) on any
+part of the system, that is a signal that the user has strong power management
+preferences, in which case ktask is disabled.
+
+TODO: Implement this.
+
+
+Backward Compatibility
+======================
+
+ktask is written so that existing calls to the API will be backwards compatible
+should the API gain new features in the future.  This is accomplished by
+restricting API changes to members of struct ktask_ctl and having clients make
+an opaque initialization call (DEFINE_KTASK_CTL).  This initialization can then
+be modified to include any new arguments so that existing call sites stay the
+same.
+
+
+Error Handling
+==============
+
+Calls to ktask fail only if the provided thread function fails.  In particular,
+ktask avoids allocating memory internally during a task, so it's safe to use in
+sensitive contexts.
+
+Tasks can fail midway through their work.  To recover, the finished chunks of
+work need to be undone in a task-specific way, so ktask allows clients to pass
+an "undo" callback that is responsible for undoing one chunk of work.  To avoid
+multiple levels of error handling, this "undo" callback should not be allowed
+to fail.  For simplicity and because it's a slow path, undoing is not
+multithreaded.
+
+Each call to ktask_run and ktask_run_numa returns a single value,
+KTASK_RETURN_SUCCESS or a client-specific value.  Since threads can fail for
+different reasons, however, ktask may need the ability to return
+thread-specific error information.  This can be added later if needed.