diff mbox

[v4,2/2] ksm: provide support to use deferrable timers for scanner thread

Message ID alpine.LSU.2.11.1409080023100.1610@eggly.anvils (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Hugh Dickins Sept. 8, 2014, 8:25 a.m. UTC
On Wed, 3 Sep 2014, Peter Zijlstra wrote:
> On Wed, Aug 27, 2014 at 11:02:20PM -0700, Hugh Dickins wrote:
> > On Wed, 20 Aug 2014, Chintan Pandya wrote:
> > 
> > > KSM thread to scan pages is scheduled on definite timeout. That wakes up
> > > CPU from idle state and hence may affect the power consumption. Provide
> > > an optional support to use deferrable timer which suites low-power
> > > use-cases.
> > > 
> > > Typically, on our setup we observed, 10% less power consumption with some
> > > use-cases in which CPU goes to power collapse frequently. For example,
> > > playing audio on Soc which has HW based Audio encoder/decoder, CPU
> > > remains idle for longer duration of time. This idle state will save
> > > significant CPU power consumption if KSM don't wakes them up
> > > periodically.
> > > 
> > > Note that, deferrable timers won't be deferred if any CPU is active and
> > > not in IDLE state.
> > > 
> > > By default, deferrable timers is enabled. To disable deferrable timers,
> > > $ echo 0 > /sys/kernel/mm/ksm/deferrable_timer
> > 
> > I have now experimented.  And, much as I wanted to eliminate the
> > tunable, and just have deferrable timers on, I have come right back
> > to your original position.
> > 
> > I was impressed by how quiet ksmd goes when there's nothing much
> > happening on the machine; but equally, disappointed in how slow
> > it then is to fulfil the outstanding merge work.  I agree with your
> > original assessment, that not everybody will want deferrable timer,
> > the way it is working at present.
> > 
> > I expect that can be fixed, partly by doing more work on wakeup from
> > a deferred timer, according to how long it has been deferred; and
> > partly by not deferring on idle until two passes of the list have been
> > completed.  But that's easier said than done, and might turn out to
> 
> So why not have the timer cancel itself when there is no more work to do
> and start itself up again when there's work added?

Well, yes, but... how do we know when there is no more work to do?
Further down I said:

> > But fixing that might require ksm hooks in hot locations where nobody
> > else would want them: I'm rather hoping we can strike a good enough
> > balance with your deferrable timer, that nobody will need any better.

Thomas has given reason why KSM might simply fail to do its job if we
rely on the deferrable timer.  So I've tried another approach, patch
below; but I do not expect you to jump for joy at the sight of it!

I've tried to minimize the offensive KSM hook in context_switch().
Why place it there, rather than do something near profile_tick() or
account_process_tick()?  Because KSM is aware of mms not tasks, and
context_switch() should have the next mm cachelines hot (if not, a
slight regrouping in mm_struct should do it); whereas I can find
no reference whatever to mm_struct in kernel/time, so hooking to
KSM from there would drag in another few cachelines every tick.

(Another approach would be to set up KSM hint faulting, along the
lines of NUMA hint faulting.  Not a path I'm keen to go down.)

I'm not thrilled with this patch, I think it's somewhat defective
in several ways.  But maybe in practice it will prove good enough,
and if so then I'd rather not waste effort on complicating it.

My own testing is not realistic, nor representative of real KSM users;
and I have no idea what values of pages_to_scan and sleep_millisecs
people really use (and those may make quite a difference to how
well it works).

Chintan, even if the scheduler guys turn out to hate it, please would
you give the patch below a try, to see how well it works in your
environment, whether it seems to go better or worse than your own patch.

If it works well enough for you, maybe we can come up with ideas to
make it more palatable.  I do think your issue is an important one
to fix, one way or another.

Thanks,
Hugh

[PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet

Description yet to be written!

Reported-by: Chintan Pandya <cpandya@codeaurora.org>
Not-Signed-off-by: Hugh Dickins <hughd@google.com>
---

 include/linux/ksm.h   |   14 +++++++++++
 include/linux/sched.h |    1 
 kernel/sched/core.c   |    9 ++++++-
 mm/ksm.c              |   50 ++++++++++++++++++++++++++++------------
 4 files changed, 58 insertions(+), 16 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Peter Zijlstra Sept. 8, 2014, 9:39 a.m. UTC | #1
On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> Well, yes, but... how do we know when there is no more work to do?

Yeah, I figured that out _after_ I send that email..

> Thomas has given reason why KSM might simply fail to do its job if we
> rely on the deferrable timer.  So I've tried another approach, patch
> below; but I do not expect you to jump for joy at the sight of it!

Indeed :/

> I've tried to minimize the offensive KSM hook in context_switch().
> Why place it there, rather than do something near profile_tick() or
> account_process_tick()?  Because KSM is aware of mms not tasks, and
> context_switch() should have the next mm cachelines hot (if not, a
> slight regrouping in mm_struct should do it); whereas I can find
> no reference whatever to mm_struct in kernel/time, so hooking to
> KSM from there would drag in another few cachelines every tick.
> 
> (Another approach would be to set up KSM hint faulting, along the
> lines of NUMA hint faulting.  Not a path I'm keen to go down.)
> 
> I'm not thrilled with this patch, I think it's somewhat defective
> in several ways.  But maybe in practice it will prove good enough,
> and if so then I'd rather not waste effort on complicating it.
> 
> My own testing is not realistic, nor representative of real KSM users;
> and I have no idea what values of pages_to_scan and sleep_millisecs
> people really use (and those may make quite a difference to how
> well it works).
> 
> Chintan, even if the scheduler guys turn out to hate it, please would
> you give the patch below a try, to see how well it works in your
> environment, whether it seems to go better or worse than your own patch.
> 
> If it works well enough for you, maybe we can come up with ideas to
> make it more palatable.  I do think your issue is an important one
> to fix, one way or another.
> 
> Thanks,
> Hugh
> 
> [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
> 
> Description yet to be written!
> 
> Reported-by: Chintan Pandya <cpandya@codeaurora.org>
> Not-Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> 
>  include/linux/ksm.h   |   14 +++++++++++
>  include/linux/sched.h |    1 
>  kernel/sched/core.c   |    9 ++++++-
>  mm/ksm.c              |   50 ++++++++++++++++++++++++++++------------
>  4 files changed, 58 insertions(+), 16 deletions(-)
> 
> --- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
> +++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700

> @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
>  {
>  }
>  
> +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)

s/ksm_switch/__&/

> +{
> +	return NULL;
> +}
> +
>  static inline int PageKsm(struct page *page)
>  {
>  	return 0;

> --- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
> +++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700

> @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
>  	       struct task_struct *next)
>  {
>  	struct mm_struct *mm, *oldmm;
> +	wait_queue_head_t *wake_ksm = NULL;
>  
>  	prepare_task_switch(rq, prev, next);
>  
> @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
>  		next->active_mm = oldmm;
>  		atomic_inc(&oldmm->mm_count);
>  		enter_lazy_tlb(oldmm, next);
> -	} else
> +	} else {
>  		switch_mm(oldmm, mm, next);
> +		wake_ksm = ksm_switch(mm);

Is this the right mm? We've just switched the stack, so we're looking at
next->mm when we switched away from current. That might not exist
anymore.

> +	}
>  
>  	if (!prev->mm) {
>  		prev->active_mm = NULL;
> @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
>  	 * frame will be invalid.
>  	 */
>  	finish_task_switch(this_rq(), prev);
> +
> +	if (wake_ksm)
> +		wake_up_interruptible(wake_ksm);
>  }

Quite horrible for sure. I really hate seeing KSM cruft all the way down
here. Can't we create a new (timer) infrastructure that does the right
thing? Surely this isn't the only such case.

I know both RCU and some NOHZ_FULL muck already track when the system is
completely idle. This is yet another case of that.
Chintan Pandya Sept. 9, 2014, 2:52 p.m. UTC | #2
Hello All,

Before its too late to discuss this basic question, allow me to share my 
view on the deferrable timer approach.

I believe KSM at this point is tunable with predictable outcomes. When 
it will get triggered, how many pages it will scan etc. This aggression 
control is with user who can implement any complex logic based on its 
own flashy parameters. Along the same line, I was seeing this deferrable 
timer knob.

Here I was hoping the same level of predictability with this knob. 
Kernel still won't do smart work and user is free to play smart/complex 
with the knob. I believe that there are many use-cases where a single 
strategy of "to make KSM perform better while saving power at the same 
time" may not work. So,


> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
>> Well, yes, but... how do we know when there is no more work to do?
>
> Yeah, I figured that out _after_ I send that email..
>
>> Thomas has given reason why KSM might simply fail to do its job if we
>> rely on the deferrable timer.

With deferrable timer, KSM thread will be scheduled on the 'active' CPU 
at that very same time. Yes, I understood from Thomas's clarification 
that if that very CPU goes IDLE, KSM task will get deferred even if at 
the timeout, we have some CPUs running. I think, this situation can be 
avoided potentially very small timeout value (?). But in totality, this 
is where KSM will be idle for sure, may be that is unwanted.

>>
>> Chintan, even if the scheduler guys turn out to hate it, please would
>> you give the patch below a try, to see how well it works in your
>> environment, whether it seems to go better or worse than your own patch.
>>
>> If it works well enough for you, maybe we can come up with ideas to
>> make it more palatable.  I do think your issue is an important one
>> to fix, one way or another.

It is taking a little more time for me to grasp your change :) So, after 
Peter's comment, do you want me to try this out or you are looking 
forward for even better idea ? BTW, if deferrable timer patch gets any 
green signal, I will publish new patch with your comments on v4.

>>
>> Thanks,
>> Hugh
>>
>> [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
>>
>> Description yet to be written!
>>
>> Reported-by: Chintan Pandya<cpandya@codeaurora.org>
>> Not-Signed-off-by: Hugh Dickins<hughd@google.com>


 >>> So looking at Hughs test results I'm quite sure that the deferrable
 >>> timer is just another tunable bandaid with dubious value and the
 >>> potential of predictable bug/regresssion reports.

Here I am naive in understanding the obvious disadvantages of 'one more 
knob'. And hence was inclined towards deferrable timer knob. Thomas, 
could you explain what kind of bug/regression you foresee with such 
approach ?
Hugh Dickins Sept. 9, 2014, 8:14 p.m. UTC | #3
On Mon, 8 Sep 2014, Peter Zijlstra wrote:
> On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> > 
> > --- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
> > +++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
> 
> > @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st
> >  {
> >  }
> >  
> > +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
> 
> s/ksm_switch/__&/

I don't think so (and I did try building with CONFIG_KSM off too).

> 
> > +{
> > +	return NULL;
> > +}
> > +
> >  static inline int PageKsm(struct page *page)
> >  {
> >  	return 0;
> 
> > --- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
> > +++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
> 
> > @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas
> >  	       struct task_struct *next)
> >  {
> >  	struct mm_struct *mm, *oldmm;
> > +	wait_queue_head_t *wake_ksm = NULL;
> >  
> >  	prepare_task_switch(rq, prev, next);
> >  
> > @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas
> >  		next->active_mm = oldmm;
> >  		atomic_inc(&oldmm->mm_count);
> >  		enter_lazy_tlb(oldmm, next);
> > -	} else
> > +	} else {
> >  		switch_mm(oldmm, mm, next);
> > +		wake_ksm = ksm_switch(mm);
> 
> Is this the right mm?

It's next->mm, that's the one I intended (though the patch might
be equally workable using prev->mm instead: given free rein, I'd
have opted for hooking into both prev and next, but free rein is
definitely not what should be granted around here!).

> We've just switched the stack,

I thought that came in switch_to() a few lines further down,
but don't think it matters for this.

> so we're looing at next->mm when we switched away from current.
> That might not exist anymore.

I fail to see how that can be.  Looking at the x86 switch_mm(),
I can see it referencing (unsurprisingly!) both old and new mms
at this point, and no reference to an mm is dropped before the
ksm_switch().  oldmm (there called mm) is mmdropped later in
finish_task_switch().

> 
> > +	}
> >  
> >  	if (!prev->mm) {
> >  		prev->active_mm = NULL;
> > @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas
> >  	 * frame will be invalid.
> >  	 */
> >  	finish_task_switch(this_rq(), prev);
> > +
> > +	if (wake_ksm)
> > +		wake_up_interruptible(wake_ksm);
> >  }
> 
> Quite horrible for sure. I really hate seeing KSM cruft all the way down

Yes, I expected that, and I would certainly feel the same way.

And even worse, imagine if this were successful, we might come along
and ask to do something similar for khugepaged.  Though if it comes to
that, I'm sure we would generalize into one hook which does not say
"ksm" or "khugepaged" on it, but would still a present a single unlikely
flag to be tested at this level.  Maybe you would even prefer the
generalized version, but I don't want to complicate the prototype yet.

If it weren't for the "we already have the mm cachelines here" argument,
I by now feel fairly sure that I would be going for hooking into timer
tick instead (where Thomas could then express his horror!).

Do you think I should just forget about cacheline micro-optimizations
and go in that direction instead?

> here. Can't we create a new (timer) infrastructure that does the right
> thing? Surely this isn't the only such case.

A sleep-walking timer, that goes to sleep in one bed, but may wake in
another; and defers while beds are empty?  I'd be happy to try using
that for KSM if it already existed, and no doubt Chintan would too

But I don't think KSM presents a very good case for developing it.
I think KSM's use of a sleep_millisecs timer is really just an apology
for the amount of often wasted work that it does, and dates from before
we niced it down 5.  I prefer the idea of a KSM which waits on activity
amongst the restricted set of tasks it is tracking: as this patch tries.

But my preference may be naive: doing lots of unnecessary work doesn't
matter as much as waking cpus from deep sleep.

> 
> I know both RCU and some NOHZ_FULL muck already track when the system is
> completely idle. This is yet another case of that.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugh Dickins Sept. 9, 2014, 8:37 p.m. UTC | #4
On Tue, 9 Sep 2014, Chintan Pandya wrote:
> Hello All,
> 
> Before its too late to discuss this basic question, allow me to share my view
> on the deferrable timer approach.
> 
> I believe KSM at this point is tunable with predictable outcomes. When it
> will get triggered, how many pages it will scan etc. This aggression control
> is with user who can implement any complex logic based on its own flashy
> parameters. Along the same line, I was seeing this deferrable timer knob.
> 
> Here I was hoping the same level of predictability with this knob. Kernel
> still won't do smart work and user is free to play smart/complex with the
> knob. I believe that there are many use-cases where a single strategy of "to
> make KSM perform better while saving power at the same time" may not work.
> So,

Understood.  Whereas seasoned developers grow increasingly sick and
sceptical of such knobs, preferring to make an effort to get things
working well without them.  Both attitudes are valid.

> 
> > On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote:
> > > Well, yes, but... how do we know when there is no more work to do?
> > 
> > Yeah, I figured that out _after_ I send that email..
> > 
> > > Thomas has given reason why KSM might simply fail to do its job if we
> > > rely on the deferrable timer.
> 
> With deferrable timer, KSM thread will be scheduled on the 'active' CPU at
> that very same time. Yes, I understood from Thomas's clarification that if
> that very CPU goes IDLE, KSM task will get deferred even if at the timeout,
> we have some CPUs running. I think, this situation can be avoided potentially
> very small timeout value (?). But in totality, this is where KSM will be idle
> for sure, may be that is unwanted.

I don't get how a very small timeout value would solve that.  But I am
all for a KSM which works well even with timeout value 0: responsive,
but not power hungry.

> 
> > > 
> > > Chintan, even if the scheduler guys turn out to hate it, please would
> > > you give the patch below a try, to see how well it works in your
> > > environment, whether it seems to go better or worse than your own patch.
> > > 
> > > If it works well enough for you, maybe we can come up with ideas to
> > > make it more palatable.  I do think your issue is an important one
> > > to fix, one way or another.
> 
> It is taking a little more time for me to grasp your change :)

Grasping the intent should be easy: I thought that was in the title,
"ksm: avoid periodic wakeup while mergeable mms are quiet": just go
to sleep until at least one of the MMF_VM_MERGEABLE tasks gets run.

Checking the details, whether I'm accomplishing that intent, yes,
that needs more understanding of ksm.c, which your deferrable timer
approach did not need to get involved with at all (to its credit).

> So, after Peter's comment, do you want me to try this out

Certainly yes, please do.  I'm not saying that's a patch which will
ever go into the kernel itself, but we do want to know whether it does
the job for you or not, whether it's a worthwhile attempt in the right
direction.  Does it save as much as your patch?  Does it save more?

> or you are looking forward for even better idea ?

I'd love a better idea: assuming mine works, is there some other way
of accomplishing it, which does not pollute sched/core.c, but does
not drag in mm_struct cachelines for a frequent flags check?

> BTW, if deferrable timer patch gets any green signal,
> I will publish new patch with your comments on v4.

It seems to be a red light at present: but lights do change ;)

> 
> > > 
> > > Thanks,
> > > Hugh
> > > 
> > > [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet
> > > 
> > > Description yet to be written!
> > > 
> > > Reported-by: Chintan Pandya<cpandya@codeaurora.org>
> > > Not-Signed-off-by: Hugh Dickins<hughd@google.com>
> 
> 
> >>> So looking at Hughs test results I'm quite sure that the deferrable
> >>> timer is just another tunable bandaid with dubious value and the
> >>> potential of predictable bug/regresssion reports.
> 
> Here I am naive in understanding the obvious disadvantages of 'one more
> knob'. And hence was inclined towards deferrable timer knob. Thomas, could
> you explain what kind of bug/regression you foresee with such approach ?
> 
> -- 
> Chintan Pandya
> 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 10, 2014, 8:10 a.m. UTC | #5
On Tue, Sep 09, 2014 at 01:14:50PM -0700, Hugh Dickins wrote:
> On Mon, 8 Sep 2014, Peter Zijlstra wrote:
> > >  		switch_mm(oldmm, mm, next);
> > > +		wake_ksm = ksm_switch(mm);
> > 
> > Is this the right mm?
> 
> It's next->mm, that's the one I intended (though the patch might
> be equally workable using prev->mm instead: given free rein, I'd
> have opted for hooking into both prev and next, but free rein is
> definitely not what should be granted around here!).
> 
> > We've just switched the stack,
> 
> I thought that came in switch_to() a few lines further down,
> but don't think it matters for this.

Ah, yes. Got my task and mm separation messed up.

> > so we're looing at next->mm when we switched away from current.
> > That might not exist anymore.
> 
> I fail to see how that can be.  Looking at the x86 switch_mm(),
> I can see it referencing (unsurprisingly!) both old and new mms
> at this point, and no reference to an mm is dropped before the
> ksm_switch().  oldmm (there called mm) is mmdropped later in
> finish_task_switch().

Well, see the above confusion about switch_mm vs switch_to :-/

So if this were switch_to(), we'd see next->mm as before the last
context switch. And since that switch fully happened, it would also
already have done the finish_task_switch() -> mmdrop().
Peter Zijlstra Sept. 10, 2014, 8:27 a.m. UTC | #6
On Tue, Sep 09, 2014 at 01:14:50PM -0700, Hugh Dickins wrote:
> > Quite horrible for sure. I really hate seeing KSM cruft all the way down
> 
> Yes, I expected that, and I would certainly feel the same way.
> 
> And even worse, imagine if this were successful, we might come along
> and ask to do something similar for khugepaged.  Though if it comes to
> that, I'm sure we would generalize into one hook which does not say
> "ksm" or "khugepaged" on it, but would still a present a single unlikely
> flag to be tested at this level.  Maybe you would even prefer the
> generalized version, but I don't want to complicate the prototype yet.
> 
> If it weren't for the "we already have the mm cachelines here" argument,
> I by now feel fairly sure that I would be going for hooking into timer
> tick instead (where Thomas could then express his horror!).
> 
> Do you think I should just forget about cacheline micro-optimizations
> and go in that direction instead?

Not really either I'm afraid. Slimming down the tick has been on my todo
list like forever. There's a lot of low hanging fruit there.

> > here. Can't we create a new (timer) infrastructure that does the right
> > thing? Surely this isn't the only such case.
> 
> A sleep-walking timer, that goes to sleep in one bed, but may wake in
> another; and defers while beds are empty?  I'd be happy to try using
> that for KSM if it already existed, and no doubt Chintan would too
> 
> But I don't think KSM presents a very good case for developing it.
> I think KSM's use of a sleep_millisecs timer is really just an apology
> for the amount of often wasted work that it does, and dates from before
> we niced it down 5.  I prefer the idea of a KSM which waits on activity
> amongst the restricted set of tasks it is tracking: as this patch tries.
> 
> But my preference may be naive: doing lots of unnecessary work doesn't
> matter as much as waking cpus from deep sleep.

Ah yes, I see your point. So far the mentioned goal has been to not
unduly wake CPUs and waste energy, but your goal is better still. And
would indeed not be met by a globally deferred timer.

Does it make sense to drive both KSM and khugepage the same way we drive
the numa scanning? It has the benefit of getting rid of these threads,
which pushes the work into the right accountable context (the task its
doing the scanning for) and makes the scanning frequency depend on the
actual task activity.
Chintan Pandya Sept. 11, 2014, 8:01 a.m. UTC | #7
I don't mean to divert the thread too much. But just one suggestion 
offered by Harshad.

Why can't we stop invoking more of a KSM scanner thread when we are 
saturating from savings ? But again, to check whether savings are 
saturated or not, we may still want to rely upon timers and we have to 
wake the CPUs up from IDLE state.

>> here. Can't we create a new (timer) infrastructure that does the right
>> thing? Surely this isn't the only such case.
>
> A sleep-walking timer, that goes to sleep in one bed, but may wake in
> another; and defers while beds are empty?  I'd be happy to try using
> that for KSM if it already existed, and no doubt Chintan would too

This is interesting for sure :)

>
> But I don't think KSM presents a very good case for developing it.
> I think KSM's use of a sleep_millisecs timer is really just an apology
> for the amount of often wasted work that it does, and dates from before
> we niced it down 5.  I prefer the idea of a KSM which waits on activity
> amongst the restricted set of tasks it is tracking: as this patch tries.
>
> But my preference may be naive: doing lots of unnecessary work doesn't
> matter as much as waking cpus from deep sleep.

This is exactly the preference we are looking for. But yes, cannot be 
generalized for all.

>
>>
>> I know both RCU and some NOHZ_FULL muck already track when the system is
>> completely idle. This is yet another case of that.
>
> Hugh
Hugh Dickins Sept. 11, 2014, 12:59 p.m. UTC | #8
On Wed, 10 Sep 2014, Peter Zijlstra wrote:
> 
> Does it make sense to drive both KSM and khugepage the same way we drive
> the numa scanning? It has the benefit of getting rid of these threads,
> which pushes the work into the right accountable context (the task its
> doing the scanning for) and makes the scanning frequency depend on the
> actual task activity.

I expect it would be possible: but more work than I'd ever find time
to complete myself, with uncertain benefit.

khugepaged would probably be easier to convert, since it is dealing
with independent mms anyway.  Whereas ksmd is establishing sharing
between unrelated mms, so cannot deal with single mms in isolation.

But what's done by a single daemon today, could be passed from task
to task under mutex instead; with probably very different handling
of KSM's "unstable" tree (at present the old one is forgotten at the
start of each cycle, and the new one rebuilt from scratch: I expect
that would have to change, to removing rb entries one by one).

How well it would work out, I'm not confident to say.  And I think
we shall need an answer to the power question sooner than we can
turn the design of KSM on its head.  Vendors will go with what works
for them, never mind what our priniciples dictate.

Your suggestion of following the NUMA scanning did make me wonder
if I could use task_work: if there were already a re-arming task_work,
I could probably use that, and escape your gaze :)  But I don't think
it exists at present, and I don't think it's an extension that would
be welcomed, and I don't think it would present an efficient solution.

The most satisfying solution aesthetically, would be for KSM to write
protect the VM_MERGEABLE areas at some stage (when they "approach
stability", whatever I mean by that), and let itself be woken by the
faults (and if there are no write faults on any of the areas, then
there is no need for it to be awoken).

But I think that all those new faults would pose a very significant
regression in performance.

I don't have a good idea of where else to hook in at present.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugh Dickins Sept. 11, 2014, 1:25 p.m. UTC | #9
On Thu, 11 Sep 2014, Chintan Pandya wrote:

> I don't mean to divert the thread too much. But just one suggestion offered
> by Harshad.
> 
> Why can't we stop invoking more of a KSM scanner thread when we are
> saturating from savings ? But again, to check whether savings are saturated
> or not, we may still want to rely upon timers and we have to wake the CPUs up
> from IDLE state.

I agree that it should make sense for KSM to slow down when it sees it's
making no progress (though that would depart from the pages_to_scan and
sleep_millisecs prescription - perhaps could be tied to sleep_millisecs 0).

But not stop.  That's the problem we're mainly concerned with here:
to save power we need it to stop, but then how to wake up, without
putting nasty hooks in hot paths for a minority interest?
I don't see an answer to that above.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- 3.17-rc4/include/linux/ksm.h	2014-03-30 20:40:15.000000000 -0700
+++ linux/include/linux/ksm.h	2014-09-07 11:54:41.528003316 -0700
@@ -12,6 +12,7 @@ 
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/sched.h>
+#include <linux/wait.h>
 
 struct stable_node;
 struct mem_cgroup;
@@ -21,6 +22,7 @@  int ksm_madvise(struct vm_area_struct *v
 		unsigned long end, int advice, unsigned long *vm_flags);
 int __ksm_enter(struct mm_struct *mm);
 void __ksm_exit(struct mm_struct *mm);
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm);
 
 static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
@@ -35,6 +37,13 @@  static inline void ksm_exit(struct mm_st
 		__ksm_exit(mm);
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags)))
+		return __ksm_switch(mm);
+	return NULL;
+}
+
 /*
  * A KSM page is one of those write-protected "shared pages" or "merged pages"
  * which KSM maps into multiple mms, wherever identical anonymous page content
@@ -87,6 +96,11 @@  static inline void ksm_exit(struct mm_st
 {
 }
 
+static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm)
+{
+	return NULL;
+}
+
 static inline int PageKsm(struct page *page)
 {
 	return 0;
--- 3.17-rc4/include/linux/sched.h	2014-08-16 16:00:53.909189060 -0700
+++ linux/include/linux/sched.h	2014-09-07 11:54:41.528003316 -0700
@@ -453,6 +453,7 @@  static inline int get_dumpable(struct mm
 
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
+#define MMF_SWITCH_TO_KSM	21	/* notify KSM of switch to this mm */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
--- 3.17-rc4/kernel/sched/core.c	2014-08-16 16:00:54.062189063 -0700
+++ linux/kernel/sched/core.c	2014-09-07 11:54:41.528003316 -0700
@@ -61,6 +61,7 @@ 
 #include <linux/times.h>
 #include <linux/tsacct_kern.h>
 #include <linux/kprobes.h>
+#include <linux/ksm.h>
 #include <linux/delayacct.h>
 #include <linux/unistd.h>
 #include <linux/pagemap.h>
@@ -2304,6 +2305,7 @@  context_switch(struct rq *rq, struct tas
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	wait_queue_head_t *wake_ksm = NULL;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2320,8 +2322,10 @@  context_switch(struct rq *rq, struct tas
 		next->active_mm = oldmm;
 		atomic_inc(&oldmm->mm_count);
 		enter_lazy_tlb(oldmm, next);
-	} else
+	} else {
 		switch_mm(oldmm, mm, next);
+		wake_ksm = ksm_switch(mm);
+	}
 
 	if (!prev->mm) {
 		prev->active_mm = NULL;
@@ -2348,6 +2352,9 @@  context_switch(struct rq *rq, struct tas
 	 * frame will be invalid.
 	 */
 	finish_task_switch(this_rq(), prev);
+
+	if (wake_ksm)
+		wake_up_interruptible(wake_ksm);
 }
 
 /*
--- 3.17-rc4/mm/ksm.c	2014-08-16 16:00:54.132189065 -0700
+++ linux/mm/ksm.c	2014-09-07 11:54:41.528003316 -0700
@@ -205,6 +205,9 @@  static struct kmem_cache *rmap_item_cach
 static struct kmem_cache *stable_node_cache;
 static struct kmem_cache *mm_slot_cache;
 
+/* The number of mergeable mms which have recently run */
+static atomic_t active_mergeable_mms = ATOMIC_INIT(0);
+
 /* The number of nodes in the stable tree */
 static unsigned long ksm_pages_shared;
 
@@ -313,9 +316,13 @@  static inline struct mm_slot *alloc_mm_s
 	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
 }
 
-static inline void free_mm_slot(struct mm_slot *mm_slot)
+static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot)
 {
 	kmem_cache_free(mm_slot_cache, mm_slot);
+
+	clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+		atomic_dec(&active_mergeable_mms);
 }
 
 static struct mm_slot *get_mm_slot(struct mm_struct *mm)
@@ -801,8 +808,7 @@  static int unmerge_and_remove_all_rmap_i
 			list_del(&mm_slot->mm_list);
 			spin_unlock(&ksm_mmlist_lock);
 
-			free_mm_slot(mm_slot);
-			clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+			free_mm_slot(mm, mm_slot);
 			up_read(&mm->mmap_sem);
 			mmdrop(mm);
 		} else {
@@ -1668,12 +1674,20 @@  next_mm:
 		list_del(&slot->mm_list);
 		spin_unlock(&ksm_mmlist_lock);
 
-		free_mm_slot(slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, slot);
 		up_read(&mm->mmap_sem);
 		mmdrop(mm);
 	} else {
 		spin_unlock(&ksm_mmlist_lock);
+		/*
+		 * After completing its scan, assume this mm to be inactive,
+		 * but set a flag for context_switch() to notify us as soon
+		 * as it is used again: see ksm_switch().  If the number of
+		 * active_mergeable_mms goes down to zero, ksmd will sleep
+		 * to save power, until awoken by mergeable context_switch().
+		 */
+		if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags))
+			atomic_dec(&active_mergeable_mms);
 		up_read(&mm->mmap_sem);
 	}
 
@@ -1707,7 +1721,7 @@  static void ksm_do_scan(unsigned int sca
 
 static int ksmd_should_run(void)
 {
-	return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list);
+	return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms);
 }
 
 static int ksm_scan_thread(void *nothing)
@@ -1785,15 +1799,11 @@  int ksm_madvise(struct vm_area_struct *v
 int __ksm_enter(struct mm_struct *mm)
 {
 	struct mm_slot *mm_slot;
-	int needs_wakeup;
 
 	mm_slot = alloc_mm_slot();
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* Check ksm_run too?  Would need tighter locking */
-	needs_wakeup = list_empty(&ksm_mm_head.mm_list);
-
 	spin_lock(&ksm_mmlist_lock);
 	insert_to_mm_slots_hash(mm, mm_slot);
 	/*
@@ -1812,10 +1822,9 @@  int __ksm_enter(struct mm_struct *mm)
 		list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	set_bit(MMF_VM_MERGEABLE, &mm->flags);
 	atomic_inc(&mm->mm_count);
-
-	if (needs_wakeup)
+	set_bit(MMF_VM_MERGEABLE, &mm->flags);
+	if (atomic_inc_return(&active_mergeable_mms) == 1)
 		wake_up_interruptible(&ksm_thread_wait);
 
 	return 0;
@@ -1850,8 +1859,7 @@  void __ksm_exit(struct mm_struct *mm)
 	spin_unlock(&ksm_mmlist_lock);
 
 	if (easy_to_free) {
-		free_mm_slot(mm_slot);
-		clear_bit(MMF_VM_MERGEABLE, &mm->flags);
+		free_mm_slot(mm, mm_slot);
 		mmdrop(mm);
 	} else if (mm_slot) {
 		down_write(&mm->mmap_sem);
@@ -1859,6 +1867,18 @@  void __ksm_exit(struct mm_struct *mm)
 	}
 }
 
+wait_queue_head_t *__ksm_switch(struct mm_struct *mm)
+{
+	/*
+	 * Called by context_switch() to a hitherto inactive mergeable mm:
+	 * scheduler locks forbid immediate wakeup so leave that to caller.
+	 */
+	if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) &&
+	    atomic_inc_return(&active_mergeable_mms) == 1)
+		return &ksm_thread_wait;
+	return NULL;
+}
+
 struct page *ksm_might_need_to_copy(struct page *page,
 			struct vm_area_struct *vma, unsigned long address)
 {