Message ID | alpine.LSU.2.11.1409080023100.1610@eggly.anvils (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote: > Well, yes, but... how do we know when there is no more work to do? Yeah, I figured that out _after_ I send that email.. > Thomas has given reason why KSM might simply fail to do its job if we > rely on the deferrable timer. So I've tried another approach, patch > below; but I do not expect you to jump for joy at the sight of it! Indeed :/ > I've tried to minimize the offensive KSM hook in context_switch(). > Why place it there, rather than do something near profile_tick() or > account_process_tick()? Because KSM is aware of mms not tasks, and > context_switch() should have the next mm cachelines hot (if not, a > slight regrouping in mm_struct should do it); whereas I can find > no reference whatever to mm_struct in kernel/time, so hooking to > KSM from there would drag in another few cachelines every tick. > > (Another approach would be to set up KSM hint faulting, along the > lines of NUMA hint faulting. Not a path I'm keen to go down.) > > I'm not thrilled with this patch, I think it's somewhat defective > in several ways. But maybe in practice it will prove good enough, > and if so then I'd rather not waste effort on complicating it. > > My own testing is not realistic, nor representative of real KSM users; > and I have no idea what values of pages_to_scan and sleep_millisecs > people really use (and those may make quite a difference to how > well it works). > > Chintan, even if the scheduler guys turn out to hate it, please would > you give the patch below a try, to see how well it works in your > environment, whether it seems to go better or worse than your own patch. > > If it works well enough for you, maybe we can come up with ideas to > make it more palatable. I do think your issue is an important one > to fix, one way or another. > > Thanks, > Hugh > > [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet > > Description yet to be written! > > Reported-by: Chintan Pandya <cpandya@codeaurora.org> > Not-Signed-off-by: Hugh Dickins <hughd@google.com> > --- > > include/linux/ksm.h | 14 +++++++++++ > include/linux/sched.h | 1 > kernel/sched/core.c | 9 ++++++- > mm/ksm.c | 50 ++++++++++++++++++++++++++++------------ > 4 files changed, 58 insertions(+), 16 deletions(-) > > --- 3.17-rc4/include/linux/ksm.h 2014-03-30 20:40:15.000000000 -0700 > +++ linux/include/linux/ksm.h 2014-09-07 11:54:41.528003316 -0700 > @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st > { > } > > +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm) s/ksm_switch/__&/ > +{ > + return NULL; > +} > + > static inline int PageKsm(struct page *page) > { > return 0; > --- 3.17-rc4/kernel/sched/core.c 2014-08-16 16:00:54.062189063 -0700 > +++ linux/kernel/sched/core.c 2014-09-07 11:54:41.528003316 -0700 > @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas > struct task_struct *next) > { > struct mm_struct *mm, *oldmm; > + wait_queue_head_t *wake_ksm = NULL; > > prepare_task_switch(rq, prev, next); > > @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas > next->active_mm = oldmm; > atomic_inc(&oldmm->mm_count); > enter_lazy_tlb(oldmm, next); > - } else > + } else { > switch_mm(oldmm, mm, next); > + wake_ksm = ksm_switch(mm); Is this the right mm? We've just switched the stack, so we're looking at next->mm when we switched away from current. That might not exist anymore. > + } > > if (!prev->mm) { > prev->active_mm = NULL; > @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas > * frame will be invalid. > */ > finish_task_switch(this_rq(), prev); > + > + if (wake_ksm) > + wake_up_interruptible(wake_ksm); > } Quite horrible for sure. I really hate seeing KSM cruft all the way down here. Can't we create a new (timer) infrastructure that does the right thing? Surely this isn't the only such case. I know both RCU and some NOHZ_FULL muck already track when the system is completely idle. This is yet another case of that.
Hello All, Before its too late to discuss this basic question, allow me to share my view on the deferrable timer approach. I believe KSM at this point is tunable with predictable outcomes. When it will get triggered, how many pages it will scan etc. This aggression control is with user who can implement any complex logic based on its own flashy parameters. Along the same line, I was seeing this deferrable timer knob. Here I was hoping the same level of predictability with this knob. Kernel still won't do smart work and user is free to play smart/complex with the knob. I believe that there are many use-cases where a single strategy of "to make KSM perform better while saving power at the same time" may not work. So, > On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote: >> Well, yes, but... how do we know when there is no more work to do? > > Yeah, I figured that out _after_ I send that email.. > >> Thomas has given reason why KSM might simply fail to do its job if we >> rely on the deferrable timer. With deferrable timer, KSM thread will be scheduled on the 'active' CPU at that very same time. Yes, I understood from Thomas's clarification that if that very CPU goes IDLE, KSM task will get deferred even if at the timeout, we have some CPUs running. I think, this situation can be avoided potentially very small timeout value (?). But in totality, this is where KSM will be idle for sure, may be that is unwanted. >> >> Chintan, even if the scheduler guys turn out to hate it, please would >> you give the patch below a try, to see how well it works in your >> environment, whether it seems to go better or worse than your own patch. >> >> If it works well enough for you, maybe we can come up with ideas to >> make it more palatable. I do think your issue is an important one >> to fix, one way or another. It is taking a little more time for me to grasp your change :) So, after Peter's comment, do you want me to try this out or you are looking forward for even better idea ? BTW, if deferrable timer patch gets any green signal, I will publish new patch with your comments on v4. >> >> Thanks, >> Hugh >> >> [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet >> >> Description yet to be written! >> >> Reported-by: Chintan Pandya<cpandya@codeaurora.org> >> Not-Signed-off-by: Hugh Dickins<hughd@google.com> >>> So looking at Hughs test results I'm quite sure that the deferrable >>> timer is just another tunable bandaid with dubious value and the >>> potential of predictable bug/regresssion reports. Here I am naive in understanding the obvious disadvantages of 'one more knob'. And hence was inclined towards deferrable timer knob. Thomas, could you explain what kind of bug/regression you foresee with such approach ?
On Mon, 8 Sep 2014, Peter Zijlstra wrote: > On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote: > > > > --- 3.17-rc4/include/linux/ksm.h 2014-03-30 20:40:15.000000000 -0700 > > +++ linux/include/linux/ksm.h 2014-09-07 11:54:41.528003316 -0700 > > > @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st > > { > > } > > > > +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm) > > s/ksm_switch/__&/ I don't think so (and I did try building with CONFIG_KSM off too). > > > +{ > > + return NULL; > > +} > > + > > static inline int PageKsm(struct page *page) > > { > > return 0; > > > --- 3.17-rc4/kernel/sched/core.c 2014-08-16 16:00:54.062189063 -0700 > > +++ linux/kernel/sched/core.c 2014-09-07 11:54:41.528003316 -0700 > > > @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas > > struct task_struct *next) > > { > > struct mm_struct *mm, *oldmm; > > + wait_queue_head_t *wake_ksm = NULL; > > > > prepare_task_switch(rq, prev, next); > > > > @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas > > next->active_mm = oldmm; > > atomic_inc(&oldmm->mm_count); > > enter_lazy_tlb(oldmm, next); > > - } else > > + } else { > > switch_mm(oldmm, mm, next); > > + wake_ksm = ksm_switch(mm); > > Is this the right mm? It's next->mm, that's the one I intended (though the patch might be equally workable using prev->mm instead: given free rein, I'd have opted for hooking into both prev and next, but free rein is definitely not what should be granted around here!). > We've just switched the stack, I thought that came in switch_to() a few lines further down, but don't think it matters for this. > so we're looing at next->mm when we switched away from current. > That might not exist anymore. I fail to see how that can be. Looking at the x86 switch_mm(), I can see it referencing (unsurprisingly!) both old and new mms at this point, and no reference to an mm is dropped before the ksm_switch(). oldmm (there called mm) is mmdropped later in finish_task_switch(). > > > + } > > > > if (!prev->mm) { > > prev->active_mm = NULL; > > @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas > > * frame will be invalid. > > */ > > finish_task_switch(this_rq(), prev); > > + > > + if (wake_ksm) > > + wake_up_interruptible(wake_ksm); > > } > > Quite horrible for sure. I really hate seeing KSM cruft all the way down Yes, I expected that, and I would certainly feel the same way. And even worse, imagine if this were successful, we might come along and ask to do something similar for khugepaged. Though if it comes to that, I'm sure we would generalize into one hook which does not say "ksm" or "khugepaged" on it, but would still a present a single unlikely flag to be tested at this level. Maybe you would even prefer the generalized version, but I don't want to complicate the prototype yet. If it weren't for the "we already have the mm cachelines here" argument, I by now feel fairly sure that I would be going for hooking into timer tick instead (where Thomas could then express his horror!). Do you think I should just forget about cacheline micro-optimizations and go in that direction instead? > here. Can't we create a new (timer) infrastructure that does the right > thing? Surely this isn't the only such case. A sleep-walking timer, that goes to sleep in one bed, but may wake in another; and defers while beds are empty? I'd be happy to try using that for KSM if it already existed, and no doubt Chintan would too But I don't think KSM presents a very good case for developing it. I think KSM's use of a sleep_millisecs timer is really just an apology for the amount of often wasted work that it does, and dates from before we niced it down 5. I prefer the idea of a KSM which waits on activity amongst the restricted set of tasks it is tracking: as this patch tries. But my preference may be naive: doing lots of unnecessary work doesn't matter as much as waking cpus from deep sleep. > > I know both RCU and some NOHZ_FULL muck already track when the system is > completely idle. This is yet another case of that. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 9 Sep 2014, Chintan Pandya wrote: > Hello All, > > Before its too late to discuss this basic question, allow me to share my view > on the deferrable timer approach. > > I believe KSM at this point is tunable with predictable outcomes. When it > will get triggered, how many pages it will scan etc. This aggression control > is with user who can implement any complex logic based on its own flashy > parameters. Along the same line, I was seeing this deferrable timer knob. > > Here I was hoping the same level of predictability with this knob. Kernel > still won't do smart work and user is free to play smart/complex with the > knob. I believe that there are many use-cases where a single strategy of "to > make KSM perform better while saving power at the same time" may not work. > So, Understood. Whereas seasoned developers grow increasingly sick and sceptical of such knobs, preferring to make an effort to get things working well without them. Both attitudes are valid. > > > On Mon, Sep 08, 2014 at 01:25:36AM -0700, Hugh Dickins wrote: > > > Well, yes, but... how do we know when there is no more work to do? > > > > Yeah, I figured that out _after_ I send that email.. > > > > > Thomas has given reason why KSM might simply fail to do its job if we > > > rely on the deferrable timer. > > With deferrable timer, KSM thread will be scheduled on the 'active' CPU at > that very same time. Yes, I understood from Thomas's clarification that if > that very CPU goes IDLE, KSM task will get deferred even if at the timeout, > we have some CPUs running. I think, this situation can be avoided potentially > very small timeout value (?). But in totality, this is where KSM will be idle > for sure, may be that is unwanted. I don't get how a very small timeout value would solve that. But I am all for a KSM which works well even with timeout value 0: responsive, but not power hungry. > > > > > > > Chintan, even if the scheduler guys turn out to hate it, please would > > > you give the patch below a try, to see how well it works in your > > > environment, whether it seems to go better or worse than your own patch. > > > > > > If it works well enough for you, maybe we can come up with ideas to > > > make it more palatable. I do think your issue is an important one > > > to fix, one way or another. > > It is taking a little more time for me to grasp your change :) Grasping the intent should be easy: I thought that was in the title, "ksm: avoid periodic wakeup while mergeable mms are quiet": just go to sleep until at least one of the MMF_VM_MERGEABLE tasks gets run. Checking the details, whether I'm accomplishing that intent, yes, that needs more understanding of ksm.c, which your deferrable timer approach did not need to get involved with at all (to its credit). > So, after Peter's comment, do you want me to try this out Certainly yes, please do. I'm not saying that's a patch which will ever go into the kernel itself, but we do want to know whether it does the job for you or not, whether it's a worthwhile attempt in the right direction. Does it save as much as your patch? Does it save more? > or you are looking forward for even better idea ? I'd love a better idea: assuming mine works, is there some other way of accomplishing it, which does not pollute sched/core.c, but does not drag in mm_struct cachelines for a frequent flags check? > BTW, if deferrable timer patch gets any green signal, > I will publish new patch with your comments on v4. It seems to be a red light at present: but lights do change ;) > > > > > > > Thanks, > > > Hugh > > > > > > [PATCH] ksm: avoid periodic wakeup while mergeable mms are quiet > > > > > > Description yet to be written! > > > > > > Reported-by: Chintan Pandya<cpandya@codeaurora.org> > > > Not-Signed-off-by: Hugh Dickins<hughd@google.com> > > > >>> So looking at Hughs test results I'm quite sure that the deferrable > >>> timer is just another tunable bandaid with dubious value and the > >>> potential of predictable bug/regresssion reports. > > Here I am naive in understanding the obvious disadvantages of 'one more > knob'. And hence was inclined towards deferrable timer knob. Thomas, could > you explain what kind of bug/regression you foresee with such approach ? > > -- > Chintan Pandya > > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a > member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 09, 2014 at 01:14:50PM -0700, Hugh Dickins wrote: > On Mon, 8 Sep 2014, Peter Zijlstra wrote: > > > switch_mm(oldmm, mm, next); > > > + wake_ksm = ksm_switch(mm); > > > > Is this the right mm? > > It's next->mm, that's the one I intended (though the patch might > be equally workable using prev->mm instead: given free rein, I'd > have opted for hooking into both prev and next, but free rein is > definitely not what should be granted around here!). > > > We've just switched the stack, > > I thought that came in switch_to() a few lines further down, > but don't think it matters for this. Ah, yes. Got my task and mm separation messed up. > > so we're looing at next->mm when we switched away from current. > > That might not exist anymore. > > I fail to see how that can be. Looking at the x86 switch_mm(), > I can see it referencing (unsurprisingly!) both old and new mms > at this point, and no reference to an mm is dropped before the > ksm_switch(). oldmm (there called mm) is mmdropped later in > finish_task_switch(). Well, see the above confusion about switch_mm vs switch_to :-/ So if this were switch_to(), we'd see next->mm as before the last context switch. And since that switch fully happened, it would also already have done the finish_task_switch() -> mmdrop().
On Tue, Sep 09, 2014 at 01:14:50PM -0700, Hugh Dickins wrote: > > Quite horrible for sure. I really hate seeing KSM cruft all the way down > > Yes, I expected that, and I would certainly feel the same way. > > And even worse, imagine if this were successful, we might come along > and ask to do something similar for khugepaged. Though if it comes to > that, I'm sure we would generalize into one hook which does not say > "ksm" or "khugepaged" on it, but would still a present a single unlikely > flag to be tested at this level. Maybe you would even prefer the > generalized version, but I don't want to complicate the prototype yet. > > If it weren't for the "we already have the mm cachelines here" argument, > I by now feel fairly sure that I would be going for hooking into timer > tick instead (where Thomas could then express his horror!). > > Do you think I should just forget about cacheline micro-optimizations > and go in that direction instead? Not really either I'm afraid. Slimming down the tick has been on my todo list like forever. There's a lot of low hanging fruit there. > > here. Can't we create a new (timer) infrastructure that does the right > > thing? Surely this isn't the only such case. > > A sleep-walking timer, that goes to sleep in one bed, but may wake in > another; and defers while beds are empty? I'd be happy to try using > that for KSM if it already existed, and no doubt Chintan would too > > But I don't think KSM presents a very good case for developing it. > I think KSM's use of a sleep_millisecs timer is really just an apology > for the amount of often wasted work that it does, and dates from before > we niced it down 5. I prefer the idea of a KSM which waits on activity > amongst the restricted set of tasks it is tracking: as this patch tries. > > But my preference may be naive: doing lots of unnecessary work doesn't > matter as much as waking cpus from deep sleep. Ah yes, I see your point. So far the mentioned goal has been to not unduly wake CPUs and waste energy, but your goal is better still. And would indeed not be met by a globally deferred timer. Does it make sense to drive both KSM and khugepage the same way we drive the numa scanning? It has the benefit of getting rid of these threads, which pushes the work into the right accountable context (the task its doing the scanning for) and makes the scanning frequency depend on the actual task activity.
I don't mean to divert the thread too much. But just one suggestion offered by Harshad. Why can't we stop invoking more of a KSM scanner thread when we are saturating from savings ? But again, to check whether savings are saturated or not, we may still want to rely upon timers and we have to wake the CPUs up from IDLE state. >> here. Can't we create a new (timer) infrastructure that does the right >> thing? Surely this isn't the only such case. > > A sleep-walking timer, that goes to sleep in one bed, but may wake in > another; and defers while beds are empty? I'd be happy to try using > that for KSM if it already existed, and no doubt Chintan would too This is interesting for sure :) > > But I don't think KSM presents a very good case for developing it. > I think KSM's use of a sleep_millisecs timer is really just an apology > for the amount of often wasted work that it does, and dates from before > we niced it down 5. I prefer the idea of a KSM which waits on activity > amongst the restricted set of tasks it is tracking: as this patch tries. > > But my preference may be naive: doing lots of unnecessary work doesn't > matter as much as waking cpus from deep sleep. This is exactly the preference we are looking for. But yes, cannot be generalized for all. > >> >> I know both RCU and some NOHZ_FULL muck already track when the system is >> completely idle. This is yet another case of that. > > Hugh
On Wed, 10 Sep 2014, Peter Zijlstra wrote: > > Does it make sense to drive both KSM and khugepage the same way we drive > the numa scanning? It has the benefit of getting rid of these threads, > which pushes the work into the right accountable context (the task its > doing the scanning for) and makes the scanning frequency depend on the > actual task activity. I expect it would be possible: but more work than I'd ever find time to complete myself, with uncertain benefit. khugepaged would probably be easier to convert, since it is dealing with independent mms anyway. Whereas ksmd is establishing sharing between unrelated mms, so cannot deal with single mms in isolation. But what's done by a single daemon today, could be passed from task to task under mutex instead; with probably very different handling of KSM's "unstable" tree (at present the old one is forgotten at the start of each cycle, and the new one rebuilt from scratch: I expect that would have to change, to removing rb entries one by one). How well it would work out, I'm not confident to say. And I think we shall need an answer to the power question sooner than we can turn the design of KSM on its head. Vendors will go with what works for them, never mind what our priniciples dictate. Your suggestion of following the NUMA scanning did make me wonder if I could use task_work: if there were already a re-arming task_work, I could probably use that, and escape your gaze :) But I don't think it exists at present, and I don't think it's an extension that would be welcomed, and I don't think it would present an efficient solution. The most satisfying solution aesthetically, would be for KSM to write protect the VM_MERGEABLE areas at some stage (when they "approach stability", whatever I mean by that), and let itself be woken by the faults (and if there are no write faults on any of the areas, then there is no need for it to be awoken). But I think that all those new faults would pose a very significant regression in performance. I don't have a good idea of where else to hook in at present. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 11 Sep 2014, Chintan Pandya wrote: > I don't mean to divert the thread too much. But just one suggestion offered > by Harshad. > > Why can't we stop invoking more of a KSM scanner thread when we are > saturating from savings ? But again, to check whether savings are saturated > or not, we may still want to rely upon timers and we have to wake the CPUs up > from IDLE state. I agree that it should make sense for KSM to slow down when it sees it's making no progress (though that would depart from the pages_to_scan and sleep_millisecs prescription - perhaps could be tied to sleep_millisecs 0). But not stop. That's the problem we're mainly concerned with here: to save power we need it to stop, but then how to wake up, without putting nasty hooks in hot paths for a minority interest? I don't see an answer to that above. Hugh -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- 3.17-rc4/include/linux/ksm.h 2014-03-30 20:40:15.000000000 -0700 +++ linux/include/linux/ksm.h 2014-09-07 11:54:41.528003316 -0700 @@ -12,6 +12,7 @@ #include <linux/pagemap.h> #include <linux/rmap.h> #include <linux/sched.h> +#include <linux/wait.h> struct stable_node; struct mem_cgroup; @@ -21,6 +22,7 @@ int ksm_madvise(struct vm_area_struct *v unsigned long end, int advice, unsigned long *vm_flags); int __ksm_enter(struct mm_struct *mm); void __ksm_exit(struct mm_struct *mm); +wait_queue_head_t *__ksm_switch(struct mm_struct *mm); static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm) { @@ -35,6 +37,13 @@ static inline void ksm_exit(struct mm_st __ksm_exit(mm); } +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm) +{ + if (unlikely(test_bit(MMF_SWITCH_TO_KSM, &mm->flags))) + return __ksm_switch(mm); + return NULL; +} + /* * A KSM page is one of those write-protected "shared pages" or "merged pages" * which KSM maps into multiple mms, wherever identical anonymous page content @@ -87,6 +96,11 @@ static inline void ksm_exit(struct mm_st { } +static inline wait_queue_head_t *ksm_switch(struct mm_struct *mm) +{ + return NULL; +} + static inline int PageKsm(struct page *page) { return 0; --- 3.17-rc4/include/linux/sched.h 2014-08-16 16:00:53.909189060 -0700 +++ linux/include/linux/sched.h 2014-09-07 11:54:41.528003316 -0700 @@ -453,6 +453,7 @@ static inline int get_dumpable(struct mm #define MMF_HAS_UPROBES 19 /* has uprobes */ #define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */ +#define MMF_SWITCH_TO_KSM 21 /* notify KSM of switch to this mm */ #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK) --- 3.17-rc4/kernel/sched/core.c 2014-08-16 16:00:54.062189063 -0700 +++ linux/kernel/sched/core.c 2014-09-07 11:54:41.528003316 -0700 @@ -61,6 +61,7 @@ #include <linux/times.h> #include <linux/tsacct_kern.h> #include <linux/kprobes.h> +#include <linux/ksm.h> #include <linux/delayacct.h> #include <linux/unistd.h> #include <linux/pagemap.h> @@ -2304,6 +2305,7 @@ context_switch(struct rq *rq, struct tas struct task_struct *next) { struct mm_struct *mm, *oldmm; + wait_queue_head_t *wake_ksm = NULL; prepare_task_switch(rq, prev, next); @@ -2320,8 +2322,10 @@ context_switch(struct rq *rq, struct tas next->active_mm = oldmm; atomic_inc(&oldmm->mm_count); enter_lazy_tlb(oldmm, next); - } else + } else { switch_mm(oldmm, mm, next); + wake_ksm = ksm_switch(mm); + } if (!prev->mm) { prev->active_mm = NULL; @@ -2348,6 +2352,9 @@ context_switch(struct rq *rq, struct tas * frame will be invalid. */ finish_task_switch(this_rq(), prev); + + if (wake_ksm) + wake_up_interruptible(wake_ksm); } /* --- 3.17-rc4/mm/ksm.c 2014-08-16 16:00:54.132189065 -0700 +++ linux/mm/ksm.c 2014-09-07 11:54:41.528003316 -0700 @@ -205,6 +205,9 @@ static struct kmem_cache *rmap_item_cach static struct kmem_cache *stable_node_cache; static struct kmem_cache *mm_slot_cache; +/* The number of mergeable mms which have recently run */ +static atomic_t active_mergeable_mms = ATOMIC_INIT(0); + /* The number of nodes in the stable tree */ static unsigned long ksm_pages_shared; @@ -313,9 +316,13 @@ static inline struct mm_slot *alloc_mm_s return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL); } -static inline void free_mm_slot(struct mm_slot *mm_slot) +static void free_mm_slot(struct mm_struct *mm, struct mm_slot *mm_slot) { kmem_cache_free(mm_slot_cache, mm_slot); + + clear_bit(MMF_VM_MERGEABLE, &mm->flags); + if (!test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags)) + atomic_dec(&active_mergeable_mms); } static struct mm_slot *get_mm_slot(struct mm_struct *mm) @@ -801,8 +808,7 @@ static int unmerge_and_remove_all_rmap_i list_del(&mm_slot->mm_list); spin_unlock(&ksm_mmlist_lock); - free_mm_slot(mm_slot); - clear_bit(MMF_VM_MERGEABLE, &mm->flags); + free_mm_slot(mm, mm_slot); up_read(&mm->mmap_sem); mmdrop(mm); } else { @@ -1668,12 +1674,20 @@ next_mm: list_del(&slot->mm_list); spin_unlock(&ksm_mmlist_lock); - free_mm_slot(slot); - clear_bit(MMF_VM_MERGEABLE, &mm->flags); + free_mm_slot(mm, slot); up_read(&mm->mmap_sem); mmdrop(mm); } else { spin_unlock(&ksm_mmlist_lock); + /* + * After completing its scan, assume this mm to be inactive, + * but set a flag for context_switch() to notify us as soon + * as it is used again: see ksm_switch(). If the number of + * active_mergeable_mms goes down to zero, ksmd will sleep + * to save power, until awoken by mergeable context_switch(). + */ + if (!test_and_set_bit(MMF_SWITCH_TO_KSM, &mm->flags)) + atomic_dec(&active_mergeable_mms); up_read(&mm->mmap_sem); } @@ -1707,7 +1721,7 @@ static void ksm_do_scan(unsigned int sca static int ksmd_should_run(void) { - return (ksm_run & KSM_RUN_MERGE) && !list_empty(&ksm_mm_head.mm_list); + return (ksm_run & KSM_RUN_MERGE) && atomic_read(&active_mergeable_mms); } static int ksm_scan_thread(void *nothing) @@ -1785,15 +1799,11 @@ int ksm_madvise(struct vm_area_struct *v int __ksm_enter(struct mm_struct *mm) { struct mm_slot *mm_slot; - int needs_wakeup; mm_slot = alloc_mm_slot(); if (!mm_slot) return -ENOMEM; - /* Check ksm_run too? Would need tighter locking */ - needs_wakeup = list_empty(&ksm_mm_head.mm_list); - spin_lock(&ksm_mmlist_lock); insert_to_mm_slots_hash(mm, mm_slot); /* @@ -1812,10 +1822,9 @@ int __ksm_enter(struct mm_struct *mm) list_add_tail(&mm_slot->mm_list, &ksm_scan.mm_slot->mm_list); spin_unlock(&ksm_mmlist_lock); - set_bit(MMF_VM_MERGEABLE, &mm->flags); atomic_inc(&mm->mm_count); - - if (needs_wakeup) + set_bit(MMF_VM_MERGEABLE, &mm->flags); + if (atomic_inc_return(&active_mergeable_mms) == 1) wake_up_interruptible(&ksm_thread_wait); return 0; @@ -1850,8 +1859,7 @@ void __ksm_exit(struct mm_struct *mm) spin_unlock(&ksm_mmlist_lock); if (easy_to_free) { - free_mm_slot(mm_slot); - clear_bit(MMF_VM_MERGEABLE, &mm->flags); + free_mm_slot(mm, mm_slot); mmdrop(mm); } else if (mm_slot) { down_write(&mm->mmap_sem); @@ -1859,6 +1867,18 @@ void __ksm_exit(struct mm_struct *mm) } } +wait_queue_head_t *__ksm_switch(struct mm_struct *mm) +{ + /* + * Called by context_switch() to a hitherto inactive mergeable mm: + * scheduler locks forbid immediate wakeup so leave that to caller. + */ + if (test_and_clear_bit(MMF_SWITCH_TO_KSM, &mm->flags) && + atomic_inc_return(&active_mergeable_mms) == 1) + return &ksm_thread_wait; + return NULL; +} + struct page *ksm_might_need_to_copy(struct page *page, struct vm_area_struct *vma, unsigned long address) {