diff mbox

x86: kvm: Revert "remove sched notifier for cross-cpu migrations"

Message ID 20150323232151.GA12772@amt.cnet (mailing list archive)
State New, archived
Headers show

Commit Message

Marcelo Tosatti March 23, 2015, 11:21 p.m. UTC
The following point:

    2. per-CPU pvclock time info is updated if the
       underlying CPU changes.

Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".

Add task migration notification back.

Problem noticed by Andy Lutomirski.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CC: stable@kernel.org # 3.11+

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andy Lutomirski March 23, 2015, 11:30 p.m. UTC | #1
On Mon, Mar 23, 2015 at 4:21 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> The following point:
>
>     2. per-CPU pvclock time info is updated if the
>        underlying CPU changes.
>
> Is not true anymore since "KVM: x86: update pvclock area conditionally,
> on cpu migration".
>
> Add task migration notification back.

IMO this is a pretty big hammer to use to work around what appears to
be a bug in the host, but I guess that's okay.

It's also unfortunate in another regard: it seems non-obvious to me
how to use this without reading the cpu number twice in the vdso.  On
the other hand, unless we have a global pvti, or at least a global
indication of TSC stability, I don't see how to do that even with the
host bug fixed.

Grumble.

On a more useful note, could you rename migrate_count to
migrate_from_count, since that's what it is?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář March 24, 2015, 3:34 p.m. UTC | #2
2015-03-23 20:21-0300, Marcelo Tosatti:
> The following point:
> 
>     2. per-CPU pvclock time info is updated if the
>        underlying CPU changes.
> 
> Is not true anymore since "KVM: x86: update pvclock area conditionally,
> on cpu migration".

I think that the revert doesn't fix point 2.:  "KVM: x86: update pvclock
[...]" changed the host to skip clock update on physical CPU change, but
guest's task migration notifier isn't tied to it at all.
(Guest can have all tasks pinned, so the revert changed nothing.)

> Add task migration notification back.
> 
> Problem noticed by Andy Lutomirski.

What is the problem?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski March 24, 2015, 10:33 p.m. UTC | #3
On Tue, Mar 24, 2015 at 8:34 AM, Radim Kr?má? <rkrcmar@redhat.com> wrote:
> 2015-03-23 20:21-0300, Marcelo Tosatti:
>> The following point:
>>
>>     2. per-CPU pvclock time info is updated if the
>>        underlying CPU changes.
>>
>> Is not true anymore since "KVM: x86: update pvclock area conditionally,
>> on cpu migration".
>
> I think that the revert doesn't fix point 2.:  "KVM: x86: update pvclock
> [...]" changed the host to skip clock update on physical CPU change, but
> guest's task migration notifier isn't tied to it at all.
> (Guest can have all tasks pinned, so the revert changed nothing.)
>
>> Add task migration notification back.
>>
>> Problem noticed by Andy Lutomirski.
>
> What is the problem?

The kvmclock spec says that the host will increment a version field to
an odd number, then update stuff, then increment it to an even number.
The host is buggy and doesn't do this, and the result is observable
when one vcpu reads another vcpu's kvmclock data.

Since there's no good way for a guest kernel to keep its vdso from
reading a different vcpu's kvmclock data, this is a real corner-case
bug.  This patch allows the vdso to retry when this happens.  I don't
think it's a great solution, but it should mostly work.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 24, 2015, 10:59 p.m. UTC | #4
On Tue, Mar 24, 2015 at 04:34:12PM +0100, Radim Kr?má? wrote:
> 2015-03-23 20:21-0300, Marcelo Tosatti:
> > The following point:
> > 
> >     2. per-CPU pvclock time info is updated if the
> >        underlying CPU changes.
> > 
> > Is not true anymore since "KVM: x86: update pvclock area conditionally,
> > on cpu migration".
> 
> I think that the revert doesn't fix point 2.:  "KVM: x86: update pvclock
> [...]" changed the host to skip clock update on physical CPU change, but
> guest's task migration notifier isn't tied to it at all.

"per-CPU pvclock time info is updated if the underlying CPU changes"
is the same as
"always perform clock update on physical CPU change".

That was a requirement for the original patch, to drop migration
notifiers.

> (Guest can have all tasks pinned, so the revert changed nothing.)
> 
> > Add task migration notification back.
> > 
> > Problem noticed by Andy Lutomirski.
> 
> What is the problem?
> 
> Thanks.

The problem is this:

T1) guest thread1 on vcpu1.
T2) guest thread1 on vcpu2.
T3) guest thread1 on vcpu1.

Inside a pvclock read loop.

Since the writes by hypervisor of pvclock area are not ordered, 
you cannot rely on version being updated _before_ 
the rest of pvclock data.

(in the case above, "has the physical cpu changed" check, inside the
guests thread1, obviously fails).




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář March 25, 2015, 11:09 a.m. UTC | #5
2015-03-24 19:59-0300, Marcelo Tosatti:
> On Tue, Mar 24, 2015 at 04:34:12PM +0100, Radim Kr?má? wrote:
> > 2015-03-23 20:21-0300, Marcelo Tosatti:
> > > The following point:
> > > 
> > >     2. per-CPU pvclock time info is updated if the
> > >        underlying CPU changes.
> > > 
> > > Is not true anymore since "KVM: x86: update pvclock area conditionally,
> > > on cpu migration".
> > 
> > I think that the revert doesn't fix point 2.:  "KVM: x86: update pvclock
> > [...]" changed the host to skip clock update on physical CPU change, but
> > guest's task migration notifier isn't tied to it at all.
> 
> "per-CPU pvclock time info is updated if the underlying CPU changes"
> is the same as
> "always perform clock update on physical CPU change".
> 
> That was a requirement for the original patch, to drop migration
> notifiers.
> 
> > (Guest can have all tasks pinned, so the revert changed nothing.)
> > 
> > > Add task migration notification back.
> > > 
> > > Problem noticed by Andy Lutomirski.
> > 
> > What is the problem?
> > 
> > Thanks.
> 
> The problem is this:
> 
> T1) guest thread1 on vcpu1.
> T2) guest thread1 on vcpu2.
> T3) guest thread1 on vcpu1.
> 
> Inside a pvclock read loop.
> 
> Since the writes by hypervisor of pvclock area are not ordered, 
> you cannot rely on version being updated _before_ 
> the rest of pvclock data.
> 
> (in the case above, "has the physical cpu changed" check, inside the
> guests thread1, obviously fails).

Ah, thanks! so the "KVM: x86: update pvclock area conditionally [...]"
has nothing to do with it -- that really confused me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář March 25, 2015, 1:06 p.m. UTC | #6
2015-03-23 20:21-0300, Marcelo Tosatti:
> The following point:
> 
>     2. per-CPU pvclock time info is updated if the
>        underlying CPU changes.
> 
> Is not true anymore since "KVM: x86: update pvclock area conditionally,
> on cpu migration".
> 
> Add task migration notification back.
> 
> Problem noticed by Andy Lutomirski.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> CC: stable@kernel.org # 3.11+

Please improve the commit message.
"KVM: x86: update pvclock area conditionally [...]" was merged half a
year before the patch we are reverting and is completely unrelated to
the bug we are fixing now, (reverted patch just was just wrong)

Reviewed-by: Radim Kr?má? <rkrcmar@redhat.com>

> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> @@ -82,18 +82,15 @@ static notrace cycle_t vread_pvclock(int *mode)
>  	/*
> -	 * Note: hypervisor must guarantee that:
> -	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
> -	 * 2. that per-CPU pvclock time info is updated if the
> -	 *    underlying CPU changes.
> -	 * 3. that version is increased whenever underlying CPU
> -	 *    changes.
> -	 *
> +	 * When looping to get a consistent (time-info, tsc) pair, we
> +	 * also need to deal with the possibility we can switch vcpus,
> +	 * so make sure we always re-fetch time-info for the current vcpu.

(All points from the original comment need to hold -- it would be nicer
 to keep both.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář March 26, 2015, 8:59 p.m. UTC | #7
2015-03-23 20:21-0300, Marcelo Tosatti:
> 
> The following point:
> 
>     2. per-CPU pvclock time info is updated if the
>        underlying CPU changes.
> 
> Is not true anymore since "KVM: x86: update pvclock area conditionally,
> on cpu migration".
> 
> Add task migration notification back.
> 
> Problem noticed by Andy Lutomirski.
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> CC: stable@kernel.org # 3.11+

Revert contains a bug that got pointed out in the discussion:

> diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>  	do {
>  		cpu = __getcpu() & VGETCPU_CPU_MASK;
>  
>  		pvti = get_pvti(cpu);

We can migrate to 'other cpu' here.

> +		migrate_count = pvti->migrate_count;
> +
>  		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);

And migrate back to 'cpu' here.

rdtsc was executed on different cpu, so pvti and tsc might not be in
sync, but migrate_count hasn't changed.

>  		cpu1 = __getcpu() & VGETCPU_CPU_MASK;

(Reading cpuid here is useless.)

>  	} while (unlikely(cpu != cpu1 ||
>  			  (pvti->pvti.version & 1) ||
> -			  pvti->pvti.version != version));
> +			  pvti->pvti.version != version ||
> +			  pvti->migrate_count != migrate_count));

We can workaround the bug with,

  	cpu = __getcpu() & VGETCPU_CPU_MASK;
  	pvti = get_pvti(cpu);
  	migrate_count = pvti->migrate_count;
  	if (cpu != (__getcpu() & VGETCPU_CPU_MASK))
  		continue;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marcelo Tosatti March 26, 2015, 10:22 p.m. UTC | #8
On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Kr?má? wrote:
> 2015-03-23 20:21-0300, Marcelo Tosatti:
> > 
> > The following point:
> > 
> >     2. per-CPU pvclock time info is updated if the
> >        underlying CPU changes.
> > 
> > Is not true anymore since "KVM: x86: update pvclock area conditionally,
> > on cpu migration".
> > 
> > Add task migration notification back.
> > 
> > Problem noticed by Andy Lutomirski.
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > CC: stable@kernel.org # 3.11+
> 
> Revert contains a bug that got pointed out in the discussion:
> 
> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >  	do {
> >  		cpu = __getcpu() & VGETCPU_CPU_MASK;
> >  
> >  		pvti = get_pvti(cpu);
> 
> We can migrate to 'other cpu' here.
> 
> > +		migrate_count = pvti->migrate_count;
> > +
> >  		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> 
> And migrate back to 'cpu' here.

Migrating back will increase pvti->migrate_count, right ?

> rdtsc was executed on different cpu, so pvti and tsc might not be in
> sync, but migrate_count hasn't changed.
> 
> >  		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> 
> (Reading cpuid here is useless.)
> 
> >  	} while (unlikely(cpu != cpu1 ||
> >  			  (pvti->pvti.version & 1) ||
> > -			  pvti->pvti.version != version));
> > +			  pvti->pvti.version != version ||
> > +			  pvti->migrate_count != migrate_count));
> 
> We can workaround the bug with,
> 
>   	cpu = __getcpu() & VGETCPU_CPU_MASK;
>   	pvti = get_pvti(cpu);
>   	migrate_count = pvti->migrate_count;
>   	if (cpu != (__getcpu() & VGETCPU_CPU_MASK))
>   		continue;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Lutomirski March 26, 2015, 10:24 p.m. UTC | #9
On Thu, Mar 26, 2015 at 3:22 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Kr?má? wrote:
>> 2015-03-23 20:21-0300, Marcelo Tosatti:
>> >
>> > The following point:
>> >
>> >     2. per-CPU pvclock time info is updated if the
>> >        underlying CPU changes.
>> >
>> > Is not true anymore since "KVM: x86: update pvclock area conditionally,
>> > on cpu migration".
>> >
>> > Add task migration notification back.
>> >
>> > Problem noticed by Andy Lutomirski.
>> >
>> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>> > CC: stable@kernel.org # 3.11+
>>
>> Revert contains a bug that got pointed out in the discussion:
>>
>> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
>> >     do {
>> >             cpu = __getcpu() & VGETCPU_CPU_MASK;
>> >
>> >             pvti = get_pvti(cpu);
>>
>> We can migrate to 'other cpu' here.
>>
>> > +           migrate_count = pvti->migrate_count;
>> > +
>> >             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
>>
>> And migrate back to 'cpu' here.
>
> Migrating back will increase pvti->migrate_count, right ?

I thought it only increased the count when we migrated away.

--Andy

>
>> rdtsc was executed on different cpu, so pvti and tsc might not be in
>> sync, but migrate_count hasn't changed.
>>
>> >             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
>>
>> (Reading cpuid here is useless.)
>>
>> >     } while (unlikely(cpu != cpu1 ||
>> >                       (pvti->pvti.version & 1) ||
>> > -                     pvti->pvti.version != version));
>> > +                     pvti->pvti.version != version ||
>> > +                     pvti->migrate_count != migrate_count));
>>
>> We can workaround the bug with,
>>
>>       cpu = __getcpu() & VGETCPU_CPU_MASK;
>>       pvti = get_pvti(cpu);
>>       migrate_count = pvti->migrate_count;
>>       if (cpu != (__getcpu() & VGETCPU_CPU_MASK))
>>               continue;
Marcelo Tosatti March 26, 2015, 10:40 p.m. UTC | #10
On Thu, Mar 26, 2015 at 03:24:10PM -0700, Andy Lutomirski wrote:
> On Thu, Mar 26, 2015 at 3:22 PM, Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > On Thu, Mar 26, 2015 at 09:59:24PM +0100, Radim Kr?má? wrote:
> >> 2015-03-23 20:21-0300, Marcelo Tosatti:
> >> >
> >> > The following point:
> >> >
> >> >     2. per-CPU pvclock time info is updated if the
> >> >        underlying CPU changes.
> >> >
> >> > Is not true anymore since "KVM: x86: update pvclock area conditionally,
> >> > on cpu migration".
> >> >
> >> > Add task migration notification back.
> >> >
> >> > Problem noticed by Andy Lutomirski.
> >> >
> >> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >> > CC: stable@kernel.org # 3.11+
> >>
> >> Revert contains a bug that got pointed out in the discussion:
> >>
> >> > diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
> >> >     do {
> >> >             cpu = __getcpu() & VGETCPU_CPU_MASK;
> >> >
> >> >             pvti = get_pvti(cpu);
> >>
> >> We can migrate to 'other cpu' here.
> >>
> >> > +           migrate_count = pvti->migrate_count;
> >> > +
> >> >             version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
> >>
> >> And migrate back to 'cpu' here.
> >
> > Migrating back will increase pvti->migrate_count, right ?
> 
> I thought it only increased the count when we migrated away.

Right.

> --Andy
> 
> >
> >> rdtsc was executed on different cpu, so pvti and tsc might not be in
> >> sync, but migrate_count hasn't changed.
> >>
> >> >             cpu1 = __getcpu() & VGETCPU_CPU_MASK;
> >>
> >> (Reading cpuid here is useless.)
> >>
> >> >     } while (unlikely(cpu != cpu1 ||
> >> >                       (pvti->pvti.version & 1) ||
> >> > -                     pvti->pvti.version != version));
> >> > +                     pvti->pvti.version != version ||
> >> > +                     pvti->migrate_count != migrate_count));
> >>
> >> We can workaround the bug with,
> >>
> >>       cpu = __getcpu() & VGETCPU_CPU_MASK;
> >>       pvti = get_pvti(cpu);
> >>       migrate_count = pvti->migrate_count;
> >>       if (cpu != (__getcpu() & VGETCPU_CPU_MASK))
> >>               continue;

Looks good, please submit a fix.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index d6b078e..25b1cc0 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -95,6 +95,7 @@  unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src,
 
 struct pvclock_vsyscall_time_info {
 	struct pvclock_vcpu_time_info pvti;
+	u32 migrate_count;
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d2..e5ecd20 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -141,7 +141,46 @@  void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock,
 	set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
 
+static struct pvclock_vsyscall_time_info *pvclock_vdso_info;
+
+static struct pvclock_vsyscall_time_info *
+pvclock_get_vsyscall_user_time_info(int cpu)
+{
+	if (!pvclock_vdso_info) {
+		BUG();
+		return NULL;
+	}
+
+	return &pvclock_vdso_info[cpu];
+}
+
+struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu)
+{
+	return &pvclock_get_vsyscall_user_time_info(cpu)->pvti;
+}
+
 #ifdef CONFIG_X86_64
+static int pvclock_task_migrate(struct notifier_block *nb, unsigned long l,
+			        void *v)
+{
+	struct task_migration_notifier *mn = v;
+	struct pvclock_vsyscall_time_info *pvti;
+
+	pvti = pvclock_get_vsyscall_user_time_info(mn->from_cpu);
+
+	/* this is NULL when pvclock vsyscall is not initialized */
+	if (unlikely(pvti == NULL))
+		return NOTIFY_DONE;
+
+	pvti->migrate_count++;
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block pvclock_migrate = {
+	.notifier_call = pvclock_task_migrate,
+};
+
 /*
  * Initialize the generic pvclock vsyscall state.  This will allocate
  * a/some page(s) for the per-vcpu pvclock information, set up a
@@ -155,12 +194,17 @@  int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
 
 	WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
 
+	pvclock_vdso_info = i;
+
 	for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
 		__set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
 			     __pa(i) + (idx*PAGE_SIZE),
 			     PAGE_KERNEL_VVAR);
 	}
 
+
+	register_task_migration_notifier(&pvclock_migrate);
+
 	return 0;
 }
 #endif
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 9793322..3093376 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -82,18 +82,15 @@  static notrace cycle_t vread_pvclock(int *mode)
 	cycle_t ret;
 	u64 last;
 	u32 version;
+	u32 migrate_count;
 	u8 flags;
 	unsigned cpu, cpu1;
 
 
 	/*
-	 * Note: hypervisor must guarantee that:
-	 * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-	 * 2. that per-CPU pvclock time info is updated if the
-	 *    underlying CPU changes.
-	 * 3. that version is increased whenever underlying CPU
-	 *    changes.
-	 *
+	 * When looping to get a consistent (time-info, tsc) pair, we
+	 * also need to deal with the possibility we can switch vcpus,
+	 * so make sure we always re-fetch time-info for the current vcpu.
 	 */
 	do {
 		cpu = __getcpu() & VGETCPU_CPU_MASK;
@@ -104,6 +101,8 @@  static notrace cycle_t vread_pvclock(int *mode)
 
 		pvti = get_pvti(cpu);
 
+		migrate_count = pvti->migrate_count;
+
 		version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags);
 
 		/*
@@ -115,7 +114,8 @@  static notrace cycle_t vread_pvclock(int *mode)
 		cpu1 = __getcpu() & VGETCPU_CPU_MASK;
 	} while (unlikely(cpu != cpu1 ||
 			  (pvti->pvti.version & 1) ||
-			  pvti->pvti.version != version));
+			  pvti->pvti.version != version ||
+			  pvti->migrate_count != migrate_count));
 
 	if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
 		*mode = VCLOCK_NONE;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..be98910 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -176,6 +176,14 @@  extern void get_iowait_load(unsigned long *nr_waiters, unsigned long *load);
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
+/* Notifier for when a task gets migrated to a new CPU */
+struct task_migration_notifier {
+	struct task_struct *task;
+	int from_cpu;
+	int to_cpu;
+};
+extern void register_task_migration_notifier(struct notifier_block *n);
+
 extern unsigned long get_parent_ip(unsigned long addr);
 
 extern void dump_cpu_task(int cpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..d0c4209 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -996,6 +996,13 @@  void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 		rq_clock_skip_update(rq, true);
 }
 
+static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
+
+void register_task_migration_notifier(struct notifier_block *n)
+{
+	atomic_notifier_chain_register(&task_migration_notifier, n);
+}
+
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -1026,10 +1033,18 @@  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	trace_sched_migrate_task(p, new_cpu);
 
 	if (task_cpu(p) != new_cpu) {
+		struct task_migration_notifier tmn;
+
 		if (p->sched_class->migrate_task_rq)
 			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
+
+		tmn.task = p;
+		tmn.from_cpu = task_cpu(p);
+		tmn.to_cpu = new_cpu;
+
+		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
 	}
 
 	__set_task_cpu(p, new_cpu);