rowhammer protection [was Re: Getting interrupt every million cache misses]
diff mbox

Message ID 20161027212747.GA18147@amd
State New
Headers show

Commit Message

Pavel Machek Oct. 27, 2016, 9:27 p.m. UTC
Hi!

> >                 if (event)
> >                         perf_event_release_kernel(event);
> >         }
> > }
> 
> This is pretty cool. Are there workloads other than rowhammer that
> could trip this, and if so, how bad would this delay be for them?
> 
> At the very least, this could be behind a CONFIG for people that don't
> have a way to fix their RAM refresh timings, etc.

Yes, CONFIG_ is next.

Here's the patch, notice that I reversed the time handling logic -- it
should be correct now.

We can't tell cache misses on different addresses from cache misses on
same address (rowhammer), so this will have false positive. But so
far, my machine seems to work.

Unfortunately, I don't have machine suitable for testing nearby. Can
someone help with testing? [On the other hand... testing this is not
going to be easy. This will probably make problem way harder to
reproduce it in any case...]

I did run rowhammer, and yes, this did trigger and it was getting
delayed -- by factor of 2. That is slightly low -- delay should be
factor of 8 to get guarantees, if I understand things correctly.

Oh and NMI gets quite angry, but that was to be expected.

[  112.476009] perf: interrupt took too long (23660454 > 23654965),
lowering ker
nel.perf_event_max_sample_rate to 250
[  170.224007] INFO: NMI handler (perf_event_nmi_handler) took too
long to run: 55.844 msecs
[  191.872007] INFO: NMI handler (perf_event_nmi_handler) took too
long to run: 55.845 msecs

Best regards,
								Pavel

Comments

Ingo Molnar Oct. 28, 2016, 7:07 a.m. UTC | #1
* Pavel Machek <pavel@ucw.cz> wrote:

> +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
> +{
> +	u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
> +	u64 now = ktime_get_mono_fast_ns();
> +	s64 delta = now - *ts;
> +
> +	*ts = now;
> +
> +	/* FIXME msec per usec, reverse logic? */
> +	if (delta < 64 * NSEC_PER_MSEC)
> +		mdelay(56);
> +}

I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is 
very magic, and do we know it 100% that 56 msecs is what is needed everywhere?

Plus I'd also suggest exposing an 'NMI rowhammer delay count' in /proc/interrupts, 
to make it easier to debug this. (Perhaps only show the line if the count is 
nonzero.)

Finally, could we please also add a sysctl and Kconfig that allows this feature to 
be turned on/off, with the default bootup value determined by the Kconfig value 
(i.e. by the distribution)? Similar to CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE.

Thanks,

	Ingo
Pavel Machek Oct. 28, 2016, 8:50 a.m. UTC | #2
On Fri 2016-10-28 09:07:01, Ingo Molnar wrote:
> 
> * Pavel Machek <pavel@ucw.cz> wrote:
> 
> > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
> > +{
> > +	u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
> > +	u64 now = ktime_get_mono_fast_ns();
> > +	s64 delta = now - *ts;
> > +
> > +	*ts = now;
> > +
> > +	/* FIXME msec per usec, reverse logic? */
> > +	if (delta < 64 * NSEC_PER_MSEC)
> > +		mdelay(56);
> > +}
> 
> I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is 
> very magic, and do we know it 100% that 56 msecs is what is needed
> everywhere?

I agree this needs to be tunable (and with the other suggestions). But
this is actually not the most important tunable: the detection
threshold (rh_attr.sample_period) should be way more important.

And yes, this will all need to be tunable, somehow. But lets verify
that this works, first :-).

Thanks and best regards,
								Pavel
Ingo Molnar Oct. 28, 2016, 8:59 a.m. UTC | #3
* Pavel Machek <pavel@ucw.cz> wrote:

> On Fri 2016-10-28 09:07:01, Ingo Molnar wrote:
> > 
> > * Pavel Machek <pavel@ucw.cz> wrote:
> > 
> > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
> > > +{
> > > +	u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
> > > +	u64 now = ktime_get_mono_fast_ns();
> > > +	s64 delta = now - *ts;
> > > +
> > > +	*ts = now;
> > > +
> > > +	/* FIXME msec per usec, reverse logic? */
> > > +	if (delta < 64 * NSEC_PER_MSEC)
> > > +		mdelay(56);
> > > +}
> > 
> > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is 
> > very magic, and do we know it 100% that 56 msecs is what is needed
> > everywhere?
> 
> I agree this needs to be tunable (and with the other suggestions). But
> this is actually not the most important tunable: the detection
> threshold (rh_attr.sample_period) should be way more important.
> 
> And yes, this will all need to be tunable, somehow. But lets verify
> that this works, first :-).

Yeah.

Btw., a 56 NMI delay is pretty brutal in terms of latencies - it might
result in a smoother system to detect 100,000 cache misses and do a
~5.6 msecs delay instead?

(Assuming the shorter threshold does not trigger too often, of course.)

With all the tunables and statistics it would be possible to enumerate how 
frequently the protection mechanism kicks in during regular workloads.

Thanks,

	Ingo
Peter Zijlstra Oct. 28, 2016, 9:04 a.m. UTC | #4
On Fri, Oct 28, 2016 at 10:50:39AM +0200, Pavel Machek wrote:
> On Fri 2016-10-28 09:07:01, Ingo Molnar wrote:
> > 
> > * Pavel Machek <pavel@ucw.cz> wrote:
> > 
> > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
> > > +{
> > > +	u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
> > > +	u64 now = ktime_get_mono_fast_ns();
> > > +	s64 delta = now - *ts;
> > > +
> > > +	*ts = now;
> > > +
> > > +	/* FIXME msec per usec, reverse logic? */
> > > +	if (delta < 64 * NSEC_PER_MSEC)
> > > +		mdelay(56);
> > > +}
> > 
> > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is 
> > very magic, and do we know it 100% that 56 msecs is what is needed
> > everywhere?
> 
> I agree this needs to be tunable (and with the other suggestions). But
> this is actually not the most important tunable: the detection
> threshold (rh_attr.sample_period) should be way more important.

So being totally ignorant of the detail of how rowhammer abuses the DDR
thing, would it make sense to trigger more often and delay shorter? Or
is there some minimal delay required for things to settle or something.
Vegard Nossum Oct. 28, 2016, 9:27 a.m. UTC | #5
On 28 October 2016 at 11:04, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 28, 2016 at 10:50:39AM +0200, Pavel Machek wrote:
>> On Fri 2016-10-28 09:07:01, Ingo Molnar wrote:
>> >
>> > * Pavel Machek <pavel@ucw.cz> wrote:
>> >
>> > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
>> > > +{
>> > > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
>> > > + u64 now = ktime_get_mono_fast_ns();
>> > > + s64 delta = now - *ts;
>> > > +
>> > > + *ts = now;
>> > > +
>> > > + /* FIXME msec per usec, reverse logic? */
>> > > + if (delta < 64 * NSEC_PER_MSEC)
>> > > +         mdelay(56);
>> > > +}
>> >
>> > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is
>> > very magic, and do we know it 100% that 56 msecs is what is needed
>> > everywhere?
>>
>> I agree this needs to be tunable (and with the other suggestions). But
>> this is actually not the most important tunable: the detection
>> threshold (rh_attr.sample_period) should be way more important.
>
> So being totally ignorant of the detail of how rowhammer abuses the DDR
> thing, would it make sense to trigger more often and delay shorter? Or
> is there some minimal delay required for things to settle or something.

Would it make sense to sample the counter on context switch, do some
accounting on a per-task cache miss counter, and slow down just the
single task(s) with a too high cache miss rate? That way there's no
global slowdown (which I assume would be the case here). The task's
slice of CPU would have to be taken into account because otherwise you
could have multiple cooperating tasks that each escape the limit but
taken together go above it.


Vegard
Ingo Molnar Oct. 28, 2016, 9:35 a.m. UTC | #6
* Vegard Nossum <vegard.nossum@gmail.com> wrote:

> Would it make sense to sample the counter on context switch, do some
> accounting on a per-task cache miss counter, and slow down just the
> single task(s) with a too high cache miss rate? That way there's no
> global slowdown (which I assume would be the case here). The task's
> slice of CPU would have to be taken into account because otherwise you
> could have multiple cooperating tasks that each escape the limit but
> taken together go above it.

Attackers could work this around by splitting the rowhammer workload between 
multiple threads/processes.

I.e. the problem is that the risk may come from any 'unprivileged user-space 
code', where the rowhammer workload might be spread over multiple threads, 
processes or even users.

Thanks,

	Ingo
Vegard Nossum Oct. 28, 2016, 9:47 a.m. UTC | #7
On 28 October 2016 at 11:35, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Vegard Nossum <vegard.nossum@gmail.com> wrote:
>
>> Would it make sense to sample the counter on context switch, do some
>> accounting on a per-task cache miss counter, and slow down just the
>> single task(s) with a too high cache miss rate? That way there's no
>> global slowdown (which I assume would be the case here). The task's
>> slice of CPU would have to be taken into account because otherwise you
>> could have multiple cooperating tasks that each escape the limit but
>> taken together go above it.
>
> Attackers could work this around by splitting the rowhammer workload between
> multiple threads/processes.
>
> I.e. the problem is that the risk may come from any 'unprivileged user-space
> code', where the rowhammer workload might be spread over multiple threads,
> processes or even users.

That's why I emphasised the number of misses per CPU slice rather than
just the total number of misses. I assumed there must be at least one
task continuously hammering memory for a successful attack, in which
case it should be observable with as little as 1 slice of CPU (however
long that is), no?


Vegard
Mark Rutland Oct. 28, 2016, 9:51 a.m. UTC | #8
Hi,

I missed the original, so I've lost some context.

Has this been tested on a system vulnerable to rowhammer, and if so, was
it reliable in mitigating the issue?

Which particular attack codebase was it tested against?

On Thu, Oct 27, 2016 at 11:27:47PM +0200, Pavel Machek wrote:
> --- /dev/null
> +++ b/kernel/events/nohammer.c
> @@ -0,0 +1,66 @@
> +/*
> + * Thanks to Peter Zijlstra <peterz@infradead.org>.
> + */
> +
> +#include <linux/perf_event.h>
> +#include <linux/module.h>
> +#include <linux/delay.h>
> +
> +struct perf_event_attr rh_attr = {
> +	.type	= PERF_TYPE_HARDWARE,
> +	.config = PERF_COUNT_HW_CACHE_MISSES,
> +	.size	= sizeof(struct perf_event_attr),
> +	.pinned	= 1,
> +	/* FIXME: it is 1000000 per cpu. */
> +	.sample_period = 500000,
> +};

I'm not sure that this is general enough to live in core code, because:

* there are existing ways around this (e.g. in the drammer case, using a
  non-cacheable mapping, which I don't believe would count as a cache
  miss).

  Given that, I'm very worried that this gives the false impression of
  protection in cases where a software workaround of this sort is
  insufficient or impossible.

* the precise semantics of performance counter events varies drastically
  across implementations. PERF_COUNT_HW_CACHE_MISSES, might only map to
  one particular level of cache, and/or may not be implemented on all
  cores.

* On some implementations, it may be that the counters are not
  interchangeable, and for those this would take away
  PERF_COUNT_HW_CACHE_MISSES from existing users.

> +static DEFINE_PER_CPU(struct perf_event *, rh_event);
> +static DEFINE_PER_CPU(u64, rh_timestamp);
> +
> +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
> +{
> +	u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
> +	u64 now = ktime_get_mono_fast_ns();
> +	s64 delta = now - *ts;
> +
> +	*ts = now;
> +
> +	/* FIXME msec per usec, reverse logic? */
> +	if (delta < 64 * NSEC_PER_MSEC)
> +		mdelay(56);
> +}

If I round-robin my attack across CPUs, how much does this help?

Thanks,
Mark.
Mark Rutland Oct. 28, 2016, 9:53 a.m. UTC | #9
On Fri, Oct 28, 2016 at 11:35:47AM +0200, Ingo Molnar wrote:
> 
> * Vegard Nossum <vegard.nossum@gmail.com> wrote:
> 
> > Would it make sense to sample the counter on context switch, do some
> > accounting on a per-task cache miss counter, and slow down just the
> > single task(s) with a too high cache miss rate? That way there's no
> > global slowdown (which I assume would be the case here). The task's
> > slice of CPU would have to be taken into account because otherwise you
> > could have multiple cooperating tasks that each escape the limit but
> > taken together go above it.
> 
> Attackers could work this around by splitting the rowhammer workload between 
> multiple threads/processes.

With the proposed approach, they could split across multiple CPUs
instead, no?

... or was that covered in a prior thread?

Thanks,
Mark.
Pavel Machek Oct. 28, 2016, 11:27 a.m. UTC | #10
Hi!

> > I agree this needs to be tunable (and with the other suggestions). But
> > this is actually not the most important tunable: the detection
> > threshold (rh_attr.sample_period) should be way more important.
> 
> So being totally ignorant of the detail of how rowhammer abuses the DDR
> thing, would it make sense to trigger more often and delay shorter? Or
> is there some minimal delay required for things to settle or
 > something.

We can trigger more often and delay shorter, but it will mean that
protection will trigger with more false positives. I guess I'll play
with constants too see how big the effect is.

BTW...

[ 6267.180092] INFO: NMI handler (perf_event_nmi_handler) took too
long to run: 63.501 msecs

but I'm doing mdelay(64). .5 msec is not big difference, but...

Best regards,
									Pavel
Pavel Machek Oct. 28, 2016, 11:55 a.m. UTC | #11
Hi!

> > I agree this needs to be tunable (and with the other suggestions). But
> > this is actually not the most important tunable: the detection
> > threshold (rh_attr.sample_period) should be way more important.
> > 
> > And yes, this will all need to be tunable, somehow. But lets verify
> > that this works, first :-).
> 
> Yeah.
> 
> Btw., a 56 NMI delay is pretty brutal in terms of latencies - it might
> result in a smoother system to detect 100,000 cache misses and do a
> ~5.6 msecs delay instead?
> 
> (Assuming the shorter threshold does not trigger too often, of
> course.)

Yeah, it is brutal workaround for a nasty bug. Slowdown depends on maximum utilization:

+/*
+ * Maximum permitted utilization of DRAM. Setting this to f will mean that
+ * when more than 1/f of maximum cache-miss performance is used, delay will
+ * be inserted, and will have similar effect on rowhammer as refreshing memory
+ * f times more often.
+ *
+ * Setting this to 8 should prevent the rowhammer attack.
+ */
+       int dram_max_utilization_factor = 8;

|                               | no prot. | fact. 1 | fact. 2 | fact. 8 |
| linux-n900$ time ./mkit       | 1m35     | 1m47    | 2m07    | 6m37    |
| rowhammer-test (for 43200000) | 2.86     | 9.75    | 16.7307 | 59.3738 |

(With factor 1 and 2 cpu attacker, we don't guarantee any protection.)

Best regards,
									Pavel

Patch
diff mbox

diff --git a/kernel/events/Makefile b/kernel/events/Makefile
index 2925188..130a185 100644
--- a/kernel/events/Makefile
+++ b/kernel/events/Makefile
@@ -2,7 +2,7 @@  ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)
 endif
 
-obj-y := core.o ring_buffer.o callchain.o
+obj-y := core.o ring_buffer.o callchain.o nohammer.o
 
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_UPROBES) += uprobes.o
diff --git a/kernel/events/nohammer.c b/kernel/events/nohammer.c
new file mode 100644
index 0000000..01844d2
--- /dev/null
+++ b/kernel/events/nohammer.c
@@ -0,0 +1,66 @@ 
+/*
+ * Thanks to Peter Zijlstra <peterz@infradead.org>.
+ */
+
+#include <linux/perf_event.h>
+#include <linux/module.h>
+#include <linux/delay.h>
+
+struct perf_event_attr rh_attr = {
+	.type	= PERF_TYPE_HARDWARE,
+	.config = PERF_COUNT_HW_CACHE_MISSES,
+	.size	= sizeof(struct perf_event_attr),
+	.pinned	= 1,
+	/* FIXME: it is 1000000 per cpu. */
+	.sample_period = 500000,
+};
+
+static DEFINE_PER_CPU(struct perf_event *, rh_event);
+static DEFINE_PER_CPU(u64, rh_timestamp);
+
+static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs)
+{
+	u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */
+	u64 now = ktime_get_mono_fast_ns();
+	s64 delta = now - *ts;
+
+	*ts = now;
+
+	/* FIXME msec per usec, reverse logic? */
+	if (delta < 64 * NSEC_PER_MSEC)
+		mdelay(56);
+}
+
+static __init int my_module_init(void)
+{
+	int cpu;
+
+	/* XXX borken vs hotplug */
+
+	for_each_online_cpu(cpu) {
+		struct perf_event *event = per_cpu(rh_event, cpu);
+
+		event = perf_event_create_kernel_counter(&rh_attr, cpu, NULL, rh_overflow, NULL);
+		if (!event)
+			pr_err("Not enough resources to initialize nohammer on cpu %d\n", cpu);
+		pr_info("Nohammer initialized on cpu %d\n", cpu);
+		
+	}
+	return 0;
+}
+
+static __exit void my_module_exit(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct perf_event *event = per_cpu(rh_event, cpu);
+
+		if (event)
+			perf_event_release_kernel(event);
+	}
+	return;
+}
+
+module_init(my_module_init);
+module_exit(my_module_exit);