Message ID | 20161027212747.GA18147@amd (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Pavel Machek <pavel@ucw.cz> wrote: > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) > +{ > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ > + u64 now = ktime_get_mono_fast_ns(); > + s64 delta = now - *ts; > + > + *ts = now; > + > + /* FIXME msec per usec, reverse logic? */ > + if (delta < 64 * NSEC_PER_MSEC) > + mdelay(56); > +} I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is very magic, and do we know it 100% that 56 msecs is what is needed everywhere? Plus I'd also suggest exposing an 'NMI rowhammer delay count' in /proc/interrupts, to make it easier to debug this. (Perhaps only show the line if the count is nonzero.) Finally, could we please also add a sysctl and Kconfig that allows this feature to be turned on/off, with the default bootup value determined by the Kconfig value (i.e. by the distribution)? Similar to CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE. Thanks, Ingo
On Fri 2016-10-28 09:07:01, Ingo Molnar wrote: > > * Pavel Machek <pavel@ucw.cz> wrote: > > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) > > +{ > > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ > > + u64 now = ktime_get_mono_fast_ns(); > > + s64 delta = now - *ts; > > + > > + *ts = now; > > + > > + /* FIXME msec per usec, reverse logic? */ > > + if (delta < 64 * NSEC_PER_MSEC) > > + mdelay(56); > > +} > > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is > very magic, and do we know it 100% that 56 msecs is what is needed > everywhere? I agree this needs to be tunable (and with the other suggestions). But this is actually not the most important tunable: the detection threshold (rh_attr.sample_period) should be way more important. And yes, this will all need to be tunable, somehow. But lets verify that this works, first :-). Thanks and best regards, Pavel
* Pavel Machek <pavel@ucw.cz> wrote: > On Fri 2016-10-28 09:07:01, Ingo Molnar wrote: > > > > * Pavel Machek <pavel@ucw.cz> wrote: > > > > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) > > > +{ > > > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ > > > + u64 now = ktime_get_mono_fast_ns(); > > > + s64 delta = now - *ts; > > > + > > > + *ts = now; > > > + > > > + /* FIXME msec per usec, reverse logic? */ > > > + if (delta < 64 * NSEC_PER_MSEC) > > > + mdelay(56); > > > +} > > > > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is > > very magic, and do we know it 100% that 56 msecs is what is needed > > everywhere? > > I agree this needs to be tunable (and with the other suggestions). But > this is actually not the most important tunable: the detection > threshold (rh_attr.sample_period) should be way more important. > > And yes, this will all need to be tunable, somehow. But lets verify > that this works, first :-). Yeah. Btw., a 56 NMI delay is pretty brutal in terms of latencies - it might result in a smoother system to detect 100,000 cache misses and do a ~5.6 msecs delay instead? (Assuming the shorter threshold does not trigger too often, of course.) With all the tunables and statistics it would be possible to enumerate how frequently the protection mechanism kicks in during regular workloads. Thanks, Ingo
On Fri, Oct 28, 2016 at 10:50:39AM +0200, Pavel Machek wrote: > On Fri 2016-10-28 09:07:01, Ingo Molnar wrote: > > > > * Pavel Machek <pavel@ucw.cz> wrote: > > > > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) > > > +{ > > > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ > > > + u64 now = ktime_get_mono_fast_ns(); > > > + s64 delta = now - *ts; > > > + > > > + *ts = now; > > > + > > > + /* FIXME msec per usec, reverse logic? */ > > > + if (delta < 64 * NSEC_PER_MSEC) > > > + mdelay(56); > > > +} > > > > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is > > very magic, and do we know it 100% that 56 msecs is what is needed > > everywhere? > > I agree this needs to be tunable (and with the other suggestions). But > this is actually not the most important tunable: the detection > threshold (rh_attr.sample_period) should be way more important. So being totally ignorant of the detail of how rowhammer abuses the DDR thing, would it make sense to trigger more often and delay shorter? Or is there some minimal delay required for things to settle or something.
On 28 October 2016 at 11:04, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Oct 28, 2016 at 10:50:39AM +0200, Pavel Machek wrote: >> On Fri 2016-10-28 09:07:01, Ingo Molnar wrote: >> > >> > * Pavel Machek <pavel@ucw.cz> wrote: >> > >> > > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) >> > > +{ >> > > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ >> > > + u64 now = ktime_get_mono_fast_ns(); >> > > + s64 delta = now - *ts; >> > > + >> > > + *ts = now; >> > > + >> > > + /* FIXME msec per usec, reverse logic? */ >> > > + if (delta < 64 * NSEC_PER_MSEC) >> > > + mdelay(56); >> > > +} >> > >> > I'd suggest making the absolute delay sysctl tunable, because 'wait 56 msecs' is >> > very magic, and do we know it 100% that 56 msecs is what is needed >> > everywhere? >> >> I agree this needs to be tunable (and with the other suggestions). But >> this is actually not the most important tunable: the detection >> threshold (rh_attr.sample_period) should be way more important. > > So being totally ignorant of the detail of how rowhammer abuses the DDR > thing, would it make sense to trigger more often and delay shorter? Or > is there some minimal delay required for things to settle or something. Would it make sense to sample the counter on context switch, do some accounting on a per-task cache miss counter, and slow down just the single task(s) with a too high cache miss rate? That way there's no global slowdown (which I assume would be the case here). The task's slice of CPU would have to be taken into account because otherwise you could have multiple cooperating tasks that each escape the limit but taken together go above it. Vegard
* Vegard Nossum <vegard.nossum@gmail.com> wrote: > Would it make sense to sample the counter on context switch, do some > accounting on a per-task cache miss counter, and slow down just the > single task(s) with a too high cache miss rate? That way there's no > global slowdown (which I assume would be the case here). The task's > slice of CPU would have to be taken into account because otherwise you > could have multiple cooperating tasks that each escape the limit but > taken together go above it. Attackers could work this around by splitting the rowhammer workload between multiple threads/processes. I.e. the problem is that the risk may come from any 'unprivileged user-space code', where the rowhammer workload might be spread over multiple threads, processes or even users. Thanks, Ingo
On 28 October 2016 at 11:35, Ingo Molnar <mingo@kernel.org> wrote: > > * Vegard Nossum <vegard.nossum@gmail.com> wrote: > >> Would it make sense to sample the counter on context switch, do some >> accounting on a per-task cache miss counter, and slow down just the >> single task(s) with a too high cache miss rate? That way there's no >> global slowdown (which I assume would be the case here). The task's >> slice of CPU would have to be taken into account because otherwise you >> could have multiple cooperating tasks that each escape the limit but >> taken together go above it. > > Attackers could work this around by splitting the rowhammer workload between > multiple threads/processes. > > I.e. the problem is that the risk may come from any 'unprivileged user-space > code', where the rowhammer workload might be spread over multiple threads, > processes or even users. That's why I emphasised the number of misses per CPU slice rather than just the total number of misses. I assumed there must be at least one task continuously hammering memory for a successful attack, in which case it should be observable with as little as 1 slice of CPU (however long that is), no? Vegard
Hi, I missed the original, so I've lost some context. Has this been tested on a system vulnerable to rowhammer, and if so, was it reliable in mitigating the issue? Which particular attack codebase was it tested against? On Thu, Oct 27, 2016 at 11:27:47PM +0200, Pavel Machek wrote: > --- /dev/null > +++ b/kernel/events/nohammer.c > @@ -0,0 +1,66 @@ > +/* > + * Thanks to Peter Zijlstra <peterz@infradead.org>. > + */ > + > +#include <linux/perf_event.h> > +#include <linux/module.h> > +#include <linux/delay.h> > + > +struct perf_event_attr rh_attr = { > + .type = PERF_TYPE_HARDWARE, > + .config = PERF_COUNT_HW_CACHE_MISSES, > + .size = sizeof(struct perf_event_attr), > + .pinned = 1, > + /* FIXME: it is 1000000 per cpu. */ > + .sample_period = 500000, > +}; I'm not sure that this is general enough to live in core code, because: * there are existing ways around this (e.g. in the drammer case, using a non-cacheable mapping, which I don't believe would count as a cache miss). Given that, I'm very worried that this gives the false impression of protection in cases where a software workaround of this sort is insufficient or impossible. * the precise semantics of performance counter events varies drastically across implementations. PERF_COUNT_HW_CACHE_MISSES, might only map to one particular level of cache, and/or may not be implemented on all cores. * On some implementations, it may be that the counters are not interchangeable, and for those this would take away PERF_COUNT_HW_CACHE_MISSES from existing users. > +static DEFINE_PER_CPU(struct perf_event *, rh_event); > +static DEFINE_PER_CPU(u64, rh_timestamp); > + > +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) > +{ > + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ > + u64 now = ktime_get_mono_fast_ns(); > + s64 delta = now - *ts; > + > + *ts = now; > + > + /* FIXME msec per usec, reverse logic? */ > + if (delta < 64 * NSEC_PER_MSEC) > + mdelay(56); > +} If I round-robin my attack across CPUs, how much does this help? Thanks, Mark.
On Fri, Oct 28, 2016 at 11:35:47AM +0200, Ingo Molnar wrote: > > * Vegard Nossum <vegard.nossum@gmail.com> wrote: > > > Would it make sense to sample the counter on context switch, do some > > accounting on a per-task cache miss counter, and slow down just the > > single task(s) with a too high cache miss rate? That way there's no > > global slowdown (which I assume would be the case here). The task's > > slice of CPU would have to be taken into account because otherwise you > > could have multiple cooperating tasks that each escape the limit but > > taken together go above it. > > Attackers could work this around by splitting the rowhammer workload between > multiple threads/processes. With the proposed approach, they could split across multiple CPUs instead, no? ... or was that covered in a prior thread? Thanks, Mark.
Hi! > > I agree this needs to be tunable (and with the other suggestions). But > > this is actually not the most important tunable: the detection > > threshold (rh_attr.sample_period) should be way more important. > > So being totally ignorant of the detail of how rowhammer abuses the DDR > thing, would it make sense to trigger more often and delay shorter? Or > is there some minimal delay required for things to settle or > something. We can trigger more often and delay shorter, but it will mean that protection will trigger with more false positives. I guess I'll play with constants too see how big the effect is. BTW... [ 6267.180092] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 63.501 msecs but I'm doing mdelay(64). .5 msec is not big difference, but... Best regards, Pavel
Hi! > > I agree this needs to be tunable (and with the other suggestions). But > > this is actually not the most important tunable: the detection > > threshold (rh_attr.sample_period) should be way more important. > > > > And yes, this will all need to be tunable, somehow. But lets verify > > that this works, first :-). > > Yeah. > > Btw., a 56 NMI delay is pretty brutal in terms of latencies - it might > result in a smoother system to detect 100,000 cache misses and do a > ~5.6 msecs delay instead? > > (Assuming the shorter threshold does not trigger too often, of > course.) Yeah, it is brutal workaround for a nasty bug. Slowdown depends on maximum utilization: +/* + * Maximum permitted utilization of DRAM. Setting this to f will mean that + * when more than 1/f of maximum cache-miss performance is used, delay will + * be inserted, and will have similar effect on rowhammer as refreshing memory + * f times more often. + * + * Setting this to 8 should prevent the rowhammer attack. + */ + int dram_max_utilization_factor = 8; | | no prot. | fact. 1 | fact. 2 | fact. 8 | | linux-n900$ time ./mkit | 1m35 | 1m47 | 2m07 | 6m37 | | rowhammer-test (for 43200000) | 2.86 | 9.75 | 16.7307 | 59.3738 | (With factor 1 and 2 cpu attacker, we don't guarantee any protection.) Best regards, Pavel
diff --git a/kernel/events/Makefile b/kernel/events/Makefile index 2925188..130a185 100644 --- a/kernel/events/Makefile +++ b/kernel/events/Makefile @@ -2,7 +2,7 @@ ifdef CONFIG_FUNCTION_TRACER CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE) endif -obj-y := core.o ring_buffer.o callchain.o +obj-y := core.o ring_buffer.o callchain.o nohammer.o obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o obj-$(CONFIG_UPROBES) += uprobes.o diff --git a/kernel/events/nohammer.c b/kernel/events/nohammer.c new file mode 100644 index 0000000..01844d2 --- /dev/null +++ b/kernel/events/nohammer.c @@ -0,0 +1,66 @@ +/* + * Thanks to Peter Zijlstra <peterz@infradead.org>. + */ + +#include <linux/perf_event.h> +#include <linux/module.h> +#include <linux/delay.h> + +struct perf_event_attr rh_attr = { + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CACHE_MISSES, + .size = sizeof(struct perf_event_attr), + .pinned = 1, + /* FIXME: it is 1000000 per cpu. */ + .sample_period = 500000, +}; + +static DEFINE_PER_CPU(struct perf_event *, rh_event); +static DEFINE_PER_CPU(u64, rh_timestamp); + +static void rh_overflow(struct perf_event *event, struct perf_sample_data *data, struct pt_regs *regs) +{ + u64 *ts = this_cpu_ptr(&rh_timestamp); /* this is NMI context */ + u64 now = ktime_get_mono_fast_ns(); + s64 delta = now - *ts; + + *ts = now; + + /* FIXME msec per usec, reverse logic? */ + if (delta < 64 * NSEC_PER_MSEC) + mdelay(56); +} + +static __init int my_module_init(void) +{ + int cpu; + + /* XXX borken vs hotplug */ + + for_each_online_cpu(cpu) { + struct perf_event *event = per_cpu(rh_event, cpu); + + event = perf_event_create_kernel_counter(&rh_attr, cpu, NULL, rh_overflow, NULL); + if (!event) + pr_err("Not enough resources to initialize nohammer on cpu %d\n", cpu); + pr_info("Nohammer initialized on cpu %d\n", cpu); + + } + return 0; +} + +static __exit void my_module_exit(void) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct perf_event *event = per_cpu(rh_event, cpu); + + if (event) + perf_event_release_kernel(event); + } + return; +} + +module_init(my_module_init); +module_exit(my_module_exit);