Message ID | 20210115130336.2520663-3-glider@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add sysfs interface to collect reports from debugging tools | expand |
Minor comments, if in the future, you really do want to mess around in sysfs: On Fri, Jan 15, 2021 at 02:03:33PM +0100, Alexander Potapenko wrote: > diff --git a/lib/error_report_notify.c b/lib/error_report_notify.c > new file mode 100644 > index 000000000000..66176cd94ba0 > --- /dev/null > +++ b/lib/error_report_notify.c > @@ -0,0 +1,278 @@ > +// SPDX-License-Identifier: GPL-2.0 > + No copyright notice for the file? While acceptable, odds are your corporate lawyers will not be happy with that :( > +/* > + * Userspace notification interface for debugging tools. > + * > + * Provide two sysfs files: > + * - /sys/kernel/error_report/last_report > + * - /sys/kernel/error_report/report_count > + * that contain the last debugging tool report (taken from dmesg, delimited by > + * the error_report_start/error_report_end tracing events) and the total report > + * count. > + */ > + > +#include <linux/atomic.h> > +#include <linux/fs.h> > +#include <linux/kobject.h> > +#include <linux/string.h> > +#include <linux/sysfs.h> > +#include <linux/tracepoint.h> > +#include <linux/workqueue.h> > +#include <trace/events/error_report.h> > +#include <trace/events/printk.h> > + > +static struct kobject *error_report_kobj; > + > +/* sysfs files are capped at PAGE_SIZE. */ > +#define BUF_SIZE PAGE_SIZE > +/* Two buffers to store the finished report and the report being recorded. */ > +static char report_buffer[2][BUF_SIZE]; > +/* > + * Total report count. Also serves as a latch for report_buffer: > + * report_buffer[num_reports % 2] is the currently available report, > + * report_buffer[(num_reports + 1) % 2] is the report being recorded. > + */ > +static atomic_t num_reports; > + > +/* > + * PID of the task currently recording the report, as returned by > + * get_encoded_pid(), or -1. Used as a writer lock for report_buffer. > + * A regular spinlock couldn't be used here, as probe_console() can be called > + * from any thread, and it needs to know whether that thread is holding the > + * lock. > + */ > +static atomic_t current_pid = ATOMIC_INIT(-1); how do you handle pid namespaces? > + > +static size_t current_pos; > +static bool truncated; > +static const char TRUNC_MSG[] = "<truncated>\n"; > + > +static struct delayed_work reporting_done; > + > +static void error_report_notify(struct work_struct *work) > +{ > + sysfs_notify(error_report_kobj, NULL, "last_report"); > + sysfs_notify(error_report_kobj, NULL, "report_count"); > +} > +static DECLARE_DELAYED_WORK(reporting_done, error_report_notify); > + > +/* > + * Return the current PID combined together with in_task(). This lets us > + * distinguish between normal task context and IRQ context. > + */ > +static int get_encoded_pid(void) > +{ > + return (current->pid << 1) | (!!in_task()); > +} > + > +/* > + * Trace hook for the error_report_start event. In an unlikely case of another > + * task already printing a report bail out, otherwise save the current pid > + * together with in_task() return value. > + * > + * Because reporting code can be called from low-level routines (e.g. locking > + * primitives or allocator guts), report recording is implemented using a > + * seqlock lock-free algorithm. > + */ > +static void probe_report_start(void *ignore, enum error_detector detector, > + unsigned long id) > +{ > + /* > + * Acquire the writer lock. Any racing probe_report_start will not > + * record anything. Pairs with the release in probe_report_end(). > + */ > + if (atomic_cmpxchg_acquire(¤t_pid, -1, get_encoded_pid()) != -1) > + return; pid namespaces? > + current_pos = 0; > + truncated = false; > +} > + > +/* > + * Trace hook for the error_report_end event. If an event from the mismatching > + * error_report_start is received, it is ignored. Otherwise, null-terminate the > + * buffer, increase the report count (effectively releasing the report to > + * last_report_show() and schedule a notification about a new report. > + */ > +static void probe_report_end(void *ignore, enum error_detector detector, > + unsigned long id) > +{ > + pid_t pid = atomic_read(¤t_pid); pid namespaces? > + int idx; > + > + if (pid != get_encoded_pid()) > + return; > + > + idx = (atomic_read(&num_reports) + 1) % 2; You read, but it could change before: > + if (current_pos == BUF_SIZE) > + report_buffer[idx][current_pos - 1] = 0; > + else > + report_buffer[idx][current_pos] = 0; > + > + /* Pairs with acquire in last_report_show(). */ > + atomic_inc_return_release(&num_reports); Not good? > + schedule_delayed_work(&reporting_done, 0); > + /* > + * Release the writer lock. Pairs with the acquire in > + * probe_report_start(). > + */ > + atomic_set_release(¤t_pid, -1); > +} > + > +/* > + * Skip one or two leading pair of brackets containing the log timestamp and > + * the task/CPU ID, plus the leading space, from the report line, e.g.: > + * [ 0.698431][ T7] BUG: KFENCE: use-after-free ... > + * becomes: > + * BUG: KFENCE: use-after-free ... > + * > + * Report size is only 4K, and this boilerplate can easily account for half of > + * that amount. > + */ > +static void skip_extra_info(const char **buf, size_t *len) > +{ > + int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) + > + IS_ENABLED(CONFIG_PRINTK_CALLER); > + const char *found; > + > + if (!buf || !len) > + return; > + > + while (num_brackets--) { > + if (!*len || *buf[0] != '[') > + return; > + found = strnchr(*buf, *len, ']'); > + if (!found) > + return; > + *len -= found - *buf + 1; > + *buf = found + 1; > + } > + if (*len && *buf[0] == ' ') { > + ++*buf; > + --*len; > + } > +} > + > +/* > + * Trace hook for the console event. If a line comes from a task/CPU that did > + * not send the error_report_start event, that line is ignored. Otherwise, it > + * is stored in the report_buffer[(num_reports + 1) % 2]. > + * > + * To save space, the leading timestamps and (when enabled) CPU/task info is > + * stripped away. The buffer may contain newlines, so this procedure is > + * repeated for every line. > + */ > +static void probe_console(void *ignore, const char *buf, size_t len) > +{ > + int pid = atomic_read(¤t_pid); > + size_t to_copy, cur_len; > + char *newline; > + int idx; > + > + if (pid != get_encoded_pid() || truncated) > + return; > + > + idx = (atomic_read(&num_reports) + 1) % 2; > + while (len) { > + newline = strnchr(buf, len, '\n'); > + if (newline) > + cur_len = newline - buf + 1; > + else > + cur_len = len; > + /* Adjust len now, because skip_extra_info() may change cur_len. */ > + len -= cur_len; > + skip_extra_info(&buf, &cur_len); > + to_copy = min(cur_len, BUF_SIZE - current_pos); > + memcpy(report_buffer[idx] + current_pos, buf, to_copy); > + current_pos += to_copy; > + if (cur_len > to_copy) { > + truncated = true; > + memcpy(report_buffer[idx] + current_pos - sizeof(TRUNC_MSG), > + TRUNC_MSG, sizeof(TRUNC_MSG)); > + break; > + } > + buf += cur_len; > + } > +} > + > +static void register_tracepoints(void) > +{ > + register_trace_console(probe_console, NULL); > + register_trace_error_report_start(probe_report_start, NULL); > + register_trace_error_report_end(probe_report_end, NULL); > +} > + > +/* > + * read() handler for /sys/kernel/error_report/last_report. > + * Because the number of reports can change under our feet, check it again > + * after copying the report, and retry if the numbers mismatch. > + */ > +static ssize_t last_report_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + ssize_t ret; > + int index; > + > + do { > + /* Pairs with release in probe_report_end(). */ > + index = atomic_read_acquire(&num_reports); > + /* > + * If index and old_index mismatch, we might be accessing > + * report_buffer concurrently with a writer thread. In that > + * case the read data will be discarded. > + */ > + ret = data_race(strscpy(buf, report_buffer[index % 2], BUF_SIZE)); > + /* > + * Prevent reordering between the memcpy above and the atomic > + * read below. > + * See the comments in include/linux/seqlock.h for more > + * details. > + */ > + smp_rmb(); > + } while (index != atomic_read(&num_reports)); endless loops, what could go wrong... Why are you rolling your own hacky locks in here? And again, sysfs is "one value" not "one buffer". > + return ret; > +} > + > +/* > + * read() handler for /sys/kernel/error_report/report_count. > + */ > +static ssize_t report_count_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&num_reports)); sysfs_emit()? And you just read it, but what keeps it from changing? > +} > + > +static struct kobj_attribute last_report_attr = __ATTR_RO(last_report); > +static struct kobj_attribute report_count_attr = __ATTR_RO(report_count); > +static struct attribute *error_report_sysfs_attrs[] = { > + &last_report_attr.attr, > + &report_count_attr.attr, > + NULL, > +}; > + > +static const struct attribute_group error_report_sysfs_attr_group = { > + .attrs = error_report_sysfs_attrs, > +}; ATTRIBUTE_GROUPS()? > + > +/* > + * Set up report notification: register tracepoints and create > + * /sys/kernel/error_report/. > + */ > +static void error_report_notify_setup(void) > +{ > + int err; > + > + register_tracepoints(); > + error_report_kobj = kobject_create_and_add("error_report", kernel_kobj); > + if (!error_report_kobj) > + goto error; > + err = sysfs_create_group(error_report_kobj, > + &error_report_sysfs_attr_group); > + if (err) > + goto error; > + return; > + > +error: > + if (error_report_kobj) > + kobject_del(error_report_kobj); > +} > +late_initcall(error_report_notify_setup); You never clean up the kobject or files? Anyway, please move this to tracefs, that's where it belongs. thanks, greg k-h
On Fri, Jan 15, 2021 at 2:50 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > Minor comments, if in the future, you really do want to mess around in sysfs: > Thanks! Guess most of these comments apply even if I choose another FS to mess around with. > No copyright notice for the file? While acceptable, odds are your > corporate lawyers will not be happy with that :( You are right, will fix. > > +/* > > + * Userspace notification interface for debugging tools. > > + * > > + * Provide two sysfs files: > > + * - /sys/kernel/error_report/last_report > > + * - /sys/kernel/error_report/report_count > > + * that contain the last debugging tool report (taken from dmesg, delimited by > > + * the error_report_start/error_report_end tracing events) and the total report > > + * count. > > + */ > > + > > +#include <linux/atomic.h> > > +#include <linux/fs.h> > > +#include <linux/kobject.h> > > +#include <linux/string.h> > > +#include <linux/sysfs.h> > > +#include <linux/tracepoint.h> > > +#include <linux/workqueue.h> > > +#include <trace/events/error_report.h> > > +#include <trace/events/printk.h> > > + > > +static struct kobject *error_report_kobj; > > + > > +/* sysfs files are capped at PAGE_SIZE. */ > > +#define BUF_SIZE PAGE_SIZE > > +/* Two buffers to store the finished report and the report being recorded. */ > > +static char report_buffer[2][BUF_SIZE]; > > +/* > > + * Total report count. Also serves as a latch for report_buffer: > > + * report_buffer[num_reports % 2] is the currently available report, > > + * report_buffer[(num_reports + 1) % 2] is the report being recorded. > > + */ > > +static atomic_t num_reports; > > + > > +/* > > + * PID of the task currently recording the report, as returned by > > + * get_encoded_pid(), or -1. Used as a writer lock for report_buffer. > > + * A regular spinlock couldn't be used here, as probe_console() can be called > > + * from any thread, and it needs to know whether that thread is holding the > > + * lock. > > + */ > > +static atomic_t current_pid = ATOMIC_INIT(-1); > > how do you handle pid namespaces? Doesn't current->pid hold the global PID of the task? See the description of task_pid_nr() here: https://elixir.bootlin.com/linux/latest/source/include/linux/sched.h#L1386, which is supposed to return a global task ID. > > + if (atomic_cmpxchg_acquire(¤t_pid, -1, get_encoded_pid()) != -1) > > + return; > > pid namespaces? See above. > > pid namespaces? > Same. > > + int idx; > > + > > + if (pid != get_encoded_pid()) > > + return; > > + > > + idx = (atomic_read(&num_reports) + 1) % 2; > > You read, but it could change before: Not sure I follow. num_reports can be only incremented by the same task that started the report, this cannot happen concurrently. > > > + if (current_pos == BUF_SIZE) > > + report_buffer[idx][current_pos - 1] = 0; > > + else > > + report_buffer[idx][current_pos] = 0; > > + > > + /* Pairs with acquire in last_report_show(). */ > > + atomic_inc_return_release(&num_reports); > > Not good? > > +static ssize_t last_report_show(struct kobject *kobj, > > + struct kobj_attribute *attr, char *buf) > > +{ > > + ssize_t ret; > > + int index; > > + > > + do { > > + /* Pairs with release in probe_report_end(). */ > > + index = atomic_read_acquire(&num_reports); > > + /* > > + * If index and old_index mismatch, we might be accessing > > + * report_buffer concurrently with a writer thread. In that > > + * case the read data will be discarded. > > + */ > > + ret = data_race(strscpy(buf, report_buffer[index % 2], BUF_SIZE)); > > + /* > > + * Prevent reordering between the memcpy above and the atomic > > + * read below. > > + * See the comments in include/linux/seqlock.h for more > > + * details. > > + */ > > + smp_rmb(); > > + } while (index != atomic_read(&num_reports)); > > endless loops, what could go wrong... Fair enough, this needs to be fixed. > > Why are you rolling your own hacky locks in here? We've also considered using a seqlock here, but thought that required too much boilerplate code (the current implementation reuses the report counter as a seqlock latch, whereas otherwise we'd need to introduce an extra seqcount_latch_t plus call the seqlock API functions). I think this can be reconsidered. > And again, sysfs is "one value" not "one buffer". > > > + return ret; > > +} > > + > > +/* > > + * read() handler for /sys/kernel/error_report/report_count. > > + */ > > +static ssize_t report_count_show(struct kobject *kobj, > > + struct kobj_attribute *attr, char *buf) > > +{ > > + return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&num_reports)); > > sysfs_emit()? Good, haven't seen that one. I think I just took Documentation/filesystems/sysfs.txt as an example. > And you just read it, but what keeps it from changing? Nothing; we can't really guarantee nobody reported another error while we were processing the previous one. Similarly, we cannot be sure that any other vfs file still has the same contents once we read it ;) > > +static const struct attribute_group error_report_sysfs_attr_group = { > > + .attrs = error_report_sysfs_attrs, > > +}; > > ATTRIBUTE_GROUPS()? Ack. > > +late_initcall(error_report_notify_setup); > > You never clean up the kobject or files? Will fix, thanks! > Anyway, please move this to tracefs, that's where it belongs. Will do in v3.
On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote: > With the introduction of various production error-detection tools, such as > MTE-based KASAN and KFENCE, the need arises to efficiently notify the > userspace OS components about kernel errors. Currently, no facility exists > to notify userspace about a kernel error from such bug-detection tools. > The problem is obviously not restricted to the above bug detection tools, > and applies to any error reporting mechanism that does not panic the > kernel; this series, however, will only add support for KASAN and KFENCE > reporting. > > +++ b/lib/error_report_notify.c > +/* > + * Trace hook for the error_report_start event. In an unlikely case of another > + * task already printing a report bail out, otherwise save the current pid > + * together with in_task() return value. This is not reliable. Some events might get lost. > + * Because reporting code can be called from low-level routines (e.g. locking > + * primitives or allocator guts), report recording is implemented using a > + * seqlock lock-free algorithm. > + */ > +static void probe_report_start(void *ignore, enum error_detector detector, > + unsigned long id) > +{ > + /* > + * Acquire the writer lock. Any racing probe_report_start will not > + * record anything. Pairs with the release in probe_report_end(). > + */ > + if (atomic_cmpxchg_acquire(¤t_pid, -1, get_encoded_pid()) != -1) > + return; > + current_pos = 0; > + truncated = false; > +} > + > +/* > + * Trace hook for the error_report_end event. If an event from the mismatching > + * error_report_start is received, it is ignored. Otherwise, null-terminate the > + * buffer, increase the report count (effectively releasing the report to > + * last_report_show() and schedule a notification about a new report. > + */ > +static void probe_report_end(void *ignore, enum error_detector detector, > + unsigned long id) > +{ > + pid_t pid = atomic_read(¤t_pid); > + int idx; > + > + if (pid != get_encoded_pid()) > + return; > + > + idx = (atomic_read(&num_reports) + 1) % 2; > + if (current_pos == BUF_SIZE) > + report_buffer[idx][current_pos - 1] = 0; > + else > + report_buffer[idx][current_pos] = 0; > + > + /* Pairs with acquire in last_report_show(). */ > + atomic_inc_return_release(&num_reports); > + schedule_delayed_work(&reporting_done, 0); Why delayed work when it gets queued immediately? > + /* > + * Release the writer lock. Pairs with the acquire in > + * probe_report_start(). > + */ > + atomic_set_release(¤t_pid, -1); > +} > + > +/* > + * Skip one or two leading pair of brackets containing the log timestamp and > + * the task/CPU ID, plus the leading space, from the report line, e.g.: > + * [ 0.698431][ T7] BUG: KFENCE: use-after-free ... > + * becomes: > + * BUG: KFENCE: use-after-free ... > + * > + * Report size is only 4K, and this boilerplate can easily account for half of > + * that amount. > + */ > +static void skip_extra_info(const char **buf, size_t *len) > +{ > + int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) + > + IS_ENABLED(CONFIG_PRINTK_CALLER); The timestamp can be disabled also at runtime by /sys/module/printk/parameters/time > + const char *found; > + > + if (!buf || !len) > + return; > + > + while (num_brackets--) { > + if (!*len || *buf[0] != '[') > + return; > + found = strnchr(*buf, *len, ']'); > + if (!found) > + return; > + *len -= found - *buf + 1; > + *buf = found + 1; > + } > + if (*len && *buf[0] == ' ') { > + ++*buf; > + --*len; > + } > +} > + > +/* > + * Trace hook for the console event. If a line comes from a task/CPU that did > + * not send the error_report_start event, that line is ignored. Otherwise, it > + * is stored in the report_buffer[(num_reports + 1) % 2]. > + * > + * To save space, the leading timestamps and (when enabled) CPU/task info is > + * stripped away. The buffer may contain newlines, so this procedure is > + * repeated for every line. IMHO, removing the timestamp is a bad idea. It will complicate matching the message with other events. I am afraid that some reports would get shrunken anyway. I suggest to use some lockless ringbuffer to store these messages. Either ftrace or printk one. But the biggest problem would be that console might be called asynchronously from another process. The messages might get lost in the meantime. There are many other more reliable solutions: Either add hook into vprintk_store() and make another copy of the message into your buffer. But there will still be the problem that you could not store more reports in parallel. So, it won't be reliable anyway. Or add a hook into /dev/kmsg interface. It allows to read any stored message immediately and quickly. Well, you would need to have your own reader of this interface. Or add your own reader of the main printk log. Or give up on having your own buffers. Instead just inform userspace about a new report via the pooled interface. And get the messages from the main log, e.g. via /dev/kmsg. I agree with Andrew that this might be _very_ dangerous feature. It is another way how to get messages from kernel. It looks like a hack how to get certain messages from the main kernel log into a separate log. This might end up with implementing yet another printk() or trace_printk(). I suggest to integrate this better with printk(): + We already have printk_context() that allows to handle messages a special way. + Just store extra flag into the main log buffer for these messages. + Allow to read the messages via some interface, either /dev/kmsg or a new one. We already store PID and context info into the main log buffer. It will allow to put relevant lines together. + Anyway, I strongly suggest to avoid your own buffer. It looks like an overhead. It it would get more complicated when there are more users, ... > + */ > +static void probe_console(void *ignore, const char *buf, size_t len) > +{ > + int pid = atomic_read(¤t_pid); > + size_t to_copy, cur_len; > + char *newline; > + int idx; > + > + if (pid != get_encoded_pid() || truncated) > + return; > + > + idx = (atomic_read(&num_reports) + 1) % 2; > + while (len) { > + newline = strnchr(buf, len, '\n'); > + if (newline) > + cur_len = newline - buf + 1; > + else > + cur_len = len; > + /* Adjust len now, because skip_extra_info() may change cur_len. */ > + len -= cur_len; > + skip_extra_info(&buf, &cur_len); > + to_copy = min(cur_len, BUF_SIZE - current_pos); > + memcpy(report_buffer[idx] + current_pos, buf, to_copy); > + current_pos += to_copy; > + if (cur_len > to_copy) { > + truncated = true; > + memcpy(report_buffer[idx] + current_pos - sizeof(TRUNC_MSG), > + TRUNC_MSG, sizeof(TRUNC_MSG)); > + break; > + } > + buf += cur_len; > + } > +} I still have to think about it. This feature heavily depends on printk(). It makes sense to integrate it there. But we also need to keep it simple because printk() is already pretty complicated. Best Regards, Petr
On Mon, Jan 18, 2021 at 12:38 PM Petr Mladek <pmladek@suse.com> wrote: Thanks for your input! Some responses below. > > On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote: > > With the introduction of various production error-detection tools, such as > > MTE-based KASAN and KFENCE, the need arises to efficiently notify the > > userspace OS components about kernel errors. Currently, no facility exists > > to notify userspace about a kernel error from such bug-detection tools. > > The problem is obviously not restricted to the above bug detection tools, > > and applies to any error reporting mechanism that does not panic the > > kernel; this series, however, will only add support for KASAN and KFENCE > > reporting. > > > > +++ b/lib/error_report_notify.c > > +/* > > + * Trace hook for the error_report_start event. In an unlikely case of another > > + * task already printing a report bail out, otherwise save the current pid > > + * together with in_task() return value. > > This is not reliable. Some events might get lost. > > + > > + /* Pairs with acquire in last_report_show(). */ > > + atomic_inc_return_release(&num_reports); > > + schedule_delayed_work(&reporting_done, 0); > > Why delayed work when it gets queued immediately? Because error reports may be sent from a place where waiting is forbidden (in the case of KFENCE that is the page fault handler). A delayed work ensures the notification occurs in a wait-safe context. > > +static void skip_extra_info(const char **buf, size_t *len) > > +{ > > + int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) + > > + IS_ENABLED(CONFIG_PRINTK_CALLER); > > The timestamp can be disabled also at runtime by > /sys/module/printk/parameters/time Guess that would result in disabling it for the whole kernel, which is too aggressive. > IMHO, removing the timestamp is a bad idea. It will complicate > matching the message with other events. Given that we are moving away from sysfs, the report size limit can be increased, therefore we don't have to remove the timestamp to save space anymore. > I am afraid that some reports would get shrunken anyway. Do you have any particular subsystem in mind? We could probably increase the report buffer to, say, 16K, to avoid this. > I suggest to use some lockless ringbuffer to store these > messages. Either ftrace or printk one. > > But the biggest problem would be that console might be called > asynchronously from another process. The messages might get > lost in the meantime. Do you mean that one process may call printk(), but the console probe for that message will be executed in the context of another process? If so, that would be a problem. Could it be possible to extend the console trace event to pass an extra caller id? > > There are many other more reliable solutions: > > Either add hook into vprintk_store() and make another copy of the > message into your buffer. But there will still be the problem > that you could not store more reports in parallel. So, it won't > be reliable anyway. At least for now, we do not expect many reports to come in parallel: this feature is meant for collecting the long tail of memory corruptions from production devices. > > Or add a hook into /dev/kmsg interface. It allows to read any > stored message immediately and quickly. Well, you would need > to have your own reader of this interface. This probably could be done, but isn't the console tracepoint designed exactly for this? > Or add your own reader of the main printk log. > > Or give up on having your own buffers. Instead just inform userspace > about a new report via the pooled interface. And get the messages > from the main log, e.g. via /dev/kmsg. Yes, that could be an option, although the need to search for tool-specific strings in kmsg and figure out where the report starts and ends still persists. At the end of the day, this sounds like the best approach for now, that we could later extend to store the logs. > I agree with Andrew that this might be _very_ dangerous feature. > It is another way how to get messages from kernel. It looks like > a hack how to get certain messages from the main kernel log > into a separate log. This might end up with implementing > yet another printk() or trace_printk(). Initially our intent was to refactor KFENCE error reporting so that the whole report is stored in a buffer and can be exposed to userspace via a VFS entry. Then it turned out we'd need something similar for KASAN, which would've also resulted in a major rewrite. So we thought a much more generic solution would be to let the tools "tee" their output to a specific file on demand. Yes, sounds like duplicating the kernel log :) On the other hand, we never really planned to extend printk or duplicate parts of it, so unless trace_console() is a good fit we'd probably refrain from storing the logs for now. > I suggest to integrate this better with printk(): > > + We already have printk_context() that allows to handle > messages a special way. > > + Just store extra flag into the main log buffer for these > messages. > > + Allow to read the messages via some interface, either > /dev/kmsg or a new one. We already store PID and context > info into the main log buffer. It will allow to put > relevant lines together. > > + Anyway, I strongly suggest to avoid your own buffer. It looks > like an overhead. It it would get more complicated when > there are more users, ... Having the special flag for the error reports is handy, but given the existence of CONFIG_PRINTK_CALLER may not be worth the effort, at least for now. > > > I still have to think about it. This feature heavily depends on > printk(). It makes sense to integrate it there. But we also > need to keep it simple because printk() is already pretty > complicated. > > Best Regards, > Petr
> > > + > > > + /* Pairs with acquire in last_report_show(). */ > > > + atomic_inc_return_release(&num_reports); > > > + schedule_delayed_work(&reporting_done, 0); > > > > Why delayed work when it gets queued immediately? > > Because error reports may be sent from a place where waiting is > forbidden (in the case of KFENCE that is the page fault handler). > A delayed work ensures the notification occurs in a wait-safe context. > Agreed, there's actually no point in calling schedule_delayed_work() instead of schedule_work() here.
On Mon 2021-01-18 14:08:19, Alexander Potapenko wrote: > On Mon, Jan 18, 2021 at 12:38 PM Petr Mladek <pmladek@suse.com> wrote: > Thanks for your input! Some responses below. > > > > > On Fri 2021-01-15 14:03:33, Alexander Potapenko wrote: > > > With the introduction of various production error-detection tools, such as > > > MTE-based KASAN and KFENCE, the need arises to efficiently notify the > > > userspace OS components about kernel errors. Currently, no facility exists > > > to notify userspace about a kernel error from such bug-detection tools. > > > The problem is obviously not restricted to the above bug detection tools, > > > and applies to any error reporting mechanism that does not panic the > > > kernel; this series, however, will only add support for KASAN and KFENCE > > > reporting. > > > > > > +++ b/lib/error_report_notify.c > > > +/* > > > + * Trace hook for the error_report_start event. In an unlikely case of another > > > + * task already printing a report bail out, otherwise save the current pid > > > + * together with in_task() return value. > > > > This is not reliable. Some events might get lost. > > > > > + > > > + /* Pairs with acquire in last_report_show(). */ > > > + atomic_inc_return_release(&num_reports); > > > + schedule_delayed_work(&reporting_done, 0); > > > > Why delayed work when it gets queued immediately? > > Because error reports may be sent from a place where waiting is > forbidden (in the case of KFENCE that is the page fault handler). > A delayed work ensures the notification occurs in a wait-safe context. > > > > > +static void skip_extra_info(const char **buf, size_t *len) > > > +{ > > > + int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) + > > > + IS_ENABLED(CONFIG_PRINTK_CALLER); > > > > The timestamp can be disabled also at runtime by > > /sys/module/printk/parameters/time > > Guess that would result in disabling it for the whole kernel, which is > too aggressive. I wanted to say that it was not enough to check IS_ENABLED(CONFIG_PRINTK_TIME). Anyway, I think that you will not use this code in the end. > > IMHO, removing the timestamp is a bad idea. It will complicate > > matching the message with other events. > > Given that we are moving away from sysfs, the report size limit can be > increased, therefore we don't have to remove the timestamp to save > space anymore. > > > I am afraid that some reports would get shrunken anyway. > Do you have any particular subsystem in mind? We could probably printk() is storing recursive messages and also messages in NMI context into temporary per-CPU buffers. They include just a plain text (no timestamps). The original size was 4kB but it caused shrinking some backtraces. They are 8kB by default now. But they still shrink, for example, lockdep report. I am not sure about the KASAN or KFENCE reports. But the experience is that amount of debug messages is just growing year over year. JFYI, these temporary buffers should get removed once the new lockless ringbuffer is fully integrated. > increase the report buffer to, say, 16K, to avoid this. > > > I suggest to use some lockless ringbuffer to store these > > messages. Either ftrace or printk one. > > > > But the biggest problem would be that console might be called > > asynchronously from another process. The messages might get > > lost in the meantime. > > Do you mean that one process may call printk(), but the console probe > for that message will be executed in the context of another process? Yes, this might happen. And it is going to be the default behavior. The current solution might cause softlockups. John Ogness is working on offloading the console handling into separate kthreads. It means that call_console_drivers() will always get called by a separate process. > If so, that would be a problem. Could it be possible to extend the > console trace event to pass an extra caller id? We should pass the messages some other way. The good news is that the caller id is always stored in the new lockless ring buffer that has been integreated into 4.10. It is used intarnally for handling continuous lines. The caller id is printed on the console only when the configure option is enabled. But we should create another interface for these debugging dumps anyway. I would implement one more dumper, like kmsg_dump(), syslog_print_all(), console_unlock(). They cycle over all/recently stored messages and print them. console_unlock() filter the messages according to console_loglevel. The new funtion might fitler them by the flag specific for the interested tools, like KASAN. > > Or add a hook into /dev/kmsg interface. It allows to read any > > stored message immediately and quickly. Well, you would need > > to have your own reader of this interface. > > This probably could be done, but isn't the console tracepoint designed > exactly for this? Hmm, the tracepoint has been added in 2011 by the commit 95100358491abaa2 ("printk/tracing: Add console output tracing"). It was a time when the messages were added into the main logbuffer one-by-one character. I guess that it was easier to add the tracing here because the console code already found the beginning and end of each message. /dev/kmsg interface has been added later. Also the messages were later stored as records (one message with its metadata) into the log buffer. So, it might be done an easier way now. > Initially our intent was to refactor KFENCE error reporting so that > the whole report is stored in a buffer and can be exposed to userspace > via a VFS entry. > Then it turned out we'd need something similar for KASAN, which > would've also resulted in a major rewrite. > So we thought a much more generic solution would be to let the tools > "tee" their output to a specific file on demand. Thanks for the background. > Yes, sounds like duplicating the kernel log :) > On the other hand, we never really planned to extend printk or > duplicate parts of it, so unless trace_console() is a good fit we'd > probably refrain from storing the logs for now. IMHO, using the tracepoint for your purpose looks too complicated and less reliable. It might be interesting for 3rd-party code because the tracepoint is rather stable. But we might do it better in upstream. > Having the special flag for the error reports is handy, but given the > existence of CONFIG_PRINTK_CALLER may not be worth the effort, at > least for now. Of course, the less we need to complicate the printk() code the better. So if you could find the right messages using a normal interface, it would be great. That said, there are currently the following ways to read the messages: + consoles + syslog syscall (man 2 syslog) + /dev/kmsg + kmsg dumpers Three of them use record_print_text() to format the message. It means that they could show the caller_id in the brackets. The only exception is /dev/kmsg. It uses info_print_ext_header() and msg_print_ext_body() to format the output. It currently does not show caller_id. Nobody wanted it yet. We could update it. We just need to be careful because systemd journald uses this interface and relies on the format. Best Regards, Petr
Thank you all for the comments! Since concerns have been raised that another error reporting system may quickly go out of control, we've decided to not pursue this solution any further. Instead, we will try to rely on existing ftrace mechanisms to deliver notifications to the userspace, and perform dmesg collection/parsing once a notification is received. As a result, we won't need the error_report_start tracepoint anytime soon (I didn't rename error_report_end to something like error_report anticipating that error_report_start may be needed in the future). I've also dropped the error_report_notify library and the corresponding documentation from the patchset.
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 4b8f9e018f0f..b950bf21b15c 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -209,6 +209,20 @@ config DEBUG_BUGVERBOSE of the BUG call as well as the EIP and oops trace. This aids debugging but costs about 70-100K of memory. +config ERROR_REPORT_NOTIFY + bool "Expose error reports from debugging tools to the userspace" + depends on TRACING + help + When enabled, captures error reports from debugging tools (such as + KFENCE or KASAN) using console tracing, and exposes reports in + /sys/kernel/error_report/: the file last_report contains the last + report (with maximum report length of PAGE_SIZE), and report_count, + the total report count. + + Userspace programs can call poll(POLLPRI) on those files to get + notified about the new reports without having to watch dmesg in a + loop. + endmenu # "printk and dmesg options" menu "Compile-time checks and compiler options" diff --git a/lib/Makefile b/lib/Makefile index afeff05fa8c5..96ed8bc0cc5b 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -240,6 +240,8 @@ obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o obj-$(CONFIG_DQL) += dynamic_queue_limits.o +obj-$(CONFIG_ERROR_REPORT_NOTIFY) += error_report_notify.o + obj-$(CONFIG_GLOB) += glob.o obj-$(CONFIG_GLOB_SELFTEST) += globtest.o diff --git a/lib/error_report_notify.c b/lib/error_report_notify.c new file mode 100644 index 000000000000..66176cd94ba0 --- /dev/null +++ b/lib/error_report_notify.c @@ -0,0 +1,278 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Userspace notification interface for debugging tools. + * + * Provide two sysfs files: + * - /sys/kernel/error_report/last_report + * - /sys/kernel/error_report/report_count + * that contain the last debugging tool report (taken from dmesg, delimited by + * the error_report_start/error_report_end tracing events) and the total report + * count. + */ + +#include <linux/atomic.h> +#include <linux/fs.h> +#include <linux/kobject.h> +#include <linux/string.h> +#include <linux/sysfs.h> +#include <linux/tracepoint.h> +#include <linux/workqueue.h> +#include <trace/events/error_report.h> +#include <trace/events/printk.h> + +static struct kobject *error_report_kobj; + +/* sysfs files are capped at PAGE_SIZE. */ +#define BUF_SIZE PAGE_SIZE +/* Two buffers to store the finished report and the report being recorded. */ +static char report_buffer[2][BUF_SIZE]; +/* + * Total report count. Also serves as a latch for report_buffer: + * report_buffer[num_reports % 2] is the currently available report, + * report_buffer[(num_reports + 1) % 2] is the report being recorded. + */ +static atomic_t num_reports; + +/* + * PID of the task currently recording the report, as returned by + * get_encoded_pid(), or -1. Used as a writer lock for report_buffer. + * A regular spinlock couldn't be used here, as probe_console() can be called + * from any thread, and it needs to know whether that thread is holding the + * lock. + */ +static atomic_t current_pid = ATOMIC_INIT(-1); + +static size_t current_pos; +static bool truncated; +static const char TRUNC_MSG[] = "<truncated>\n"; + +static struct delayed_work reporting_done; + +static void error_report_notify(struct work_struct *work) +{ + sysfs_notify(error_report_kobj, NULL, "last_report"); + sysfs_notify(error_report_kobj, NULL, "report_count"); +} +static DECLARE_DELAYED_WORK(reporting_done, error_report_notify); + +/* + * Return the current PID combined together with in_task(). This lets us + * distinguish between normal task context and IRQ context. + */ +static int get_encoded_pid(void) +{ + return (current->pid << 1) | (!!in_task()); +} + +/* + * Trace hook for the error_report_start event. In an unlikely case of another + * task already printing a report bail out, otherwise save the current pid + * together with in_task() return value. + * + * Because reporting code can be called from low-level routines (e.g. locking + * primitives or allocator guts), report recording is implemented using a + * seqlock lock-free algorithm. + */ +static void probe_report_start(void *ignore, enum error_detector detector, + unsigned long id) +{ + /* + * Acquire the writer lock. Any racing probe_report_start will not + * record anything. Pairs with the release in probe_report_end(). + */ + if (atomic_cmpxchg_acquire(¤t_pid, -1, get_encoded_pid()) != -1) + return; + current_pos = 0; + truncated = false; +} + +/* + * Trace hook for the error_report_end event. If an event from the mismatching + * error_report_start is received, it is ignored. Otherwise, null-terminate the + * buffer, increase the report count (effectively releasing the report to + * last_report_show() and schedule a notification about a new report. + */ +static void probe_report_end(void *ignore, enum error_detector detector, + unsigned long id) +{ + pid_t pid = atomic_read(¤t_pid); + int idx; + + if (pid != get_encoded_pid()) + return; + + idx = (atomic_read(&num_reports) + 1) % 2; + if (current_pos == BUF_SIZE) + report_buffer[idx][current_pos - 1] = 0; + else + report_buffer[idx][current_pos] = 0; + + /* Pairs with acquire in last_report_show(). */ + atomic_inc_return_release(&num_reports); + schedule_delayed_work(&reporting_done, 0); + /* + * Release the writer lock. Pairs with the acquire in + * probe_report_start(). + */ + atomic_set_release(¤t_pid, -1); +} + +/* + * Skip one or two leading pair of brackets containing the log timestamp and + * the task/CPU ID, plus the leading space, from the report line, e.g.: + * [ 0.698431][ T7] BUG: KFENCE: use-after-free ... + * becomes: + * BUG: KFENCE: use-after-free ... + * + * Report size is only 4K, and this boilerplate can easily account for half of + * that amount. + */ +static void skip_extra_info(const char **buf, size_t *len) +{ + int num_brackets = IS_ENABLED(CONFIG_PRINTK_TIME) + + IS_ENABLED(CONFIG_PRINTK_CALLER); + const char *found; + + if (!buf || !len) + return; + + while (num_brackets--) { + if (!*len || *buf[0] != '[') + return; + found = strnchr(*buf, *len, ']'); + if (!found) + return; + *len -= found - *buf + 1; + *buf = found + 1; + } + if (*len && *buf[0] == ' ') { + ++*buf; + --*len; + } +} + +/* + * Trace hook for the console event. If a line comes from a task/CPU that did + * not send the error_report_start event, that line is ignored. Otherwise, it + * is stored in the report_buffer[(num_reports + 1) % 2]. + * + * To save space, the leading timestamps and (when enabled) CPU/task info is + * stripped away. The buffer may contain newlines, so this procedure is + * repeated for every line. + */ +static void probe_console(void *ignore, const char *buf, size_t len) +{ + int pid = atomic_read(¤t_pid); + size_t to_copy, cur_len; + char *newline; + int idx; + + if (pid != get_encoded_pid() || truncated) + return; + + idx = (atomic_read(&num_reports) + 1) % 2; + while (len) { + newline = strnchr(buf, len, '\n'); + if (newline) + cur_len = newline - buf + 1; + else + cur_len = len; + /* Adjust len now, because skip_extra_info() may change cur_len. */ + len -= cur_len; + skip_extra_info(&buf, &cur_len); + to_copy = min(cur_len, BUF_SIZE - current_pos); + memcpy(report_buffer[idx] + current_pos, buf, to_copy); + current_pos += to_copy; + if (cur_len > to_copy) { + truncated = true; + memcpy(report_buffer[idx] + current_pos - sizeof(TRUNC_MSG), + TRUNC_MSG, sizeof(TRUNC_MSG)); + break; + } + buf += cur_len; + } +} + +static void register_tracepoints(void) +{ + register_trace_console(probe_console, NULL); + register_trace_error_report_start(probe_report_start, NULL); + register_trace_error_report_end(probe_report_end, NULL); +} + +/* + * read() handler for /sys/kernel/error_report/last_report. + * Because the number of reports can change under our feet, check it again + * after copying the report, and retry if the numbers mismatch. + */ +static ssize_t last_report_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + ssize_t ret; + int index; + + do { + /* Pairs with release in probe_report_end(). */ + index = atomic_read_acquire(&num_reports); + /* + * If index and old_index mismatch, we might be accessing + * report_buffer concurrently with a writer thread. In that + * case the read data will be discarded. + */ + ret = data_race(strscpy(buf, report_buffer[index % 2], BUF_SIZE)); + /* + * Prevent reordering between the memcpy above and the atomic + * read below. + * See the comments in include/linux/seqlock.h for more + * details. + */ + smp_rmb(); + } while (index != atomic_read(&num_reports)); + return ret; +} + +/* + * read() handler for /sys/kernel/error_report/report_count. + */ +static ssize_t report_count_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return scnprintf(buf, PAGE_SIZE, "%d\n", atomic_read(&num_reports)); +} + +static struct kobj_attribute last_report_attr = __ATTR_RO(last_report); +static struct kobj_attribute report_count_attr = __ATTR_RO(report_count); +static struct attribute *error_report_sysfs_attrs[] = { + &last_report_attr.attr, + &report_count_attr.attr, + NULL, +}; + +static const struct attribute_group error_report_sysfs_attr_group = { + .attrs = error_report_sysfs_attrs, +}; + +/* + * Set up report notification: register tracepoints and create + * /sys/kernel/error_report/. + */ +static void error_report_notify_setup(void) +{ + int err; + + register_tracepoints(); + error_report_kobj = kobject_create_and_add("error_report", kernel_kobj); + if (!error_report_kobj) + goto error; + err = sysfs_create_group(error_report_kobj, + &error_report_sysfs_attr_group); + if (err) + goto error; + return; + +error: + if (error_report_kobj) + kobject_del(error_report_kobj); +} +late_initcall(error_report_notify_setup);
With the introduction of various production error-detection tools, such as MTE-based KASAN and KFENCE, the need arises to efficiently notify the userspace OS components about kernel errors. Currently, no facility exists to notify userspace about a kernel error from such bug-detection tools. The problem is obviously not restricted to the above bug detection tools, and applies to any error reporting mechanism that does not panic the kernel; this series, however, will only add support for KASAN and KFENCE reporting. All such error reports appear in the kernel log. But, when such errors occur, userspace would normally need to read the entire kernel log and parse the relevant errors. This is error prone and inefficient, as userspace needs to continuously monitor the kernel log for error messages. On certain devices, this is unfortunately not acceptable. Therefore, we need to revisit how reports are propagated to userspace. The library added, error_report_notify (CONFIG_ERROR_REPORT_NOTIFY), solves the above by using the error_report_start/error_report_end tracing events and exposing the last report and the total report count to the userspace via /sys/kernel/error_report/last_report and /sys/kernel/error_report/report_count. Userspace apps can call poll(POLLPRI) on those files to get notified about the new reports without having to watch dmesg in a loop. Suggested-by: Marco Elver <elver@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Marco Elver <elver@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: linux-mm@kvack.org Signed-off-by: Alexander Potapenko <glider@google.com> --- v2: - s/memory error reports/error reports from debugging tools/ (per Andrew Morton's comment) - change error_report_start and error_report_end prototypes to accept enum error_detector instead of char* (as suggested by Steven Rostedt) --- lib/Kconfig.debug | 14 ++ lib/Makefile | 2 + lib/error_report_notify.c | 278 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 294 insertions(+) create mode 100644 lib/error_report_notify.c