Message ID | f96f8f8a-e65c-3f36-dc85-fc3f5191e8c5@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce CAP_PERFMON to secure system performance monitoring and observability | expand |
Em Thu, Apr 02, 2020 at 11:42:05AM +0300, Alexey Budankov escreveu: > This patch set introduces CAP_PERFMON capability designed to secure > system performance monitoring and observability operations so that > CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role > for performance monitoring and observability subsystems of the kernel. So, what am I doing wrong? [perf@five ~]$ type perf perf is hashed (/home/perf/bin/perf) [perf@five ~]$ [perf@five ~]$ ls -lahF /home/perf/bin/perf -rwxr-x---. 1 root perf_users 24M Apr 7 10:34 /home/perf/bin/perf* [perf@five ~]$ [perf@five ~]$ getcap /home/perf/bin/perf [perf@five ~]$ perf top --stdio Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, which controls use of the performance events system by unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN). The current value is 2: -1: Allow use of (almost) all events by all users Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK >= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN >= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN >= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN To make this setting permanent, edit /etc/sysctl.conf too, e.g.: kernel.perf_event_paranoid = -1 [perf@five ~]$ Ok, the message says I need to have CAP_PERFMON, lets do it, using an unpatched libcap that doesn't know about it but we can use 38, CAP_PERFMON value instead, and I tested this with a patched libcap as well, same results: As root: [root@five bin]# setcap "38,cap_sys_ptrace,cap_syslog=ep" perf [root@five bin]# Back to the 'perf' user in the 'perf_users' group, ok, so now 'perf record -a' works for system wide sampling of cycles:u, i.e. only userspace samples, but 'perf top' is failing: [perf@five ~]$ type perf perf is hashed (/home/perf/bin/perf) [perf@five ~]$ getcap /home/perf/bin/perf /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep [perf@five ~]$ groups perf perf_users [perf@five ~]$ id uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [perf@five ~]$ perf top --stdio Error: Failed to mmap with 1 (Operation not permitted) [perf@five ~]$ perf record -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] [perf@five ~]$ perf evlist cycles:u [perf@five ~]$ - Arnaldo
Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: > [perf@five ~]$ type perf > perf is hashed (/home/perf/bin/perf) > [perf@five ~]$ getcap /home/perf/bin/perf > /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep > [perf@five ~]$ groups > perf perf_users > [perf@five ~]$ id > uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [perf@five ~]$ perf top --stdio > Error: > Failed to mmap with 1 (Operation not permitted) > [perf@five ~]$ perf record -a > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] > > [perf@five ~]$ perf evlist > cycles:u > [perf@five ~]$ Humm, perf record falls back to cycles:u after initially trying cycles (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', lemme test, humm not really: [perf@five ~]$ perf top --stdio -e cycles:u Error: Failed to mmap with 1 (Operation not permitted) [perf@five ~]$ perf record -e cycles:u -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] [perf@five ~]$ Back to debugging this. - Arnaldo
On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote: > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: >> [perf@five ~]$ type perf >> perf is hashed (/home/perf/bin/perf) >> [perf@five ~]$ getcap /home/perf/bin/perf >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep >> [perf@five ~]$ groups >> perf perf_users >> [perf@five ~]$ id >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 >> [perf@five ~]$ perf top --stdio >> Error: >> Failed to mmap with 1 (Operation not permitted) >> [perf@five ~]$ perf record -a >> ^C[ perf record: Woken up 1 times to write data ] >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] >> >> [perf@five ~]$ perf evlist >> cycles:u >> [perf@five ~]$ > > Humm, perf record falls back to cycles:u after initially trying cycles > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', > lemme test, humm not really: > > [perf@five ~]$ perf top --stdio -e cycles:u > Error: > Failed to mmap with 1 (Operation not permitted) > [perf@five ~]$ perf record -e cycles:u -a sleep 1 > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] > [perf@five ~]$ > > Back to debugging this. Could makes sense adding cap_ipc_lock to the binary to isolate from this: kernel/events/core.c: 6101 if ((locked > lock_limit) && perf_is_paranoid() && !capable(CAP_IPC_LOCK)) { ret = -EPERM; goto unlock; } ~Alexey
Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote: > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: > >> [perf@five ~]$ type perf > >> perf is hashed (/home/perf/bin/perf) > >> [perf@five ~]$ getcap /home/perf/bin/perf > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep > >> [perf@five ~]$ groups > >> perf perf_users > >> [perf@five ~]$ id > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > >> [perf@five ~]$ perf top --stdio > >> Error: > >> Failed to mmap with 1 (Operation not permitted) > >> [perf@five ~]$ perf record -a > >> ^C[ perf record: Woken up 1 times to write data ] > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] > >> > >> [perf@five ~]$ perf evlist > >> cycles:u > >> [perf@five ~]$ > > > > Humm, perf record falls back to cycles:u after initially trying cycles > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', > > lemme test, humm not really: > > > > [perf@five ~]$ perf top --stdio -e cycles:u > > Error: > > Failed to mmap with 1 (Operation not permitted) > > [perf@five ~]$ perf record -e cycles:u -a sleep 1 > > [ perf record: Woken up 1 times to write data ] > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] > > [perf@five ~]$ > > > > Back to debugging this. > > Could makes sense adding cap_ipc_lock to the binary to isolate from this: > > kernel/events/core.c: 6101 > if ((locked > lock_limit) && perf_is_paranoid() && > !capable(CAP_IPC_LOCK)) { > ret = -EPERM; > goto unlock; > } That did the trick, I'll update the documentation and include in my "Committer testing" section: [perf@five ~]$ groups perf perf_users [perf@five ~]$ ls -lahF bin/perf -rwxr-x---. 1 root perf_users 24M Apr 7 10:34 bin/perf* [perf@five ~]$ getcap bin/perf bin/perf = cap_ipc_lock,cap_sys_ptrace,cap_syslog,38+ep [perf@five ~]$ [perf@five ~]$ perf top --stdio PerfTop: 652 irqs/sec kernel:73.8% exact: 99.7% lost: 0/0 drop: 0/0 [4000Hz cycles:u], (all, 12 CPUs) --------------------------------------------------------------------------------------------------------------- 13.03% [kernel] [k] module_get_kallsym 5.25% [kernel] [k] kallsyms_expand_symbol.constprop.0 5.00% libc-2.30.so [.] __GI_____strtoull_l_internal 4.41% [kernel] [k] memcpy 3.42% [kernel] [k] vsnprintf 2.98% perf [.] map__process_kallsym_symbol 2.86% [kernel] [k] format_decode 2.73% [kernel] [k] number 2.70% perf [.] rb_next 2.59% perf [.] maps__split_kallsyms 2.54% [kernel] [k] string_nocheck 1.90% libc-2.30.so [.] _IO_getdelim 1.86% [kernel] [k] __x86_indirect_thunk_rax 1.53% libc-2.30.so [.] _int_malloc 1.48% libc-2.30.so [.] __memmove_avx_unaligned_erms 1.40% [kernel] [k] clear_page_rep 1.07% perf [.] rb_insert_color 1.01% libc-2.30.so [.] _IO_feof 0.99% perf [.] __dso__load_kallsyms 0.98% [kernel] [k] s_next 0.96% perf [.] __rblist__findnew 0.95% [kernel] [k] strlen 0.95% perf [.] arch__symbols__fixup_end 0.94% libpixman-1.so.0.38.4 [.] 0x000000000006f4af 0.94% perf [.] symbol__new 0.89% libpixman-1.so.0.38.4 [.] 0x000000000006f4a0 0.86% [kernel] [k] seq_read 0.81% libpixman-1.so.0.38.4 [.] 0x000000000006f4ab 0.80% perf [.] __symbols__insert 0.73% libpixman-1.so.0.38.4 [.] 0x000000000006f4a7 0.67% [kernel] [k] s_show 0.66% libc-2.30.so [.] __libc_calloc 0.61% libpixman-1.so.0.38.4 [.] 0x000000000006f4bb 0.59% [kernel] [k] get_page_from_freelist 0.59% perf [.] memcpy@plt 0.58% perf [.] eprintf exiting. [perf@five ~]$ There is still something strange in here, the event is cycles:u (see at the PerfTop line, but it is getting kernel samples :-\ - Arnaldo
Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu: > Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: > > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote: > > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: > > >> [perf@five ~]$ type perf > > >> perf is hashed (/home/perf/bin/perf) > > >> [perf@five ~]$ getcap /home/perf/bin/perf > > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep > > >> [perf@five ~]$ groups > > >> perf perf_users > > >> [perf@five ~]$ id > > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > > >> [perf@five ~]$ perf top --stdio > > >> Error: > > >> Failed to mmap with 1 (Operation not permitted) > > >> [perf@five ~]$ perf record -a > > >> ^C[ perf record: Woken up 1 times to write data ] > > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] > > >> > > >> [perf@five ~]$ perf evlist > > >> cycles:u > > >> [perf@five ~]$ > > > > > > Humm, perf record falls back to cycles:u after initially trying cycles > > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', > > > lemme test, humm not really: > > > > > > [perf@five ~]$ perf top --stdio -e cycles:u > > > Error: > > > Failed to mmap with 1 (Operation not permitted) > > > [perf@five ~]$ perf record -e cycles:u -a sleep 1 > > > [ perf record: Woken up 1 times to write data ] > > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] > > > [perf@five ~]$ > > > > > > Back to debugging this. > > > > Could makes sense adding cap_ipc_lock to the binary to isolate from this: > > > > kernel/events/core.c: 6101 > > if ((locked > lock_limit) && perf_is_paranoid() && > > !capable(CAP_IPC_LOCK)) { > > ret = -EPERM; > > goto unlock; > > } > > > That did the trick, I'll update the documentation and include in my > "Committer testing" section: I ammended this to that patch, please check the wording: - Arnaldo diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst index c0ca0c1a6804..ed33682e26b0 100644 --- a/Documentation/admin-guide/perf-security.rst +++ b/Documentation/admin-guide/perf-security.rst @@ -127,12 +127,19 @@ taken to create such groups of privileged Perf users. :: - # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf - # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf + # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf + # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf perf: OK # getcap perf perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep +If the libcap installed doesn't yet support "cap_perfmon", use "38" instead, +i.e.: + +:: + + # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf + As a result, members of perf_users group are capable of conducting performance monitoring and observability by using functionality of the configured Perf tool executable that, when executes, passes perf_events
On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote: > Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: >> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote: >>> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: >>>> [perf@five ~]$ type perf <SNIP> >>>> perf is hashed (/home/perf/bin/perf) >>>> [perf@five ~]$ >>> >>> Humm, perf record falls back to cycles:u after initially trying cycles >>> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', >>> lemme test, humm not really: >>> >>> [perf@five ~]$ perf top --stdio -e cycles:u >>> Error: >>> Failed to mmap with 1 (Operation not permitted) >>> [perf@five ~]$ perf record -e cycles:u -a sleep 1 >>> [ perf record: Woken up 1 times to write data ] >>> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] >>> [perf@five ~]$ >>> >>> Back to debugging this. >> >> Could makes sense adding cap_ipc_lock to the binary to isolate from this: >> >> kernel/events/core.c: 6101 >> if ((locked > lock_limit) && perf_is_paranoid() && >> !capable(CAP_IPC_LOCK)) { >> ret = -EPERM; >> goto unlock; >> } > > > That did the trick, I'll update the documentation and include in my > "Committer testing" section: Looks like top mode somehow reaches perf mmap limit described here [1]. Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine: perf top -e cycles -m 1 ~Alexey [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html#memory-allocation
Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu: > Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: > > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote: > > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: > > >> [perf@five ~]$ type perf > > >> perf is hashed (/home/perf/bin/perf) > > >> [perf@five ~]$ getcap /home/perf/bin/perf > > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep > > >> [perf@five ~]$ groups > > >> perf perf_users > > >> [perf@five ~]$ id > > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > > >> [perf@five ~]$ perf top --stdio > > >> Error: > > >> Failed to mmap with 1 (Operation not permitted) > > >> [perf@five ~]$ perf record -a > > >> ^C[ perf record: Woken up 1 times to write data ] > > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] > > >> > > >> [perf@five ~]$ perf evlist > > >> cycles:u > > >> [perf@five ~]$ > > > > > > Humm, perf record falls back to cycles:u after initially trying cycles > > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', > > > lemme test, humm not really: > > > > > > [perf@five ~]$ perf top --stdio -e cycles:u > > > Error: > > > Failed to mmap with 1 (Operation not permitted) > > > [perf@five ~]$ perf record -e cycles:u -a sleep 1 > > > [ perf record: Woken up 1 times to write data ] > > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] > > > [perf@five ~]$ > > > > > > Back to debugging this. > > > > Could makes sense adding cap_ipc_lock to the binary to isolate from this: > > > > kernel/events/core.c: 6101 > > if ((locked > lock_limit) && perf_is_paranoid() && > > !capable(CAP_IPC_LOCK)) { > > ret = -EPERM; > > goto unlock; > > } > > > That did the trick, I'll update the documentation and include in my > "Committer testing" section: > > [perf@five ~]$ groups > perf perf_users > [perf@five ~]$ ls -lahF bin/perf > -rwxr-x---. 1 root perf_users 24M Apr 7 10:34 bin/perf* > [perf@five ~]$ getcap bin/perf > bin/perf = cap_ipc_lock,cap_sys_ptrace,cap_syslog,38+ep > [perf@five ~]$ > [perf@five ~]$ perf top --stdio > > > PerfTop: 652 irqs/sec kernel:73.8% exact: 99.7% lost: 0/0 drop: 0/0 [4000Hz cycles:u], (all, 12 CPUs) > --------------------------------------------------------------------------------------------------------------- > > 13.03% [kernel] [k] module_get_kallsym > 5.25% [kernel] [k] kallsyms_expand_symbol.constprop.0 > 5.00% libc-2.30.so [.] __GI_____strtoull_l_internal > 4.41% [kernel] [k] memcpy > 3.42% [kernel] [k] vsnprintf > 2.98% perf [.] map__process_kallsym_symbol > 2.86% [kernel] [k] format_decode > 2.73% [kernel] [k] number > 2.70% perf [.] rb_next > 2.59% perf [.] maps__split_kallsyms > 2.54% [kernel] [k] string_nocheck > 1.90% libc-2.30.so [.] _IO_getdelim > 1.86% [kernel] [k] __x86_indirect_thunk_rax > 1.53% libc-2.30.so [.] _int_malloc > 1.48% libc-2.30.so [.] __memmove_avx_unaligned_erms > 1.40% [kernel] [k] clear_page_rep > 1.07% perf [.] rb_insert_color > 1.01% libc-2.30.so [.] _IO_feof > 0.99% perf [.] __dso__load_kallsyms > 0.98% [kernel] [k] s_next > 0.96% perf [.] __rblist__findnew > 0.95% [kernel] [k] strlen > 0.95% perf [.] arch__symbols__fixup_end > 0.94% libpixman-1.so.0.38.4 [.] 0x000000000006f4af > 0.94% perf [.] symbol__new > 0.89% libpixman-1.so.0.38.4 [.] 0x000000000006f4a0 > 0.86% [kernel] [k] seq_read > 0.81% libpixman-1.so.0.38.4 [.] 0x000000000006f4ab > 0.80% perf [.] __symbols__insert > 0.73% libpixman-1.so.0.38.4 [.] 0x000000000006f4a7 > 0.67% [kernel] [k] s_show > 0.66% libc-2.30.so [.] __libc_calloc > 0.61% libpixman-1.so.0.38.4 [.] 0x000000000006f4bb > 0.59% [kernel] [k] get_page_from_freelist > 0.59% perf [.] memcpy@plt > 0.58% perf [.] eprintf > exiting. > [perf@five ~]$ > > There is still something strange in here, the event is cycles:u (see at > the PerfTop line, but it is getting kernel samples :-\ So running with 'perf top --stdio -vv 2> /tmp/output' I see we try create three events, the first is some capability querying, then we try to determine the max precision level, but continue with attr.exclude_kernel=1, which shouldn't be the case, perhaps we're seeing that it is not the root in the tooling part, and end up setting that to 1 as, previously, we knew it would fail, so we should switch to checking if we have cap_perfmon too, will check that: ------------------------------------------------------------ perf_event_attr: type 1 size 120 config 0x9 watermark 1 sample_id_all 1 bpf_event 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ ------------------------------------------------------------ perf_event_attr: size 120 { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|CPU|PERIOD read_format ID disabled 1 inherit 1 exclude_kernel 1 mmap 1 comm 1 freq 1 task 1 precise_ip 3 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 ------------------------------------------------------------ ------------------------------------------------------------ perf_event_attr: size 120 { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|CPU|PERIOD read_format ID disabled 1 inherit 1 exclude_kernel 1 mmap 1 comm 1 freq 1 task 1 precise_ip 2 sample_id_all 1 exclude_guest 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 ------------------------------------------------------------ But then, even with that attr.exclude_kernel set to 1 we _still_ get kernel samples, which looks like another bug, now trying with strace, which leads us to another rabbit hole: [perf@five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid, which controls use of the performance events system by unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN). The current value is 2: -1: Allow use of (almost) all events by all users Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK >= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN >= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN >= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN To make this setting permanent, edit /etc/sysctl.conf too, e.g.: kernel.perf_event_paranoid = -1 [perf@five ~]$ If I remove that strace -e ... from the front, 'perf top' is back working as a non-cap_sys_admin user, just with cap_perfmon. - Arnaldo
Em Tue, Apr 07, 2020 at 07:52:56PM +0300, Alexey Budankov escreveu: > > On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote: > > Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: > >> Could makes sense adding cap_ipc_lock to the binary to isolate from this: > >> kernel/events/core.c: 6101 > >> if ((locked > lock_limit) && perf_is_paranoid() && > >> !capable(CAP_IPC_LOCK)) { > >> ret = -EPERM; > >> goto unlock; > >> } > > That did the trick, I'll update the documentation and include in my > > "Committer testing" section: > Looks like top mode somehow reaches perf mmap limit described here [1]. > Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine: > perf top -e cycles -m 1 So this would read better? diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst index ed33682e26b0..d44dd24b0244 100644 --- a/Documentation/admin-guide/perf-security.rst +++ b/Documentation/admin-guide/perf-security.rst @@ -127,8 +127,8 @@ taken to create such groups of privileged Perf users. :: - # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf - # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf + # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf + # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf perf: OK # getcap perf perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep @@ -140,6 +140,10 @@ i.e.: # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf +Note that you may need to have 'cap_ipc_lock' in the mix for tools such as +'perf top', alternatively use 'perf top -m N', to reduce the memory that +it uses for the perf ring buffer, see the memory allocation section below. + As a result, members of perf_users group are capable of conducting performance monitoring and observability by using functionality of the configured Perf tool executable that, when executes, passes perf_events
On 07.04.2020 19:40, Arnaldo Carvalho de Melo wrote: > Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu: >> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: >>> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote: >>>> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu: >>>>> [perf@five ~]$ type perf >>>>> perf is hashed (/home/perf/bin/perf) >>>>> [perf@five ~]$ getcap /home/perf/bin/perf >>>>> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep >>>>> [perf@five ~]$ groups >>>>> perf perf_users >>>>> [perf@five ~]$ id >>>>> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 >>>>> [perf@five ~]$ perf top --stdio >>>>> Error: >>>>> Failed to mmap with 1 (Operation not permitted) >>>>> [perf@five ~]$ perf record -a >>>>> ^C[ perf record: Woken up 1 times to write data ] >>>>> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ] >>>>> >>>>> [perf@five ~]$ perf evlist >>>>> cycles:u >>>>> [perf@five ~]$ >>>> >>>> Humm, perf record falls back to cycles:u after initially trying cycles >>>> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u', >>>> lemme test, humm not really: >>>> >>>> [perf@five ~]$ perf top --stdio -e cycles:u >>>> Error: >>>> Failed to mmap with 1 (Operation not permitted) >>>> [perf@five ~]$ perf record -e cycles:u -a sleep 1 >>>> [ perf record: Woken up 1 times to write data ] >>>> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ] >>>> [perf@five ~]$ >>>> >>>> Back to debugging this. >>> >>> Could makes sense adding cap_ipc_lock to the binary to isolate from this: >>> >>> kernel/events/core.c: 6101 >>> if ((locked > lock_limit) && perf_is_paranoid() && >>> !capable(CAP_IPC_LOCK)) { >>> ret = -EPERM; >>> goto unlock; >>> } >> >> >> That did the trick, I'll update the documentation and include in my >> "Committer testing" section: > > I ammended this to that patch, please check the wording: > > - Arnaldo > > diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst > index c0ca0c1a6804..ed33682e26b0 100644 > --- a/Documentation/admin-guide/perf-security.rst > +++ b/Documentation/admin-guide/perf-security.rst > @@ -127,12 +127,19 @@ taken to create such groups of privileged Perf users. > > :: > > - # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf > - # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf > + # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf > + # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf > perf: OK > # getcap perf > perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep > > +If the libcap installed doesn't yet support "cap_perfmon", use "38" instead, > +i.e.: > + > +:: > + > + # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf > + > As a result, members of perf_users group are capable of conducting > performance monitoring and observability by using functionality of the > configured Perf tool executable that, when executes, passes perf_events > Looks good to me. The paragraph just above should then also be extended to mention that perf_events subsystem memory limit is ignored due to usage of CAP_IPC_LOCK: "As a result, members of perf_users group are capable of conducting performance monitoring and observability by using functionality of the configured Perf tool executable that, when executes, passes perf_events subsystem scope and perf_event_mlock_kb locking limit checks." ~Alexey
Em Tue, Apr 07, 2020 at 01:56:43PM -0300, Arnaldo Carvalho de Melo escreveu: > > But then, even with that attr.exclude_kernel set to 1 we _still_ get > kernel samples, which looks like another bug, now trying with strace, > which leads us to another rabbit hole: > > [perf@five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio > Error: > You may not have permission to collect system-wide stats. > > Consider tweaking /proc/sys/kernel/perf_event_paranoid, > which controls use of the performance events system by > unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN). > > The current value is 2: > > -1: Allow use of (almost) all events by all users > Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK > >= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN > Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN > >= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN > >= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN > > To make this setting permanent, edit /etc/sysctl.conf too, e.g.: > > kernel.perf_event_paranoid = -1 > > [perf@five ~]$ > > If I remove that strace -e ... from the front, 'perf top' is back > working as a non-cap_sys_admin user, just with cap_perfmon. > So I couldn't figure it out so far why is that exclude_kernel is being set to 1, as perf-top when no event is passed defaults to this to find out what to use as a default event: perf_evlist__add_default(top.evlist) perf_evsel__new_cycles(true); struct perf_event_attr attr = { .type = PERF_TYPE_HARDWARE, .config = PERF_COUNT_HW_CPU_CYCLES, .exclude_kernel = !perf_event_can_profile_kernel(), }; perf_event_paranoid_check(1); return perf_cap__capable(CAP_SYS_ADMIN) || perf_cap__capable(CAP_PERFMON) || perf_event_paranoid() <= max_level; And then that second condition should hold true, it returns true, and then .exclude_kernel should be set to !true -> zero.o Now the wallclock says I need to stop being a programmer and turn into a daycare provider for Pedro, cya! - Arnaldo
On 07.04.2020 20:02, Arnaldo Carvalho de Melo wrote: > Em Tue, Apr 07, 2020 at 07:52:56PM +0300, Alexey Budankov escreveu: >> >> On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote: >>> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu: >>>> Could makes sense adding cap_ipc_lock to the binary to isolate from this: > >>>> kernel/events/core.c: 6101 >>>> if ((locked > lock_limit) && perf_is_paranoid() && >>>> !capable(CAP_IPC_LOCK)) { >>>> ret = -EPERM; >>>> goto unlock; >>>> } > >>> That did the trick, I'll update the documentation and include in my >>> "Committer testing" section: > >> Looks like top mode somehow reaches perf mmap limit described here [1]. >> Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine: >> perf top -e cycles -m 1 > > So this would read better? > > diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst > index ed33682e26b0..d44dd24b0244 100644 > --- a/Documentation/admin-guide/perf-security.rst > +++ b/Documentation/admin-guide/perf-security.rst > @@ -127,8 +127,8 @@ taken to create such groups of privileged Perf users. > > :: > > - # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf > - # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf > + # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf > + # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf > perf: OK > # getcap perf > perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep > @@ -140,6 +140,10 @@ i.e.: > > # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf > > +Note that you may need to have 'cap_ipc_lock' in the mix for tools such as > +'perf top', alternatively use 'perf top -m N', to reduce the memory that > +it uses for the perf ring buffer, see the memory allocation section below. > + Let's stay with the first variant of you addition to this patch and also extend the paragraph below as suggested in other mail in the thread. > As a result, members of perf_users group are capable of conducting > performance monitoring and observability by using functionality of the > configured Perf tool executable that, when executes, passes perf_events > Thanks, Alexey
Hi Alexey, > Currently access to perf_events, i915_perf and other performance > monitoring and observability subsystems of the kernel is open only for > a privileged process [1] with CAP_SYS_ADMIN capability enabled in the > process effective set [2]. > > This patch set introduces CAP_PERFMON capability designed to secure > system performance monitoring and observability operations so that > CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role > for performance monitoring and observability subsystems of the kernel. I'm seeing an issue with CAP_PERFMON when I try to record data for a specific target. I don't know whether this is sort of a regression or an expected behavior. Without setting CAP_PERFMON: $ getcap ./perf $ ./perf stat -a ls Error: Access to performance monitoring and observability operations is limited. $ ./perf stat ls Performance counter stats for 'ls': 2.06 msec task-clock:u # 0.418 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec With CAP_PERFMON: $ getcap ./perf ./perf = cap_perfmon+ep $ ./perf stat -a ls Performance counter stats for 'system wide': 142.42 msec cpu-clock # 25.062 CPUs utilized 182 context-switches # 0.001 M/sec 48 cpu-migrations # 0.337 K/sec $ ./perf stat ls Error: Access to performance monitoring and observability operations is limited. Am I missing something silly? Analysis: --------- A bit more analysis lead me to below kernel code fs/exec.c: begin_new_exec() { ... if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || !(uid_eq(current_euid(), current_uid()) && gid_eq(current_egid(), current_gid()))) set_dumpable(current->mm, suid_dumpable); else set_dumpable(current->mm, SUID_DUMP_USER); ... commit_creds(bprm->cred); } When I execute './perf stat ls', it's going into else condition and thus sets dumpable flag as SUID_DUMP_USER. Then in commit_creds(): int commit_creds(struct cred *new) { ... /* dumpability changes */ if (... !cred_cap_issubset(old, new)) { if (task->mm) set_dumpable(task->mm, suid_dumpable); } !cred_cap_issubset(old, new) fails for perf without any capability and thus it doesn't execute set_dumpable(). Whereas that condition passes for perf with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable). Now while opening an event: perf_event_open() ptrace_may_access() __ptrace_may_access() { ... if (mm && ((get_dumpable(mm) != SUID_DUMP_USER) && !ptrace_has_cap(cred, mm->user_ns, mode))) return -EPERM; } This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM. But it fails for perf without CAP_PERFMON and thus it goes ahead and returns success. So opening an event fails when perf has CAP_PREFMON and tries to open process specific event as normal user. Workarounds: ------------ Based on above analysis, I found couple of workarounds (examples are on Ubuntu 18.04.4 powerpc): Workaround1: Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the issue. # echo 1 > /proc/sys/fs/suid_dumpable $ getcap ./perf ./perf = cap_perfmon+ep $ ./perf stat ls Performance counter stats for 'ls': 1.47 msec task-clock # 0.806 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec Workaround2: Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue. $ cat /proc/sys/fs/suid_dumpable 2 # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf $ ./perf stat ls Performance counter stats for 'ls': 1.41 msec task-clock # 0.826 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec Workaround3: Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue. $ cat /proc/sys/fs/suid_dumpable 2 # setcap "cap_perfmon=ep" /bin/bash # setcap "cap_perfmon=ep" ./perf $ bash $ ./perf stat ls Performance counter stats for 'ls': 1.47 msec task-clock # 0.806 CPUs utilized 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec - Ravi
Hi Ravi, On 10.07.2020 16:31, Ravi Bangoria wrote: > Hi Alexey, > >> Currently access to perf_events, i915_perf and other performance >> monitoring and observability subsystems of the kernel is open only for >> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the >> process effective set [2]. >> >> This patch set introduces CAP_PERFMON capability designed to secure >> system performance monitoring and observability operations so that >> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role >> for performance monitoring and observability subsystems of the kernel. > > I'm seeing an issue with CAP_PERFMON when I try to record data for a > specific target. I don't know whether this is sort of a regression or > an expected behavior. Thanks for reporting and root causing this case. The behavior looks like kind of expected since currently CAP_PERFMON takes over the related part of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say that access control is also subject to CAP_SYS_PTRACE credentials. CAP_PERFMON could be used to extend and substitute ptrace_may_access() check in perf_events subsystem to simplify user experience at least in this specific case. Alexei [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > > Without setting CAP_PERFMON: > > $ getcap ./perf > $ ./perf stat -a ls > Error: > Access to performance monitoring and observability operations is limited. > $ ./perf stat ls > Performance counter stats for 'ls': > 2.06 msec task-clock:u # 0.418 CPUs utilized > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > > With CAP_PERFMON: > > $ getcap ./perf > ./perf = cap_perfmon+ep > $ ./perf stat -a ls > Performance counter stats for 'system wide': > 142.42 msec cpu-clock # 25.062 CPUs utilized > 182 context-switches # 0.001 M/sec > 48 cpu-migrations # 0.337 K/sec > $ ./perf stat ls > Error: > Access to performance monitoring and observability operations is limited. > > Am I missing something silly? > > Analysis: > --------- > A bit more analysis lead me to below kernel code fs/exec.c: > > begin_new_exec() > { > ... > if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || > !(uid_eq(current_euid(), current_uid()) && > gid_eq(current_egid(), current_gid()))) > set_dumpable(current->mm, suid_dumpable); > else > set_dumpable(current->mm, SUID_DUMP_USER); > > ... > commit_creds(bprm->cred); > } > > When I execute './perf stat ls', it's going into else condition and thus sets > dumpable flag as SUID_DUMP_USER. Then in commit_creds(): > > int commit_creds(struct cred *new) > { > ... > /* dumpability changes */ > if (... > !cred_cap_issubset(old, new)) { > if (task->mm) > set_dumpable(task->mm, suid_dumpable); > } > > !cred_cap_issubset(old, new) fails for perf without any capability and thus > it doesn't execute set_dumpable(). Whereas that condition passes for perf > with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with > suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is > SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable). > > Now while opening an event: > > perf_event_open() > ptrace_may_access() > __ptrace_may_access() { > ... > if (mm && > ((get_dumpable(mm) != SUID_DUMP_USER) && > !ptrace_has_cap(cred, mm->user_ns, mode))) > return -EPERM; > } > > This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM. > But it fails for perf without CAP_PERFMON and thus it goes ahead and returns > success. So opening an event fails when perf has CAP_PREFMON and tries to open > process specific event as normal user. > > Workarounds: > ------------ > Based on above analysis, I found couple of workarounds (examples are on > Ubuntu 18.04.4 powerpc): > > Workaround1: > Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the > issue. > > # echo 1 > /proc/sys/fs/suid_dumpable > $ getcap ./perf > ./perf = cap_perfmon+ep > $ ./perf stat ls > Performance counter stats for 'ls': > 1.47 msec task-clock # 0.806 CPUs utilized > 0 context-switches # 0.000 K/sec > 0 cpu-migrations # 0.000 K/sec > > Workaround2: > Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue. > > $ cat /proc/sys/fs/suid_dumpable > 2 > # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf > $ ./perf stat ls > Performance counter stats for 'ls': > 1.41 msec task-clock # 0.826 CPUs utilized > 0 context-switches # 0.000 K/sec > 0 cpu-migrations # 0.000 K/sec > > Workaround3: > Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue. > > $ cat /proc/sys/fs/suid_dumpable > 2 > # setcap "cap_perfmon=ep" /bin/bash > # setcap "cap_perfmon=ep" ./perf > $ bash > $ ./perf stat ls > Performance counter stats for 'ls': > 1.47 msec task-clock # 0.806 CPUs utilized > 0 context-switches # 0.000 K/sec > 0 cpu-migrations # 0.000 K/sec > > - Ravi
Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu: > On 10.07.2020 16:31, Ravi Bangoria wrote: > >> Currently access to perf_events, i915_perf and other performance > >> monitoring and observability subsystems of the kernel is open only for > >> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the > >> process effective set [2]. > >> This patch set introduces CAP_PERFMON capability designed to secure > >> system performance monitoring and observability operations so that > >> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role > >> for performance monitoring and observability subsystems of the kernel. > > I'm seeing an issue with CAP_PERFMON when I try to record data for a > > specific target. I don't know whether this is sort of a regression or > > an expected behavior. > Thanks for reporting and root causing this case. The behavior looks like > kind of expected since currently CAP_PERFMON takes over the related part > of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say > that access control is also subject to CAP_SYS_PTRACE credentials. I think that stating that in the error message would be helpful, after all, who reads docs? 8-) I.e., this: $ ./perf stat ls Error: Access to performance monitoring and observability operations is limited. $ Could become: $ ./perf stat ls Error: Access to performance monitoring and observability operations is limited. Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. $ - Arnaldo > CAP_PERFMON could be used to extend and substitute ptrace_may_access() > check in perf_events subsystem to simplify user experience at least in > this specific case. > > Alexei > > [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > > > > > Without setting CAP_PERFMON: > > > > $ getcap ./perf > > $ ./perf stat -a ls > > Error: > > Access to performance monitoring and observability operations is limited. > > $ ./perf stat ls > > Performance counter stats for 'ls': > > 2.06 msec task-clock:u # 0.418 CPUs utilized > > 0 context-switches:u # 0.000 K/sec > > 0 cpu-migrations:u # 0.000 K/sec > > > > With CAP_PERFMON: > > > > $ getcap ./perf > > ./perf = cap_perfmon+ep > > $ ./perf stat -a ls > > Performance counter stats for 'system wide': > > 142.42 msec cpu-clock # 25.062 CPUs utilized > > 182 context-switches # 0.001 M/sec > > 48 cpu-migrations # 0.337 K/sec > > $ ./perf stat ls > > Error: > > Access to performance monitoring and observability operations is limited. > > > > Am I missing something silly? > > > > Analysis: > > --------- > > A bit more analysis lead me to below kernel code fs/exec.c: > > > > begin_new_exec() > > { > > ... > > if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || > > !(uid_eq(current_euid(), current_uid()) && > > gid_eq(current_egid(), current_gid()))) > > set_dumpable(current->mm, suid_dumpable); > > else > > set_dumpable(current->mm, SUID_DUMP_USER); > > > > ... > > commit_creds(bprm->cred); > > } > > > > When I execute './perf stat ls', it's going into else condition and thus sets > > dumpable flag as SUID_DUMP_USER. Then in commit_creds(): > > > > int commit_creds(struct cred *new) > > { > > ... > > /* dumpability changes */ > > if (... > > !cred_cap_issubset(old, new)) { > > if (task->mm) > > set_dumpable(task->mm, suid_dumpable); > > } > > > > !cred_cap_issubset(old, new) fails for perf without any capability and thus > > it doesn't execute set_dumpable(). Whereas that condition passes for perf > > with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with > > suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is > > SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable). > > > > Now while opening an event: > > > > perf_event_open() > > ptrace_may_access() > > __ptrace_may_access() { > > ... > > if (mm && > > ((get_dumpable(mm) != SUID_DUMP_USER) && > > !ptrace_has_cap(cred, mm->user_ns, mode))) > > return -EPERM; > > } > > > > This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM. > > But it fails for perf without CAP_PERFMON and thus it goes ahead and returns > > success. So opening an event fails when perf has CAP_PREFMON and tries to open > > process specific event as normal user. > > > > Workarounds: > > ------------ > > Based on above analysis, I found couple of workarounds (examples are on > > Ubuntu 18.04.4 powerpc): > > > > Workaround1: > > Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the > > issue. > > > > # echo 1 > /proc/sys/fs/suid_dumpable > > $ getcap ./perf > > ./perf = cap_perfmon+ep > > $ ./perf stat ls > > Performance counter stats for 'ls': > > 1.47 msec task-clock # 0.806 CPUs utilized > > 0 context-switches # 0.000 K/sec > > 0 cpu-migrations # 0.000 K/sec > > > > Workaround2: > > Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue. > > > > $ cat /proc/sys/fs/suid_dumpable > > 2 > > # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf > > $ ./perf stat ls > > Performance counter stats for 'ls': > > 1.41 msec task-clock # 0.826 CPUs utilized > > 0 context-switches # 0.000 K/sec > > 0 cpu-migrations # 0.000 K/sec > > > > Workaround3: > > Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue. > > > > $ cat /proc/sys/fs/suid_dumpable > > 2 > > # setcap "cap_perfmon=ep" /bin/bash > > # setcap "cap_perfmon=ep" ./perf > > $ bash > > $ ./perf stat ls > > Performance counter stats for 'ls': > > 1.47 msec task-clock # 0.806 CPUs utilized > > 0 context-switches # 0.000 K/sec > > 0 cpu-migrations # 0.000 K/sec > > > > - Ravi
On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote: > Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu: >> On 10.07.2020 16:31, Ravi Bangoria wrote: >>>> Currently access to perf_events, i915_perf and other performance >>>> monitoring and observability subsystems of the kernel is open only for >>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the >>>> process effective set [2]. > >>>> This patch set introduces CAP_PERFMON capability designed to secure >>>> system performance monitoring and observability operations so that >>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role >>>> for performance monitoring and observability subsystems of the kernel. > >>> I'm seeing an issue with CAP_PERFMON when I try to record data for a >>> specific target. I don't know whether this is sort of a regression or >>> an expected behavior. > >> Thanks for reporting and root causing this case. The behavior looks like >> kind of expected since currently CAP_PERFMON takes over the related part >> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say >> that access control is also subject to CAP_SYS_PTRACE credentials. > > I think that stating that in the error message would be helpful, after > all, who reads docs? 8-) At least those who write it :D ... > > I.e., this: > > $ ./perf stat ls > Error: > Access to performance monitoring and observability operations is limited. > $ > > Could become: > > $ ./perf stat ls > Error: > Access to performance monitoring and observability operations is limited. > Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. > $ It would better provide reference to perf security docs in the tool output. Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON makes monitoring simpler and even more secure to use since Perf tool need not to start/stop/single-step and read/write registers and memory and so on like a debugger or strace-like tool. What do you think? Alexei > > - Arnaldo > >> CAP_PERFMON could be used to extend and substitute ptrace_may_access() >> check in perf_events subsystem to simplify user experience at least in >> this specific case. >> >> Alexei >> >> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html >> >>> >>> Without setting CAP_PERFMON: >>> >>> $ getcap ./perf >>> $ ./perf stat -a ls >>> Error: >>> Access to performance monitoring and observability operations is limited. >>> $ ./perf stat ls >>> Performance counter stats for 'ls': >>> 2.06 msec task-clock:u # 0.418 CPUs utilized >>> 0 context-switches:u # 0.000 K/sec >>> 0 cpu-migrations:u # 0.000 K/sec >>> >>> With CAP_PERFMON: >>> >>> $ getcap ./perf >>> ./perf = cap_perfmon+ep >>> $ ./perf stat -a ls >>> Performance counter stats for 'system wide': >>> 142.42 msec cpu-clock # 25.062 CPUs utilized >>> 182 context-switches # 0.001 M/sec >>> 48 cpu-migrations # 0.337 K/sec >>> $ ./perf stat ls >>> Error: >>> Access to performance monitoring and observability operations is limited. >>> >>> Am I missing something silly? >>> >>> Analysis: >>> --------- >>> A bit more analysis lead me to below kernel code fs/exec.c: >>> >>> begin_new_exec() >>> { >>> ... >>> if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || >>> !(uid_eq(current_euid(), current_uid()) && >>> gid_eq(current_egid(), current_gid()))) >>> set_dumpable(current->mm, suid_dumpable); >>> else >>> set_dumpable(current->mm, SUID_DUMP_USER); >>> >>> ... >>> commit_creds(bprm->cred); >>> } >>> >>> When I execute './perf stat ls', it's going into else condition and thus sets >>> dumpable flag as SUID_DUMP_USER. Then in commit_creds(): >>> >>> int commit_creds(struct cred *new) >>> { >>> ... >>> /* dumpability changes */ >>> if (... >>> !cred_cap_issubset(old, new)) { >>> if (task->mm) >>> set_dumpable(task->mm, suid_dumpable); >>> } >>> >>> !cred_cap_issubset(old, new) fails for perf without any capability and thus >>> it doesn't execute set_dumpable(). Whereas that condition passes for perf >>> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with >>> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is >>> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable). >>> >>> Now while opening an event: >>> >>> perf_event_open() >>> ptrace_may_access() >>> __ptrace_may_access() { >>> ... >>> if (mm && >>> ((get_dumpable(mm) != SUID_DUMP_USER) && >>> !ptrace_has_cap(cred, mm->user_ns, mode))) >>> return -EPERM; >>> } >>> >>> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM. >>> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns >>> success. So opening an event fails when perf has CAP_PREFMON and tries to open >>> process specific event as normal user. >>> >>> Workarounds: >>> ------------ >>> Based on above analysis, I found couple of workarounds (examples are on >>> Ubuntu 18.04.4 powerpc): >>> >>> Workaround1: >>> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the >>> issue. >>> >>> # echo 1 > /proc/sys/fs/suid_dumpable >>> $ getcap ./perf >>> ./perf = cap_perfmon+ep >>> $ ./perf stat ls >>> Performance counter stats for 'ls': >>> 1.47 msec task-clock # 0.806 CPUs utilized >>> 0 context-switches # 0.000 K/sec >>> 0 cpu-migrations # 0.000 K/sec >>> >>> Workaround2: >>> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue. >>> >>> $ cat /proc/sys/fs/suid_dumpable >>> 2 >>> # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf >>> $ ./perf stat ls >>> Performance counter stats for 'ls': >>> 1.41 msec task-clock # 0.826 CPUs utilized >>> 0 context-switches # 0.000 K/sec >>> 0 cpu-migrations # 0.000 K/sec >>> >>> Workaround3: >>> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue. >>> >>> $ cat /proc/sys/fs/suid_dumpable >>> 2 >>> # setcap "cap_perfmon=ep" /bin/bash >>> # setcap "cap_perfmon=ep" ./perf >>> $ bash >>> $ ./perf stat ls >>> Performance counter stats for 'ls': >>> 1.47 msec task-clock # 0.806 CPUs utilized >>> 0 context-switches # 0.000 K/sec >>> 0 cpu-migrations # 0.000 K/sec >>> >>> - Ravi >
Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu: > > On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote: > > Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu: > >> On 10.07.2020 16:31, Ravi Bangoria wrote: > >>>> Currently access to perf_events, i915_perf and other performance > >>>> monitoring and observability subsystems of the kernel is open only for > >>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the > >>>> process effective set [2]. > >>>> This patch set introduces CAP_PERFMON capability designed to secure > >>>> system performance monitoring and observability operations so that > >>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role > >>>> for performance monitoring and observability subsystems of the kernel. > >>> I'm seeing an issue with CAP_PERFMON when I try to record data for a > >>> specific target. I don't know whether this is sort of a regression or > >>> an expected behavior. > >> Thanks for reporting and root causing this case. The behavior looks like > >> kind of expected since currently CAP_PERFMON takes over the related part > >> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say > >> that access control is also subject to CAP_SYS_PTRACE credentials. > > I think that stating that in the error message would be helpful, after > > all, who reads docs? 8-) > At least those who write it :D ... Everybody should read it, sure :-) > > I.e., this: > > > > $ ./perf stat ls > > Error: > > Access to performance monitoring and observability operations is limited. > > $ > > > > Could become: > > > > $ ./perf stat ls > > Error: > > Access to performance monitoring and observability operations is limited. > > Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. > > $ > > It would better provide reference to perf security docs in the tool output. So add a 3rd line: $ ./perf stat ls Error: Access to performance monitoring and observability operations is limited. Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. Please read the 'Perf events and tool security' document: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON You mean the following? diff --git a/kernel/events/core.c b/kernel/events/core.c index 856d98c36f56..a2397f724c10 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open, * perf_event_exit_task() that could imply). */ err = -EACCES; - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) goto err_cred; } > makes monitoring simpler and even more secure to use since Perf tool need > not to start/stop/single-step and read/write registers and memory and so on > like a debugger or strace-like tool. What do you think? I tend to agree, Peter? > Alexei > > > > > - Arnaldo > > > >> CAP_PERFMON could be used to extend and substitute ptrace_may_access() > >> check in perf_events subsystem to simplify user experience at least in > >> this specific case. > >> > >> Alexei > >> > >> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > >> > >>> > >>> Without setting CAP_PERFMON: > >>> > >>> $ getcap ./perf > >>> $ ./perf stat -a ls > >>> Error: > >>> Access to performance monitoring and observability operations is limited. > >>> $ ./perf stat ls > >>> Performance counter stats for 'ls': > >>> 2.06 msec task-clock:u # 0.418 CPUs utilized > >>> 0 context-switches:u # 0.000 K/sec > >>> 0 cpu-migrations:u # 0.000 K/sec > >>> > >>> With CAP_PERFMON: > >>> > >>> $ getcap ./perf > >>> ./perf = cap_perfmon+ep > >>> $ ./perf stat -a ls > >>> Performance counter stats for 'system wide': > >>> 142.42 msec cpu-clock # 25.062 CPUs utilized > >>> 182 context-switches # 0.001 M/sec > >>> 48 cpu-migrations # 0.337 K/sec > >>> $ ./perf stat ls > >>> Error: > >>> Access to performance monitoring and observability operations is limited. > >>> > >>> Am I missing something silly? > >>> > >>> Analysis: > >>> --------- > >>> A bit more analysis lead me to below kernel code fs/exec.c: > >>> > >>> begin_new_exec() > >>> { > >>> ... > >>> if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP || > >>> !(uid_eq(current_euid(), current_uid()) && > >>> gid_eq(current_egid(), current_gid()))) > >>> set_dumpable(current->mm, suid_dumpable); > >>> else > >>> set_dumpable(current->mm, SUID_DUMP_USER); > >>> > >>> ... > >>> commit_creds(bprm->cred); > >>> } > >>> > >>> When I execute './perf stat ls', it's going into else condition and thus sets > >>> dumpable flag as SUID_DUMP_USER. Then in commit_creds(): > >>> > >>> int commit_creds(struct cred *new) > >>> { > >>> ... > >>> /* dumpability changes */ > >>> if (... > >>> !cred_cap_issubset(old, new)) { > >>> if (task->mm) > >>> set_dumpable(task->mm, suid_dumpable); > >>> } > >>> > >>> !cred_cap_issubset(old, new) fails for perf without any capability and thus > >>> it doesn't execute set_dumpable(). Whereas that condition passes for perf > >>> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with > >>> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is > >>> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable). > >>> > >>> Now while opening an event: > >>> > >>> perf_event_open() > >>> ptrace_may_access() > >>> __ptrace_may_access() { > >>> ... > >>> if (mm && > >>> ((get_dumpable(mm) != SUID_DUMP_USER) && > >>> !ptrace_has_cap(cred, mm->user_ns, mode))) > >>> return -EPERM; > >>> } > >>> > >>> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM. > >>> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns > >>> success. So opening an event fails when perf has CAP_PREFMON and tries to open > >>> process specific event as normal user. > >>> > >>> Workarounds: > >>> ------------ > >>> Based on above analysis, I found couple of workarounds (examples are on > >>> Ubuntu 18.04.4 powerpc): > >>> > >>> Workaround1: > >>> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the > >>> issue. > >>> > >>> # echo 1 > /proc/sys/fs/suid_dumpable > >>> $ getcap ./perf > >>> ./perf = cap_perfmon+ep > >>> $ ./perf stat ls > >>> Performance counter stats for 'ls': > >>> 1.47 msec task-clock # 0.806 CPUs utilized > >>> 0 context-switches # 0.000 K/sec > >>> 0 cpu-migrations # 0.000 K/sec > >>> > >>> Workaround2: > >>> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue. > >>> > >>> $ cat /proc/sys/fs/suid_dumpable > >>> 2 > >>> # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf > >>> $ ./perf stat ls > >>> Performance counter stats for 'ls': > >>> 1.41 msec task-clock # 0.826 CPUs utilized > >>> 0 context-switches # 0.000 K/sec > >>> 0 cpu-migrations # 0.000 K/sec > >>> > >>> Workaround3: > >>> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue. > >>> > >>> $ cat /proc/sys/fs/suid_dumpable > >>> 2 > >>> # setcap "cap_perfmon=ep" /bin/bash > >>> # setcap "cap_perfmon=ep" ./perf > >>> $ bash > >>> $ ./perf stat ls > >>> Performance counter stats for 'ls': > >>> 1.47 msec task-clock # 0.806 CPUs utilized > >>> 0 context-switches # 0.000 K/sec > >>> 0 cpu-migrations # 0.000 K/sec > >>> > >>> - Ravi > >
On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote: > Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu: >> >> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote: >>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu: >>>> On 10.07.2020 16:31, Ravi Bangoria wrote: >>>>>> Currently access to perf_events, i915_perf and other performance >>>>>> monitoring and observability subsystems of the kernel is open only for >>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the >>>>>> process effective set [2]. > >>>>>> This patch set introduces CAP_PERFMON capability designed to secure >>>>>> system performance monitoring and observability operations so that >>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role >>>>>> for performance monitoring and observability subsystems of the kernel. > >>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a >>>>> specific target. I don't know whether this is sort of a regression or >>>>> an expected behavior. > >>>> Thanks for reporting and root causing this case. The behavior looks like >>>> kind of expected since currently CAP_PERFMON takes over the related part >>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say >>>> that access control is also subject to CAP_SYS_PTRACE credentials. > >>> I think that stating that in the error message would be helpful, after >>> all, who reads docs? 8-) > >> At least those who write it :D ... > > Everybody should read it, sure :-) > >>> I.e., this: >>> >>> $ ./perf stat ls >>> Error: >>> Access to performance monitoring and observability operations is limited. >>> $ >>> >>> Could become: >>> >>> $ ./perf stat ls >>> Error: >>> Access to performance monitoring and observability operations is limited. >>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. >>> $ >> >> It would better provide reference to perf security docs in the tool output. > > So add a 3rd line: > > $ ./perf stat ls > Error: > Access to performance monitoring and observability operations is limited. > Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. > Please read the 'Perf events and tool security' document: > https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html If it had that patch below then message change would not be required. However this two sentences in the end of whole message would still add up: "Please read the 'Perf events and tool security' document: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html" > >> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON > > You mean the following? Exactly that. > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 856d98c36f56..a2397f724c10 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open, > * perf_event_exit_task() that could imply). > */ > err = -EACCES; > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > goto err_cred; > } > >> makes monitoring simpler and even more secure to use since Perf tool need >> not to start/stop/single-step and read/write registers and memory and so on >> like a debugger or strace-like tool. What do you think? > > I tend to agree, Peter? > >> Alexei >> >>> >>> - Arnaldo Alexei
Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu: > > On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote: > > Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu: > >> > >> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote: > >>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu: > >>>> On 10.07.2020 16:31, Ravi Bangoria wrote: > >>>>>> Currently access to perf_events, i915_perf and other performance > >>>>>> monitoring and observability subsystems of the kernel is open only for > >>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the > >>>>>> process effective set [2]. > > > >>>>>> This patch set introduces CAP_PERFMON capability designed to secure > >>>>>> system performance monitoring and observability operations so that > >>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role > >>>>>> for performance monitoring and observability subsystems of the kernel. > > > >>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a > >>>>> specific target. I don't know whether this is sort of a regression or > >>>>> an expected behavior. > > > >>>> Thanks for reporting and root causing this case. The behavior looks like > >>>> kind of expected since currently CAP_PERFMON takes over the related part > >>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say > >>>> that access control is also subject to CAP_SYS_PTRACE credentials. > > > >>> I think that stating that in the error message would be helpful, after > >>> all, who reads docs? 8-) > > > >> At least those who write it :D ... > > > > Everybody should read it, sure :-) > > > >>> I.e., this: > >>> > >>> $ ./perf stat ls > >>> Error: > >>> Access to performance monitoring and observability operations is limited. > >>> $ > >>> > >>> Could become: > >>> > >>> $ ./perf stat ls > >>> Error: > >>> Access to performance monitoring and observability operations is limited. > >>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. > >>> $ > >> > >> It would better provide reference to perf security docs in the tool output. > > > > So add a 3rd line: > > > > $ ./perf stat ls > > Error: > > Access to performance monitoring and observability operations is limited. > > Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. > > Please read the 'Perf events and tool security' document: > > https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > If it had that patch below then message change would not be required. Sure, but the tool should continue to work and provide useful messages when running on kernels without that change. Pointing to the document is valid and should be done, that is an agreed point. But the tool can do some checks, narrow down the possible causes for the error message and provide something that in most cases will make the user make progress. > However this two sentences in the end of whole message would still add up: > "Please read the 'Perf events and tool security' document: > https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html" We're in violent agreement here. :-) > > > >> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON > > > > You mean the following? > > Exactly that. Sure, lets then wait for others to chime in and then you can go ahead and submit that patch. Peter? - Arnaldo > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 856d98c36f56..a2397f724c10 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open, > > * perf_event_exit_task() that could imply). > > */ > > err = -EACCES; > > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > > goto err_cred; > > } > > > >> makes monitoring simpler and even more secure to use since Perf tool need > >> not to start/stop/single-step and read/write registers and memory and so on > >> like a debugger or strace-like tool. What do you think? > > > > I tend to agree, Peter? > > > >> Alexei > >> > >>> > >>> - Arnaldo > > Alexei
On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote: > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > > index 856d98c36f56..a2397f724c10 100644 > > > --- a/kernel/events/core.c > > > +++ b/kernel/events/core.c > > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open, > > > * perf_event_exit_task() that could imply). > > > */ > > > err = -EACCES; > > > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > > > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > > > goto err_cred; > > > } > > > > > >> makes monitoring simpler and even more secure to use since Perf tool need > > >> not to start/stop/single-step and read/write registers and memory and so on > > >> like a debugger or strace-like tool. What do you think? > > > > > > I tend to agree, Peter? So this basically says that if CAP_PERFMON, we don't care about the ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow the ptrace checks? I suppose that makes sense.
Em Tue, Jul 14, 2020 at 12:59:34PM +0200, Peter Zijlstra escreveu: > On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote: > > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > > > index 856d98c36f56..a2397f724c10 100644 > > > > --- a/kernel/events/core.c > > > > +++ b/kernel/events/core.c > > > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open, > > > > * perf_event_exit_task() that could imply). > > > > */ > > > > err = -EACCES; > > > > - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > > > > + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) > > > > goto err_cred; > > > > } > > > >> makes monitoring simpler and even more secure to use since Perf tool need > > > >> not to start/stop/single-step and read/write registers and memory and so on > > > >> like a debugger or strace-like tool. What do you think? > > > > I tend to agree, Peter? > So this basically says that if CAP_PERFMON, we don't care about the > ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow > the ptrace checks? > I suppose that makes sense. Yeah, it in fact addresses the comment right above it: if (task) { err = mutex_lock_interruptible(&task->signal->exec_update_mutex); if (err) goto err_task; /* * Reuse ptrace permission checks for now. * * We must hold exec_update_mutex across this and any potential * perf_install_in_context() call for this new event to * serialize against exec() altering our credentials (and the * perf_event_exit_task() that could imply). */ err = -EACCES; if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) goto err_cred; } that "for now" part :-) Idea is to not require CAP_PTRACE for that, i.e. the attack surface for the perf binary is reduced. - Arnaldo
On 13.07.2020 21:51, Arnaldo Carvalho de Melo wrote: > Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu: >> >> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote: >>> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu: >>>> >>>> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote: >>>>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu: >>>>>> On 10.07.2020 16:31, Ravi Bangoria wrote: >>>>>>>> Currently access to perf_events, i915_perf and other performance >>>>>>>> monitoring and observability subsystems of the kernel is open only for >>>>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the >>>>>>>> process effective set [2]. >>> >>>>>>>> This patch set introduces CAP_PERFMON capability designed to secure >>>>>>>> system performance monitoring and observability operations so that >>>>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role >>>>>>>> for performance monitoring and observability subsystems of the kernel. >>> >>>>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a >>>>>>> specific target. I don't know whether this is sort of a regression or >>>>>>> an expected behavior. >>> >>>>>> Thanks for reporting and root causing this case. The behavior looks like >>>>>> kind of expected since currently CAP_PERFMON takes over the related part >>>>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say >>>>>> that access control is also subject to CAP_SYS_PTRACE credentials. >>> >>>>> I think that stating that in the error message would be helpful, after >>>>> all, who reads docs? 8-) >>> >>>> At least those who write it :D ... >>> >>> Everybody should read it, sure :-) >>> >>>>> I.e., this: >>>>> >>>>> $ ./perf stat ls >>>>> Error: >>>>> Access to performance monitoring and observability operations is limited. >>>>> $ >>>>> >>>>> Could become: >>>>> >>>>> $ ./perf stat ls >>>>> Error: >>>>> Access to performance monitoring and observability operations is limited. >>>>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. >>>>> $ >>>> >>>> It would better provide reference to perf security docs in the tool output. >>> >>> So add a 3rd line: >>> >>> $ ./perf stat ls >>> Error: >>> Access to performance monitoring and observability operations is limited. >>> Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE. >>> Please read the 'Perf events and tool security' document: >>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > >> If it had that patch below then message change would not be required. > > Sure, but the tool should continue to work and provide useful messages > when running on kernels without that change. Pointing to the document is > valid and should be done, that is an agreed point. But the tool can do > some checks, narrow down the possible causes for the error message and > provide something that in most cases will make the user make progress. > >> However this two sentences in the end of whole message would still add up: >> "Please read the 'Perf events and tool security' document: >> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html" > > We're in violent agreement here. :-) Here is the message draft mentioning a) CAP_SYS_PTRACE, for kernels prior v5.8, and b) Perf security document link. The plan is to send a patch extending perf_events with CAP_PERFMON check [1] for ptrace_may_access() and extending the tool with this message. "Access to performance monitoring and observability operations is limited. Enforced MAC policy settings (SELinux) can limit access to performance monitoring and observability operations. Inspect system audit records for more perf_event access control information and adjusting the policy. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open access to performance monitoring and observability operations for processes without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. More information can be found at 'Perf events and tool security' document: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html perf_event_paranoid setting is -1: -1: Allow use of (almost) all events by all users Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK >= 0: Disallow raw and ftrace function tracepoint access >= 1: Disallow CPU event access >= 2: Disallow kernel profiling To make the adjusted perf_event_paranoid setting permanent preserve it in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)" Alexei [1] https://lore.kernel.org/lkml/20200713121746.GA7029@kernel.org/ > >>> >>>> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON >>> >>> You mean the following? >> >> Exactly that. > > Sure, lets then wait for others to chime in and then you can go ahead > and submit that patch. > > Peter? > > - Arnaldo > >>> >>> diff --git a/kernel/events/core.c b/kernel/events/core.c >>> index 856d98c36f56..a2397f724c10 100644 >>> --- a/kernel/events/core.c >>> +++ b/kernel/events/core.c >>> @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open, >>> * perf_event_exit_task() that could imply). >>> */ >>> err = -EACCES; >>> - if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) >>> + if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) >>> goto err_cred; >>> } >>> >>>> makes monitoring simpler and even more secure to use since Perf tool need >>>> not to start/stop/single-step and read/write registers and memory and so on >>>> like a debugger or strace-like tool. What do you think? >>> >>> I tend to agree, Peter? >>> >>>> Alexei >>>> >>>>> >>>>> - Arnaldo >> >> Alexei >
Em Tue, Jul 21, 2020 at 04:06:34PM +0300, Alexey Budankov escreveu: > > On 13.07.2020 21:51, Arnaldo Carvalho de Melo wrote: > > Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu: > >> > >> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote: > >>> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu: > >> If it had that patch below then message change would not be required. > > Sure, but the tool should continue to work and provide useful messages > > when running on kernels without that change. Pointing to the document is > > valid and should be done, that is an agreed point. But the tool can do > > some checks, narrow down the possible causes for the error message and > > provide something that in most cases will make the user make progress. > >> However this two sentences in the end of whole message would still add up: > >> "Please read the 'Perf events and tool security' document: > >> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html" > > We're in violent agreement here. :-) > Here is the message draft mentioning a) CAP_SYS_PTRACE, for kernels prior > v5.8, and b) Perf security document link. The plan is to send a patch extending > perf_events with CAP_PERFMON check [1] for ptrace_may_access() and extending > the tool with this message. > "Access to performance monitoring and observability operations is limited. > Enforced MAC policy settings (SELinux) can limit access to performance > monitoring and observability operations. Inspect system audit records for > more perf_event access control information and adjusting the policy. > Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open > access to performance monitoring and observability operations for processes > without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. > More information can be found at 'Perf events and tool security' document: > https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html > perf_event_paranoid setting is -1: > -1: Allow use of (almost) all events by all users > Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK > >= 0: Disallow raw and ftrace function tracepoint access > >= 1: Disallow CPU event access > >= 2: Disallow kernel profiling > To make the adjusted perf_event_paranoid setting permanent preserve it > in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)" Looks ok! Lots of knobs to control access as one needs. - Arnaldo > Alexei > > [1] https://lore.kernel.org/lkml/20200713121746.GA7029@kernel.org/