mbox series

[v8,00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

Message ID f96f8f8a-e65c-3f36-dc85-fc3f5191e8c5@linux.intel.com (mailing list archive)
Headers show
Series Introduce CAP_PERFMON to secure system performance monitoring and observability | expand

Message

Alexey Budankov April 2, 2020, 8:42 a.m. UTC
Changes in v8:
- added Acked-by and Reviewed-by tags acquired so far
- rebased on the top of tip perf/core repository:
  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip perf/core
  sha1: 629b3df7ecb01fddfdf71cb5d3c563d143117c33
Changes in v7:
- updated and extended kernel.rst and perf-security.rst documentation 
  files with the information about CAP_PERFMON capability and its use cases
- documented the case of double audit logging of CAP_PERFMON and CAP_SYS_ADMIN
  capabilities on a SELinux enabled system
Changes in v6:
- avoided noaudit checks in perfmon_capable() to explicitly advertise
  CAP_PERFMON usage thru audit logs to secure system performance
  monitoring and observability
Changes in v5:
- renamed CAP_SYS_PERFMON to CAP_PERFMON
- extended perfmon_capable() with noaudit checks
Changes in v4:
- converted perfmon_capable() into an inline function
- made perf_events kprobes, uprobes, hw breakpoints and namespaces data
  available to CAP_SYS_PERFMON privileged processes
- applied perfmon_capable() to drivers/perf and drivers/oprofile
- extended __cmd_ftrace() with support of CAP_SYS_PERFMON
Changes in v3:
- implemented perfmon_capable() macros aggregating required capabilities
  checks
Changes in v2:
- made perf_events trace points available to CAP_SYS_PERFMON privileged
  processes
- made perf_event_paranoid_check() treat CAP_SYS_PERFMON equally to
  CAP_SYS_ADMIN
- applied CAP_SYS_PERFMON to i915_perf, bpf_trace, powerpc and parisc
  system performance monitoring and observability related subsystems

Currently access to perf_events, i915_perf and other performance
monitoring and observability subsystems of the kernel is open only for
a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
process effective set [2].

This patch set introduces CAP_PERFMON capability designed to secure
system performance monitoring and observability operations so that
CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
for performance monitoring and observability subsystems of the kernel.

CAP_PERFMON intends to harden system security and integrity during
performance monitoring and observability operations by decreasing attack
surface that is available to a CAP_SYS_ADMIN privileged process [2].
Providing the access to performance monitoring and observability
operations under CAP_PERFMON capability singly, without the rest of
CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
and makes the operation more secure. Thus, CAP_PERFMON implements the
principal of least privilege for performance monitoring and
observability operations (POSIX IEEE 1003.1e: 2.2.2.39 principle of
least privilege: A security design principle that states that a process
or program be granted only those privileges (e.g., capabilities)
necessary to accomplish its legitimate function, and only for the time
that such privileges are actually required)

CAP_PERFMON intends to meet the demand to secure system performance
monitoring and observability operations for adoption in security
sensitive, restricted, multiuser production environments (e.g. HPC
clusters, cloud and virtual compute environments), where root or
CAP_SYS_ADMIN credentials are not available to mass users of a system,
and securely unblock accessibility of system performance monitoring and
observability operations beyond root and CAP_SYS_ADMIN use cases.

CAP_PERFMON intends to take over CAP_SYS_ADMIN credentials related to
system performance monitoring and observability operations and balance
amount of CAP_SYS_ADMIN credentials following the recommendations in
the capabilities man page [2] for CAP_SYS_ADMIN: "Note: this capability
is overloaded; see Notes to kernel developers, below." For backward
compatibility reasons access to system performance monitoring and
observability subsystems of the kernel remains open for CAP_SYS_ADMIN
privileged processes but CAP_SYS_ADMIN capability usage for secure
system performance monitoring and observability operations is
discouraged with respect to the designed CAP_PERFMON capability.

Possible alternative solution to this system security hardening,
capabilities balancing task of making performance monitoring and
observability operations more secure and accessible could be to use
the existing CAP_SYS_PTRACE capability to govern system performance
monitoring and observability subsystems. However CAP_SYS_PTRACE
capability still provides users with more credentials than are
required for secure performance monitoring and observability
operations and this excess is avoided by the designed CAP_PERFMON.

Although software running under CAP_PERFMON can not ensure avoidance of
related hardware issues, the software can still mitigate those issues
following the official hardware issues mitigation procedure [3]. The
bugs in the software itself can be fixed following the standard kernel
development process [4] to maintain and harden security of system
performance monitoring and observability operations. Finally, the patch
set is shaped in the way that simplifies backtracking procedure of
possible induced issues [5] as much as possible.

---
Alexey Budankov (12):
  capabilities: introduce CAP_PERFMON to kernel and user space
  perf/core: open access to the core for CAP_PERFMON privileged process
  perf/core: open access to probes for CAP_PERFMON privileged process
  perf tool: extend Perf tool with CAP_PERFMON capability support
  drm/i915/perf: open access for CAP_PERFMON privileged process
  trace/bpf_trace: open access for CAP_PERFMON privileged process
  powerpc/perf: open access for CAP_PERFMON privileged process
  parisc/perf: open access for CAP_PERFMON privileged process
  drivers/perf: open access for CAP_PERFMON privileged process
  drivers/oprofile: open access for CAP_PERFMON privileged process
  doc/admin-guide: update perf-security.rst with CAP_PERFMON information
  doc/admin-guide: update kernel.rst with CAP_PERFMON information

 Documentation/admin-guide/perf-security.rst | 65 +++++++++++++--------
 Documentation/admin-guide/sysctl/kernel.rst | 16 +++--
 arch/parisc/kernel/perf.c                   |  2 +-
 arch/powerpc/perf/imc-pmu.c                 |  4 +-
 drivers/gpu/drm/i915/i915_perf.c            | 13 ++---
 drivers/oprofile/event_buffer.c             |  2 +-
 drivers/perf/arm_spe_pmu.c                  |  4 +-
 include/linux/capability.h                  |  4 ++
 include/linux/perf_event.h                  |  6 +-
 include/uapi/linux/capability.h             |  8 ++-
 kernel/events/core.c                        |  6 +-
 kernel/trace/bpf_trace.c                    |  2 +-
 security/selinux/include/classmap.h         |  4 +-
 tools/perf/builtin-ftrace.c                 |  5 +-
 tools/perf/design.txt                       |  3 +-
 tools/perf/util/cap.h                       |  4 ++
 tools/perf/util/evsel.c                     | 10 ++--
 tools/perf/util/util.c                      |  1 +
 18 files changed, 98 insertions(+), 61 deletions(-)

---
Validation (Intel Skylake, 8 cores, Fedora 29, 5.5.0-rc3+, x86_64):

libcap library [6], [7], [8] and Perf tool can be used to apply
CAP_PERFMON capability for secure system performance monitoring and
observability beyond the scope permitted by the system wide
perf_event_paranoid kernel setting [9] and below are the steps for
evaluation:

  - patch, build and boot the kernel
  - patch, build Perf tool e.g. to /home/user/perf
  ...
  # git clone git://git.kernel.org/pub/scm/libs/libcap/libcap.git libcap
  # pushd libcap
  # patch libcap/include/uapi/linux/capabilities.h with [PATCH 1]
  # make
  # pushd progs
  # ./setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" /home/user/perf
  # ./setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" /home/user/perf
  /home/user/perf: OK
  # ./getcap /home/user/perf
  /home/user/perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
  # echo 2 > /proc/sys/kernel/perf_event_paranoid
  # cat /proc/sys/kernel/perf_event_paranoid 
  2
  ...
  $ /home/user/perf top
    ... works as expected ...
  $ cat /proc/`pidof perf`/status
  Name:	perf
  Umask:	0002
  State:	S (sleeping)
  Tgid:	2958
  Ngid:	0
  Pid:	2958
  PPid:	9847
  TracerPid:	0
  Uid:	500	500	500	500
  Gid:	500	500	500	500
  FDSize:	256
  ...
  CapInh:	0000000000000000
  CapPrm:	0000004400080000
  CapEff:	0000004400080000 => 01000100 00000000 00001000 00000000 00000000
                                     cap_perfmon,cap_sys_ptrace,cap_syslog
  CapBnd:	0000007fffffffff
  CapAmb:	0000000000000000
  NoNewPrivs:	0
  Seccomp:	0
  Speculation_Store_Bypass:	thread vulnerable
  Cpus_allowed:	ff
  Cpus_allowed_list:	0-7
  ...

Usage of cap_perfmon effectively avoids unused credentials excess:

- with cap_sys_admin:
  CapEff:	0000007fffffffff => 01111111 11111111 11111111 11111111 11111111

- with cap_perfmon:
  CapEff:	0000004400080000 => 01000100 00000000 00001000 00000000 00000000
                                    38   34               19
                               perfmon   syslog           sys_ptrace

---
[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
[2] http://man7.org/linux/man-pages/man7/capabilities.7.html
[3] https://www.kernel.org/doc/html/latest/process/embargoed-hardware-issues.html
[4] https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html
[5] https://www.kernel.org/doc/html/latest/process/management-style.html#decisions
[6] http://man7.org/linux/man-pages/man8/setcap.8.html
[7] https://git.kernel.org/pub/scm/libs/libcap/libcap.git
[8] https://sites.google.com/site/fullycapable/, posix_1003.1e-990310.pdf
[9] http://man7.org/linux/man-pages/man2/perf_event_open.2.html

Comments

Arnaldo Carvalho de Melo April 7, 2020, 2:30 p.m. UTC | #1
Em Thu, Apr 02, 2020 at 11:42:05AM +0300, Alexey Budankov escreveu:
> This patch set introduces CAP_PERFMON capability designed to secure
> system performance monitoring and observability operations so that
> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> for performance monitoring and observability subsystems of the kernel.

So, what am I doing wrong?

[perf@five ~]$ type perf
perf is hashed (/home/perf/bin/perf)
[perf@five ~]$
[perf@five ~]$ ls -lahF /home/perf/bin/perf
-rwxr-x---. 1 root perf_users 24M Apr  7 10:34 /home/perf/bin/perf*
[perf@five ~]$
[perf@five ~]$ getcap /home/perf/bin/perf
[perf@five ~]$ perf top --stdio
Error:
You may not have permission to collect system-wide stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).

The current value is 2:

  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
      Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
>= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN

To make this setting permanent, edit /etc/sysctl.conf too, e.g.:

	kernel.perf_event_paranoid = -1

[perf@five ~]$

Ok, the message says I  need to have CAP_PERFMON, lets do it, using an
unpatched libcap that doesn't know about it but we can use 38,
CAP_PERFMON value instead, and I tested this with a patched libcap as
well, same results:

As root:

[root@five bin]# setcap "38,cap_sys_ptrace,cap_syslog=ep" perf
[root@five bin]#

Back to the 'perf' user in the 'perf_users' group, ok, so now 'perf
record -a' works for system wide sampling of cycles:u, i.e. only
userspace samples, but 'perf top' is failing:

[perf@five ~]$ type perf
perf is hashed (/home/perf/bin/perf)
[perf@five ~]$ getcap /home/perf/bin/perf
/home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
[perf@five ~]$ groups
perf perf_users
[perf@five ~]$ id
uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[perf@five ~]$ perf top --stdio
Error:
Failed to mmap with 1 (Operation not permitted)
[perf@five ~]$ perf record -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]

[perf@five ~]$ perf evlist
cycles:u
[perf@five ~]$

- Arnaldo
Arnaldo Carvalho de Melo April 7, 2020, 2:35 p.m. UTC | #2
Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> [perf@five ~]$ type perf
> perf is hashed (/home/perf/bin/perf)
> [perf@five ~]$ getcap /home/perf/bin/perf
> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> [perf@five ~]$ groups
> perf perf_users
> [perf@five ~]$ id
> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [perf@five ~]$ perf top --stdio
> Error:
> Failed to mmap with 1 (Operation not permitted)
> [perf@five ~]$ perf record -a
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> 
> [perf@five ~]$ perf evlist
> cycles:u
> [perf@five ~]$

Humm, perf record falls back to cycles:u after initially trying cycles
(i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
lemme test, humm not really:

[perf@five ~]$ perf top --stdio -e cycles:u
Error:
Failed to mmap with 1 (Operation not permitted)
[perf@five ~]$ perf record -e cycles:u -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
[perf@five ~]$

Back to debugging this.

- Arnaldo
Alexey Budankov April 7, 2020, 2:54 p.m. UTC | #3
On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
>> [perf@five ~]$ type perf
>> perf is hashed (/home/perf/bin/perf)
>> [perf@five ~]$ getcap /home/perf/bin/perf
>> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
>> [perf@five ~]$ groups
>> perf perf_users
>> [perf@five ~]$ id
>> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>> [perf@five ~]$ perf top --stdio
>> Error:
>> Failed to mmap with 1 (Operation not permitted)
>> [perf@five ~]$ perf record -a
>> ^C[ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
>>
>> [perf@five ~]$ perf evlist
>> cycles:u
>> [perf@five ~]$
> 
> Humm, perf record falls back to cycles:u after initially trying cycles
> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> lemme test, humm not really:
> 
> [perf@five ~]$ perf top --stdio -e cycles:u
> Error:
> Failed to mmap with 1 (Operation not permitted)
> [perf@five ~]$ perf record -e cycles:u -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> [perf@five ~]$
> 
> Back to debugging this.

Could makes sense adding cap_ipc_lock to the binary to isolate from this:

kernel/events/core.c: 6101
	if ((locked > lock_limit) && perf_is_paranoid() &&
		!capable(CAP_IPC_LOCK)) {
		ret = -EPERM;
		goto unlock;
	}

~Alexey
Arnaldo Carvalho de Melo April 7, 2020, 4:36 p.m. UTC | #4
Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> >> [perf@five ~]$ type perf
> >> perf is hashed (/home/perf/bin/perf)
> >> [perf@five ~]$ getcap /home/perf/bin/perf
> >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> >> [perf@five ~]$ groups
> >> perf perf_users
> >> [perf@five ~]$ id
> >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> >> [perf@five ~]$ perf top --stdio
> >> Error:
> >> Failed to mmap with 1 (Operation not permitted)
> >> [perf@five ~]$ perf record -a
> >> ^C[ perf record: Woken up 1 times to write data ]
> >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> >>
> >> [perf@five ~]$ perf evlist
> >> cycles:u
> >> [perf@five ~]$
> > 
> > Humm, perf record falls back to cycles:u after initially trying cycles
> > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> > lemme test, humm not really:
> > 
> > [perf@five ~]$ perf top --stdio -e cycles:u
> > Error:
> > Failed to mmap with 1 (Operation not permitted)
> > [perf@five ~]$ perf record -e cycles:u -a sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> > [perf@five ~]$
> > 
> > Back to debugging this.
> 
> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
> 
> kernel/events/core.c: 6101
> 	if ((locked > lock_limit) && perf_is_paranoid() &&
> 		!capable(CAP_IPC_LOCK)) {
> 		ret = -EPERM;
> 		goto unlock;
> 	}


That did the trick, I'll update the documentation and include in my
"Committer testing" section:

[perf@five ~]$ groups
perf perf_users
[perf@five ~]$ ls -lahF bin/perf
-rwxr-x---. 1 root perf_users 24M Apr  7 10:34 bin/perf*
[perf@five ~]$ getcap bin/perf
bin/perf = cap_ipc_lock,cap_sys_ptrace,cap_syslog,38+ep
[perf@five ~]$
[perf@five ~]$ perf top --stdio


   PerfTop:     652 irqs/sec  kernel:73.8%  exact: 99.7% lost: 0/0 drop: 0/0 [4000Hz cycles:u],  (all, 12 CPUs)
---------------------------------------------------------------------------------------------------------------

    13.03%  [kernel]               [k] module_get_kallsym
     5.25%  [kernel]               [k] kallsyms_expand_symbol.constprop.0
     5.00%  libc-2.30.so           [.] __GI_____strtoull_l_internal
     4.41%  [kernel]               [k] memcpy
     3.42%  [kernel]               [k] vsnprintf
     2.98%  perf                   [.] map__process_kallsym_symbol
     2.86%  [kernel]               [k] format_decode
     2.73%  [kernel]               [k] number
     2.70%  perf                   [.] rb_next
     2.59%  perf                   [.] maps__split_kallsyms
     2.54%  [kernel]               [k] string_nocheck
     1.90%  libc-2.30.so           [.] _IO_getdelim
     1.86%  [kernel]               [k] __x86_indirect_thunk_rax
     1.53%  libc-2.30.so           [.] _int_malloc
     1.48%  libc-2.30.so           [.] __memmove_avx_unaligned_erms
     1.40%  [kernel]               [k] clear_page_rep
     1.07%  perf                   [.] rb_insert_color
     1.01%  libc-2.30.so           [.] _IO_feof
     0.99%  perf                   [.] __dso__load_kallsyms
     0.98%  [kernel]               [k] s_next
     0.96%  perf                   [.] __rblist__findnew
     0.95%  [kernel]               [k] strlen
     0.95%  perf                   [.] arch__symbols__fixup_end
     0.94%  libpixman-1.so.0.38.4  [.] 0x000000000006f4af
     0.94%  perf                   [.] symbol__new
     0.89%  libpixman-1.so.0.38.4  [.] 0x000000000006f4a0
     0.86%  [kernel]               [k] seq_read
     0.81%  libpixman-1.so.0.38.4  [.] 0x000000000006f4ab
     0.80%  perf                   [.] __symbols__insert
     0.73%  libpixman-1.so.0.38.4  [.] 0x000000000006f4a7
     0.67%  [kernel]               [k] s_show
     0.66%  libc-2.30.so           [.] __libc_calloc
     0.61%  libpixman-1.so.0.38.4  [.] 0x000000000006f4bb
     0.59%  [kernel]               [k] get_page_from_freelist
     0.59%  perf                   [.] memcpy@plt
     0.58%  perf                   [.] eprintf
exiting.
[perf@five ~]$

There is still something strange in here, the event is cycles:u (see at
the PerfTop line, but it is getting kernel samples :-\

- Arnaldo
Arnaldo Carvalho de Melo April 7, 2020, 4:40 p.m. UTC | #5
Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> > >> [perf@five ~]$ type perf
> > >> perf is hashed (/home/perf/bin/perf)
> > >> [perf@five ~]$ getcap /home/perf/bin/perf
> > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> > >> [perf@five ~]$ groups
> > >> perf perf_users
> > >> [perf@five ~]$ id
> > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> > >> [perf@five ~]$ perf top --stdio
> > >> Error:
> > >> Failed to mmap with 1 (Operation not permitted)
> > >> [perf@five ~]$ perf record -a
> > >> ^C[ perf record: Woken up 1 times to write data ]
> > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> > >>
> > >> [perf@five ~]$ perf evlist
> > >> cycles:u
> > >> [perf@five ~]$
> > > 
> > > Humm, perf record falls back to cycles:u after initially trying cycles
> > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> > > lemme test, humm not really:
> > > 
> > > [perf@five ~]$ perf top --stdio -e cycles:u
> > > Error:
> > > Failed to mmap with 1 (Operation not permitted)
> > > [perf@five ~]$ perf record -e cycles:u -a sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> > > [perf@five ~]$
> > > 
> > > Back to debugging this.
> > 
> > Could makes sense adding cap_ipc_lock to the binary to isolate from this:
> > 
> > kernel/events/core.c: 6101
> > 	if ((locked > lock_limit) && perf_is_paranoid() &&
> > 		!capable(CAP_IPC_LOCK)) {
> > 		ret = -EPERM;
> > 		goto unlock;
> > 	}
> 
> 
> That did the trick, I'll update the documentation and include in my
> "Committer testing" section:

I ammended this to that patch, please check the wording:

- Arnaldo

diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index c0ca0c1a6804..ed33682e26b0 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -127,12 +127,19 @@ taken to create such groups of privileged Perf users.
 
 ::
 
-   # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
-   # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
+   # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+   # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
    perf: OK
    # getcap perf
    perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
 
+If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
+i.e.:
+
+::
+
+   # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+
 As a result, members of perf_users group are capable of conducting
 performance monitoring and observability by using functionality of the
 configured Perf tool executable that, when executes, passes perf_events
Alexey Budankov April 7, 2020, 4:52 p.m. UTC | #6
On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
>> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
>>>> [perf@five ~]$ type perf
<SNIP>
>>>> perf is hashed (/home/perf/bin/perf)
>>>> [perf@five ~]$
>>>
>>> Humm, perf record falls back to cycles:u after initially trying cycles
>>> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
>>> lemme test, humm not really:
>>>
>>> [perf@five ~]$ perf top --stdio -e cycles:u
>>> Error:
>>> Failed to mmap with 1 (Operation not permitted)
>>> [perf@five ~]$ perf record -e cycles:u -a sleep 1
>>> [ perf record: Woken up 1 times to write data ]
>>> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
>>> [perf@five ~]$
>>>
>>> Back to debugging this.
>>
>> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
>>
>> kernel/events/core.c: 6101
>> 	if ((locked > lock_limit) && perf_is_paranoid() &&
>> 		!capable(CAP_IPC_LOCK)) {
>> 		ret = -EPERM;
>> 		goto unlock;
>> 	}
> 
> 
> That did the trick, I'll update the documentation and include in my
> "Committer testing" section:

Looks like top mode somehow reaches perf mmap limit described here [1].
Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine:
perf top -e cycles -m 1

~Alexey

[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html#memory-allocation
Arnaldo Carvalho de Melo April 7, 2020, 4:56 p.m. UTC | #7
Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> > On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
> > > Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
> > >> [perf@five ~]$ type perf
> > >> perf is hashed (/home/perf/bin/perf)
> > >> [perf@five ~]$ getcap /home/perf/bin/perf
> > >> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
> > >> [perf@five ~]$ groups
> > >> perf perf_users
> > >> [perf@five ~]$ id
> > >> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> > >> [perf@five ~]$ perf top --stdio
> > >> Error:
> > >> Failed to mmap with 1 (Operation not permitted)
> > >> [perf@five ~]$ perf record -a
> > >> ^C[ perf record: Woken up 1 times to write data ]
> > >> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
> > >>
> > >> [perf@five ~]$ perf evlist
> > >> cycles:u
> > >> [perf@five ~]$
> > > 
> > > Humm, perf record falls back to cycles:u after initially trying cycles
> > > (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
> > > lemme test, humm not really:
> > > 
> > > [perf@five ~]$ perf top --stdio -e cycles:u
> > > Error:
> > > Failed to mmap with 1 (Operation not permitted)
> > > [perf@five ~]$ perf record -e cycles:u -a sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
> > > [perf@five ~]$
> > > 
> > > Back to debugging this.
> > 
> > Could makes sense adding cap_ipc_lock to the binary to isolate from this:
> > 
> > kernel/events/core.c: 6101
> > 	if ((locked > lock_limit) && perf_is_paranoid() &&
> > 		!capable(CAP_IPC_LOCK)) {
> > 		ret = -EPERM;
> > 		goto unlock;
> > 	}
> 
> 
> That did the trick, I'll update the documentation and include in my
> "Committer testing" section:
> 
> [perf@five ~]$ groups
> perf perf_users
> [perf@five ~]$ ls -lahF bin/perf
> -rwxr-x---. 1 root perf_users 24M Apr  7 10:34 bin/perf*
> [perf@five ~]$ getcap bin/perf
> bin/perf = cap_ipc_lock,cap_sys_ptrace,cap_syslog,38+ep
> [perf@five ~]$
> [perf@five ~]$ perf top --stdio
> 
> 
>    PerfTop:     652 irqs/sec  kernel:73.8%  exact: 99.7% lost: 0/0 drop: 0/0 [4000Hz cycles:u],  (all, 12 CPUs)
> ---------------------------------------------------------------------------------------------------------------
> 
>     13.03%  [kernel]               [k] module_get_kallsym
>      5.25%  [kernel]               [k] kallsyms_expand_symbol.constprop.0
>      5.00%  libc-2.30.so           [.] __GI_____strtoull_l_internal
>      4.41%  [kernel]               [k] memcpy
>      3.42%  [kernel]               [k] vsnprintf
>      2.98%  perf                   [.] map__process_kallsym_symbol
>      2.86%  [kernel]               [k] format_decode
>      2.73%  [kernel]               [k] number
>      2.70%  perf                   [.] rb_next
>      2.59%  perf                   [.] maps__split_kallsyms
>      2.54%  [kernel]               [k] string_nocheck
>      1.90%  libc-2.30.so           [.] _IO_getdelim
>      1.86%  [kernel]               [k] __x86_indirect_thunk_rax
>      1.53%  libc-2.30.so           [.] _int_malloc
>      1.48%  libc-2.30.so           [.] __memmove_avx_unaligned_erms
>      1.40%  [kernel]               [k] clear_page_rep
>      1.07%  perf                   [.] rb_insert_color
>      1.01%  libc-2.30.so           [.] _IO_feof
>      0.99%  perf                   [.] __dso__load_kallsyms
>      0.98%  [kernel]               [k] s_next
>      0.96%  perf                   [.] __rblist__findnew
>      0.95%  [kernel]               [k] strlen
>      0.95%  perf                   [.] arch__symbols__fixup_end
>      0.94%  libpixman-1.so.0.38.4  [.] 0x000000000006f4af
>      0.94%  perf                   [.] symbol__new
>      0.89%  libpixman-1.so.0.38.4  [.] 0x000000000006f4a0
>      0.86%  [kernel]               [k] seq_read
>      0.81%  libpixman-1.so.0.38.4  [.] 0x000000000006f4ab
>      0.80%  perf                   [.] __symbols__insert
>      0.73%  libpixman-1.so.0.38.4  [.] 0x000000000006f4a7
>      0.67%  [kernel]               [k] s_show
>      0.66%  libc-2.30.so           [.] __libc_calloc
>      0.61%  libpixman-1.so.0.38.4  [.] 0x000000000006f4bb
>      0.59%  [kernel]               [k] get_page_from_freelist
>      0.59%  perf                   [.] memcpy@plt
>      0.58%  perf                   [.] eprintf
> exiting.
> [perf@five ~]$
> 
> There is still something strange in here, the event is cycles:u (see at
> the PerfTop line, but it is getting kernel samples :-\

So running with 'perf top --stdio -vv 2> /tmp/output' I see we try
create three events, the first is some capability querying, then we try
to determine the max precision level, but continue with
attr.exclude_kernel=1, which shouldn't be the case, perhaps we're seeing
that it is not the root in the tooling part, and end up setting that to
1 as, previously, we knew it would fail, so we should switch to checking
if we have cap_perfmon too, will check that:

------------------------------------------------------------
perf_event_attr:
  type                             1
  size                             120
  config                           0x9
  watermark                        1
  sample_id_all                    1
  bpf_event                        1
  { wakeup_events, wakeup_watermark } 1
------------------------------------------------------------
------------------------------------------------------------
perf_event_attr:
  size                             120
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|CPU|PERIOD
  read_format                      ID
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  precise_ip                       3
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
  ksymbol                          1
  bpf_event                        1
------------------------------------------------------------
------------------------------------------------------------
perf_event_attr:
  size                             120
  { sample_period, sample_freq }   4000
  sample_type                      IP|TID|TIME|CPU|PERIOD
  read_format                      ID
  disabled                         1
  inherit                          1
  exclude_kernel                   1
  mmap                             1
  comm                             1
  freq                             1
  task                             1
  precise_ip                       2
  sample_id_all                    1
  exclude_guest                    1
  mmap2                            1
  comm_exec                        1
  ksymbol                          1
  bpf_event                        1
------------------------------------------------------------

But then, even with that attr.exclude_kernel set to 1 we _still_ get
kernel samples, which looks like another bug, now trying with strace,
which leads us to another rabbit hole:

[perf@five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio
Error:
You may not have permission to collect system-wide stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).

The current value is 2:

  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
      Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
>= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN

To make this setting permanent, edit /etc/sysctl.conf too, e.g.:

	kernel.perf_event_paranoid = -1

[perf@five ~]$

If I remove that strace -e ... from the front, 'perf top' is back
working as a non-cap_sys_admin user, just with cap_perfmon.

- Arnaldo
Arnaldo Carvalho de Melo April 7, 2020, 5:02 p.m. UTC | #8
Em Tue, Apr 07, 2020 at 07:52:56PM +0300, Alexey Budankov escreveu:
> 
> On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote:
> > Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
> >> Could makes sense adding cap_ipc_lock to the binary to isolate from this:

> >> kernel/events/core.c: 6101
> >> 	if ((locked > lock_limit) && perf_is_paranoid() &&
> >> 		!capable(CAP_IPC_LOCK)) {
> >> 		ret = -EPERM;
> >> 		goto unlock;
> >> 	}

> > That did the trick, I'll update the documentation and include in my
> > "Committer testing" section:
 
> Looks like top mode somehow reaches perf mmap limit described here [1].
> Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine:
> perf top -e cycles -m 1

So this would read better?

diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
index ed33682e26b0..d44dd24b0244 100644
--- a/Documentation/admin-guide/perf-security.rst
+++ b/Documentation/admin-guide/perf-security.rst
@@ -127,8 +127,8 @@ taken to create such groups of privileged Perf users.
 
 ::
 
-   # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
-   # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
+   # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
+   # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
    perf: OK
    # getcap perf
    perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
@@ -140,6 +140,10 @@ i.e.:
 
    # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
 
+Note that you may need to have 'cap_ipc_lock' in the mix for tools such as
+'perf top', alternatively use 'perf top -m N', to reduce the memory that
+it uses for the perf ring buffer, see the memory allocation section below.
+
 As a result, members of perf_users group are capable of conducting
 performance monitoring and observability by using functionality of the
 configured Perf tool executable that, when executes, passes perf_events
Alexey Budankov April 7, 2020, 5:17 p.m. UTC | #9
On 07.04.2020 19:40, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 01:36:54PM -0300, Arnaldo Carvalho de Melo escreveu:
>> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
>>> On 07.04.2020 17:35, Arnaldo Carvalho de Melo wrote:
>>>> Em Tue, Apr 07, 2020 at 11:30:14AM -0300, Arnaldo Carvalho de Melo escreveu:
>>>>> [perf@five ~]$ type perf
>>>>> perf is hashed (/home/perf/bin/perf)
>>>>> [perf@five ~]$ getcap /home/perf/bin/perf
>>>>> /home/perf/bin/perf = cap_sys_ptrace,cap_syslog,38+ep
>>>>> [perf@five ~]$ groups
>>>>> perf perf_users
>>>>> [perf@five ~]$ id
>>>>> uid=1002(perf) gid=1002(perf) groups=1002(perf),1003(perf_users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
>>>>> [perf@five ~]$ perf top --stdio
>>>>> Error:
>>>>> Failed to mmap with 1 (Operation not permitted)
>>>>> [perf@five ~]$ perf record -a
>>>>> ^C[ perf record: Woken up 1 times to write data ]
>>>>> [ perf record: Captured and wrote 1.177 MB perf.data (1552 samples) ]
>>>>>
>>>>> [perf@five ~]$ perf evlist
>>>>> cycles:u
>>>>> [perf@five ~]$
>>>>
>>>> Humm, perf record falls back to cycles:u after initially trying cycles
>>>> (i.e. kernel and userspace), lemme see trying 'perf top -e cycles:u',
>>>> lemme test, humm not really:
>>>>
>>>> [perf@five ~]$ perf top --stdio -e cycles:u
>>>> Error:
>>>> Failed to mmap with 1 (Operation not permitted)
>>>> [perf@five ~]$ perf record -e cycles:u -a sleep 1
>>>> [ perf record: Woken up 1 times to write data ]
>>>> [ perf record: Captured and wrote 1.123 MB perf.data (132 samples) ]
>>>> [perf@five ~]$
>>>>
>>>> Back to debugging this.
>>>
>>> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
>>>
>>> kernel/events/core.c: 6101
>>> 	if ((locked > lock_limit) && perf_is_paranoid() &&
>>> 		!capable(CAP_IPC_LOCK)) {
>>> 		ret = -EPERM;
>>> 		goto unlock;
>>> 	}
>>
>>
>> That did the trick, I'll update the documentation and include in my
>> "Committer testing" section:
> 
> I ammended this to that patch, please check the wording:
> 
> - Arnaldo
> 
> diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
> index c0ca0c1a6804..ed33682e26b0 100644
> --- a/Documentation/admin-guide/perf-security.rst
> +++ b/Documentation/admin-guide/perf-security.rst
> @@ -127,12 +127,19 @@ taken to create such groups of privileged Perf users.
>  
>  ::
>  
> -   # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> -   # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> +   # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> +   # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
>     perf: OK
>     # getcap perf
>     perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
>  
> +If the libcap installed doesn't yet support "cap_perfmon", use "38" instead,
> +i.e.:
> +
> +::
> +
> +   # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> +
>  As a result, members of perf_users group are capable of conducting
>  performance monitoring and observability by using functionality of the
>  configured Perf tool executable that, when executes, passes perf_events
> 

Looks good to me. The paragraph just above should then also be extended to
mention that perf_events subsystem memory limit is ignored due to usage of
CAP_IPC_LOCK:

"As a result, members of perf_users group are capable of conducting
 performance monitoring and observability by using functionality of the
 configured Perf tool executable that, when executes, passes perf_events
 subsystem scope and perf_event_mlock_kb locking limit checks."

~Alexey
Arnaldo Carvalho de Melo April 7, 2020, 5:23 p.m. UTC | #10
Em Tue, Apr 07, 2020 at 01:56:43PM -0300, Arnaldo Carvalho de Melo escreveu:
> 
> But then, even with that attr.exclude_kernel set to 1 we _still_ get
> kernel samples, which looks like another bug, now trying with strace,
> which leads us to another rabbit hole:
> 
> [perf@five ~]$ strace -e perf_event_open -o /tmp/out.put perf top --stdio
> Error:
> You may not have permission to collect system-wide stats.
> 
> Consider tweaking /proc/sys/kernel/perf_event_paranoid,
> which controls use of the performance events system by
> unprivileged users (without CAP_PERFMON or CAP_SYS_ADMIN).
> 
> The current value is 2:
> 
>   -1: Allow use of (almost) all events by all users
>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >= 0: Disallow ftrace function tracepoint by users without CAP_PERFMON or CAP_SYS_ADMIN
>       Disallow raw tracepoint access by users without CAP_SYS_PERFMON or CAP_SYS_ADMIN
> >= 1: Disallow CPU event access by users without CAP_PERFMON or CAP_SYS_ADMIN
> >= 2: Disallow kernel profiling by users without CAP_PERFMON or CAP_SYS_ADMIN
> 
> To make this setting permanent, edit /etc/sysctl.conf too, e.g.:
> 
> 	kernel.perf_event_paranoid = -1
> 
> [perf@five ~]$
> 
> If I remove that strace -e ... from the front, 'perf top' is back
> working as a non-cap_sys_admin user, just with cap_perfmon.
> 

So I couldn't figure it out so far why is that exclude_kernel is being
set to 1, as perf-top when no event is passed defaults to this to find
out what to use as a default event:

  perf_evlist__add_default(top.evlist)
     perf_evsel__new_cycles(true);
	struct perf_event_attr attr = {
                .type   = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .exclude_kernel = !perf_event_can_profile_kernel(),
        };

			perf_event_paranoid_check(1);
			        return perf_cap__capable(CAP_SYS_ADMIN) ||
				       perf_cap__capable(CAP_PERFMON) ||
				       perf_event_paranoid() <= max_level;


And then that second condition should hold true, it returns true, and
then .exclude_kernel should be set to !true -> zero.o

Now the wallclock says I need to stop being a programmer and turn into a
daycare provider for Pedro, cya!

- Arnaldo
Alexey Budankov April 7, 2020, 5:32 p.m. UTC | #11
On 07.04.2020 20:02, Arnaldo Carvalho de Melo wrote:
> Em Tue, Apr 07, 2020 at 07:52:56PM +0300, Alexey Budankov escreveu:
>>
>> On 07.04.2020 19:36, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Apr 07, 2020 at 05:54:27PM +0300, Alexey Budankov escreveu:
>>>> Could makes sense adding cap_ipc_lock to the binary to isolate from this:
> 
>>>> kernel/events/core.c: 6101
>>>> 	if ((locked > lock_limit) && perf_is_paranoid() &&
>>>> 		!capable(CAP_IPC_LOCK)) {
>>>> 		ret = -EPERM;
>>>> 		goto unlock;
>>>> 	}
> 
>>> That did the trick, I'll update the documentation and include in my
>>> "Committer testing" section:
>  
>> Looks like top mode somehow reaches perf mmap limit described here [1].
>> Using -m option solves the issue avoiding cap_ipc_lock on my 8 cores machine:
>> perf top -e cycles -m 1
> 
> So this would read better?
> 
> diff --git a/Documentation/admin-guide/perf-security.rst b/Documentation/admin-guide/perf-security.rst
> index ed33682e26b0..d44dd24b0244 100644
> --- a/Documentation/admin-guide/perf-security.rst
> +++ b/Documentation/admin-guide/perf-security.rst
> @@ -127,8 +127,8 @@ taken to create such groups of privileged Perf users.
>  
>  ::
>  
> -   # setcap "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> -   # setcap -v "cap_perfmon,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
> +   # setcap "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
> +   # setcap -v "cap_perfmon,cap_sys_ptrace,cap_syslog=ep" perf
>     perf: OK
>     # getcap perf
>     perf = cap_sys_ptrace,cap_syslog,cap_perfmon+ep
> @@ -140,6 +140,10 @@ i.e.:
>  
>     # setcap "38,cap_ipc_lock,cap_sys_ptrace,cap_syslog=ep" perf
>  
> +Note that you may need to have 'cap_ipc_lock' in the mix for tools such as
> +'perf top', alternatively use 'perf top -m N', to reduce the memory that
> +it uses for the perf ring buffer, see the memory allocation section below.
> +

Let's stay with the first variant of you addition to this patch and also 
extend the paragraph below as suggested in other mail in the thread.

>  As a result, members of perf_users group are capable of conducting
>  performance monitoring and observability by using functionality of the
>  configured Perf tool executable that, when executes, passes perf_events
> 

Thanks,
Alexey
Ravi Bangoria July 10, 2020, 1:31 p.m. UTC | #12
Hi Alexey,

> Currently access to perf_events, i915_perf and other performance
> monitoring and observability subsystems of the kernel is open only for
> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> process effective set [2].
> 
> This patch set introduces CAP_PERFMON capability designed to secure
> system performance monitoring and observability operations so that
> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> for performance monitoring and observability subsystems of the kernel.

I'm seeing an issue with CAP_PERFMON when I try to record data for a
specific target. I don't know whether this is sort of a regression or
an expected behavior.

Without setting CAP_PERFMON:

   $ getcap ./perf
   $ ./perf stat -a ls
     Error:
     Access to performance monitoring and observability operations is limited.
   $ ./perf stat ls
     Performance counter stats for 'ls':
    
                  2.06 msec task-clock:u              #    0.418 CPUs utilized
                     0      context-switches:u        #    0.000 K/sec
                     0      cpu-migrations:u          #    0.000 K/sec

With CAP_PERFMON:

   $ getcap ./perf
     ./perf = cap_perfmon+ep
   $ ./perf stat -a ls
     Performance counter stats for 'system wide':
    
                142.42 msec cpu-clock                 #   25.062 CPUs utilized
                   182      context-switches          #    0.001 M/sec
                    48      cpu-migrations            #    0.337 K/sec
   $ ./perf stat ls
     Error:
     Access to performance monitoring and observability operations is limited.

Am I missing something silly?

Analysis:
---------
A bit more analysis lead me to below kernel code fs/exec.c:

   begin_new_exec()
   {
         ...
         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
             !(uid_eq(current_euid(), current_uid()) &&
               gid_eq(current_egid(), current_gid())))
                 set_dumpable(current->mm, suid_dumpable);
         else
                 set_dumpable(current->mm, SUID_DUMP_USER);

         ...
         commit_creds(bprm->cred);
   }

When I execute './perf stat ls', it's going into else condition and thus sets
dumpable flag as SUID_DUMP_USER. Then in commit_creds():

   int commit_creds(struct cred *new)
   {
         ...
         /* dumpability changes */
         if (...
             !cred_cap_issubset(old, new)) {
                 if (task->mm)
                         set_dumpable(task->mm, suid_dumpable);
   }

!cred_cap_issubset(old, new) fails for perf without any capability and thus
it doesn't execute set_dumpable(). Whereas that condition passes for perf
with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).

Now while opening an event:

   perf_event_open()
     ptrace_may_access()
       __ptrace_may_access() {
                 ...
                 if (mm &&
                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
                      !ptrace_has_cap(cred, mm->user_ns, mode)))
                     return -EPERM;
       }

This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
success. So opening an event fails when perf has CAP_PREFMON and tries to open
process specific event as normal user.

Workarounds:
------------
Based on above analysis, I found couple of workarounds (examples are on
Ubuntu 18.04.4 powerpc):

Workaround1:
Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
issue.

   # echo 1 > /proc/sys/fs/suid_dumpable
   $ getcap ./perf
     ./perf = cap_perfmon+ep
   $ ./perf stat ls
     Performance counter stats for 'ls':
    
                  1.47 msec task-clock                #    0.806 CPUs utilized
                     0      context-switches          #    0.000 K/sec
                     0      cpu-migrations            #    0.000 K/sec

Workaround2:
Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.

   $ cat /proc/sys/fs/suid_dumpable
     2
   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
   $ ./perf stat ls
     Performance counter stats for 'ls':
    
                  1.41 msec task-clock                #    0.826 CPUs utilized
                     0      context-switches          #    0.000 K/sec
                     0      cpu-migrations            #    0.000 K/sec

Workaround3:
Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.

   $ cat /proc/sys/fs/suid_dumpable
     2
   # setcap "cap_perfmon=ep" /bin/bash
   # setcap "cap_perfmon=ep" ./perf
   $ bash
   $ ./perf stat ls
     Performance counter stats for 'ls':
    
                  1.47 msec task-clock                #    0.806 CPUs utilized
                     0      context-switches          #    0.000 K/sec
                     0      cpu-migrations            #    0.000 K/sec

- Ravi
Alexey Budankov July 10, 2020, 2:30 p.m. UTC | #13
Hi Ravi,

On 10.07.2020 16:31, Ravi Bangoria wrote:
> Hi Alexey,
> 
>> Currently access to perf_events, i915_perf and other performance
>> monitoring and observability subsystems of the kernel is open only for
>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>> process effective set [2].
>>
>> This patch set introduces CAP_PERFMON capability designed to secure
>> system performance monitoring and observability operations so that
>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>> for performance monitoring and observability subsystems of the kernel.
> 
> I'm seeing an issue with CAP_PERFMON when I try to record data for a
> specific target. I don't know whether this is sort of a regression or
> an expected behavior.

Thanks for reporting and root causing this case. The behavior looks like
kind of expected since currently CAP_PERFMON takes over the related part
of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
that access control is also subject to CAP_SYS_PTRACE credentials.

CAP_PERFMON could be used to extend and substitute ptrace_may_access()
check in perf_events subsystem to simplify user experience at least in
this specific case.

Alexei

[1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

> 
> Without setting CAP_PERFMON:
> 
>   $ getcap ./perf
>   $ ./perf stat -a ls
>     Error:
>     Access to performance monitoring and observability operations is limited.
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     2.06 msec task-clock:u              #    0.418 CPUs utilized
>                     0      context-switches:u        #    0.000 K/sec
>                     0      cpu-migrations:u          #    0.000 K/sec
> 
> With CAP_PERFMON:
> 
>   $ getcap ./perf
>     ./perf = cap_perfmon+ep
>   $ ./perf stat -a ls
>     Performance counter stats for 'system wide':
>                   142.42 msec cpu-clock                 #   25.062 CPUs utilized
>                   182      context-switches          #    0.001 M/sec
>                    48      cpu-migrations            #    0.337 K/sec
>   $ ./perf stat ls
>     Error:
>     Access to performance monitoring and observability operations is limited.
> 
> Am I missing something silly?
> 
> Analysis:
> ---------
> A bit more analysis lead me to below kernel code fs/exec.c:
> 
>   begin_new_exec()
>   {
>         ...
>         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>             !(uid_eq(current_euid(), current_uid()) &&
>               gid_eq(current_egid(), current_gid())))
>                 set_dumpable(current->mm, suid_dumpable);
>         else
>                 set_dumpable(current->mm, SUID_DUMP_USER);
> 
>         ...
>         commit_creds(bprm->cred);
>   }
> 
> When I execute './perf stat ls', it's going into else condition and thus sets
> dumpable flag as SUID_DUMP_USER. Then in commit_creds():
> 
>   int commit_creds(struct cred *new)
>   {
>         ...
>         /* dumpability changes */
>         if (...
>             !cred_cap_issubset(old, new)) {
>                 if (task->mm)
>                         set_dumpable(task->mm, suid_dumpable);
>   }
> 
> !cred_cap_issubset(old, new) fails for perf without any capability and thus
> it doesn't execute set_dumpable(). Whereas that condition passes for perf
> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
> 
> Now while opening an event:
> 
>   perf_event_open()
>     ptrace_may_access()
>       __ptrace_may_access() {
>                 ...
>                 if (mm &&
>                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
>                      !ptrace_has_cap(cred, mm->user_ns, mode)))
>                     return -EPERM;
>       }
> 
> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
> success. So opening an event fails when perf has CAP_PREFMON and tries to open
> process specific event as normal user.
> 
> Workarounds:
> ------------
> Based on above analysis, I found couple of workarounds (examples are on
> Ubuntu 18.04.4 powerpc):
> 
> Workaround1:
> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
> issue.
> 
>   # echo 1 > /proc/sys/fs/suid_dumpable
>   $ getcap ./perf
>     ./perf = cap_perfmon+ep
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     1.47 msec task-clock                #    0.806 CPUs utilized
>                     0      context-switches          #    0.000 K/sec
>                     0      cpu-migrations            #    0.000 K/sec
> 
> Workaround2:
> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
> 
>   $ cat /proc/sys/fs/suid_dumpable
>     2
>   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     1.41 msec task-clock                #    0.826 CPUs utilized
>                     0      context-switches          #    0.000 K/sec
>                     0      cpu-migrations            #    0.000 K/sec
> 
> Workaround3:
> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
> 
>   $ cat /proc/sys/fs/suid_dumpable
>     2
>   # setcap "cap_perfmon=ep" /bin/bash
>   # setcap "cap_perfmon=ep" ./perf
>   $ bash
>   $ ./perf stat ls
>     Performance counter stats for 'ls':
>                     1.47 msec task-clock                #    0.806 CPUs utilized
>                     0      context-switches          #    0.000 K/sec
>                     0      cpu-migrations            #    0.000 K/sec
> 
> - Ravi
Arnaldo Carvalho de Melo July 10, 2020, 5:09 p.m. UTC | #14
Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
> On 10.07.2020 16:31, Ravi Bangoria wrote:
> >> Currently access to perf_events, i915_perf and other performance
> >> monitoring and observability subsystems of the kernel is open only for
> >> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> >> process effective set [2].

> >> This patch set introduces CAP_PERFMON capability designed to secure
> >> system performance monitoring and observability operations so that
> >> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> >> for performance monitoring and observability subsystems of the kernel.
 
> > I'm seeing an issue with CAP_PERFMON when I try to record data for a
> > specific target. I don't know whether this is sort of a regression or
> > an expected behavior.
 
> Thanks for reporting and root causing this case. The behavior looks like
> kind of expected since currently CAP_PERFMON takes over the related part
> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
> that access control is also subject to CAP_SYS_PTRACE credentials.

I think that stating that in the error message would be helpful, after
all, who reads docs? 8-)

I.e., this:

$ ./perf stat ls
  Error:
  Access to performance monitoring and observability operations is limited.
$

Could become:

$ ./perf stat ls
  Error:
  Access to performance monitoring and observability operations is limited.
  Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
$

- Arnaldo
 
> CAP_PERFMON could be used to extend and substitute ptrace_may_access()
> check in perf_events subsystem to simplify user experience at least in
> this specific case.
> 
> Alexei
> 
> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> 
> > 
> > Without setting CAP_PERFMON:
> > 
> >   $ getcap ./perf
> >   $ ./perf stat -a ls
> >     Error:
> >     Access to performance monitoring and observability operations is limited.
> >   $ ./perf stat ls
> >     Performance counter stats for 'ls':
> >                     2.06 msec task-clock:u              #    0.418 CPUs utilized
> >                     0      context-switches:u        #    0.000 K/sec
> >                     0      cpu-migrations:u          #    0.000 K/sec
> > 
> > With CAP_PERFMON:
> > 
> >   $ getcap ./perf
> >     ./perf = cap_perfmon+ep
> >   $ ./perf stat -a ls
> >     Performance counter stats for 'system wide':
> >                   142.42 msec cpu-clock                 #   25.062 CPUs utilized
> >                   182      context-switches          #    0.001 M/sec
> >                    48      cpu-migrations            #    0.337 K/sec
> >   $ ./perf stat ls
> >     Error:
> >     Access to performance monitoring and observability operations is limited.
> > 
> > Am I missing something silly?
> > 
> > Analysis:
> > ---------
> > A bit more analysis lead me to below kernel code fs/exec.c:
> > 
> >   begin_new_exec()
> >   {
> >         ...
> >         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> >             !(uid_eq(current_euid(), current_uid()) &&
> >               gid_eq(current_egid(), current_gid())))
> >                 set_dumpable(current->mm, suid_dumpable);
> >         else
> >                 set_dumpable(current->mm, SUID_DUMP_USER);
> > 
> >         ...
> >         commit_creds(bprm->cred);
> >   }
> > 
> > When I execute './perf stat ls', it's going into else condition and thus sets
> > dumpable flag as SUID_DUMP_USER. Then in commit_creds():
> > 
> >   int commit_creds(struct cred *new)
> >   {
> >         ...
> >         /* dumpability changes */
> >         if (...
> >             !cred_cap_issubset(old, new)) {
> >                 if (task->mm)
> >                         set_dumpable(task->mm, suid_dumpable);
> >   }
> > 
> > !cred_cap_issubset(old, new) fails for perf without any capability and thus
> > it doesn't execute set_dumpable(). Whereas that condition passes for perf
> > with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
> > suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
> > SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
> > 
> > Now while opening an event:
> > 
> >   perf_event_open()
> >     ptrace_may_access()
> >       __ptrace_may_access() {
> >                 ...
> >                 if (mm &&
> >                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
> >                      !ptrace_has_cap(cred, mm->user_ns, mode)))
> >                     return -EPERM;
> >       }
> > 
> > This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
> > But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
> > success. So opening an event fails when perf has CAP_PREFMON and tries to open
> > process specific event as normal user.
> > 
> > Workarounds:
> > ------------
> > Based on above analysis, I found couple of workarounds (examples are on
> > Ubuntu 18.04.4 powerpc):
> > 
> > Workaround1:
> > Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
> > issue.
> > 
> >   # echo 1 > /proc/sys/fs/suid_dumpable
> >   $ getcap ./perf
> >     ./perf = cap_perfmon+ep
> >   $ ./perf stat ls
> >     Performance counter stats for 'ls':
> >                     1.47 msec task-clock                #    0.806 CPUs utilized
> >                     0      context-switches          #    0.000 K/sec
> >                     0      cpu-migrations            #    0.000 K/sec
> > 
> > Workaround2:
> > Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
> > 
> >   $ cat /proc/sys/fs/suid_dumpable
> >     2
> >   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
> >   $ ./perf stat ls
> >     Performance counter stats for 'ls':
> >                     1.41 msec task-clock                #    0.826 CPUs utilized
> >                     0      context-switches          #    0.000 K/sec
> >                     0      cpu-migrations            #    0.000 K/sec
> > 
> > Workaround3:
> > Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
> > 
> >   $ cat /proc/sys/fs/suid_dumpable
> >     2
> >   # setcap "cap_perfmon=ep" /bin/bash
> >   # setcap "cap_perfmon=ep" ./perf
> >   $ bash
> >   $ ./perf stat ls
> >     Performance counter stats for 'ls':
> >                     1.47 msec task-clock                #    0.806 CPUs utilized
> >                     0      context-switches          #    0.000 K/sec
> >                     0      cpu-migrations            #    0.000 K/sec
> > 
> > - Ravi
Alexey Budankov July 13, 2020, 9:48 a.m. UTC | #15
On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
>> On 10.07.2020 16:31, Ravi Bangoria wrote:
>>>> Currently access to perf_events, i915_perf and other performance
>>>> monitoring and observability subsystems of the kernel is open only for
>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>>>> process effective set [2].
> 
>>>> This patch set introduces CAP_PERFMON capability designed to secure
>>>> system performance monitoring and observability operations so that
>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>>>> for performance monitoring and observability subsystems of the kernel.
>  
>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
>>> specific target. I don't know whether this is sort of a regression or
>>> an expected behavior.
>  
>> Thanks for reporting and root causing this case. The behavior looks like
>> kind of expected since currently CAP_PERFMON takes over the related part
>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
>> that access control is also subject to CAP_SYS_PTRACE credentials.
> 
> I think that stating that in the error message would be helpful, after
> all, who reads docs? 8-)

At least those who write it :D ...

> 
> I.e., this:
> 
> $ ./perf stat ls
>   Error:
>   Access to performance monitoring and observability operations is limited.
> $
> 
> Could become:
> 
> $ ./perf stat ls
>   Error:
>   Access to performance monitoring and observability operations is limited.
>   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> $

It would better provide reference to perf security docs in the tool output.
Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
makes monitoring simpler and even more secure to use since Perf tool need
not to start/stop/single-step and read/write registers and memory and so on
like a debugger or strace-like tool. What do you think?

Alexei

> 
> - Arnaldo
>  
>> CAP_PERFMON could be used to extend and substitute ptrace_may_access()
>> check in perf_events subsystem to simplify user experience at least in
>> this specific case.
>>
>> Alexei
>>
>> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>>
>>>
>>> Without setting CAP_PERFMON:
>>>
>>>   $ getcap ./perf
>>>   $ ./perf stat -a ls
>>>     Error:
>>>     Access to performance monitoring and observability operations is limited.
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     2.06 msec task-clock:u              #    0.418 CPUs utilized
>>>                     0      context-switches:u        #    0.000 K/sec
>>>                     0      cpu-migrations:u          #    0.000 K/sec
>>>
>>> With CAP_PERFMON:
>>>
>>>   $ getcap ./perf
>>>     ./perf = cap_perfmon+ep
>>>   $ ./perf stat -a ls
>>>     Performance counter stats for 'system wide':
>>>                   142.42 msec cpu-clock                 #   25.062 CPUs utilized
>>>                   182      context-switches          #    0.001 M/sec
>>>                    48      cpu-migrations            #    0.337 K/sec
>>>   $ ./perf stat ls
>>>     Error:
>>>     Access to performance monitoring and observability operations is limited.
>>>
>>> Am I missing something silly?
>>>
>>> Analysis:
>>> ---------
>>> A bit more analysis lead me to below kernel code fs/exec.c:
>>>
>>>   begin_new_exec()
>>>   {
>>>         ...
>>>         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
>>>             !(uid_eq(current_euid(), current_uid()) &&
>>>               gid_eq(current_egid(), current_gid())))
>>>                 set_dumpable(current->mm, suid_dumpable);
>>>         else
>>>                 set_dumpable(current->mm, SUID_DUMP_USER);
>>>
>>>         ...
>>>         commit_creds(bprm->cred);
>>>   }
>>>
>>> When I execute './perf stat ls', it's going into else condition and thus sets
>>> dumpable flag as SUID_DUMP_USER. Then in commit_creds():
>>>
>>>   int commit_creds(struct cred *new)
>>>   {
>>>         ...
>>>         /* dumpability changes */
>>>         if (...
>>>             !cred_cap_issubset(old, new)) {
>>>                 if (task->mm)
>>>                         set_dumpable(task->mm, suid_dumpable);
>>>   }
>>>
>>> !cred_cap_issubset(old, new) fails for perf without any capability and thus
>>> it doesn't execute set_dumpable(). Whereas that condition passes for perf
>>> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
>>> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
>>> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
>>>
>>> Now while opening an event:
>>>
>>>   perf_event_open()
>>>     ptrace_may_access()
>>>       __ptrace_may_access() {
>>>                 ...
>>>                 if (mm &&
>>>                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
>>>                      !ptrace_has_cap(cred, mm->user_ns, mode)))
>>>                     return -EPERM;
>>>       }
>>>
>>> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
>>> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
>>> success. So opening an event fails when perf has CAP_PREFMON and tries to open
>>> process specific event as normal user.
>>>
>>> Workarounds:
>>> ------------
>>> Based on above analysis, I found couple of workarounds (examples are on
>>> Ubuntu 18.04.4 powerpc):
>>>
>>> Workaround1:
>>> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
>>> issue.
>>>
>>>   # echo 1 > /proc/sys/fs/suid_dumpable
>>>   $ getcap ./perf
>>>     ./perf = cap_perfmon+ep
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     1.47 msec task-clock                #    0.806 CPUs utilized
>>>                     0      context-switches          #    0.000 K/sec
>>>                     0      cpu-migrations            #    0.000 K/sec
>>>
>>> Workaround2:
>>> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
>>>
>>>   $ cat /proc/sys/fs/suid_dumpable
>>>     2
>>>   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     1.41 msec task-clock                #    0.826 CPUs utilized
>>>                     0      context-switches          #    0.000 K/sec
>>>                     0      cpu-migrations            #    0.000 K/sec
>>>
>>> Workaround3:
>>> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
>>>
>>>   $ cat /proc/sys/fs/suid_dumpable
>>>     2
>>>   # setcap "cap_perfmon=ep" /bin/bash
>>>   # setcap "cap_perfmon=ep" ./perf
>>>   $ bash
>>>   $ ./perf stat ls
>>>     Performance counter stats for 'ls':
>>>                     1.47 msec task-clock                #    0.806 CPUs utilized
>>>                     0      context-switches          #    0.000 K/sec
>>>                     0      cpu-migrations            #    0.000 K/sec
>>>
>>> - Ravi
>
Arnaldo Carvalho de Melo July 13, 2020, 12:17 p.m. UTC | #16
Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
> 
> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
> >> On 10.07.2020 16:31, Ravi Bangoria wrote:
> >>>> Currently access to perf_events, i915_perf and other performance
> >>>> monitoring and observability subsystems of the kernel is open only for
> >>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> >>>> process effective set [2].

> >>>> This patch set introduces CAP_PERFMON capability designed to secure
> >>>> system performance monitoring and observability operations so that
> >>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> >>>> for performance monitoring and observability subsystems of the kernel.

> >>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
> >>> specific target. I don't know whether this is sort of a regression or
> >>> an expected behavior.

> >> Thanks for reporting and root causing this case. The behavior looks like
> >> kind of expected since currently CAP_PERFMON takes over the related part
> >> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
> >> that access control is also subject to CAP_SYS_PTRACE credentials.

> > I think that stating that in the error message would be helpful, after
> > all, who reads docs? 8-)

> At least those who write it :D ...

Everybody should read it, sure :-)
 
> > I.e., this:
> > 
> > $ ./perf stat ls
> >   Error:
> >   Access to performance monitoring and observability operations is limited.
> > $
> > 
> > Could become:
> > 
> > $ ./perf stat ls
> >   Error:
> >   Access to performance monitoring and observability operations is limited.
> >   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> > $
> 
> It would better provide reference to perf security docs in the tool output.

So add a 3rd line:

$ ./perf stat ls
  Error:
  Access to performance monitoring and observability operations is limited.
  Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
  Please read the 'Perf events and tool security' document:
  https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON

You mean the following?

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 856d98c36f56..a2397f724c10 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * perf_event_exit_task() that could imply).
 		 */
 		err = -EACCES;
-		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
+		if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
 			goto err_cred;
 	}

> makes monitoring simpler and even more secure to use since Perf tool need
> not to start/stop/single-step and read/write registers and memory and so on
> like a debugger or strace-like tool. What do you think?

I tend to agree, Peter?
 
> Alexei
> 
> > 
> > - Arnaldo
> >  
> >> CAP_PERFMON could be used to extend and substitute ptrace_may_access()
> >> check in perf_events subsystem to simplify user experience at least in
> >> this specific case.
> >>
> >> Alexei
> >>
> >> [1] https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> >>
> >>>
> >>> Without setting CAP_PERFMON:
> >>>
> >>>   $ getcap ./perf
> >>>   $ ./perf stat -a ls
> >>>     Error:
> >>>     Access to performance monitoring and observability operations is limited.
> >>>   $ ./perf stat ls
> >>>     Performance counter stats for 'ls':
> >>>                     2.06 msec task-clock:u              #    0.418 CPUs utilized
> >>>                     0      context-switches:u        #    0.000 K/sec
> >>>                     0      cpu-migrations:u          #    0.000 K/sec
> >>>
> >>> With CAP_PERFMON:
> >>>
> >>>   $ getcap ./perf
> >>>     ./perf = cap_perfmon+ep
> >>>   $ ./perf stat -a ls
> >>>     Performance counter stats for 'system wide':
> >>>                   142.42 msec cpu-clock                 #   25.062 CPUs utilized
> >>>                   182      context-switches          #    0.001 M/sec
> >>>                    48      cpu-migrations            #    0.337 K/sec
> >>>   $ ./perf stat ls
> >>>     Error:
> >>>     Access to performance monitoring and observability operations is limited.
> >>>
> >>> Am I missing something silly?
> >>>
> >>> Analysis:
> >>> ---------
> >>> A bit more analysis lead me to below kernel code fs/exec.c:
> >>>
> >>>   begin_new_exec()
> >>>   {
> >>>         ...
> >>>         if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP ||
> >>>             !(uid_eq(current_euid(), current_uid()) &&
> >>>               gid_eq(current_egid(), current_gid())))
> >>>                 set_dumpable(current->mm, suid_dumpable);
> >>>         else
> >>>                 set_dumpable(current->mm, SUID_DUMP_USER);
> >>>
> >>>         ...
> >>>         commit_creds(bprm->cred);
> >>>   }
> >>>
> >>> When I execute './perf stat ls', it's going into else condition and thus sets
> >>> dumpable flag as SUID_DUMP_USER. Then in commit_creds():
> >>>
> >>>   int commit_creds(struct cred *new)
> >>>   {
> >>>         ...
> >>>         /* dumpability changes */
> >>>         if (...
> >>>             !cred_cap_issubset(old, new)) {
> >>>                 if (task->mm)
> >>>                         set_dumpable(task->mm, suid_dumpable);
> >>>   }
> >>>
> >>> !cred_cap_issubset(old, new) fails for perf without any capability and thus
> >>> it doesn't execute set_dumpable(). Whereas that condition passes for perf
> >>> with CAP_PERFMON and thus it overwrites old value (SUID_DUMP_USER) with
> >>> suid_dumpable in mm_flags. On an Ubuntu, suid_dumpable default value is
> >>> SUID_DUMP_ROOT. On Fedora, it's SUID_DUMP_DISABLE. (/proc/sys/fs/suid_dumpable).
> >>>
> >>> Now while opening an event:
> >>>
> >>>   perf_event_open()
> >>>     ptrace_may_access()
> >>>       __ptrace_may_access() {
> >>>                 ...
> >>>                 if (mm &&
> >>>                     ((get_dumpable(mm) != SUID_DUMP_USER) &&
> >>>                      !ptrace_has_cap(cred, mm->user_ns, mode)))
> >>>                     return -EPERM;
> >>>       }
> >>>
> >>> This if condition passes for perf with CAP_PERFMON and thus it returns -EPERM.
> >>> But it fails for perf without CAP_PERFMON and thus it goes ahead and returns
> >>> success. So opening an event fails when perf has CAP_PREFMON and tries to open
> >>> process specific event as normal user.
> >>>
> >>> Workarounds:
> >>> ------------
> >>> Based on above analysis, I found couple of workarounds (examples are on
> >>> Ubuntu 18.04.4 powerpc):
> >>>
> >>> Workaround1:
> >>> Setting SUID_DUMP_USER as default (in /proc/sys/fs/suid_dumpable) solves the
> >>> issue.
> >>>
> >>>   # echo 1 > /proc/sys/fs/suid_dumpable
> >>>   $ getcap ./perf
> >>>     ./perf = cap_perfmon+ep
> >>>   $ ./perf stat ls
> >>>     Performance counter stats for 'ls':
> >>>                     1.47 msec task-clock                #    0.806 CPUs utilized
> >>>                     0      context-switches          #    0.000 K/sec
> >>>                     0      cpu-migrations            #    0.000 K/sec
> >>>
> >>> Workaround2:
> >>> Using CAP_SYS_PTRACE along with CAP_PERFMON solves the issue.
> >>>
> >>>   $ cat /proc/sys/fs/suid_dumpable
> >>>     2
> >>>   # setcap "cap_perfmon,cap_sys_ptrace=ep" ./perf
> >>>   $ ./perf stat ls
> >>>     Performance counter stats for 'ls':
> >>>                     1.41 msec task-clock                #    0.826 CPUs utilized
> >>>                     0      context-switches          #    0.000 K/sec
> >>>                     0      cpu-migrations            #    0.000 K/sec
> >>>
> >>> Workaround3:
> >>> Adding CAP_PERFMON to parent of perf (/bin/bash) also solves the issue.
> >>>
> >>>   $ cat /proc/sys/fs/suid_dumpable
> >>>     2
> >>>   # setcap "cap_perfmon=ep" /bin/bash
> >>>   # setcap "cap_perfmon=ep" ./perf
> >>>   $ bash
> >>>   $ ./perf stat ls
> >>>     Performance counter stats for 'ls':
> >>>                     1.47 msec task-clock                #    0.806 CPUs utilized
> >>>                     0      context-switches          #    0.000 K/sec
> >>>                     0      cpu-migrations            #    0.000 K/sec
> >>>
> >>> - Ravi
> >
Alexey Budankov July 13, 2020, 12:37 p.m. UTC | #17
On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
>>
>> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
>>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
>>>> On 10.07.2020 16:31, Ravi Bangoria wrote:
>>>>>> Currently access to perf_events, i915_perf and other performance
>>>>>> monitoring and observability subsystems of the kernel is open only for
>>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>>>>>> process effective set [2].
> 
>>>>>> This patch set introduces CAP_PERFMON capability designed to secure
>>>>>> system performance monitoring and observability operations so that
>>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>>>>>> for performance monitoring and observability subsystems of the kernel.
> 
>>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
>>>>> specific target. I don't know whether this is sort of a regression or
>>>>> an expected behavior.
> 
>>>> Thanks for reporting and root causing this case. The behavior looks like
>>>> kind of expected since currently CAP_PERFMON takes over the related part
>>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
>>>> that access control is also subject to CAP_SYS_PTRACE credentials.
> 
>>> I think that stating that in the error message would be helpful, after
>>> all, who reads docs? 8-)
> 
>> At least those who write it :D ...
> 
> Everybody should read it, sure :-)
>  
>>> I.e., this:
>>>
>>> $ ./perf stat ls
>>>   Error:
>>>   Access to performance monitoring and observability operations is limited.
>>> $
>>>
>>> Could become:
>>>
>>> $ ./perf stat ls
>>>   Error:
>>>   Access to performance monitoring and observability operations is limited.
>>>   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>>> $
>>
>> It would better provide reference to perf security docs in the tool output.
> 
> So add a 3rd line:
> 
> $ ./perf stat ls
>   Error:
>   Access to performance monitoring and observability operations is limited.
>   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>   Please read the 'Perf events and tool security' document:
>   https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
If it had that patch below then message change would not be required.
However this two sentences in the end of whole message would still add up:
"Please read the 'Perf events and tool security' document:
 https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"

> 
>> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
> 
> You mean the following?

Exactly that.

> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 856d98c36f56..a2397f724c10 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
>  		 * perf_event_exit_task() that could imply).
>  		 */
>  		err = -EACCES;
> -		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> +		if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
>  			goto err_cred;
>  	}
> 
>> makes monitoring simpler and even more secure to use since Perf tool need
>> not to start/stop/single-step and read/write registers and memory and so on
>> like a debugger or strace-like tool. What do you think?
> 
> I tend to agree, Peter?
>  
>> Alexei
>>
>>>
>>> - Arnaldo

Alexei
Arnaldo Carvalho de Melo July 13, 2020, 6:51 p.m. UTC | #18
Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu:
> 
> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
> >>
> >> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
> >>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
> >>>> On 10.07.2020 16:31, Ravi Bangoria wrote:
> >>>>>> Currently access to perf_events, i915_perf and other performance
> >>>>>> monitoring and observability subsystems of the kernel is open only for
> >>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
> >>>>>> process effective set [2].
> > 
> >>>>>> This patch set introduces CAP_PERFMON capability designed to secure
> >>>>>> system performance monitoring and observability operations so that
> >>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
> >>>>>> for performance monitoring and observability subsystems of the kernel.
> > 
> >>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
> >>>>> specific target. I don't know whether this is sort of a regression or
> >>>>> an expected behavior.
> > 
> >>>> Thanks for reporting and root causing this case. The behavior looks like
> >>>> kind of expected since currently CAP_PERFMON takes over the related part
> >>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
> >>>> that access control is also subject to CAP_SYS_PTRACE credentials.
> > 
> >>> I think that stating that in the error message would be helpful, after
> >>> all, who reads docs? 8-)
> > 
> >> At least those who write it :D ...
> > 
> > Everybody should read it, sure :-)
> >  
> >>> I.e., this:
> >>>
> >>> $ ./perf stat ls
> >>>   Error:
> >>>   Access to performance monitoring and observability operations is limited.
> >>> $
> >>>
> >>> Could become:
> >>>
> >>> $ ./perf stat ls
> >>>   Error:
> >>>   Access to performance monitoring and observability operations is limited.
> >>>   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> >>> $
> >>
> >> It would better provide reference to perf security docs in the tool output.
> > 
> > So add a 3rd line:
> > 
> > $ ./perf stat ls
> >   Error:
> >   Access to performance monitoring and observability operations is limited.
> >   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
> >   Please read the 'Perf events and tool security' document:
> >   https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

> If it had that patch below then message change would not be required.

Sure, but the tool should continue to work and provide useful messages
when running on kernels without that change. Pointing to the document is
valid and should be done, that is an agreed point. But the tool can do
some checks, narrow down the possible causes for the error message and
provide something that in most cases will make the user make progress.

> However this two sentences in the end of whole message would still add up:
> "Please read the 'Perf events and tool security' document:
>  https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"

We're in violent agreement here. :-)
 
> > 
> >> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
> > 
> > You mean the following?
> 
> Exactly that.

Sure, lets then wait for others to chime in and then you can go ahead
and submit that patch.

Peter?

- Arnaldo
 
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 856d98c36f56..a2397f724c10 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> >  		 * perf_event_exit_task() that could imply).
> >  		 */
> >  		err = -EACCES;
> > -		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > +		if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> >  			goto err_cred;
> >  	}
> > 
> >> makes monitoring simpler and even more secure to use since Perf tool need
> >> not to start/stop/single-step and read/write registers and memory and so on
> >> like a debugger or strace-like tool. What do you think?
> > 
> > I tend to agree, Peter?
> >  
> >> Alexei
> >>
> >>>
> >>> - Arnaldo
> 
> Alexei
Peter Zijlstra July 14, 2020, 10:59 a.m. UTC | #19
On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote:

> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 856d98c36f56..a2397f724c10 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> > >  		 * perf_event_exit_task() that could imply).
> > >  		 */
> > >  		err = -EACCES;
> > > -		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > +		if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > >  			goto err_cred;
> > >  	}
> > > 
> > >> makes monitoring simpler and even more secure to use since Perf tool need
> > >> not to start/stop/single-step and read/write registers and memory and so on
> > >> like a debugger or strace-like tool. What do you think?
> > > 
> > > I tend to agree, Peter?

So this basically says that if CAP_PERFMON, we don't care about the
ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow
the ptrace checks?

I suppose that makes sense.
Arnaldo Carvalho de Melo July 14, 2020, 3:27 p.m. UTC | #20
Em Tue, Jul 14, 2020 at 12:59:34PM +0200, Peter Zijlstra escreveu:
> On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote:
> 
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 856d98c36f56..a2397f724c10 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> > > >  		 * perf_event_exit_task() that could imply).
> > > >  		 */
> > > >  		err = -EACCES;
> > > > -		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > > +		if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
> > > >  			goto err_cred;
> > > >  	}

> > > >> makes monitoring simpler and even more secure to use since Perf tool need
> > > >> not to start/stop/single-step and read/write registers and memory and so on
> > > >> like a debugger or strace-like tool. What do you think?

> > > > I tend to agree, Peter?

> So this basically says that if CAP_PERFMON, we don't care about the
> ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow
> the ptrace checks?

> I suppose that makes sense.

Yeah, it in fact addresses the comment right above it:

        if (task) {
                err = mutex_lock_interruptible(&task->signal->exec_update_mutex);
                if (err)
                        goto err_task;

                /*
                 * Reuse ptrace permission checks for now.
                 *
                 * We must hold exec_update_mutex across this and any potential
                 * perf_install_in_context() call for this new event to
                 * serialize against exec() altering our credentials (and the
                 * perf_event_exit_task() that could imply).
                 */
                err = -EACCES;
                if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
                        goto err_cred;
        }


that "for now" part :-)

Idea is to not require CAP_PTRACE for that, i.e. the attack surface for the
perf binary is reduced.

- Arnaldo
Alexey Budankov July 21, 2020, 1:06 p.m. UTC | #21
On 13.07.2020 21:51, Arnaldo Carvalho de Melo wrote:
> Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu:
>>
>> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
>>> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
>>>>
>>>> On 10.07.2020 20:09, Arnaldo Carvalho de Melo wrote:
>>>>> Em Fri, Jul 10, 2020 at 05:30:50PM +0300, Alexey Budankov escreveu:
>>>>>> On 10.07.2020 16:31, Ravi Bangoria wrote:
>>>>>>>> Currently access to perf_events, i915_perf and other performance
>>>>>>>> monitoring and observability subsystems of the kernel is open only for
>>>>>>>> a privileged process [1] with CAP_SYS_ADMIN capability enabled in the
>>>>>>>> process effective set [2].
>>>
>>>>>>>> This patch set introduces CAP_PERFMON capability designed to secure
>>>>>>>> system performance monitoring and observability operations so that
>>>>>>>> CAP_PERFMON would assist CAP_SYS_ADMIN capability in its governing role
>>>>>>>> for performance monitoring and observability subsystems of the kernel.
>>>
>>>>>>> I'm seeing an issue with CAP_PERFMON when I try to record data for a
>>>>>>> specific target. I don't know whether this is sort of a regression or
>>>>>>> an expected behavior.
>>>
>>>>>> Thanks for reporting and root causing this case. The behavior looks like
>>>>>> kind of expected since currently CAP_PERFMON takes over the related part
>>>>>> of CAP_SYS_ADMIN credentials only. Actually Perf security docs [1] say
>>>>>> that access control is also subject to CAP_SYS_PTRACE credentials.
>>>
>>>>> I think that stating that in the error message would be helpful, after
>>>>> all, who reads docs? 8-)
>>>
>>>> At least those who write it :D ...
>>>
>>> Everybody should read it, sure :-)
>>>  
>>>>> I.e., this:
>>>>>
>>>>> $ ./perf stat ls
>>>>>   Error:
>>>>>   Access to performance monitoring and observability operations is limited.
>>>>> $
>>>>>
>>>>> Could become:
>>>>>
>>>>> $ ./perf stat ls
>>>>>   Error:
>>>>>   Access to performance monitoring and observability operations is limited.
>>>>>   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>>>>> $
>>>>
>>>> It would better provide reference to perf security docs in the tool output.
>>>
>>> So add a 3rd line:
>>>
>>> $ ./perf stat ls
>>>   Error:
>>>   Access to performance monitoring and observability operations is limited.
>>>   Right now only CAP_PERFMON is granted, you may need CAP_SYS_PTRACE.
>>>   Please read the 'Perf events and tool security' document:
>>>   https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> 
>> If it had that patch below then message change would not be required.
> 
> Sure, but the tool should continue to work and provide useful messages
> when running on kernels without that change. Pointing to the document is
> valid and should be done, that is an agreed point. But the tool can do
> some checks, narrow down the possible causes for the error message and
> provide something that in most cases will make the user make progress.
> 
>> However this two sentences in the end of whole message would still add up:
>> "Please read the 'Perf events and tool security' document:
>>  https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"
> 
> We're in violent agreement here. :-)

Here is the message draft mentioning a) CAP_SYS_PTRACE, for kernels prior
v5.8, and b) Perf security document link. The plan is to send a patch extending
perf_events with CAP_PERFMON check [1] for ptrace_may_access() and extending
the tool with this message.

"Access to performance monitoring and observability operations is limited.
 Enforced MAC policy settings (SELinux) can limit access to performance
 monitoring and observability operations. Inspect system audit records for
 more perf_event access control information and adjusting the policy.
 Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
 access to performance monitoring and observability operations for processes
 without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
 More information can be found at 'Perf events and tool security' document:
 https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
 perf_event_paranoid setting is -1:
     -1: Allow use of (almost) all events by all users
           Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
 >= 0: Disallow raw and ftrace function tracepoint access
 >= 1: Disallow CPU event access
 >= 2: Disallow kernel profiling
 To make the adjusted perf_event_paranoid setting permanent preserve it
 in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)"

Alexei

[1] https://lore.kernel.org/lkml/20200713121746.GA7029@kernel.org/

>  
>>>
>>>> Looks like extending ptrace_may_access() check for perf_events with CAP_PERFMON
>>>
>>> You mean the following?
>>
>> Exactly that.
> 
> Sure, lets then wait for others to chime in and then you can go ahead
> and submit that patch.
> 
> Peter?
> 
> - Arnaldo
>  
>>>
>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>> index 856d98c36f56..a2397f724c10 100644
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
>>>  		 * perf_event_exit_task() that could imply).
>>>  		 */
>>>  		err = -EACCES;
>>> -		if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
>>> +		if (!perfmon_capable() && !ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
>>>  			goto err_cred;
>>>  	}
>>>
>>>> makes monitoring simpler and even more secure to use since Perf tool need
>>>> not to start/stop/single-step and read/write registers and memory and so on
>>>> like a debugger or strace-like tool. What do you think?
>>>
>>> I tend to agree, Peter?
>>>  
>>>> Alexei
>>>>
>>>>>
>>>>> - Arnaldo
>>
>> Alexei
>
Arnaldo Carvalho de Melo July 22, 2020, 11:30 a.m. UTC | #22
Em Tue, Jul 21, 2020 at 04:06:34PM +0300, Alexey Budankov escreveu:
> 
> On 13.07.2020 21:51, Arnaldo Carvalho de Melo wrote:
> > Em Mon, Jul 13, 2020 at 03:37:51PM +0300, Alexey Budankov escreveu:
> >>
> >> On 13.07.2020 15:17, Arnaldo Carvalho de Melo wrote:
> >>> Em Mon, Jul 13, 2020 at 12:48:25PM +0300, Alexey Budankov escreveu:
> >> If it had that patch below then message change would not be required.

> > Sure, but the tool should continue to work and provide useful messages
> > when running on kernels without that change. Pointing to the document is
> > valid and should be done, that is an agreed point. But the tool can do
> > some checks, narrow down the possible causes for the error message and
> > provide something that in most cases will make the user make progress.

> >> However this two sentences in the end of whole message would still add up:
> >> "Please read the 'Perf events and tool security' document:
> >>  https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html"

> > We're in violent agreement here. :-)
 
> Here is the message draft mentioning a) CAP_SYS_PTRACE, for kernels prior
> v5.8, and b) Perf security document link. The plan is to send a patch extending
> perf_events with CAP_PERFMON check [1] for ptrace_may_access() and extending
> the tool with this message.
 
> "Access to performance monitoring and observability operations is limited.
>  Enforced MAC policy settings (SELinux) can limit access to performance
>  monitoring and observability operations. Inspect system audit records for
>  more perf_event access control information and adjusting the policy.
>  Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
>  access to performance monitoring and observability operations for processes
>  without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
>  More information can be found at 'Perf events and tool security' document:
>  https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>  perf_event_paranoid setting is -1:
>      -1: Allow use of (almost) all events by all users
>            Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>  >= 0: Disallow raw and ftrace function tracepoint access
>  >= 1: Disallow CPU event access
>  >= 2: Disallow kernel profiling
>  To make the adjusted perf_event_paranoid setting permanent preserve it
>  in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)"

Looks ok! Lots of knobs to control access as one needs.

- Arnaldo
 
> Alexei
> 
> [1] https://lore.kernel.org/lkml/20200713121746.GA7029@kernel.org/