diff mbox series

procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds

Message ID 20210126225138.1823266-1-kaleshsingh@google.com (mailing list archive)
State New, archived
Headers show
Series procfs/dmabuf: Add /proc/<pid>/task/<tid>/dmabuf_fds | expand

Commit Message

Kalesh Singh Jan. 26, 2021, 10:51 p.m. UTC
In order to measure how much memory a process actually consumes, it is
necessary to include the DMA buffer sizes for that process in the memory
accounting. Since the handle to DMA buffers are raw FDs, it is important
to be able to identify which processes have FD references to a DMA buffer.

Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both of which are only root readable, as follows:
  1. Do a readlink on each FD.
  2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
  3. stat the file to get the dmabuf inode number.
  4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.

Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. To include a process’s dmabuf usage as part of its memory state,
the data collection needs to be fast enough to reflect the memory state at
the time of such events.

Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
privileges, this approach is not suitable for production builds. Granting
root privileges even to a system process increases the attack surface and
is highly undesirable. Additionally this is slow as it requires many
context switches for searching and getting the dma-buf info.

With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
details can be queried using their unique inode numbers.

This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.

/proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
every DMA buffer FD that the task has. Entries with the same inode
number can appear more than once, indicating the total FD references
for the associated DMA buffer.

If a thread shares the same files as the group leader then its
dmabuf_fds file will be empty, as these dmabufs are reported by the
group leader.

The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
and allows the efficient accounting of per-process DMA buffer usage without
requiring root privileges. (See data below)

Performance Comparison:
-----------------------

The following data compares the time to capture the sizes of all DMA
buffers referenced by FDs for all processes on an arm64 android device.

-------------------------------------------------------
                   |  Core 0 (Little)  |  Core 7 (Big) |
-------------------------------------------------------
From <pid>/fdinfo  |      318 ms       |     145 ms    |
-------------------------------------------------------
Inodes from        |      114 ms       |      27 ms    |
dmabuf_fds;        |    (2.8x  ^)      |   (5.4x  ^)   |
data from sysfs    |                   |               |
-------------------------------------------------------

It can be inferred that in the worst case there is a 2.8x speedup for
determining per-process DMA buffer FD sizes, when using the proposed
interfaces.

[1] https://lore.kernel.org/dri-devel/20210119225723.388883-1-hridya@google.com/

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
---
 Documentation/filesystems/proc.rst |  30 ++++++
 drivers/dma-buf/dma-buf.c          |   7 +-
 fs/proc/Makefile                   |   1 +
 fs/proc/base.c                     |   1 +
 fs/proc/dma_bufs.c                 | 159 +++++++++++++++++++++++++++++
 fs/proc/internal.h                 |   1 +
 include/linux/dma-buf.h            |   5 +
 7 files changed, 198 insertions(+), 6 deletions(-)
 create mode 100644 fs/proc/dma_bufs.c

Comments

Michal Hocko Jan. 27, 2021, 9:05 a.m. UTC | #1
[Cc linux-api as this is a new user interface]

On Tue 26-01-21 22:51:28, Kalesh Singh wrote:
> In order to measure how much memory a process actually consumes, it is
> necessary to include the DMA buffer sizes for that process in the memory
> accounting. Since the handle to DMA buffers are raw FDs, it is important
> to be able to identify which processes have FD references to a DMA buffer.
> 
> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both of which are only root readable, as follows:
>   1. Do a readlink on each FD.
>   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>   3. stat the file to get the dmabuf inode number.
>   4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
> 
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers. To include a process’s dmabuf usage as part of its memory state,
> the data collection needs to be fast enough to reflect the memory state at
> the time of such events.
> 
> Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> privileges, this approach is not suitable for production builds. Granting
> root privileges even to a system process increases the attack surface and
> is highly undesirable. Additionally this is slow as it requires many
> context switches for searching and getting the dma-buf info.
> 
> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> details can be queried using their unique inode numbers.
> 
> This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
> 
> /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> every DMA buffer FD that the task has. Entries with the same inode
> number can appear more than once, indicating the total FD references
> for the associated DMA buffer.
> 
> If a thread shares the same files as the group leader then its
> dmabuf_fds file will be empty, as these dmabufs are reported by the
> group leader.
> 
> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> and allows the efficient accounting of per-process DMA buffer usage without
> requiring root privileges. (See data below)
> 
> Performance Comparison:
> -----------------------
> 
> The following data compares the time to capture the sizes of all DMA
> buffers referenced by FDs for all processes on an arm64 android device.
> 
> -------------------------------------------------------
>                    |  Core 0 (Little)  |  Core 7 (Big) |
> -------------------------------------------------------
> >From <pid>/fdinfo  |      318 ms       |     145 ms    |
> -------------------------------------------------------
> Inodes from        |      114 ms       |      27 ms    |
> dmabuf_fds;        |    (2.8x  ^)      |   (5.4x  ^)   |
> data from sysfs    |                   |               |
> -------------------------------------------------------
> 
> It can be inferred that in the worst case there is a 2.8x speedup for
> determining per-process DMA buffer FD sizes, when using the proposed
> interfaces.
> 
> [1] https://lore.kernel.org/dri-devel/20210119225723.388883-1-hridya@google.com/
> 
> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> ---
>  Documentation/filesystems/proc.rst |  30 ++++++
>  drivers/dma-buf/dma-buf.c          |   7 +-
>  fs/proc/Makefile                   |   1 +
>  fs/proc/base.c                     |   1 +
>  fs/proc/dma_bufs.c                 | 159 +++++++++++++++++++++++++++++
>  fs/proc/internal.h                 |   1 +
>  include/linux/dma-buf.h            |   5 +
>  7 files changed, 198 insertions(+), 6 deletions(-)
>  create mode 100644 fs/proc/dma_bufs.c
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 2fa69f710e2a..757dd47ab679 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -47,6 +47,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> +  3.13	/proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
>  
>    4	Configuring procfs
>    4.1	Mount options
> @@ -2131,6 +2132,35 @@ AVX512_elapsed_ms
>    the task is unlikely an AVX512 user, but depends on the workload and the
>    scheduling scenario, it also could be a false negative mentioned above.
>  
> +3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
> +-------------------------------------------------------------------------
> +This file  exposes a list of the inode numbers for every DMA buffer
> +FD that the task has.
> +
> +The same inode number can appear more than once, indicating the total
> +FD references for the associated DMA buffer.
> +
> +The inode number can be used to lookup the DMA buffer information in
> +the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/.
> +
> +Example Output
> +~~~~~~~~~~~~~~
> +$ cat /proc/612/task/612/dmabuf_fds
> +30972 30973 45678 49326
> +
> +Permission to access this file is governed by a ptrace access mode
> +PTRACE_MODE_READ_FSCREDS.
> +
> +Threads can have different files when created without specifying
> +the CLONE_FILES flag. For this reason the interface is presented as
> +/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds.
> +This simplifies kernel code and aggregation can be handled in
> +userspace.
> +
> +If a thread has the same files as its group leader, then its dmabuf_fds
> +file will be empty as these dmabufs are already reported by the
> +group leader.
> +
>  Chapter 4: Configuring procfs
>  =============================
>  
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 9ad6397aaa97..0660c06be4c6 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -29,8 +29,6 @@
>  #include <uapi/linux/dma-buf.h>
>  #include <uapi/linux/magic.h>
>  
> -static inline int is_dma_buf_file(struct file *);
> -
>  struct dma_buf_list {
>  	struct list_head head;
>  	struct mutex lock;
> @@ -434,10 +432,7 @@ static const struct file_operations dma_buf_fops = {
>  	.show_fdinfo	= dma_buf_show_fdinfo,
>  };
>  
> -/*
> - * is_dma_buf_file - Check if struct file* is associated with dma_buf
> - */
> -static inline int is_dma_buf_file(struct file *file)
> +int is_dma_buf_file(struct file *file)
>  {
>  	return file->f_op == &dma_buf_fops;
>  }
> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
> index bd08616ed8ba..91a67f43ddf4 100644
> --- a/fs/proc/Makefile
> +++ b/fs/proc/Makefile
> @@ -16,6 +16,7 @@ proc-y	+= cmdline.o
>  proc-y	+= consoles.o
>  proc-y	+= cpuinfo.o
>  proc-y	+= devices.o
> +proc-y	+= dma_bufs.o
>  proc-y	+= interrupts.o
>  proc-y	+= loadavg.o
>  proc-y	+= meminfo.o
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index b3422cda2a91..af15a60b9831 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3598,6 +3598,7 @@ static const struct pid_entry tid_base_stuff[] = {
>  #ifdef CONFIG_SECCOMP_CACHE_DEBUG
>  	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
>  #endif
> +	REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations),
>  };
>  
>  static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
> diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c
> new file mode 100644
> index 000000000000..46ea9cf968ed
> --- /dev/null
> +++ b/fs/proc/dma_bufs.c
> @@ -0,0 +1,159 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Per-process DMA-BUF Stats
> + *
> + * Copyright (C) 2021 Google LLC.
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/fdtable.h>
> +#include <linux/ptrace.h>
> +#include <linux/seq_file.h>
> +
> +#include "internal.h"
> +
> +struct dmabuf_fds_private {
> +	struct inode *inode;
> +	struct task_struct *task;
> +	struct file *dmabuf_file;
> +};
> +
> +static loff_t *next_dmabuf(struct dmabuf_fds_private *priv,
> +		loff_t *pos)
> +{
> +	struct fdtable *fdt;
> +	struct file *file;
> +
> +	rcu_read_lock();
> +	fdt = files_fdtable(priv->task->files);
> +	for (; *pos < fdt->max_fds; ++*pos) {
> +		file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos);
> +		if (file && is_dma_buf_file(file) && get_file_rcu(file)) {
> +			priv->dmabuf_file = file;
> +			break;
> +		}
> +	}
> +	if (*pos >= fdt->max_fds)
> +		pos = NULL;
> +	rcu_read_unlock();
> +
> +	return pos;
> +}
> +
> +static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos)
> +{
> +	struct dmabuf_fds_private *priv = s->private;
> +	struct files_struct *group_leader_files;
> +
> +	priv->task = get_proc_task(priv->inode);
> +
> +	if (!priv->task)
> +		return ERR_PTR(-ESRCH);
> +
> +	/* Hold task lock for duration that files need to be stable */
> +	task_lock(priv->task);
> +
> +	/*
> +	 * If this task is not the group leader but shares the same files, leave file empty.
> +	 * These dmabufs are already reported in the group leader's dmabuf_fds.
> +	 */
> +	group_leader_files = priv->task->group_leader->files;
> +	if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) {
> +		task_unlock(priv->task);
> +		put_task_struct(priv->task);
> +		priv->task = NULL;
> +		return NULL;
> +	}
> +
> +	return next_dmabuf(priv, pos);
> +}
> +
> +static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
> +{
> +	++*pos;
> +	return next_dmabuf(s->private, pos);
> +}
> +
> +static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
> +{
> +	struct dmabuf_fds_private *priv = s->private;
> +
> +	if (priv->task) {
> +		task_unlock(priv->task);
> +		put_task_struct(priv->task);
> +
> +	}
> +	if (priv->dmabuf_file)
> +		fput(priv->dmabuf_file);
> +}
> +
> +static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
> +{
> +	struct dmabuf_fds_private *priv = s->private;
> +	struct file *file = priv->dmabuf_file;
> +	struct dma_buf *dmabuf = file->private_data;
> +
> +	if (!dmabuf)
> +		return -ESRCH;
> +
> +	seq_printf(s, "%8lu ", file_inode(file)->i_ino);
> +
> +	fput(priv->dmabuf_file);
> +	priv->dmabuf_file = NULL;
> +
> +	return 0;
> +}
> +
> +static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
> +	.start = dmabuf_fds_seq_start,
> +	.next  = dmabuf_fds_seq_next,
> +	.stop  = dmabuf_fds_seq_stop,
> +	.show  = dmabuf_fds_seq_show
> +};
> +
> +static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
> +		     const struct seq_operations *ops)
> +{
> +	struct dmabuf_fds_private *priv;
> +	struct task_struct *task;
> +	bool allowed = false;
> +
> +	task = get_proc_task(inode);
> +	if (!task)
> +		return -ESRCH;
> +
> +	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
> +	put_task_struct(task);
> +
> +	if (!allowed)
> +		return -EACCES;
> +
> +	priv = __seq_open_private(file, ops, sizeof(*priv));
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	priv->inode = inode;
> +	priv->task = NULL;
> +	priv->dmabuf_file = NULL;
> +
> +	return 0;
> +}
> +
> +static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
> +{
> +	return seq_release_private(inode, file);
> +}
> +
> +static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
> +{
> +	return proc_dmabuf_fds_open(inode, file,
> +			&proc_tid_dmabuf_fds_seq_ops);
> +}
> +
> +const struct file_operations proc_tid_dmabuf_fds_operations = {
> +	.open		= tid_dmabuf_fds_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= proc_dmabuf_fds_release,
> +};
> +
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index f60b379dcdc7..4ca74220db9c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -303,6 +303,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>  extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_tid_dmabuf_fds_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index cf72699cb2bc..087e11f7f193 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -27,6 +27,11 @@ struct device;
>  struct dma_buf;
>  struct dma_buf_attachment;
>  
> +/**
> + * Check if struct file* is associated with dma_buf.
> + */
> +int is_dma_buf_file(struct file *file);
> +
>  /**
>   * struct dma_buf_ops - operations possible on struct dma_buf
>   * @vmap: [optional] creates a virtual mapping for the buffer into kernel
> -- 
> 2.30.0.280.ga3ce27912f-goog
Jann Horn Jan. 27, 2021, 10:47 a.m. UTC | #2
+jeffv from Android

On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <kaleshsingh@google.com> wrote:
> In order to measure how much memory a process actually consumes, it is
> necessary to include the DMA buffer sizes for that process in the memory
> accounting. Since the handle to DMA buffers are raw FDs, it is important
> to be able to identify which processes have FD references to a DMA buffer.

Or you could try to let the DMA buffer take a reference on the
mm_struct and account its size into the mm_struct? That would probably
be nicer to work with than having to poke around in procfs separately
for DMA buffers.

> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> /proc/<pid>/fdinfo -- both of which are only root readable, as follows:

That's not quite right. They can both also be accessed by the user
owning the process. Also, fdinfo is a standard interface for
inspecting process state that doesn't permit reading process memory or
manipulating process state - so I think it would be fine to permit
access to fdinfo under a PTRACE_MODE_READ_FSCRED check, just like the
interface you're suggesting.

>   1. Do a readlink on each FD.
>   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>   3. stat the file to get the dmabuf inode number.
>   4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
>
> Android captures per-process system memory state when certain low memory
> events (e.g a foreground app kill) occur, to identify potential memory
> hoggers. To include a process’s dmabuf usage as part of its memory state,
> the data collection needs to be fast enough to reflect the memory state at
> the time of such events.
>
> Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> privileges, this approach is not suitable for production builds.

It should be easy to add enough information to /proc/<pid>/fdinfo/ so
that you don't need to look at /proc/<pid>/fd/ anymore.

> Granting
> root privileges even to a system process increases the attack surface and
> is highly undesirable. Additionally this is slow as it requires many
> context switches for searching and getting the dma-buf info.

What do you mean by "context switches"? Task switches or kernel/user
transitions (e.g. via syscall)?

> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> details can be queried using their unique inode numbers.
>
> This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
>
> /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> every DMA buffer FD that the task has. Entries with the same inode
> number can appear more than once, indicating the total FD references
> for the associated DMA buffer.
>
> If a thread shares the same files as the group leader then its
> dmabuf_fds file will be empty, as these dmabufs are reported by the
> group leader.
>
> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> and allows the efficient accounting of per-process DMA buffer usage without
> requiring root privileges. (See data below)

I'm not convinced that introducing a new procfs file for this is the
right way to go. And the idea of having to poke into multiple
different files in procfs and in sysfs just to be able to compute a
proper memory usage score for a process seems weird to me. "How much
memory is this process using" seems like the kind of question the
kernel ought to be able to answer (and the kernel needs to be able to
answer somewhat accurately so that its own OOM killer can do its job
properly)?
Christian König Jan. 27, 2021, 10:53 a.m. UTC | #3
Am 27.01.21 um 10:05 schrieb Michal Hocko:
> [Cc linux-api as this is a new user interface]
>
> On Tue 26-01-21 22:51:28, Kalesh Singh wrote:
>> In order to measure how much memory a process actually consumes, it is
>> necessary to include the DMA buffer sizes for that process in the memory
>> accounting. Since the handle to DMA buffers are raw FDs, it is important
>> to be able to identify which processes have FD references to a DMA buffer.
>>
>> Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
>> /proc/<pid>/fdinfo -- both of which are only root readable, as follows:
>>    1. Do a readlink on each FD.
>>    2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
>>    3. stat the file to get the dmabuf inode number.
>>    4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
>>
>> Android captures per-process system memory state when certain low memory
>> events (e.g a foreground app kill) occur, to identify potential memory
>> hoggers. To include a process’s dmabuf usage as part of its memory state,
>> the data collection needs to be fast enough to reflect the memory state at
>> the time of such events.
>>
>> Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
>> privileges, this approach is not suitable for production builds. Granting
>> root privileges even to a system process increases the attack surface and
>> is highly undesirable. Additionally this is slow as it requires many
>> context switches for searching and getting the dma-buf info.
>>
>> With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
>> details can be queried using their unique inode numbers.

While this looks technically clean I have to agree with Daniel that this 
approach doesn't sounds like the right thing to do. The fundamental 
problem goes deeper than what's proposed here.

In general processes are currently not held accountable for memory they 
reference through their file descriptors. DMA-buf is just one special case.

In other words you can currently do something like this

fd = memfd_create("test", 0);
while (1)
     write(fd, buf, 1024);

and the OOM killer will terminate random processes, but never the one 
holding the memfd reference.

It even becomes worse with GPU and acceleration drivers where easily all 
of the system memory is allocated for for games or scientific 
applications without being able to see that in proc or sysfs.

Some years ago I've proposed a way to at least improve the OOM killer 
decision which process to terminate, but that never went something.

Regards,
Christian.

>>
>> This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
>>
>> /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
>> every DMA buffer FD that the task has. Entries with the same inode
>> number can appear more than once, indicating the total FD references
>> for the associated DMA buffer.
>>
>> If a thread shares the same files as the group leader then its
>> dmabuf_fds file will be empty, as these dmabufs are reported by the
>> group leader.
>>
>> The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
>> and allows the efficient accounting of per-process DMA buffer usage without
>> requiring root privileges. (See data below)
>>
>> Performance Comparison:
>> -----------------------
>>
>> The following data compares the time to capture the sizes of all DMA
>> buffers referenced by FDs for all processes on an arm64 android device.
>>
>> -------------------------------------------------------
>>                     |  Core 0 (Little)  |  Core 7 (Big) |
>> -------------------------------------------------------
>> >From <pid>/fdinfo  |      318 ms       |     145 ms    |
>> -------------------------------------------------------
>> Inodes from        |      114 ms       |      27 ms    |
>> dmabuf_fds;        |    (2.8x  ^)      |   (5.4x  ^)   |
>> data from sysfs    |                   |               |
>> -------------------------------------------------------
>>
>> It can be inferred that in the worst case there is a 2.8x speedup for
>> determining per-process DMA buffer FD sizes, when using the proposed
>> interfaces.
>>
>> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210119225723.388883-1-hridya%40google.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Cfafaecb186a8408c307208d8c2a2b264%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637473351428416475%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NIhGLE6ysENKIZPMKari23pczegYl5xNwbz0gzK8sj4%3D&amp;reserved=0
>>
>> Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
>> ---
>>   Documentation/filesystems/proc.rst |  30 ++++++
>>   drivers/dma-buf/dma-buf.c          |   7 +-
>>   fs/proc/Makefile                   |   1 +
>>   fs/proc/base.c                     |   1 +
>>   fs/proc/dma_bufs.c                 | 159 +++++++++++++++++++++++++++++
>>   fs/proc/internal.h                 |   1 +
>>   include/linux/dma-buf.h            |   5 +
>>   7 files changed, 198 insertions(+), 6 deletions(-)
>>   create mode 100644 fs/proc/dma_bufs.c
>>
>> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
>> index 2fa69f710e2a..757dd47ab679 100644
>> --- a/Documentation/filesystems/proc.rst
>> +++ b/Documentation/filesystems/proc.rst
>> @@ -47,6 +47,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
>>     3.10  /proc/<pid>/timerslack_ns - Task timerslack value
>>     3.11	/proc/<pid>/patch_state - Livepatch patch operation state
>>     3.12	/proc/<pid>/arch_status - Task architecture specific information
>> +  3.13	/proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
>>   
>>     4	Configuring procfs
>>     4.1	Mount options
>> @@ -2131,6 +2132,35 @@ AVX512_elapsed_ms
>>     the task is unlikely an AVX512 user, but depends on the workload and the
>>     scheduling scenario, it also could be a false negative mentioned above.
>>   
>> +3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
>> +-------------------------------------------------------------------------
>> +This file  exposes a list of the inode numbers for every DMA buffer
>> +FD that the task has.
>> +
>> +The same inode number can appear more than once, indicating the total
>> +FD references for the associated DMA buffer.
>> +
>> +The inode number can be used to lookup the DMA buffer information in
>> +the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/.
>> +
>> +Example Output
>> +~~~~~~~~~~~~~~
>> +$ cat /proc/612/task/612/dmabuf_fds
>> +30972 30973 45678 49326
>> +
>> +Permission to access this file is governed by a ptrace access mode
>> +PTRACE_MODE_READ_FSCREDS.
>> +
>> +Threads can have different files when created without specifying
>> +the CLONE_FILES flag. For this reason the interface is presented as
>> +/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds.
>> +This simplifies kernel code and aggregation can be handled in
>> +userspace.
>> +
>> +If a thread has the same files as its group leader, then its dmabuf_fds
>> +file will be empty as these dmabufs are already reported by the
>> +group leader.
>> +
>>   Chapter 4: Configuring procfs
>>   =============================
>>   
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index 9ad6397aaa97..0660c06be4c6 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -29,8 +29,6 @@
>>   #include <uapi/linux/dma-buf.h>
>>   #include <uapi/linux/magic.h>
>>   
>> -static inline int is_dma_buf_file(struct file *);
>> -
>>   struct dma_buf_list {
>>   	struct list_head head;
>>   	struct mutex lock;
>> @@ -434,10 +432,7 @@ static const struct file_operations dma_buf_fops = {
>>   	.show_fdinfo	= dma_buf_show_fdinfo,
>>   };
>>   
>> -/*
>> - * is_dma_buf_file - Check if struct file* is associated with dma_buf
>> - */
>> -static inline int is_dma_buf_file(struct file *file)
>> +int is_dma_buf_file(struct file *file)
>>   {
>>   	return file->f_op == &dma_buf_fops;
>>   }
>> diff --git a/fs/proc/Makefile b/fs/proc/Makefile
>> index bd08616ed8ba..91a67f43ddf4 100644
>> --- a/fs/proc/Makefile
>> +++ b/fs/proc/Makefile
>> @@ -16,6 +16,7 @@ proc-y	+= cmdline.o
>>   proc-y	+= consoles.o
>>   proc-y	+= cpuinfo.o
>>   proc-y	+= devices.o
>> +proc-y	+= dma_bufs.o
>>   proc-y	+= interrupts.o
>>   proc-y	+= loadavg.o
>>   proc-y	+= meminfo.o
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index b3422cda2a91..af15a60b9831 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -3598,6 +3598,7 @@ static const struct pid_entry tid_base_stuff[] = {
>>   #ifdef CONFIG_SECCOMP_CACHE_DEBUG
>>   	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
>>   #endif
>> +	REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations),
>>   };
>>   
>>   static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
>> diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c
>> new file mode 100644
>> index 000000000000..46ea9cf968ed
>> --- /dev/null
>> +++ b/fs/proc/dma_bufs.c
>> @@ -0,0 +1,159 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Per-process DMA-BUF Stats
>> + *
>> + * Copyright (C) 2021 Google LLC.
>> + */
>> +
>> +#include <linux/dma-buf.h>
>> +#include <linux/fdtable.h>
>> +#include <linux/ptrace.h>
>> +#include <linux/seq_file.h>
>> +
>> +#include "internal.h"
>> +
>> +struct dmabuf_fds_private {
>> +	struct inode *inode;
>> +	struct task_struct *task;
>> +	struct file *dmabuf_file;
>> +};
>> +
>> +static loff_t *next_dmabuf(struct dmabuf_fds_private *priv,
>> +		loff_t *pos)
>> +{
>> +	struct fdtable *fdt;
>> +	struct file *file;
>> +
>> +	rcu_read_lock();
>> +	fdt = files_fdtable(priv->task->files);
>> +	for (; *pos < fdt->max_fds; ++*pos) {
>> +		file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos);
>> +		if (file && is_dma_buf_file(file) && get_file_rcu(file)) {
>> +			priv->dmabuf_file = file;
>> +			break;
>> +		}
>> +	}
>> +	if (*pos >= fdt->max_fds)
>> +		pos = NULL;
>> +	rcu_read_unlock();
>> +
>> +	return pos;
>> +}
>> +
>> +static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos)
>> +{
>> +	struct dmabuf_fds_private *priv = s->private;
>> +	struct files_struct *group_leader_files;
>> +
>> +	priv->task = get_proc_task(priv->inode);
>> +
>> +	if (!priv->task)
>> +		return ERR_PTR(-ESRCH);
>> +
>> +	/* Hold task lock for duration that files need to be stable */
>> +	task_lock(priv->task);
>> +
>> +	/*
>> +	 * If this task is not the group leader but shares the same files, leave file empty.
>> +	 * These dmabufs are already reported in the group leader's dmabuf_fds.
>> +	 */
>> +	group_leader_files = priv->task->group_leader->files;
>> +	if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) {
>> +		task_unlock(priv->task);
>> +		put_task_struct(priv->task);
>> +		priv->task = NULL;
>> +		return NULL;
>> +	}
>> +
>> +	return next_dmabuf(priv, pos);
>> +}
>> +
>> +static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
>> +{
>> +	++*pos;
>> +	return next_dmabuf(s->private, pos);
>> +}
>> +
>> +static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
>> +{
>> +	struct dmabuf_fds_private *priv = s->private;
>> +
>> +	if (priv->task) {
>> +		task_unlock(priv->task);
>> +		put_task_struct(priv->task);
>> +
>> +	}
>> +	if (priv->dmabuf_file)
>> +		fput(priv->dmabuf_file);
>> +}
>> +
>> +static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
>> +{
>> +	struct dmabuf_fds_private *priv = s->private;
>> +	struct file *file = priv->dmabuf_file;
>> +	struct dma_buf *dmabuf = file->private_data;
>> +
>> +	if (!dmabuf)
>> +		return -ESRCH;
>> +
>> +	seq_printf(s, "%8lu ", file_inode(file)->i_ino);
>> +
>> +	fput(priv->dmabuf_file);
>> +	priv->dmabuf_file = NULL;
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
>> +	.start = dmabuf_fds_seq_start,
>> +	.next  = dmabuf_fds_seq_next,
>> +	.stop  = dmabuf_fds_seq_stop,
>> +	.show  = dmabuf_fds_seq_show
>> +};
>> +
>> +static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
>> +		     const struct seq_operations *ops)
>> +{
>> +	struct dmabuf_fds_private *priv;
>> +	struct task_struct *task;
>> +	bool allowed = false;
>> +
>> +	task = get_proc_task(inode);
>> +	if (!task)
>> +		return -ESRCH;
>> +
>> +	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
>> +	put_task_struct(task);
>> +
>> +	if (!allowed)
>> +		return -EACCES;
>> +
>> +	priv = __seq_open_private(file, ops, sizeof(*priv));
>> +	if (!priv)
>> +		return -ENOMEM;
>> +
>> +	priv->inode = inode;
>> +	priv->task = NULL;
>> +	priv->dmabuf_file = NULL;
>> +
>> +	return 0;
>> +}
>> +
>> +static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
>> +{
>> +	return seq_release_private(inode, file);
>> +}
>> +
>> +static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
>> +{
>> +	return proc_dmabuf_fds_open(inode, file,
>> +			&proc_tid_dmabuf_fds_seq_ops);
>> +}
>> +
>> +const struct file_operations proc_tid_dmabuf_fds_operations = {
>> +	.open		= tid_dmabuf_fds_open,
>> +	.read		= seq_read,
>> +	.llseek		= seq_lseek,
>> +	.release	= proc_dmabuf_fds_release,
>> +};
>> +
>> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
>> index f60b379dcdc7..4ca74220db9c 100644
>> --- a/fs/proc/internal.h
>> +++ b/fs/proc/internal.h
>> @@ -303,6 +303,7 @@ extern const struct file_operations proc_pid_smaps_operations;
>>   extern const struct file_operations proc_pid_smaps_rollup_operations;
>>   extern const struct file_operations proc_clear_refs_operations;
>>   extern const struct file_operations proc_pagemap_operations;
>> +extern const struct file_operations proc_tid_dmabuf_fds_operations;
>>   
>>   extern unsigned long task_vsize(struct mm_struct *);
>>   extern unsigned long task_statm(struct mm_struct *,
>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>> index cf72699cb2bc..087e11f7f193 100644
>> --- a/include/linux/dma-buf.h
>> +++ b/include/linux/dma-buf.h
>> @@ -27,6 +27,11 @@ struct device;
>>   struct dma_buf;
>>   struct dma_buf_attachment;
>>   
>> +/**
>> + * Check if struct file* is associated with dma_buf.
>> + */
>> +int is_dma_buf_file(struct file *file);
>> +
>>   /**
>>    * struct dma_buf_ops - operations possible on struct dma_buf
>>    * @vmap: [optional] creates a virtual mapping for the buffer into kernel
>> -- 
>> 2.30.0.280.ga3ce27912f-goog
Michal Hocko Jan. 27, 2021, 10:57 a.m. UTC | #4
On Wed 27-01-21 11:47:29, Jann Horn wrote:
> +jeffv from Android
> 
> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <kaleshsingh@google.com> wrote:
> > In order to measure how much memory a process actually consumes, it is
> > necessary to include the DMA buffer sizes for that process in the memory
> > accounting. Since the handle to DMA buffers are raw FDs, it is important
> > to be able to identify which processes have FD references to a DMA buffer.
> 
> Or you could try to let the DMA buffer take a reference on the
> mm_struct and account its size into the mm_struct? That would probably
> be nicer to work with than having to poke around in procfs separately
> for DMA buffers.

Yes that would make some sense to me as well but how do you know that
the process actually uses a buffer? If it mmaps it then you already have
that information via /proc/<pid>/maps. My understanding of dma-buf is
really coarse but my impression is that you can consume the memory via
standard read syscall as well. How would you account for that.

[...]
Skipping over a large part of your response but I do agree that the
interface is really elaborate to drill down to the information.

> I'm not convinced that introducing a new procfs file for this is the
> right way to go. And the idea of having to poke into multiple
> different files in procfs and in sysfs just to be able to compute a
> proper memory usage score for a process seems weird to me. "How much
> memory is this process using" seems like the kind of question the
> kernel ought to be able to answer (and the kernel needs to be able to
> answer somewhat accurately so that its own OOM killer can do its job
> properly)?

Well, shared buffers are tricky but it is true that we already consider
shmem in badness so this wouldn't go out of line. Kernel oom killer
could be more clever with these special fds though and query for buffer
size directly.
Christian König Jan. 27, 2021, 11:01 a.m. UTC | #5
Am 27.01.21 um 11:57 schrieb Michal Hocko:
> On Wed 27-01-21 11:47:29, Jann Horn wrote:
>> +jeffv from Android
>>
>> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <kaleshsingh@google.com> wrote:
>>> In order to measure how much memory a process actually consumes, it is
>>> necessary to include the DMA buffer sizes for that process in the memory
>>> accounting. Since the handle to DMA buffers are raw FDs, it is important
>>> to be able to identify which processes have FD references to a DMA buffer.
>> Or you could try to let the DMA buffer take a reference on the
>> mm_struct and account its size into the mm_struct? That would probably
>> be nicer to work with than having to poke around in procfs separately
>> for DMA buffers.
> Yes that would make some sense to me as well but how do you know that
> the process actually uses a buffer? If it mmaps it then you already have
> that information via /proc/<pid>/maps. My understanding of dma-buf is
> really coarse but my impression is that you can consume the memory via
> standard read syscall as well. How would you account for that.

Somewhat correct. This interface here really doesn't make sense since 
the file descriptor representation of DMA-buf is only meant to be used 
for short term usage.

E.g. the idea is that you can export a DMA-buf fd from your device 
driver, transfer that to another process and then import it again into a 
device driver.

Keeping a long term reference to a DMA-buf fd sounds like a design bug 
in userspace to me.

> [...]
> Skipping over a large part of your response but I do agree that the
> interface is really elaborate to drill down to the information.
>
>> I'm not convinced that introducing a new procfs file for this is the
>> right way to go. And the idea of having to poke into multiple
>> different files in procfs and in sysfs just to be able to compute a
>> proper memory usage score for a process seems weird to me. "How much
>> memory is this process using" seems like the kind of question the
>> kernel ought to be able to answer (and the kernel needs to be able to
>> answer somewhat accurately so that its own OOM killer can do its job
>> properly)?
> Well, shared buffers are tricky but it is true that we already consider
> shmem in badness so this wouldn't go out of line. Kernel oom killer
> could be more clever with these special fds though and query for buffer
> size directly.

Some years ago I've proposed an change to improve the OOM killer to take 
into account how much memory is reference through special file 
descriptors like device drivers or DMA-buf.

But that never want anywhere because of concerns that this would add to 
much overhead to the OOM killer.

Regards,
Christian.
Michal Hocko Jan. 27, 2021, 11:02 a.m. UTC | #6
On Wed 27-01-21 11:53:55, Christian König wrote:
[...]
> In general processes are currently not held accountable for memory they
> reference through their file descriptors. DMA-buf is just one special case.

True

> In other words you can currently do something like this
> 
> fd = memfd_create("test", 0);
> while (1)
>     write(fd, buf, 1024);
> 
> and the OOM killer will terminate random processes, but never the one
> holding the memfd reference.

memfd is just shmem under cover, no? And that means that the memory gets
accounted to MM_SHMEMPAGES. But you are right that this in its own
doesn't help much if the fd is shared and the memory stays behind a
killed victim.

But I do agree with you that there are resources which are bound to a
process life time but the oom killer has no idea about those as they are
not accounted on a per process level and/or oom_badness doesn't take
them into consideration.
Christian König Jan. 27, 2021, 11:08 a.m. UTC | #7
Am 27.01.21 um 12:02 schrieb Michal Hocko:
> On Wed 27-01-21 11:53:55, Christian König wrote:
> [...]
>> In general processes are currently not held accountable for memory they
>> reference through their file descriptors. DMA-buf is just one special case.
> True
>
>> In other words you can currently do something like this
>>
>> fd = memfd_create("test", 0);
>> while (1)
>>      write(fd, buf, 1024);
>>
>> and the OOM killer will terminate random processes, but never the one
>> holding the memfd reference.
> memfd is just shmem under cover, no? And that means that the memory gets
> accounted to MM_SHMEMPAGES. But you are right that this in its own
> doesn't help much if the fd is shared and the memory stays behind a
> killed victim.

I think so, yes. But I just tested this and it doesn't seem to work 
correctly.

When I run the few lines above the OOM killer starts to terminate 
processes, but since my small test program uses very very little memory 
basically everything else gets terminated (including X, desktop, sshd 
etc..) before it is terminated as well.

Regards,
Christian.

> But I do agree with you that there are resources which are bound to a
> process life time but the oom killer has no idea about those as they are
> not accounted on a per process level and/or oom_badness doesn't take
> them into consideration.
Michal Hocko Jan. 27, 2021, 11:23 a.m. UTC | #8
On Wed 27-01-21 12:08:50, Christian König wrote:
> Am 27.01.21 um 12:02 schrieb Michal Hocko:
> > On Wed 27-01-21 11:53:55, Christian König wrote:
> > [...]
> > > In general processes are currently not held accountable for memory they
> > > reference through their file descriptors. DMA-buf is just one special case.
> > True
> > 
> > > In other words you can currently do something like this
> > > 
> > > fd = memfd_create("test", 0);
> > > while (1)
> > >      write(fd, buf, 1024);
> > > 
> > > and the OOM killer will terminate random processes, but never the one
> > > holding the memfd reference.
> > memfd is just shmem under cover, no? And that means that the memory gets
> > accounted to MM_SHMEMPAGES. But you are right that this in its own
> > doesn't help much if the fd is shared and the memory stays behind a
> > killed victim.
> 
> I think so, yes. But I just tested this and it doesn't seem to work
> correctly.
> 
> When I run the few lines above the OOM killer starts to terminate processes,
> but since my small test program uses very very little memory basically
> everything else gets terminated (including X, desktop, sshd etc..) before it
> is terminated as well.

Something worth looking into. Maybe those pages are not really accounted
properly after all. Can you send a separate email about details with oom
reports please?
Michal Hocko Jan. 27, 2021, 11:27 a.m. UTC | #9
On Wed 27-01-21 12:01:55, Christian König wrote:
[...]
> Some years ago I've proposed an change to improve the OOM killer to take
> into account how much memory is reference through special file descriptors
> like device drivers or DMA-buf.
> 
> But that never want anywhere because of concerns that this would add to much
> overhead to the OOM killer.

I have to admit I do not remember such a discussion. There were many on
the related topic, concerns were mostly about a proper accounting of
shared resources or locking required to gain the information. I do not
remember overhead as being the main concern as oom is a real cold path.

Anyway, if you want to revamp those patches then feel free to CC me and
we can discuss further. I do not want to hijack this thread by an
unrelated topic.
Kalesh Singh Jan. 27, 2021, 5:16 p.m. UTC | #10
On Wed, Jan 27, 2021 at 5:47 AM Jann Horn <jannh@google.com> wrote:
>
> +jeffv from Android
>
> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh <kaleshsingh@google.com> wrote:
> > In order to measure how much memory a process actually consumes, it is
> > necessary to include the DMA buffer sizes for that process in the memory
> > accounting. Since the handle to DMA buffers are raw FDs, it is important
> > to be able to identify which processes have FD references to a DMA buffer.
>
> Or you could try to let the DMA buffer take a reference on the
> mm_struct and account its size into the mm_struct? That would probably
> be nicer to work with than having to poke around in procfs separately
> for DMA buffers.
>
> > Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
> > /proc/<pid>/fdinfo -- both of which are only root readable, as follows:
>
> That's not quite right. They can both also be accessed by the user
> owning the process. Also, fdinfo is a standard interface for
> inspecting process state that doesn't permit reading process memory or
> manipulating process state - so I think it would be fine to permit
> access to fdinfo under a PTRACE_MODE_READ_FSCRED check, just like the
> interface you're suggesting.


Hi everyone. Thank you for the feedback.

I understand there is a deeper problem of accounting shared memory in
the kernel, that’s not only specific to the DMA buffers. In this case
DMA buffers, I think Jann’s proposal is the cleanest way to attribute
the shared buffers to processes. I can respin a patch modifying fdinfo
as suggested, if this is not an issue from a security perspective.

Thanks,
Kalesh

>
>
> >   1. Do a readlink on each FD.
> >   2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
> >   3. stat the file to get the dmabuf inode number.
> >   4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
> >
> > Android captures per-process system memory state when certain low memory
> > events (e.g a foreground app kill) occur, to identify potential memory
> > hoggers. To include a process’s dmabuf usage as part of its memory state,
> > the data collection needs to be fast enough to reflect the memory state at
> > the time of such events.
> >
> > Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root
> > privileges, this approach is not suitable for production builds.
>
> It should be easy to add enough information to /proc/<pid>/fdinfo/ so
> that you don't need to look at /proc/<pid>/fd/ anymore.
>
> > Granting
> > root privileges even to a system process increases the attack surface and
> > is highly undesirable. Additionally this is slow as it requires many
> > context switches for searching and getting the dma-buf info.
>
> What do you mean by "context switches"? Task switches or kernel/user
> transitions (e.g. via syscall)?
>
> > With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer
> > details can be queried using their unique inode numbers.
> >
> > This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface.
> >
> > /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for
> > every DMA buffer FD that the task has. Entries with the same inode
> > number can appear more than once, indicating the total FD references
> > for the associated DMA buffer.
> >
> > If a thread shares the same files as the group leader then its
> > dmabuf_fds file will be empty, as these dmabufs are reported by the
> > group leader.
> >
> > The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps)
> > and allows the efficient accounting of per-process DMA buffer usage without
> > requiring root privileges. (See data below)
>
> I'm not convinced that introducing a new procfs file for this is the
> right way to go. And the idea of having to poke into multiple
> different files in procfs and in sysfs just to be able to compute a
> proper memory usage score for a process seems weird to me. "How much
> memory is this process using" seems like the kind of question the
> kernel ought to be able to answer (and the kernel needs to be able to
> answer somewhat accurately so that its own OOM killer can do its job
> properly)?
Pekka Paalanen Jan. 28, 2021, 10:01 a.m. UTC | #11
On Wed, 27 Jan 2021 12:01:55 +0100
Christian König <christian.koenig@amd.com> wrote:

> Somewhat correct. This interface here really doesn't make sense since 
> the file descriptor representation of DMA-buf is only meant to be used 
> for short term usage.
> 
> E.g. the idea is that you can export a DMA-buf fd from your device 
> driver, transfer that to another process and then import it again into a 
> device driver.
> 
> Keeping a long term reference to a DMA-buf fd sounds like a design bug 
> in userspace to me.

Except keeping the fd is exactly what userspace must do if it wishes to
re-use the buffer without passing a new fd over IPC again. Particularly
Wayland compositors need to keep the client buffer dmabuf fd open after
receiving it, so that they can re-import it to EGL to ensure updated
contents are correctly flushed as EGL has no other API for it.

That is my vague understanding, and what Weston implements. You can say
it's a bad userspace API design in EGL, but what else can we do?

However, in the particular case of Wayland, the shared dmabufs should
be accounted to the Wayland client process. OOM-killing the client
process will eventually free the dmabuf, also the Wayland server
references to it. Killing the Wayland server (compositor, display
server) OTOH is something that should not be done as long as there are
e.g. Wayland clients to be killed.

Unfortunately(?), Wayland clients do not have a reason to keep the
dmabuf fd open themselves, so they probably close it as soon as it has
been sent to a display server. So the process that should be OOM-killed
does not have an open fd for the dmabuf (but probably has something
else, but not an mmap for CPU). Buffer re-use in Wayland does not
require re-sending the dmabuf fd over IPC.

(In general, dmabufs are never mmapped for CPU. They are accessed by
devices.)


Thanks,
pq
Christian König Jan. 29, 2021, 12:18 p.m. UTC | #12
Am 28.01.21 um 11:01 schrieb Pekka Paalanen:
> On Wed, 27 Jan 2021 12:01:55 +0100
> Christian König <christian.koenig@amd.com> wrote:
>
>> Somewhat correct. This interface here really doesn't make sense since
>> the file descriptor representation of DMA-buf is only meant to be used
>> for short term usage.
>>
>> E.g. the idea is that you can export a DMA-buf fd from your device
>> driver, transfer that to another process and then import it again into a
>> device driver.
>>
>> Keeping a long term reference to a DMA-buf fd sounds like a design bug
>> in userspace to me.
> Except keeping the fd is exactly what userspace must do if it wishes to
> re-use the buffer without passing a new fd over IPC again. Particularly
> Wayland compositors need to keep the client buffer dmabuf fd open after
> receiving it, so that they can re-import it to EGL to ensure updated
> contents are correctly flushed as EGL has no other API for it.

Hui what??? I'm not that deep into the EGL specification, but from the 
kernel side that is utterly nonsense.

Could be that re-importing triggers something in userspace, but this 
sounds strongly like a workaround to me which shouldn't be necessary.

> That is my vague understanding, and what Weston implements. You can say
> it's a bad userspace API design in EGL, but what else can we do?

Please open up a bug report with your EGL driver if that is really 
necessary.

DMA-bufs shared using a file descriptor should be coherent when 
written/read from a GPU or other hardware device. What is possible is 
that you need to do something special for CPU access.

In other words once a DMA-buf is imported and available as 
handle/texture/image inside EGL it doesn't needs to be flushed/synced 
explicitly again.

Re-importing it adds quite a huge CPU overhead to both userspace as well 
as the kernel.

> However, in the particular case of Wayland, the shared dmabufs should
> be accounted to the Wayland client process. OOM-killing the client
> process will eventually free the dmabuf, also the Wayland server
> references to it. Killing the Wayland server (compositor, display
> server) OTOH is something that should not be done as long as there are
> e.g. Wayland clients to be killed.
>
> Unfortunately(?), Wayland clients do not have a reason to keep the
> dmabuf fd open themselves, so they probably close it as soon as it has
> been sent to a display server. So the process that should be OOM-killed
> does not have an open fd for the dmabuf (but probably has something
> else, but not an mmap for CPU). Buffer re-use in Wayland does not
> require re-sending the dmabuf fd over IPC.

Correct. The reference to your memory is kept inside the drivers GEM handle.

That's one reason why we said that this approach here is not taking in 
the whole picture.

Regards,
Christian.

>
> (In general, dmabufs are never mmapped for CPU. They are accessed by
> devices.)
>
>
> Thanks,
> pq
Pekka Paalanen Jan. 29, 2021, 2:13 p.m. UTC | #13
On Fri, 29 Jan 2021 13:18:01 +0100
Christian König <christian.koenig@amd.com> wrote:

> Am 28.01.21 um 11:01 schrieb Pekka Paalanen:
> > On Wed, 27 Jan 2021 12:01:55 +0100
> > Christian König <christian.koenig@amd.com> wrote:
> >  
> >> Somewhat correct. This interface here really doesn't make sense since
> >> the file descriptor representation of DMA-buf is only meant to be used
> >> for short term usage.
> >>
> >> E.g. the idea is that you can export a DMA-buf fd from your device
> >> driver, transfer that to another process and then import it again into a
> >> device driver.
> >>
> >> Keeping a long term reference to a DMA-buf fd sounds like a design bug
> >> in userspace to me.  
> > Except keeping the fd is exactly what userspace must do if it wishes to
> > re-use the buffer without passing a new fd over IPC again. Particularly
> > Wayland compositors need to keep the client buffer dmabuf fd open after
> > receiving it, so that they can re-import it to EGL to ensure updated
> > contents are correctly flushed as EGL has no other API for it.  
> 
> Hui what??? I'm not that deep into the EGL specification, but from the 
> kernel side that is utterly nonsense.
> 
> Could be that re-importing triggers something in userspace, but this 
> sounds strongly like a workaround to me which shouldn't be necessary.

Hi,

there was a pretty long discussion about exactly this on #dri-devel
today, starting at 12:30:
https://people.freedesktop.org/~cbrill/dri-log/index.php?channel=dri-devel&highlight_names=&date=2021-01-29

The conclusion is that indeed, it is no longer necessary to re-import
to EGL all the time. It should be enough (at least with all Mesa
drivers) to call glEGLImageTargetTexture2DOES() every time you want to
ensure you get the updated buffer contents.

So now people across various projects are thinking how to keep the
EGLImage and not re-import on every update.

> > That is my vague understanding, and what Weston implements. You can say
> > it's a bad userspace API design in EGL, but what else can we do?  
> 
> Please open up a bug report with your EGL driver if that is really 
> necessary.

Sure, I would hope there is no such case anymore.

> DMA-bufs shared using a file descriptor should be coherent when 
> written/read from a GPU or other hardware device. What is possible is 
> that you need to do something special for CPU access.
> 
> In other words once a DMA-buf is imported and available as 
> handle/texture/image inside EGL it doesn't needs to be flushed/synced 
> explicitly again.

There is still the case where the some GL drivers sometimes need
to make a copy when turning the dmabuf into a GL texture[IRC log]. But
indeed, that seems to be not EGL's concern.

> 
> Re-importing it adds quite a huge CPU overhead to both userspace as well 
> as the kernel.

Perhaps, but so far it seems no-one has noticed the overhead, with Mesa
at least.

I happily stand corrected.


Thanks,
pq
Simon Ser Jan. 29, 2021, 2:17 p.m. UTC | #14
On Friday, January 29th, 2021 at 3:13 PM, Pekka Paalanen <ppaalanen@gmail.com> wrote:

> > Re-importing it adds quite a huge CPU overhead to both userspace as well
> > as the kernel.
>
> Perhaps, but so far it seems no-one has noticed the overhead, with Mesa
> at least.
>
> I happily stand corrected.

Note, all of this doesn't mean that compositors will stop keeping
DMA-BUF FDs around. They may want to keep them open for other purposes
like importing them into KMS or other EGL displays as needed.
Christian König Jan. 29, 2021, 2:22 p.m. UTC | #15
Am 29.01.21 um 15:17 schrieb Simon Ser:
> On Friday, January 29th, 2021 at 3:13 PM, Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
>>> Re-importing it adds quite a huge CPU overhead to both userspace as well
>>> as the kernel.
>> Perhaps, but so far it seems no-one has noticed the overhead, with Mesa
>> at least.
>>
>> I happily stand corrected.
> Note, all of this doesn't mean that compositors will stop keeping
> DMA-BUF FDs around. They may want to keep them open for other purposes
> like importing them into KMS or other EGL displays as needed.

Correct and that's a perfectly valid use case. Just re-importing it on 
every frame is something we should really try to avoid.

At least with debugging enabled it's massive overhead and maybe even 
performance penalty when we have to re-create device page tables all the 
time.

But thinking more about that it is possible that we short-cut this step 
as long as the original import was still referenced. Otherwise we 
probably would have noticed this much earlier.

Regards,
Christian.
Daniel Vetter Feb. 3, 2021, 10:23 a.m. UTC | #16
On Fri, Jan 29, 2021 at 03:22:06PM +0100, Christian König wrote:
> Am 29.01.21 um 15:17 schrieb Simon Ser:
> > On Friday, January 29th, 2021 at 3:13 PM, Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > 
> > > > Re-importing it adds quite a huge CPU overhead to both userspace as well
> > > > as the kernel.
> > > Perhaps, but so far it seems no-one has noticed the overhead, with Mesa
> > > at least.
> > > 
> > > I happily stand corrected.
> > Note, all of this doesn't mean that compositors will stop keeping
> > DMA-BUF FDs around. They may want to keep them open for other purposes
> > like importing them into KMS or other EGL displays as needed.
> 
> Correct and that's a perfectly valid use case. Just re-importing it on every
> frame is something we should really try to avoid.
> 
> At least with debugging enabled it's massive overhead and maybe even
> performance penalty when we have to re-create device page tables all the
> time.
> 
> But thinking more about that it is possible that we short-cut this step as
> long as the original import was still referenced. Otherwise we probably
> would have noticed this much earlier.

Yeah kernel keeps lots of caches around and just gives you back the
previous buffer if it's still around. Still probably not the smartest
idea.
-Daniel
diff mbox series

Patch

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..757dd47ab679 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -47,6 +47,7 @@  fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
   3.10  /proc/<pid>/timerslack_ns - Task timerslack value
   3.11	/proc/<pid>/patch_state - Livepatch patch operation state
   3.12	/proc/<pid>/arch_status - Task architecture specific information
+  3.13	/proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
 
   4	Configuring procfs
   4.1	Mount options
@@ -2131,6 +2132,35 @@  AVX512_elapsed_ms
   the task is unlikely an AVX512 user, but depends on the workload and the
   scheduling scenario, it also could be a false negative mentioned above.
 
+3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD
+-------------------------------------------------------------------------
+This file  exposes a list of the inode numbers for every DMA buffer
+FD that the task has.
+
+The same inode number can appear more than once, indicating the total
+FD references for the associated DMA buffer.
+
+The inode number can be used to lookup the DMA buffer information in
+the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/.
+
+Example Output
+~~~~~~~~~~~~~~
+$ cat /proc/612/task/612/dmabuf_fds
+30972 30973 45678 49326
+
+Permission to access this file is governed by a ptrace access mode
+PTRACE_MODE_READ_FSCREDS.
+
+Threads can have different files when created without specifying
+the CLONE_FILES flag. For this reason the interface is presented as
+/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds.
+This simplifies kernel code and aggregation can be handled in
+userspace.
+
+If a thread has the same files as its group leader, then its dmabuf_fds
+file will be empty as these dmabufs are already reported by the
+group leader.
+
 Chapter 4: Configuring procfs
 =============================
 
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 9ad6397aaa97..0660c06be4c6 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -29,8 +29,6 @@ 
 #include <uapi/linux/dma-buf.h>
 #include <uapi/linux/magic.h>
 
-static inline int is_dma_buf_file(struct file *);
-
 struct dma_buf_list {
 	struct list_head head;
 	struct mutex lock;
@@ -434,10 +432,7 @@  static const struct file_operations dma_buf_fops = {
 	.show_fdinfo	= dma_buf_show_fdinfo,
 };
 
-/*
- * is_dma_buf_file - Check if struct file* is associated with dma_buf
- */
-static inline int is_dma_buf_file(struct file *file)
+int is_dma_buf_file(struct file *file)
 {
 	return file->f_op == &dma_buf_fops;
 }
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index bd08616ed8ba..91a67f43ddf4 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -16,6 +16,7 @@  proc-y	+= cmdline.o
 proc-y	+= consoles.o
 proc-y	+= cpuinfo.o
 proc-y	+= devices.o
+proc-y	+= dma_bufs.o
 proc-y	+= interrupts.o
 proc-y	+= loadavg.o
 proc-y	+= meminfo.o
diff --git a/fs/proc/base.c b/fs/proc/base.c
index b3422cda2a91..af15a60b9831 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3598,6 +3598,7 @@  static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
 	ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
 #endif
+	REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations),
 };
 
 static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c
new file mode 100644
index 000000000000..46ea9cf968ed
--- /dev/null
+++ b/fs/proc/dma_bufs.c
@@ -0,0 +1,159 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per-process DMA-BUF Stats
+ *
+ * Copyright (C) 2021 Google LLC.
+ */
+
+#include <linux/dma-buf.h>
+#include <linux/fdtable.h>
+#include <linux/ptrace.h>
+#include <linux/seq_file.h>
+
+#include "internal.h"
+
+struct dmabuf_fds_private {
+	struct inode *inode;
+	struct task_struct *task;
+	struct file *dmabuf_file;
+};
+
+static loff_t *next_dmabuf(struct dmabuf_fds_private *priv,
+		loff_t *pos)
+{
+	struct fdtable *fdt;
+	struct file *file;
+
+	rcu_read_lock();
+	fdt = files_fdtable(priv->task->files);
+	for (; *pos < fdt->max_fds; ++*pos) {
+		file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos);
+		if (file && is_dma_buf_file(file) && get_file_rcu(file)) {
+			priv->dmabuf_file = file;
+			break;
+		}
+	}
+	if (*pos >= fdt->max_fds)
+		pos = NULL;
+	rcu_read_unlock();
+
+	return pos;
+}
+
+static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos)
+{
+	struct dmabuf_fds_private *priv = s->private;
+	struct files_struct *group_leader_files;
+
+	priv->task = get_proc_task(priv->inode);
+
+	if (!priv->task)
+		return ERR_PTR(-ESRCH);
+
+	/* Hold task lock for duration that files need to be stable */
+	task_lock(priv->task);
+
+	/*
+	 * If this task is not the group leader but shares the same files, leave file empty.
+	 * These dmabufs are already reported in the group leader's dmabuf_fds.
+	 */
+	group_leader_files = priv->task->group_leader->files;
+	if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) {
+		task_unlock(priv->task);
+		put_task_struct(priv->task);
+		priv->task = NULL;
+		return NULL;
+	}
+
+	return next_dmabuf(priv, pos);
+}
+
+static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
+{
+	++*pos;
+	return next_dmabuf(s->private, pos);
+}
+
+static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
+{
+	struct dmabuf_fds_private *priv = s->private;
+
+	if (priv->task) {
+		task_unlock(priv->task);
+		put_task_struct(priv->task);
+
+	}
+	if (priv->dmabuf_file)
+		fput(priv->dmabuf_file);
+}
+
+static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
+{
+	struct dmabuf_fds_private *priv = s->private;
+	struct file *file = priv->dmabuf_file;
+	struct dma_buf *dmabuf = file->private_data;
+
+	if (!dmabuf)
+		return -ESRCH;
+
+	seq_printf(s, "%8lu ", file_inode(file)->i_ino);
+
+	fput(priv->dmabuf_file);
+	priv->dmabuf_file = NULL;
+
+	return 0;
+}
+
+static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
+	.start = dmabuf_fds_seq_start,
+	.next  = dmabuf_fds_seq_next,
+	.stop  = dmabuf_fds_seq_stop,
+	.show  = dmabuf_fds_seq_show
+};
+
+static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
+		     const struct seq_operations *ops)
+{
+	struct dmabuf_fds_private *priv;
+	struct task_struct *task;
+	bool allowed = false;
+
+	task = get_proc_task(inode);
+	if (!task)
+		return -ESRCH;
+
+	allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
+	put_task_struct(task);
+
+	if (!allowed)
+		return -EACCES;
+
+	priv = __seq_open_private(file, ops, sizeof(*priv));
+	if (!priv)
+		return -ENOMEM;
+
+	priv->inode = inode;
+	priv->task = NULL;
+	priv->dmabuf_file = NULL;
+
+	return 0;
+}
+
+static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
+{
+	return seq_release_private(inode, file);
+}
+
+static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
+{
+	return proc_dmabuf_fds_open(inode, file,
+			&proc_tid_dmabuf_fds_seq_ops);
+}
+
+const struct file_operations proc_tid_dmabuf_fds_operations = {
+	.open		= tid_dmabuf_fds_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= proc_dmabuf_fds_release,
+};
+
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index f60b379dcdc7..4ca74220db9c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -303,6 +303,7 @@  extern const struct file_operations proc_pid_smaps_operations;
 extern const struct file_operations proc_pid_smaps_rollup_operations;
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
+extern const struct file_operations proc_tid_dmabuf_fds_operations;
 
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index cf72699cb2bc..087e11f7f193 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -27,6 +27,11 @@  struct device;
 struct dma_buf;
 struct dma_buf_attachment;
 
+/**
+ * Check if struct file* is associated with dma_buf.
+ */
+int is_dma_buf_file(struct file *file);
+
 /**
  * struct dma_buf_ops - operations possible on struct dma_buf
  * @vmap: [optional] creates a virtual mapping for the buffer into kernel