From patchwork Wed Apr 20 23:52:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "T.J. Mercier" X-Patchwork-Id: 12820939 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 015C2C43219 for ; Wed, 20 Apr 2022 23:52:37 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2D26A10E63C; Wed, 20 Apr 2022 23:52:37 +0000 (UTC) Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1AC0C10E63C for ; Wed, 20 Apr 2022 23:52:36 +0000 (UTC) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2dc7bdd666fso29286237b3.7 for ; Wed, 20 Apr 2022 16:52:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=wXmJayhij+TwqNidzBSB+0E2/eNzdXfcp3O+/c0CbLs=; b=OaTB0NiuNGz3gZItyS5lGwvFOV7EZfpnX61C01bWWJjrQvw2KVEzFOtQmxRdxaSqrK DyZu71fHPSfuIhXETJEDfwfomhHzLen7FXo4xGDixCofaw1VUz53e+6jEJsNCFKbS4R2 b32RTlbQVSExtwz2j01qt89K5YDAARZUOHLDBkXMIjUvYc2di8K1OGvXpUpBQA9KxVQe VqDWByInG2C9PEoeUTqqMWo3pWkEziJDfUyj4iIBaw0drmx47/i43FMmzuAJ665mVVL1 mu6hSBkRK9Kj23/NVYHS3v83cv/2KEOTKYF7s8dEDCyuNULl9QHRJ2FdImYzP++gH5jq o9KA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=wXmJayhij+TwqNidzBSB+0E2/eNzdXfcp3O+/c0CbLs=; b=JrfLxAJTqXQGsnGlwbF9T8Pz4+aNKI1/yy27NabEGW1cnQ5myYP3zkw61IzN6V+2iU 64vVmr4DvMriGgT6BokRhW/8cHG30TcVL2a1q0l80dvYNvqwDip01t/jr8Fr+6Z9IJQF fcmw23KzNhifvpCsfiZniOLWCMBQuJRufmwa50KhCHht8YAV5YqJIfXxDzhQE8h8LtUx H6ht7YN+Ozp56Oa5GPUJRurmEiOguF6t+PjPpQGx5193JDEmoGjc2Lwxb8DfTtYnwVCM 0nCH7zMxn8m+6TUOQXhdKtNLd+4N018yCvMG1o485NW5PTvesSlz9RhjRmbHM3yNUKra jDnw== X-Gm-Message-State: AOAM530QXVBAV8DkT5+eCivbS9VB3ibQn+ek67kGOfR5e1CVblIgLQJo zCRChhglRVMQ3uxAYw/rAnOPdrRNxTo6l7M= X-Google-Smtp-Source: ABdhPJwdyO1a59Wbmq1dxswXBXqz9W9bswYyvbTDUV+1oi//Kyi7KDzuR7vBGAHyOSEgZ4kGsGhEe+dl4K2LpjM= X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a]) (user=tjmercier job=sendgmr) by 2002:a05:6902:1244:b0:644:b8a5:e195 with SMTP id t4-20020a056902124400b00644b8a5e195mr22014142ybu.556.1650498755305; Wed, 20 Apr 2022 16:52:35 -0700 (PDT) Date: Wed, 20 Apr 2022 23:52:19 +0000 In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com> Message-Id: <20220420235228.2767816-2-tjmercier@google.com> Mime-Version: 1.0 References: <20220420235228.2767816-1-tjmercier@google.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [RFC v5 1/6] gpu: rfc: Proposal for a GPU cgroup controller From: "T.J. Mercier" To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org, Maarten Lankhorst , Maxime Ripard , Thomas Zimmermann , David Airlie , Jonathan Corbet X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kernel-team@android.com, tkjos@android.com, linux-doc@vger.kernel.org, Kenny.Ho@amd.com, skhan@linuxfoundation.org, cmllamas@google.com, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, jstultz@google.com, kaleshsingh@google.com, hridya@google.com, mkoutny@suse.com, surenb@google.com, christian.koenig@amd.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Hridya Valsaraju This patch adds a proposal for a new GPU cgroup controller for accounting/limiting GPU and GPU-related memory allocations. The proposed controller is based on the DRM cgroup controller[1] and follows the design of the RDMA cgroup controller. The new cgroup controller would: * Allow setting per-device limits on the total size of buffers allocated by device within a cgroup. * Expose a per-device/allocator breakdown of the buffers charged to a cgroup. The prototype in the following patches is only for memory accounting using the GPU cgroup controller and does not implement limit setting. [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/ Signed-off-by: Hridya Valsaraju Signed-off-by: T.J. Mercier --- v5 changes Drop the global GPU cgroup "total" (sum of all device totals) portion of the design since there is no currently known use for this per Tejun Heo. Update for renamed functions/variables. v3 changes Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz. Use more common dual author commit message format per John Stultz. --- Documentation/gpu/rfc/gpu-cgroup.rst | 190 +++++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 194 insertions(+) create mode 100644 Documentation/gpu/rfc/gpu-cgroup.rst diff --git a/Documentation/gpu/rfc/gpu-cgroup.rst b/Documentation/gpu/rfc/gpu-cgroup.rst new file mode 100644 index 000000000000..0be2a3a9f641 --- /dev/null +++ b/Documentation/gpu/rfc/gpu-cgroup.rst @@ -0,0 +1,190 @@ +=================================== +GPU cgroup controller +=================================== + +Goals +===== +This document intends to outline a plan to create a cgroup v2 controller subsystem +for the per-cgroup accounting of device and system memory allocated by the GPU +and related subsystems. + +The new cgroup controller would: + +* Allow setting per-device limits on the total size of buffers allocated by a + device/allocator within a cgroup. + +* Expose a per-device/allocator breakdown of the buffers charged to a cgroup. + +Alternatives Considered +======================= + +The following alternatives were considered: + +The memory cgroup controller +____________________________ + +1. As was noted in [1], memory accounting provided by the GPU cgroup +controller is not a good fit for integration into memcg due to the +differences in how accounting is performed. It implements a mechanism +for the allocator attribution of GPU and GPU-related memory by +charging each buffer to the cgroup of the process on behalf of which +the memory was allocated. The buffer stays charged to the cgroup until +it is freed regardless of whether the process retains any references +to it. On the other hand, the memory cgroup controller offers a more +fine-grained charging and uncharging behavior depending on the kind of +page being accounted. + +2. Memcg performs accounting in units of pages. In the DMA-BUF buffer sharing model, +a process takes a reference to the entire buffer(hence keeping it alive) even if +it is only accessing parts of it. Therefore, per-page memory tracking for DMA-BUF +memory accounting would only introduce additional overhead without any benefits. + +[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705 + +Userspace service to keep track of buffer allocations and releases +__________________________________________________________________ + +1. There is no way for a userspace service to intercept all allocations and releases. +2. In case the process gets killed or restarted, we lose all accounting so far. + +UAPI +==== +When enabled, the new cgroup controller would create the following files in every cgroup. + +:: + + gpu.memory.current (R) + gpu.memory.max (R/W) + +gpu.memory.current is a read-only file and would contain per-device memory allocations +in a key-value format where key is a string representing the device name and the value +is the size of memory charged to the device in the cgroup in bytes. The device name +should be globally unique. + +For example: + +:: + + cat /sys/kernel/fs/cgroup1/gpu.memory.current + dev1 4194304 + dev2 4194304 + +The string key for each device is set by the device driver when the device registers +with the GPU cgroup controller to participate in resource accounting (see section +'Design and Implementation' for more details). + +gpu.memory.max is a read/write file. It would show the current size limits on +memory usage for each allocator/device. + +Setting a limit for a particular device/allocator can be done as follows: + +:: + + echo “dev1 4194304” > /sys/kernel/fs/cgroup1/gpu.memory.max + +In this example, 'dev1' is the string key set by the device driver during +registration. + +Design and Implementation +========================= + +The cgroup controller would closely follow the design of the RDMA cgroup controller +subsystem where each cgroup maintains a list of resource pools. +Each resource pool is associated with a device name via a pointer to a struct gpucg_bucket +and contains a counter to track current, total, and the maximum limit set for the device. + +The below code block is a preliminary estimation on how the core kernel data structures +and APIs would look like. + +.. code-block:: c + + /* The GPU cgroup controller data structure */ + struct gpucg { + struct cgroup_subsys_state css; + + /* list of all resource pools that belong to this cgroup */ + struct list_head rpools; + }; + + /* A named entity representing bucket of tracked memory. */ + struct gpucg_bucket { + /* list of various resource pools in various cgroups that the bucket is part of */ + struct list_head rpools; + + /* list of all buckets registered for GPU cgroup accounting */ + struct list_head bucket_node; + + /* string to be used as identifier for accounting and limit setting */ + const char *name; + }; + + struct gpucg_resource_pool { + /* The bucket whose resource usage is tracked by this resource pool */ + struct gpucg_bucket *bucket; + + /* list of all resource pools for the cgroup */ + struct list_head cg_node; + + /* list maintained by the gpucg_bucket to keep track of its resource pools */ + struct list_head bucket_node; + + /* tracks memory usage of the resource pool */ + struct page_counter total; + }; + + /** + * gpucg_register_bucket - Registers a bucket for memory accounting using the + * GPU cgroup controller. + * + * @bucket: The bucket to register for memory accounting. + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name + * should be globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes. + * + * @bucket must remain valid. @name will be copied. + */ + void gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name) + + /** + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket. + * + * @gpucg: The gpu cgroup to charge the memory to. + * @bucket: The pool to charge the memory to. + * @size: The size of memory to charge in bytes. + * This size will be rounded up to the nearest page size. + * + * Return: returns 0 if the charging is successful and otherwise returns an + * error code. + */ + int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size); + + /** + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket. + * The caller must hold a reference to @gpucg obtained through gpucg_get(). + * + * @gpucg: The gpu cgroup to uncharge the memory from. + * @bucket: The bucket to uncharge the memory from. + * @size: The size of memory to uncharge in bytes. + * This size will be rounded up to the nearest page size. + */ + void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size); + + /** + * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another. + * + * @source: [in] The GPU cgroup the charge will be transferred from. + * @dest: [in] The GPU cgroup the charge will be transferred to. + * @bucket: [in] The GPU cgroup bucket corresponding to the charge. + * @size: [in] The size of the memory in bytes. + * This size will be rounded up to the nearest page size. + * + * Returns 0 on success, or a negative errno code otherwise. + */ + int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size) + + +Future Work +=========== +Additional GPU resources can be supported by adding new controller files. diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..0a9bcd94e95d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree:: i915_scheduler.rst + +.. toctree:: + + gpu-cgroup.rst From patchwork Wed Apr 20 23:52:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "T.J. Mercier" X-Patchwork-Id: 12820940 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 613A4C433F5 for ; Wed, 20 Apr 2022 23:52:46 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B04F010ED91; Wed, 20 Apr 2022 23:52:45 +0000 (UTC) Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com [IPv6:2607:f8b0:4864:20::1149]) by gabe.freedesktop.org (Postfix) with ESMTPS id CB74410ED91 for ; Wed, 20 Apr 2022 23:52:44 +0000 (UTC) Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-2dc7bdd666fso29288217b3.7 for ; Wed, 20 Apr 2022 16:52:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=5E9Rbs7lnkN3+W98JnkIUAPTQ49TUyPznJB6J7YNkeo=; b=SWzxIRgdRQUtqcnezZ1qAW7U1kX7+57LPNNbfRy88rcYumm4IA2OJuYgyyg7UDgn4q MCDVqH2XccdjAMgVUESHKUP693m4h0vVR930REBPdZ6wch5vagUCuVlCO3pq1G/WttSm AKvzWT9Nlv5PQyonszFAMeFGYDERNSYU+Ih9zKKj6eBos5pKLMLliVS1pcZcxSRG7Zrn Oqa7NQcw9kxxBDKs5sWyq9NwhYkTbsb6IZoi2/ioWxpbEzVBB21ZKhOvAmxpG8APKTEc XJCoqM2RQuc7J0nxY6KCv4Zd+OgEpsgHqAse8y3xvopI5/qJepc4wtE8OIQeY+rpy9yg tZlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=5E9Rbs7lnkN3+W98JnkIUAPTQ49TUyPznJB6J7YNkeo=; b=Yy00J5DQ24ZPb/XqfFW2NZU7F/rOUSjcbblInqkqxtfkCDNTeuFZ6Zuo4tDVMiO4KG NEHHxmyXT2S+BveYjw68/DDo0BriVG4tr/DUbXoDQQt/7soq1l8uCMU6USGwJANdr6kp GLXLLUJdq6bqC8hGMMDPYzO/FUDv+qpCZ8DQiuvYaxl337uXUFQWddsCQKZE1U8A9//v FIjrIrFqESJJszOELZ/OmC240PMi5uuels7dXg95xKB4ccTHwSVN9WyugaE0nn9S8Uda 5lTjxJ+gOScTLqxKMCA1UB6vlQ2cEJpBJXyxijHy1qGB/RBBJ0Bu6YK3A5ZthGIOsVPW Qe3w== X-Gm-Message-State: AOAM532r/ADoUmNbR2aK5sRF/l8Ev5WCiTa3EKpCnjg03IukUvbWTHMi uzQi8qokBraLbkd6K1Bi8mQrAwCGkmQ7JzE= X-Google-Smtp-Source: ABdhPJw5EwHTC7Q1N41+mc2GzL5YAYNhrKpjL8DBLEBrLVaVg9GO5bS/iutUfeBK1JN69nQVWaRhMs4VIs/z4os= X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a]) (user=tjmercier job=sendgmr) by 2002:a25:7795:0:b0:645:682a:d56e with SMTP id s143-20020a257795000000b00645682ad56emr3285177ybc.403.1650498763943; Wed, 20 Apr 2022 16:52:43 -0700 (PDT) Date: Wed, 20 Apr 2022 23:52:21 +0000 In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com> Message-Id: <20220420235228.2767816-4-tjmercier@google.com> Mime-Version: 1.0 References: <20220420235228.2767816-1-tjmercier@google.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [RFC v5 3/6] dmabuf: heaps: export system_heap buffers with GPU cgroup charging From: "T.J. Mercier" To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org, Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Benjamin Gaignard , Liam Mark , Laura Abbott , Brian Starkey , John Stultz X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kernel-team@android.com, tkjos@android.com, Kenny.Ho@amd.com, skhan@linuxfoundation.org, cmllamas@google.com, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linaro-mm-sig@lists.linaro.org, jstultz@google.com, kaleshsingh@google.com, hridya@google.com, mkoutny@suse.com, surenb@google.com, linux-media@vger.kernel.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" All DMA heaps now register a new GPU cgroup bucket upon creation, and the system_heap now exports buffers associated with its GPU cgroup bucket for tracking purposes. In order to support GPU cgroup charge transfer on a dma-buf, the current GPU cgroup information must be stored inside the dma-buf struct. For tracked buffers, exporters include the struct gpucg and struct gpucg_bucket pointers in the export info which can later be modified if the charge is migrated to another cgroup. Signed-off-by: Hridya Valsaraju Signed-off-by: T.J. Mercier --- v5 changes Merge dmabuf: Use the GPU cgroup charge/uncharge APIs into this patch. Remove all GPU cgroup code from dma-buf except what's necessary to support charge transfer. Previously charging was done in export, but for non-Android graphics use-cases this is not ideal since there may be a dealy between allocation and export, during which time there is no accounting. Append "-heap" to gpucg_bucket names. Charge on allocation instead of export. This should more closely mirror non-Android use-cases where there is potentially a delay between allocation and export. Put the charge and uncharge code in the same file (system_heap_allocate, system_heap_dma_buf_release) instead of splitting them between the heap and the dma_buf_release. Move no-op code to header file to match other files in the series. v3 changes Use more common dual author commit message format per John Stultz. v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. --- drivers/dma-buf/dma-buf.c | 19 +++++++++++++ drivers/dma-buf/dma-heap.c | 39 +++++++++++++++++++++++++++ drivers/dma-buf/heaps/system_heap.c | 28 +++++++++++++++++--- include/linux/dma-buf.h | 41 +++++++++++++++++++++++------ include/linux/dma-heap.h | 15 +++++++++++ 5 files changed, 130 insertions(+), 12 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index df23239b04fc..bc89c44bd9b9 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -462,6 +462,24 @@ static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) * &dma_buf_ops. */ +#ifdef CONFIG_CGROUP_GPU +static void dma_buf_set_gpucg(struct dma_buf *dmabuf, const struct dma_buf_export_info *exp) +{ + dmabuf->gpucg = exp->gpucg; + dmabuf->gpucg_bucket = exp->gpucg_bucket; +} + +void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, + struct gpucg *gpucg, + struct gpucg_bucket *gpucg_bucket) +{ + exp_info->gpucg = gpucg; + exp_info->gpucg_bucket = gpucg_bucket; +} +#else +static void dma_buf_set_gpucg(struct dma_buf *dmabuf, struct dma_buf_export_info *exp) {} +#endif + /** * dma_buf_export - Creates a new dma_buf, and associates an anon file * with this buffer, so it can be exported. @@ -527,6 +545,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) init_waitqueue_head(&dmabuf->poll); dmabuf->cb_in.poll = dmabuf->cb_out.poll = &dmabuf->poll; dmabuf->cb_in.active = dmabuf->cb_out.active = 0; + dma_buf_set_gpucg(dmabuf, exp_info); if (!resv) { resv = (struct dma_resv *)&dmabuf[1]; diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c index 8f5848aa144f..b81015548314 100644 --- a/drivers/dma-buf/dma-heap.c +++ b/drivers/dma-buf/dma-heap.c @@ -7,10 +7,12 @@ */ #include +#include #include #include #include #include +#include #include #include #include @@ -21,6 +23,7 @@ #include #define DEVNAME "dma_heap" +#define HEAP_NAME_SUFFIX "-heap" #define NUM_HEAP_MINORS 128 @@ -31,6 +34,7 @@ * @heap_devt heap device node * @list list head connecting to list of heaps * @heap_cdev heap char device + * @gpucg_bucket gpu cgroup bucket for memory accounting * * Represents a heap of memory from which buffers can be made. */ @@ -41,6 +45,9 @@ struct dma_heap { dev_t heap_devt; struct list_head list; struct cdev heap_cdev; +#ifdef CONFIG_CGROUP_GPU + struct gpucg_bucket gpucg_bucket; +#endif }; static LIST_HEAD(heap_list); @@ -216,6 +223,19 @@ const char *dma_heap_get_name(struct dma_heap *heap) return heap->name; } +/** + * dma_heap_get_gpucg_bucket() - get struct gpucg_bucket for the heap. + * @heap: DMA-Heap to get the gpucg_bucket struct for. + * + * Returns: + * The gpucg_bucket struct for the heap. NULL if the GPU cgroup controller is + * not enabled. + */ +struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap) +{ + return &heap->gpucg_bucket; +} + struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) { struct dma_heap *heap, *h, *err_ret; @@ -228,6 +248,12 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) return ERR_PTR(-EINVAL); } + if (IS_ENABLED(CONFIG_CGROUP_GPU) && strlen(exp_info->name) + strlen(HEAP_NAME_SUFFIX) >= + GPUCG_BUCKET_NAME_MAX_LEN) { + pr_err("dma_heap: Name is too long for GPU cgroup\n"); + return ERR_PTR(-ENAMETOOLONG); + } + if (!exp_info->ops || !exp_info->ops->allocate) { pr_err("dma_heap: Cannot add heap with invalid ops struct\n"); return ERR_PTR(-EINVAL); @@ -253,6 +279,19 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info) heap->ops = exp_info->ops; heap->priv = exp_info->priv; + if (IS_ENABLED(CONFIG_CGROUP_GPU)) { + char gpucg_bucket_name[GPUCG_BUCKET_NAME_MAX_LEN]; + + snprintf(gpucg_bucket_name, sizeof(gpucg_bucket_name), "%s%s", + exp_info->name, HEAP_NAME_SUFFIX); + + ret = gpucg_register_bucket(dma_heap_get_gpucg_bucket(heap), gpucg_bucket_name); + if (ret < 0) { + err_ret = ERR_PTR(ret); + goto err0; + } + } + /* Find unused minor number */ ret = xa_alloc(&dma_heap_minors, &minor, heap, XA_LIMIT(0, NUM_HEAP_MINORS - 1), GFP_KERNEL); diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index fcf836ba9c1f..27f686faef00 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -297,6 +297,11 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf) } sg_free_table(table); kfree(buffer); + + if (dmabuf->gpucg && dmabuf->gpucg_bucket) { + gpucg_uncharge(dmabuf->gpucg, dmabuf->gpucg_bucket, dmabuf->size); + gpucg_put(dmabuf->gpucg); + } } static const struct dma_buf_ops system_heap_buf_ops = { @@ -346,11 +351,21 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, struct scatterlist *sg; struct list_head pages; struct page *page, *tmp_page; - int i, ret = -ENOMEM; + struct gpucg *gpucg; + struct gpucg_bucket *gpucg_bucket; + int i, ret; + + gpucg = gpucg_get(current); + gpucg_bucket = dma_heap_get_gpucg_bucket(heap); + ret = gpucg_charge(gpucg, gpucg_bucket, len); + if (ret) + goto put_gpucg; buffer = kzalloc(sizeof(*buffer), GFP_KERNEL); - if (!buffer) - return ERR_PTR(-ENOMEM); + if (!buffer) { + ret = -ENOMEM; + goto uncharge_gpucg; + } INIT_LIST_HEAD(&buffer->attachments); mutex_init(&buffer->lock); @@ -396,6 +411,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, exp_info.size = buffer->len; exp_info.flags = fd_flags; exp_info.priv = buffer; + dma_buf_exp_info_set_gpucg(&exp_info, gpucg, gpucg_bucket); + dmabuf = dma_buf_export(&exp_info); if (IS_ERR(dmabuf)) { ret = PTR_ERR(dmabuf); @@ -414,7 +431,10 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap, list_for_each_entry_safe(page, tmp_page, &pages, lru) __free_pages(page, compound_order(page)); kfree(buffer); - +uncharge_gpucg: + gpucg_uncharge(gpucg, gpucg_bucket, len); +put_gpucg: + gpucg_put(gpucg); return ERR_PTR(ret); } diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 2097760e8e95..8e7c55c830b3 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -13,6 +13,7 @@ #ifndef __DMA_BUF_H__ #define __DMA_BUF_H__ +#include #include #include #include @@ -303,7 +304,7 @@ struct dma_buf { /** * @size: * - * Size of the buffer; invariant over the lifetime of the buffer. + * Size of the buffer in bytes; invariant over the lifetime of the buffer. */ size_t size; @@ -453,6 +454,14 @@ struct dma_buf { struct dma_buf *dmabuf; } *sysfs_entry; #endif + +#ifdef CONFIG_CGROUP_GPU + /** @gpucg: Pointer to the GPU cgroup this buffer currently belongs to. */ + struct gpucg *gpucg; + + /* @gpucg_bucket: Pointer to the GPU cgroup bucket whence this buffer originates. */ + struct gpucg_bucket *gpucg_bucket; +#endif }; /** @@ -526,13 +535,15 @@ struct dma_buf_attachment { /** * struct dma_buf_export_info - holds information needed to export a dma_buf - * @exp_name: name of the exporter - useful for debugging. - * @owner: pointer to exporter module - used for refcounting kernel module - * @ops: Attach allocator-defined dma buf ops to the new buffer - * @size: Size of the buffer - invariant over the lifetime of the buffer - * @flags: mode flags for the file - * @resv: reservation-object, NULL to allocate default one - * @priv: Attach private data of allocator to this buffer + * @exp_name: name of the exporter - useful for debugging. + * @owner: pointer to exporter module - used for refcounting kernel module + * @ops: Attach allocator-defined dma buf ops to the new buffer + * @size: Size of the buffer in bytes - invariant over the lifetime of the buffer + * @flags: mode flags for the file + * @resv: reservation-object, NULL to allocate default one + * @priv: Attach private data of allocator to this buffer + * @gpucg: Pointer to GPU cgroup this buffer is charged to, or NULL if not charged + * @gpucg_bucket: Pointer to GPU cgroup bucket this buffer comes from, or NULL if not charged * * This structure holds the information required to export the buffer. Used * with dma_buf_export() only. @@ -545,6 +556,10 @@ struct dma_buf_export_info { int flags; struct dma_resv *resv; void *priv; +#ifdef CONFIG_CGROUP_GPU + struct gpucg *gpucg; + struct gpucg_bucket *gpucg_bucket; +#endif }; /** @@ -630,4 +645,14 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *, unsigned long); int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); + +#ifdef CONFIG_CGROUP_GPU +void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, + struct gpucg *gpucg, + struct gpucg_bucket *gpucg_bucket); +#else/* CONFIG_CGROUP_GPU */ +static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, + struct gpucg *gpucg, + struct gpucg_bucket *gpucg_bucket) {} +#endif /* CONFIG_CGROUP_GPU */ #endif /* __DMA_BUF_H__ */ diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h index 0c05561cad6e..6321e7636538 100644 --- a/include/linux/dma-heap.h +++ b/include/linux/dma-heap.h @@ -10,6 +10,7 @@ #define _DMA_HEAPS_H #include +#include #include struct dma_heap; @@ -59,6 +60,20 @@ void *dma_heap_get_drvdata(struct dma_heap *heap); */ const char *dma_heap_get_name(struct dma_heap *heap); +#ifdef CONFIG_CGROUP_GPU +/** + * dma_heap_get_gpucg_bucket() - get a pointer to the struct gpucg_bucket for the heap. + * @heap: DMA-Heap to retrieve gpucg_bucket for + * + * Returns: + * The gpucg_bucket struct for the heap. + */ +struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap); +#else /* CONFIG_CGROUP_GPU */ +static inline struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap) +{ return NULL; } +#endif /* CONFIG_CGROUP_GPU */ + /** * dma_heap_add - adds a heap to dmabuf heaps * @exp_info: information needed to register this heap From patchwork Wed Apr 20 23:52:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "T.J. Mercier" X-Patchwork-Id: 12820941 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 408F6C433F5 for ; Wed, 20 Apr 2022 23:52:50 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 82A3810F12C; Wed, 20 Apr 2022 23:52:49 +0000 (UTC) Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by gabe.freedesktop.org (Postfix) with ESMTPS id C914410F12C for ; Wed, 20 Apr 2022 23:52:48 +0000 (UTC) Received: by mail-yb1-xb49.google.com with SMTP id t18-20020a257812000000b0064140091ba2so2916287ybc.1 for ; Wed, 20 Apr 2022 16:52:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=MFjiN2c4422vfH3G5XH/W0LypfixtCAuep+NsTNNNpQ=; b=O9DK+nzovZ+3UKgC7FIOwT0aMuF2SvsCaF3ZXeSoWy1EhpK0ePx8O/H08V2UqVfgNE UN5jw6LkKrD8qnmQ8yNpabAi8LYhV7IVlIhIyxjzWt3DwY0xQYzoGMBwBTocnz0s4rQ5 ZOKGBkrXwliivGK3Dk1OOuz1iBAnoM5/GpdgoZDvda2LQ+79y+gHEcN4hEcic5si8/tq BgIQcchgyZXlLHtJLZrCzUf6RpuLOSrX5iJi0+Wk0Z3RcDlTPXgAj1jg8XJPOdnclJ4g 1cwxGcpXpjpo6GYI8wlIE6qmSSkrnjGd9jEmnSx6WLGCOCS3QWAGjtKo+pr14uahGroF uPhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=MFjiN2c4422vfH3G5XH/W0LypfixtCAuep+NsTNNNpQ=; b=NOAHMohJnu8GqR/T51lBo700dkWSMj1tyPxcljs4cqcmBZ8t003R8vApUL3Oun1yBi Jof+Ra+IvYBrOQlwoyVMKq4erzp5k4eBXDPzW0g5pniNjKzf8l0aHU12ZgKntJ3LPlfb ouXw33k0h6DwCVmK89sQwD0oFmjecwtl8S4xZ/T4sqsR+CtCmj0oEiE5wK5YyYtUjx2J utyYt0J3QyLDkRTkSFq3Kk4yLGnXObI5dI+lfSmq09ZQOtjch72WL3IIiphnjDu6sRew n1qGRvw8QPf0ncrxYxR8ERGcVdg2dwkkhgtSWa9YDZiKVaeM+G/KY+nd5ezu+8b0FozA 9yOg== X-Gm-Message-State: AOAM532/vT2U4Obi1MSQfArCGlmdBqCJWqhbn1KeBwTqJxkg3EQVdXam rKE7b4Yj7rtWluvebemcLOA+ZktnYfRMcTE= X-Google-Smtp-Source: ABdhPJwgj/q1pkBqUx+EtmezTxZVaTnub+E1mqztCIeOzLZpLnxcjiZoZHoIs2LVhmVe02tDSbzgfaHA7Of00Js= X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a]) (user=tjmercier job=sendgmr) by 2002:a0d:f0c3:0:b0:2f4:d291:9dde with SMTP id z186-20020a0df0c3000000b002f4d2919ddemr1882789ywe.437.1650498767961; Wed, 20 Apr 2022 16:52:47 -0700 (PDT) Date: Wed, 20 Apr 2022 23:52:22 +0000 In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com> Message-Id: <20220420235228.2767816-5-tjmercier@google.com> Mime-Version: 1.0 References: <20220420235228.2767816-1-tjmercier@google.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [RFC v5 4/6] dmabuf: Add gpu cgroup charge transfer function From: "T.J. Mercier" To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org, Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Zefan Li , Johannes Weiner X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kernel-team@android.com, tkjos@android.com, Kenny.Ho@amd.com, cgroups@vger.kernel.org, skhan@linuxfoundation.org, cmllamas@google.com, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linaro-mm-sig@lists.linaro.org, jstultz@google.com, kaleshsingh@google.com, hridya@google.com, mkoutny@suse.com, surenb@google.com, linux-media@vger.kernel.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" The dma_buf_transfer_charge function provides a way for processes to transfer charge of a buffer to a different process. This is essential for the cases where a central allocator process does allocations for various subsystems, hands over the fd to the client who requested the memory and drops all references to the allocated memory. Originally-by: Hridya Valsaraju Signed-off-by: T.J. Mercier --- v5 changes Fix commit message which still contained the old name for dma_buf_transfer_charge per Michal Koutný. Modify the dma_buf_transfer_charge API to accept a task_struct instead of a gpucg. This avoids requiring the caller to manage the refcount of the gpucg upon failure and confusing ownership transfer logic. v4 changes Adjust ordering of charge/uncharge during transfer to avoid potentially hitting cgroup limit per Michal Koutný. v3 changes Use more common dual author commit message format per John Stultz. v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. --- drivers/dma-buf/dma-buf.c | 57 +++++++++++++++++++++++++++++++++++ include/linux/cgroup_gpu.h | 14 +++++++++ include/linux/dma-buf.h | 6 ++++ kernel/cgroup/gpu.c | 62 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 139 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index bc89c44bd9b9..f3fb844925e2 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -1341,6 +1341,63 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map) } EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF); +/** + * dma_buf_transfer_charge - Change the GPU cgroup to which the provided dma_buf is charged. + * @dmabuf: [in] buffer whose charge will be migrated to a different GPU cgroup + * @target: [in] the task_struct of the destination process for the GPU cgroup charge + * + * Only tasks that belong to the same cgroup the buffer is currently charged to + * may call this function, otherwise it will return -EPERM. + * + * Returns 0 on success, or a negative errno code otherwise. + */ +int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target) +{ + struct gpucg *current_gpucg, *target_gpucg, *to_release; + int ret; + + if (!dmabuf->gpucg || !dmabuf->gpucg_bucket) { + /* This dmabuf is not tracked under GPU cgroup accounting */ + return 0; + } + + current_gpucg = gpucg_get(current); + target_gpucg = gpucg_get(target); + to_release = target_gpucg; + + /* If the source and destination cgroups are the same, don't do anything. */ + if (current_gpucg == target_gpucg) { + ret = 0; + goto skip_transfer; + } + + /* + * Verify that the cgroup of the process requesting the transfer + * is the same as the one the buffer is currently charged to. + */ + mutex_lock(&dmabuf->lock); + if (current_gpucg != dmabuf->gpucg) { + ret = -EPERM; + goto err; + } + + ret = gpucg_transfer_charge( + dmabuf->gpucg, target_gpucg, dmabuf->gpucg_bucket, dmabuf->size); + if (ret) + goto err; + + to_release = dmabuf->gpucg; + dmabuf->gpucg = target_gpucg; + +err: + mutex_unlock(&dmabuf->lock); +skip_transfer: + gpucg_put(current_gpucg); + gpucg_put(to_release); + return ret; +} +EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF); + #ifdef CONFIG_DEBUG_FS static int dma_buf_debug_show(struct seq_file *s, void *unused) { diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h index 4dfe633d6ec7..f5973ef9f926 100644 --- a/include/linux/cgroup_gpu.h +++ b/include/linux/cgroup_gpu.h @@ -83,7 +83,13 @@ static inline struct gpucg *gpucg_parent(struct gpucg *cg) } int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size); + void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size); + +int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size); int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name); #else /* CONFIG_CGROUP_GPU */ @@ -118,6 +124,14 @@ static inline void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size) {} +static inline int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size) +{ + return 0; +} + static inline int gpucg_register_bucket(struct gpucg_bucket *bucket, const char *name) {} #endif /* CONFIG_CGROUP_GPU */ #endif /* _CGROUP_GPU_H */ diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 8e7c55c830b3..438ad8577b76 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -650,9 +651,14 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, struct gpucg *gpucg, struct gpucg_bucket *gpucg_bucket); + +int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target); #else/* CONFIG_CGROUP_GPU */ static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info, struct gpucg *gpucg, struct gpucg_bucket *gpucg_bucket) {} + +static inline int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target) +{ return 0; } #endif /* CONFIG_CGROUP_GPU */ #endif /* __DMA_BUF_H__ */ diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c index 34d0a5b85834..7dfbe0fd7e45 100644 --- a/kernel/cgroup/gpu.c +++ b/kernel/cgroup/gpu.c @@ -252,6 +252,68 @@ void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size) css_put(&gpucg->css); } +/** + * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another. + * + * @source: [in] The GPU cgroup the charge will be transferred from. + * @dest: [in] The GPU cgroup the charge will be transferred to. + * @bucket: [in] The GPU cgroup bucket corresponding to the charge. + * @size: [in] The size of the memory in bytes. + * This size will be rounded up to the nearest page size. + * + * Returns 0 on success, or a negative errno code otherwise. + */ +int gpucg_transfer_charge(struct gpucg *source, + struct gpucg *dest, + struct gpucg_bucket *bucket, + u64 size) +{ + struct page_counter *counter; + u64 nr_pages; + struct gpucg_resource_pool *rp_source, *rp_dest; + int ret = 0; + + nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT; + + mutex_lock(&gpucg_mutex); + rp_source = cg_rpool_find_locked(source, bucket); + if (unlikely(!rp_source)) { + ret = -ENOENT; + goto exit_early; + } + + rp_dest = cg_rpool_get_locked(dest, bucket); + if (IS_ERR(rp_dest)) { + ret = PTR_ERR(rp_dest); + goto exit_early; + } + + /* + * First uncharge from the pool it's currently charged to. This ordering avoids double + * charging while the transfer is in progress, which could cause us to hit a limit. + * If the try_charge fails for this transfer, we need to be able to reverse this uncharge, + * so we continue to hold the gpucg_mutex here. + */ + page_counter_uncharge(&rp_source->total, nr_pages); + css_put(&source->css); + + /* Now attempt the new charge */ + if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) { + css_get(&dest->css); + } else { + /* + * The new charge failed, so reverse the uncharge from above. This should always + * succeed since charges on source are blocked by gpucg_mutex. + */ + WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter)); + css_get(&source->css); + ret = -ENOMEM; + } +exit_early: + mutex_unlock(&gpucg_mutex); + return ret; +} + /** * gpucg_register_bucket - Registers a bucket for memory accounting using the * GPU cgroup controller. From patchwork Wed Apr 20 23:52:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "T.J. Mercier" X-Patchwork-Id: 12820942 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8C3B1C433FE for ; Wed, 20 Apr 2022 23:52:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B5A9610F2C2; Wed, 20 Apr 2022 23:52:54 +0000 (UTC) Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by gabe.freedesktop.org (Postfix) with ESMTPS id 76C5D10F2CC for ; Wed, 20 Apr 2022 23:52:52 +0000 (UTC) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-2ef53391dbaso29245407b3.11 for ; Wed, 20 Apr 2022 16:52:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=/gvN+QpYV7OLh/aEKqlMjgKuKem4NnQTY6rvHp91u8M=; b=sMA3qf7elQFHyscq19H1EpWDhaRa/I7ci48fDvM7QclHAcQqhkQtppdGXfJls2ija/ MvtUn9QqCmeX2jXoh9aXOsqLWBhS95CDoqK3LEcBjVCrsZn3rhJ6gF2q3wsCJy6141rn HXihJHcxQFcvvrhcZnXaBAOeuPSzPOu5OP6IP7nryfGBwy/4jH9/4rQvdFweYWS4ZJ7o iFBWJJboWcdIl0qxM5PnlYsSTP+g7LGwTte4o0TcoqvscLQAxZbvTu5sODCDXJ8Q1tgA rw3i8pgp7ryIMuZezHAtqRxaB/qJTOcE2DcxLzUs4No3oDDwuF4FO1tWii3lpPVJrOBb r2rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=/gvN+QpYV7OLh/aEKqlMjgKuKem4NnQTY6rvHp91u8M=; b=1v2ktBoz97bD1PhycQaIn4kYdxm98sBvwyixNjgQYqvu3hP9bBw5ZOHzGYa2mgdoHF f9TGWk52ha5Q9uvLFJcpwQ/A2Cjl+mAkGIgyjNfa485VccRHLzeAdBHNq1xBpvBKovim X8KaDG8RqqnL3//PrlWPinIOxkGxTckBAbURQpwHATwDjfJakyM1ecREi0PRRZmHXtQA faevq6NyBA/vqB7jAUoaalkTxeHVuio7lPLoeOMrNlApCrucjmvJhPlRK09XbSakD3DH tMve0OHrc5Z2b7HmaiDHAc280EvPNj/9fQNbMcuwLcLLc4PfS8CfTKejJ6GWthGtCRr6 0cbw== X-Gm-Message-State: AOAM531i2qe12ptfun3TAaZ6h98HEtmtTHimahSaPJnTsQeDIVHmb3Fo yJX1zKMdHGiUUe8VWUXPcZe3FZTT9wDnvtE= X-Google-Smtp-Source: ABdhPJxTnlpo/iNagwkjPHbucxYH+CtEj8ohxJdE5GbeBLJ3G2GWRSAkV6DdlP8lN0r7l8RCXlv3YW3DGl+NeWs= X-Received: from tj.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:53a]) (user=tjmercier job=sendgmr) by 2002:a25:2d4d:0:b0:641:d14e:ff85 with SMTP id s13-20020a252d4d000000b00641d14eff85mr21685745ybe.128.1650498771646; Wed, 20 Apr 2022 16:52:51 -0700 (PDT) Date: Wed, 20 Apr 2022 23:52:23 +0000 In-Reply-To: <20220420235228.2767816-1-tjmercier@google.com> Message-Id: <20220420235228.2767816-6-tjmercier@google.com> Mime-Version: 1.0 References: <20220420235228.2767816-1-tjmercier@google.com> X-Mailer: git-send-email 2.36.0.rc0.470.gd361397f0d-goog Subject: [RFC v5 5/6] binder: Add flags to relinquish ownership of fds From: "T.J. Mercier" To: tjmercier@google.com, daniel@ffwll.ch, tj@kernel.org, Greg Kroah-Hartman , " =?utf-8?q?Arve_Hj?= =?utf-8?q?=C3=B8nnev=C3=A5g?= " , Todd Kjos , Martijn Coenen , Joel Fernandes , Christian Brauner , Hridya Valsaraju , Suren Baghdasaryan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Kenny.Ho@amd.com, kaleshsingh@google.com, cmllamas@google.com, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, linaro-mm-sig@lists.linaro.org, mkoutny@suse.com, skhan@linuxfoundation.org, jstultz@google.com, kernel-team@android.com, linux-media@vger.kernel.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: Hridya Valsaraju This patch introduces flags BINDER_FD_FLAG_SENDER_NO_NEED, and BINDER_FDA_FLAG_SENDER_NO_NEED that a process sending an individual fd or fd array to another process over binder IPC can set to relinquish ownership of the fds being sent for memory accounting purposes. If the flag is found to be set during the fd or fd array translation and the fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup and charged to the receiving process's cgroup instead. It is up to the sending process to ensure that it closes the fds regardless of whether the transfer failed or succeeded. Most graphics shared memory allocations in Android are done by the graphics allocator HAL process. On requests from clients, the HAL process allocates memory and sends the fds to the clients over binder IPC. The graphics allocator HAL will not retain any references to the buffers. When the HAL sets *_FLAG_SENDER_NO_NEED for fd arrays holding DMA-BUF fds, or individual fd objects, the gpu cgroup controller will be able to correctly charge the buffers to the client processes instead of the graphics allocator HAL. Since this is a new feature exposed to userspace, the kernel and userspace must be compatible for the accounting to work for transfers. In all cases the allocation and transport of DMA buffers via binder will succeed, but only when both the kernel supports, and userspace depends on this feature will the transfer accounting work. The possible scenarios are detailed below: 1. new kernel + old userspace The kernel supports the feature but userspace does not use it. The old userspace won't mount the new cgroup controller, accounting is not performed, charge is not transferred. 2. old kernel + new userspace The new cgroup controller is not supported by the kernel, accounting is not performed, charge is not transferred. 3. old kernel + old userspace Same as #2 4. new kernel + new userspace Cgroup is mounted, feature is supported and used. Signed-off-by: Hridya Valsaraju Signed-off-by: T.J. Mercier Reviewed-by: Carlos Llamas --- v5 changes Support both binder_fd_array_object and binder_fd_object. This is necessary because new versions of Android will use binder_fd_object instead of binder_fd_array_object, and we need to support both. Use the new, simpler dma_buf_transfer_charge API. v3 changes Remove android from title per Todd Kjos. Use more common dual author commit message format per John Stultz. Include details on behavior for all combinations of kernel/userspace versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman. v2 changes Move dma-buf cgroup charge transfer from a dma_buf_op defined by every heap to a single dma-buf function for all heaps per Daniel Vetter and Christian König. --- drivers/android/binder.c | 27 +++++++++++++++++++++++---- drivers/dma-buf/dma-buf.c | 4 ++-- include/linux/dma-buf.h | 2 +- include/uapi/linux/android/binder.h | 23 +++++++++++++++++++---- 4 files changed, 45 insertions(+), 11 deletions(-) diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 8351c5638880..b07d50fe1c80 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -42,6 +42,7 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include #include #include #include @@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder_object *fp, return ret; } -static int binder_translate_fd(u32 fd, binder_size_t fd_offset, +static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags, struct binder_transaction *t, struct binder_thread *thread, struct binder_transaction *in_reply_to) @@ -2208,6 +2209,23 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, goto err_security; } + if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_SENDER_NO_NEED)) { + if (is_dma_buf_file(file)) { + struct dma_buf *dmabuf = file->private_data; + + ret = dma_buf_transfer_charge(dmabuf, target_proc->tsk); + if (ret) + pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n", + proc->pid, thread->pid, target_proc->pid); + } else { + binder_user_error( + "%d:%d got transaction with SENDER_NO_NEED for non-dmabuf fd, %d\n", + proc->pid, thread->pid, fd); + ret = -EINVAL; + goto err_noneed; + } + } + /* * Add fixup record for this transaction. The allocation * of the fd in the target needs to be done from a @@ -2226,6 +2244,7 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset, return ret; err_alloc: +err_noneed: err_security: fput(file); err_fget: @@ -2528,7 +2547,7 @@ static int binder_translate_fd_array(struct list_head *pf_head, ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd)); if (!ret) - ret = binder_translate_fd(fd, offset, t, thread, + ret = binder_translate_fd(fd, offset, fda->flags, t, thread, in_reply_to); if (ret) return ret > 0 ? -EINVAL : ret; @@ -3179,8 +3198,8 @@ static void binder_transaction(struct binder_proc *proc, struct binder_fd_object *fp = to_binder_fd_object(hdr); binder_size_t fd_offset = object_offset + (uintptr_t)&fp->fd - (uintptr_t)fp; - int ret = binder_translate_fd(fp->fd, fd_offset, t, - thread, in_reply_to); + int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags, + t, thread, in_reply_to); fp->pad_binder = 0; if (ret < 0 || diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index f3fb844925e2..36ed6cd4ddcc 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -31,7 +31,6 @@ #include "dma-buf-sysfs-stats.h" -static inline int is_dma_buf_file(struct file *); struct dma_buf_list { struct list_head head; @@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops = { /* * is_dma_buf_file - Check if struct file* is associated with dma_buf */ -static inline int is_dma_buf_file(struct file *file) +int is_dma_buf_file(struct file *file) { return file->f_op == &dma_buf_fops; } +EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF); static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) { diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 438ad8577b76..2b9812758fee 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach) { return !!attach->importer_ops; } - +int is_dma_buf_file(struct file *file); struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev); struct dma_buf_attachment * diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h index 11157fae8a8e..b263cbb603ea 100644 --- a/include/uapi/linux/android/binder.h +++ b/include/uapi/linux/android/binder.h @@ -91,14 +91,14 @@ struct flat_binder_object { /** * struct binder_fd_object - describes a filedescriptor to be fixed up. * @hdr: common header structure - * @pad_flags: padding to remain compatible with old userspace code + * @flags: One or more BINDER_FD_FLAG_* flags * @pad_binder: padding to remain compatible with old userspace code * @fd: file descriptor * @cookie: opaque data, used by user-space */ struct binder_fd_object { struct binder_object_header hdr; - __u32 pad_flags; + __u32 flags; union { binder_uintptr_t pad_binder; __u32 fd; @@ -107,6 +107,17 @@ struct binder_fd_object { binder_uintptr_t cookie; }; +enum { + /** + * @BINDER_FD_FLAG_SENDER_NO_NEED + * + * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for + * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the + * sender's cgroup and charged to the receiving process's cgroup instead. + */ + BINDER_FD_FLAG_SENDER_NO_NEED = 0x2000, +}; + /* struct binder_buffer_object - object describing a userspace buffer * @hdr: common header structure * @flags: one or more BINDER_BUFFER_* flags @@ -141,7 +152,7 @@ enum { /* struct binder_fd_array_object - object describing an array of fds in a buffer * @hdr: common header structure - * @pad: padding to ensure correct alignment + * flags: One or more BINDER_FDA_FLAG_* flags * @num_fds: number of file descriptors in the buffer * @parent: index in offset array to buffer holding the fd array * @parent_offset: start offset of fd array in the buffer @@ -162,12 +173,16 @@ enum { */ struct binder_fd_array_object { struct binder_object_header hdr; - __u32 pad; + __u32 flags; binder_size_t num_fds; binder_size_t parent; binder_size_t parent_offset; }; +enum { + BINDER_FDA_FLAG_SENDER_NO_NEED = BINDER_FD_FLAG_SENDER_NO_NEED, +}; + /* * On 64-bit platforms where user code may run in 32-bits the driver must * translate the buffer (and local binder) addresses appropriately.