diff mbox series

[09/19] io_uring: convert to use vm_account

Message ID 44e6ead48bc53789191b22b0e140aeb82459e75f.1675669136.git-series.apopple@nvidia.com (mailing list archive)
State New
Headers show
Series mm: Introduce a cgroup to limit the amount of locked and pinned memory | expand

Commit Message

Alistair Popple Feb. 6, 2023, 7:47 a.m. UTC
Convert io_uring to use vm_account instead of directly charging pages
against the user/mm. Rather than charge pages to both user->locked_vm
and mm->pinned_vm this will only charge pages to user->locked_vm.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/linux/io_uring_types.h |  4 ++--
 io_uring/io_uring.c            | 20 +++---------------
 io_uring/notif.c               |  4 ++--
 io_uring/notif.h               | 10 +++------
 io_uring/rsrc.c                | 38 +++--------------------------------
 io_uring/rsrc.h                |  9 +--------
 6 files changed, 17 insertions(+), 68 deletions(-)

Comments

Jens Axboe Feb. 6, 2023, 3:29 p.m. UTC | #1
On 2/6/23 12:47 AM, Alistair Popple wrote:
> Convert io_uring to use vm_account instead of directly charging pages
> against the user/mm. Rather than charge pages to both user->locked_vm
> and mm->pinned_vm this will only charge pages to user->locked_vm.

Not sure how we're supposed to review this, when you just send us 9/19
and vm_account_release() is supposedly an earlier patch in this series.

Either CC the whole series, or at least the cover letter, core parts,
and the per-subsystem parts.
Alistair Popple Feb. 7, 2023, 1:03 a.m. UTC | #2
Jens Axboe <axboe@kernel.dk> writes:

> On 2/6/23 12:47 AM, Alistair Popple wrote:
>> Convert io_uring to use vm_account instead of directly charging pages
>> against the user/mm. Rather than charge pages to both user->locked_vm
>> and mm->pinned_vm this will only charge pages to user->locked_vm.
>
> Not sure how we're supposed to review this, when you just send us 9/19
> and vm_account_release() is supposedly an earlier patch in this series.
>
> Either CC the whole series, or at least the cover letter, core parts,
> and the per-subsystem parts.

Ok, thanks. Will be sure to add everyone to the cover letter and patch
01 when I send the next version.

For reference the cover letter is here:

https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/

And the core patch that introduces vm_account is here:

https://lore.kernel.org/linux-mm/e80b61561f97296a6c08faeebe281cb949333d1d.1675669136.git-series.apopple@nvidia.com/

No problem if you want to wait for the resend/next version before
taking another look though.
Jens Axboe Feb. 7, 2023, 2:28 p.m. UTC | #3
On 2/6/23 6:03?PM, Alistair Popple wrote:
> 
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 2/6/23 12:47?AM, Alistair Popple wrote:
>>> Convert io_uring to use vm_account instead of directly charging pages
>>> against the user/mm. Rather than charge pages to both user->locked_vm
>>> and mm->pinned_vm this will only charge pages to user->locked_vm.
>>
>> Not sure how we're supposed to review this, when you just send us 9/19
>> and vm_account_release() is supposedly an earlier patch in this series.
>>
>> Either CC the whole series, or at least the cover letter, core parts,
>> and the per-subsystem parts.
> 
> Ok, thanks. Will be sure to add everyone to the cover letter and patch
> 01 when I send the next version.
> 
> For reference the cover letter is here:
> 
> https://lore.kernel.org/linux-mm/cover.c238416f0e82377b449846dbb2459ae9d7030c8e.1675669136.git-series.apopple@nvidia.com/
> 
> And the core patch that introduces vm_account is here:
> 
> https://lore.kernel.org/linux-mm/e80b61561f97296a6c08faeebe281cb949333d1d.1675669136.git-series.apopple@nvidia.com/
> 
> No problem if you want to wait for the resend/next version before
> taking another look though.

Thanks, that helps. Like listed in the cover letter, I also have to
agree that this is badly named. It's way too generic, it needs to have a
name that tells you what it does. There's tons of accounting, you need
to be more specific.

Outside of that, we're now doubling the amount of memory associated with
tracking this. That isn't necessarily a showstopper, but it is not
ideal. I didn't take a look at the other conversions (again, because
they were not sent to me), but seems like the task_struct and flags
could just be passed in as they may very well be known to many/most
callers?
Jason Gunthorpe Feb. 7, 2023, 2:55 p.m. UTC | #4
On Tue, Feb 07, 2023 at 07:28:56AM -0700, Jens Axboe wrote:

> Outside of that, we're now doubling the amount of memory associated with
> tracking this. That isn't necessarily a showstopper, but it is not
> ideal. I didn't take a look at the other conversions (again, because
> they were not sent to me), but seems like the task_struct and flags
> could just be passed in as they may very well be known to many/most
> callers?

For places doing the mm accounting type it cannot use the task struct
as the underlying mm can be replaced and keep the task, IIRC.

We just had a bug in VFIO related to this..

If we could go back from the mm to the task (even a from a destroyed
mm though) that might work to reduce storage?

Jason
Jens Axboe Feb. 7, 2023, 5:05 p.m. UTC | #5
On 2/7/23 7:55?AM, Jason Gunthorpe wrote:
> On Tue, Feb 07, 2023 at 07:28:56AM -0700, Jens Axboe wrote:
> 
>> Outside of that, we're now doubling the amount of memory associated with
>> tracking this. That isn't necessarily a showstopper, but it is not
>> ideal. I didn't take a look at the other conversions (again, because
>> they were not sent to me), but seems like the task_struct and flags
>> could just be passed in as they may very well be known to many/most
>> callers?
> 
> For places doing the mm accounting type it cannot use the task struct
> as the underlying mm can be replaced and keep the task, IIRC.
> 
> We just had a bug in VFIO related to this..
> 
> If we could go back from the mm to the task (even a from a destroyed
> mm though) that might work to reduce storage?

Then maybe just nest them:

struct small_one {
	struct mm_struct *mm;
	struct user_struct *user;
};

struct big_one {
	struct small_one foo;
	struct task_struct *task;
	enum vm_account_flags flags;
};

and have the real helpers deal with small_one, and wrappers around that
taking big_one that just passes in the missing bits. Then users that
don't need the extra bits can just use the right API.
Alistair Popple Feb. 13, 2023, 11:30 a.m. UTC | #6
Jens Axboe <axboe@kernel.dk> writes:

> On 2/7/23 7:55?AM, Jason Gunthorpe wrote:
>> On Tue, Feb 07, 2023 at 07:28:56AM -0700, Jens Axboe wrote:
>> 
>>> Outside of that, we're now doubling the amount of memory associated with
>>> tracking this. That isn't necessarily a showstopper, but it is not
>>> ideal. I didn't take a look at the other conversions (again, because
>>> they were not sent to me), but seems like the task_struct and flags
>>> could just be passed in as they may very well be known to many/most
>>> callers?
>> 
>> For places doing the mm accounting type it cannot use the task struct
>> as the underlying mm can be replaced and keep the task, IIRC.
>> 
>> We just had a bug in VFIO related to this..
>> 
>> If we could go back from the mm to the task (even a from a destroyed
>> mm though) that might work to reduce storage?

Yes, it's the going back from a destroyed mm that gets a bit murky. I
don't think it's a showstopper as we could probably keep track of that
when we destroy the mm but it seems like a fair amount of complexity to
save a smallish amount of memory.

However if we end up tacking this onto memcg instead then we could use
that to go back to the task and move any charges when the mm moves.

> Then maybe just nest them:
>
> struct small_one {
> 	struct mm_struct *mm;
> 	struct user_struct *user;
> };
>
> struct big_one {
> 	struct small_one foo;
> 	struct task_struct *task;
> 	enum vm_account_flags flags;
> };
>
> and have the real helpers deal with small_one, and wrappers around that
> taking big_one that just passes in the missing bits. Then users that
> don't need the extra bits can just use the right API.

Thanks for the suggestion, it should work noting that we will have to
add a struct pins_cgroup pointer to struct small_one. It may also help
with a similar problem I was having with one of the networking
conversions.
diff mbox series

Patch

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 128a67a..45ac75d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -5,6 +5,7 @@ 
 #include <linux/task_work.h>
 #include <linux/bitmap.h>
 #include <linux/llist.h>
+#include <linux/vm_account.h>
 #include <uapi/linux/io_uring.h>
 
 struct io_wq_work_node {
@@ -343,8 +344,7 @@  struct io_ring_ctx {
 	struct io_wq_hash		*hash_map;
 
 	/* Only used for accounting purposes */
-	struct user_struct		*user;
-	struct mm_struct		*mm_account;
+	struct vm_account               vm_account;
 
 	/* ctx exit and cancelation */
 	struct llist_head		fallback_llist;
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 0a4efad..912da4f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2744,15 +2744,11 @@  static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
-	if (ctx->mm_account) {
-		mmdrop(ctx->mm_account);
-		ctx->mm_account = NULL;
-	}
+	vm_account_release(&ctx->vm_account);
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
 
 	percpu_ref_exit(&ctx->refs);
-	free_uid(ctx->user);
 	io_req_caches_free(ctx);
 	if (ctx->hash_map)
 		io_wq_put_hash(ctx->hash_map);
@@ -3585,8 +3581,9 @@  static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 		ctx->syscall_iopoll = 1;
 
 	ctx->compat = in_compat_syscall();
-	if (!capable(CAP_IPC_LOCK))
-		ctx->user = get_uid(current_user());
+	vm_account_init(&ctx->vm_account, current, current_user(),
+			VM_ACCOUNT_USER |
+			(capable(CAP_IPC_LOCK) ? VM_ACCOUNT_BYPASS : 0));
 
 	/*
 	 * For SQPOLL, we just need a wakeup, always. For !SQPOLL, if
@@ -3619,15 +3616,6 @@  static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 		goto err;
 	}
 
-	/*
-	 * This is just grabbed for accounting purposes. When a process exits,
-	 * the mm is exited and dropped before the files, hence we need to hang
-	 * on to this mm purely for the purposes of being able to unaccount
-	 * memory (locked/pinned vm). It's not used for anything else.
-	 */
-	mmgrab(current->mm);
-	ctx->mm_account = current->mm;
-
 	ret = io_allocate_scq_urings(ctx, p);
 	if (ret)
 		goto err;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index c4bb793..0f589fa 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -17,8 +17,8 @@  static void io_notif_complete_tw_ext(struct io_kiocb *notif, bool *locked)
 	if (nd->zc_report && (nd->zc_copied || !nd->zc_used))
 		notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED;
 
-	if (nd->account_pages && ctx->user) {
-		__io_unaccount_mem(ctx->user, nd->account_pages);
+	if (nd->account_pages) {
+		vm_unaccount_pinned(&ctx->vm_account, nd->account_pages);
 		nd->account_pages = 0;
 	}
 	io_req_task_complete(notif, locked);
diff --git a/io_uring/notif.h b/io_uring/notif.h
index c88c800..e2cb44a 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -43,11 +43,9 @@  static inline int io_notif_account_mem(struct io_kiocb *notif, unsigned len)
 	unsigned nr_pages = (len >> PAGE_SHIFT) + 2;
 	int ret;
 
-	if (ctx->user) {
-		ret = __io_account_mem(ctx->user, nr_pages);
-		if (ret)
-			return ret;
-		nd->account_pages += nr_pages;
-	}
+	ret = __io_account_mem(&ctx->vm_account, nr_pages);
+	if (ret)
+		return ret;
+	nd->account_pages += nr_pages;
 	return 0;
 }
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 18de10c..aa44528 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -42,49 +42,19 @@  void io_rsrc_refs_drop(struct io_ring_ctx *ctx)
 	}
 }
 
-int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
+int __io_account_mem(struct vm_account *vm_account, unsigned long nr_pages)
 {
-	unsigned long page_limit, cur_pages, new_pages;
-
-	if (!nr_pages)
-		return 0;
-
-	/* Don't allow more pages than we can safely lock */
-	page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-	cur_pages = atomic_long_read(&user->locked_vm);
-	do {
-		new_pages = cur_pages + nr_pages;
-		if (new_pages > page_limit)
-			return -ENOMEM;
-	} while (!atomic_long_try_cmpxchg(&user->locked_vm,
-					  &cur_pages, new_pages));
-	return 0;
+	return vm_account_pinned(vm_account, nr_pages);
 }
 
 static void io_unaccount_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 {
-	if (ctx->user)
-		__io_unaccount_mem(ctx->user, nr_pages);
-
-	if (ctx->mm_account)
-		atomic64_sub(nr_pages, &ctx->mm_account->pinned_vm);
+	vm_unaccount_pinned(&ctx->vm_account, nr_pages);
 }
 
 static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages)
 {
-	int ret;
-
-	if (ctx->user) {
-		ret = __io_account_mem(ctx->user, nr_pages);
-		if (ret)
-			return ret;
-	}
-
-	if (ctx->mm_account)
-		atomic64_add(nr_pages, &ctx->mm_account->pinned_vm);
-
-	return 0;
+	return vm_account_pinned(&ctx->vm_account, nr_pages);
 }
 
 static int io_copy_iov(struct io_ring_ctx *ctx, struct iovec *dst,
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 2b87436..d8833d0 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -167,12 +167,5 @@  static inline u64 *io_get_tag_slot(struct io_rsrc_data *data, unsigned int idx)
 int io_files_update(struct io_kiocb *req, unsigned int issue_flags);
 int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 
-int __io_account_mem(struct user_struct *user, unsigned long nr_pages);
-
-static inline void __io_unaccount_mem(struct user_struct *user,
-				      unsigned long nr_pages)
-{
-	atomic_long_sub(nr_pages, &user->locked_vm);
-}
-
+int __io_account_mem(struct vm_account *vm_account, unsigned long nr_pages);
 #endif