diff mbox

aio: Add memcg accounting of user used data

Message ID 151246799787.29619.13497559380263553588.stgit@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Kirill Tkhai Dec. 5, 2017, 10 a.m. UTC
Currently, number of available aio requests may be
limited only globally. There are two sysctl variables
aio_max_nr and aio_nr, which implement the limitation
and request accounting. They help to avoid
the situation, when all the memory is eaten in-flight
requests, which are written by slow block device,
and which can't be reclaimed by shrinker.

This meets the problem in case of many containers
are used on the hardware node. Since aio_max_nr is
a global limit, any container may occupy the whole
available aio requests, and to deprive others the
possibility to use aio at all. The situation may
happen because of evil intentions of the container's
user or because of the program error, when the user
makes this occasionally

The patch allows to fix the problem. It adds memcg
accounting of user used aio data (the biggest is
the bunch of aio_kiocb; ring buffer is the second
biggest), so a user of a certain memcg won't be able
to allocate more aio requests memory, then the cgroup
allows, and he will bumped into the limit.

This may be useful for LXC and for protection of some
critical microservices.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
---
 fs/aio.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Michal Hocko Dec. 5, 2017, 3:15 p.m. UTC | #1
On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> Currently, number of available aio requests may be
> limited only globally. There are two sysctl variables
> aio_max_nr and aio_nr, which implement the limitation
> and request accounting. They help to avoid
> the situation, when all the memory is eaten in-flight
> requests, which are written by slow block device,
> and which can't be reclaimed by shrinker.
> 
> This meets the problem in case of many containers
> are used on the hardware node. Since aio_max_nr is
> a global limit, any container may occupy the whole
> available aio requests, and to deprive others the
> possibility to use aio at all. The situation may
> happen because of evil intentions of the container's
> user or because of the program error, when the user
> makes this occasionally
> 
> The patch allows to fix the problem. It adds memcg
> accounting of user used aio data (the biggest is
> the bunch of aio_kiocb; ring buffer is the second
> biggest), so a user of a certain memcg won't be able
> to allocate more aio requests memory, then the cgroup
> allows, and he will bumped into the limit.

So what happens when we hit the hard limit and oom kill somebody?
Are those charged objects somehow bound to a process context?
Kirill Tkhai Dec. 5, 2017, 3:34 p.m. UTC | #2
On 05.12.2017 18:15, Michal Hocko wrote:
> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
>> Currently, number of available aio requests may be
>> limited only globally. There are two sysctl variables
>> aio_max_nr and aio_nr, which implement the limitation
>> and request accounting. They help to avoid
>> the situation, when all the memory is eaten in-flight
>> requests, which are written by slow block device,
>> and which can't be reclaimed by shrinker.
>>
>> This meets the problem in case of many containers
>> are used on the hardware node. Since aio_max_nr is
>> a global limit, any container may occupy the whole
>> available aio requests, and to deprive others the
>> possibility to use aio at all. The situation may
>> happen because of evil intentions of the container's
>> user or because of the program error, when the user
>> makes this occasionally
>>
>> The patch allows to fix the problem. It adds memcg
>> accounting of user used aio data (the biggest is
>> the bunch of aio_kiocb; ring buffer is the second
>> biggest), so a user of a certain memcg won't be able
>> to allocate more aio requests memory, then the cgroup
>> allows, and he will bumped into the limit.
> 
> So what happens when we hit the hard limit and oom kill somebody?
> Are those charged objects somehow bound to a process context?

There is exit_aio() called from __mmput(), which waits till
the charged objects complete and decrement reference counter.

If there was a problem with oom in memcg, there would be
the same problem on global oom, as it can be seen there is
no __GFP_NOFAIL flags anywhere in aio code.

But it seems everything is safe.

Kirill
Michal Hocko Dec. 5, 2017, 3:43 p.m. UTC | #3
On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
> On 05.12.2017 18:15, Michal Hocko wrote:
> > On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> >> Currently, number of available aio requests may be
> >> limited only globally. There are two sysctl variables
> >> aio_max_nr and aio_nr, which implement the limitation
> >> and request accounting. They help to avoid
> >> the situation, when all the memory is eaten in-flight
> >> requests, which are written by slow block device,
> >> and which can't be reclaimed by shrinker.
> >>
> >> This meets the problem in case of many containers
> >> are used on the hardware node. Since aio_max_nr is
> >> a global limit, any container may occupy the whole
> >> available aio requests, and to deprive others the
> >> possibility to use aio at all. The situation may
> >> happen because of evil intentions of the container's
> >> user or because of the program error, when the user
> >> makes this occasionally
> >>
> >> The patch allows to fix the problem. It adds memcg
> >> accounting of user used aio data (the biggest is
> >> the bunch of aio_kiocb; ring buffer is the second
> >> biggest), so a user of a certain memcg won't be able
> >> to allocate more aio requests memory, then the cgroup
> >> allows, and he will bumped into the limit.
> > 
> > So what happens when we hit the hard limit and oom kill somebody?
> > Are those charged objects somehow bound to a process context?
> 
> There is exit_aio() called from __mmput(), which waits till
> the charged objects complete and decrement reference counter.

OK, so it is bound to _a_ process context. The oom killer will not know
about which process has consumed those objects but the effect will be at
least reduced to a memcg.

> If there was a problem with oom in memcg, there would be
> the same problem on global oom, as it can be seen there is
> no __GFP_NOFAIL flags anywhere in aio code.
> 
> But it seems everything is safe.

Could you share your testing scenario and the way how the system behaved
during a heavy aio?

I am not saying the patch is wrong, I am just trying to undestand all
the consequences.
Kirill Tkhai Dec. 5, 2017, 4:02 p.m. UTC | #4
On 05.12.2017 18:43, Michal Hocko wrote:
> On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
>> On 05.12.2017 18:15, Michal Hocko wrote:
>>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
>>>> Currently, number of available aio requests may be
>>>> limited only globally. There are two sysctl variables
>>>> aio_max_nr and aio_nr, which implement the limitation
>>>> and request accounting. They help to avoid
>>>> the situation, when all the memory is eaten in-flight
>>>> requests, which are written by slow block device,
>>>> and which can't be reclaimed by shrinker.
>>>>
>>>> This meets the problem in case of many containers
>>>> are used on the hardware node. Since aio_max_nr is
>>>> a global limit, any container may occupy the whole
>>>> available aio requests, and to deprive others the
>>>> possibility to use aio at all. The situation may
>>>> happen because of evil intentions of the container's
>>>> user or because of the program error, when the user
>>>> makes this occasionally
>>>>
>>>> The patch allows to fix the problem. It adds memcg
>>>> accounting of user used aio data (the biggest is
>>>> the bunch of aio_kiocb; ring buffer is the second
>>>> biggest), so a user of a certain memcg won't be able
>>>> to allocate more aio requests memory, then the cgroup
>>>> allows, and he will bumped into the limit.
>>>
>>> So what happens when we hit the hard limit and oom kill somebody?
>>> Are those charged objects somehow bound to a process context?
>>
>> There is exit_aio() called from __mmput(), which waits till
>> the charged objects complete and decrement reference counter.
> 
> OK, so it is bound to _a_ process context. The oom killer will not know
> about which process has consumed those objects but the effect will be at
> least reduced to a memcg.
> 
>> If there was a problem with oom in memcg, there would be
>> the same problem on global oom, as it can be seen there is
>> no __GFP_NOFAIL flags anywhere in aio code.
>>
>> But it seems everything is safe.
> 
> Could you share your testing scenario and the way how the system behaved
> during a heavy aio?
> 
> I am not saying the patch is wrong, I am just trying to undestand all
> the consequences.

My test is simple program, which creates aio context and then starts
infinity io_submit() cycle. I've tested the cases, when certain stages
fail: io_setup() meets oom, io_submit() meets oom, io_getevents() meets
oom. This was simply tested by inserting sleep() before the stage, and
moving the task to appropriate cgroup with low memory limit. The most
cases, I get bash killed (I moved it to cgroup too). Also, I've executed
the test in parallel.

If you want I can send you the source code, but I don't think it will be
easy to use it if you are not the author.

Kirill
Michal Hocko Dec. 6, 2017, 9:05 a.m. UTC | #5
On Tue 05-12-17 19:02:00, Kirill Tkhai wrote:
> On 05.12.2017 18:43, Michal Hocko wrote:
> > On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
> >> On 05.12.2017 18:15, Michal Hocko wrote:
> >>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> >>>> Currently, number of available aio requests may be
> >>>> limited only globally. There are two sysctl variables
> >>>> aio_max_nr and aio_nr, which implement the limitation
> >>>> and request accounting. They help to avoid
> >>>> the situation, when all the memory is eaten in-flight
> >>>> requests, which are written by slow block device,
> >>>> and which can't be reclaimed by shrinker.
> >>>>
> >>>> This meets the problem in case of many containers
> >>>> are used on the hardware node. Since aio_max_nr is
> >>>> a global limit, any container may occupy the whole
> >>>> available aio requests, and to deprive others the
> >>>> possibility to use aio at all. The situation may
> >>>> happen because of evil intentions of the container's
> >>>> user or because of the program error, when the user
> >>>> makes this occasionally
> >>>>
> >>>> The patch allows to fix the problem. It adds memcg
> >>>> accounting of user used aio data (the biggest is
> >>>> the bunch of aio_kiocb; ring buffer is the second
> >>>> biggest), so a user of a certain memcg won't be able
> >>>> to allocate more aio requests memory, then the cgroup
> >>>> allows, and he will bumped into the limit.
> >>>
> >>> So what happens when we hit the hard limit and oom kill somebody?
> >>> Are those charged objects somehow bound to a process context?
> >>
> >> There is exit_aio() called from __mmput(), which waits till
> >> the charged objects complete and decrement reference counter.
> > 
> > OK, so it is bound to _a_ process context. The oom killer will not know
> > about which process has consumed those objects but the effect will be at
> > least reduced to a memcg.
> > 
> >> If there was a problem with oom in memcg, there would be
> >> the same problem on global oom, as it can be seen there is
> >> no __GFP_NOFAIL flags anywhere in aio code.
> >>
> >> But it seems everything is safe.
> > 
> > Could you share your testing scenario and the way how the system behaved
> > during a heavy aio?
> > 
> > I am not saying the patch is wrong, I am just trying to undestand all
> > the consequences.
> 
> My test is simple program, which creates aio context and then starts
> infinity io_submit() cycle. I've tested the cases, when certain stages
> fail: io_setup() meets oom, io_submit() meets oom, io_getevents() meets
> oom. This was simply tested by inserting sleep() before the stage, and
> moving the task to appropriate cgroup with low memory limit. The most
> cases, I get bash killed (I moved it to cgroup too). Also, I've executed
> the test in parallel.
> 
> If you want I can send you the source code, but I don't think it will be
> easy to use it if you are not the author.

Well, not really, I was merely interest about the testing scenario
mainly to see how the system behaved because memcg hitting the hard
limit will OOM kill something only if the failing charge is from the
page fault path. All kernel allocations therefore return with ENOMEM.
The fact we are not considering per task charged kernel memory and
therefore a small task constantly allocating kernel memory can put the
whole cgroup down. As I've said this is something that _should_ be OK
because the bad behavior is isolated within the cgroup.

If that is something that is expected behavior for your usecase then OK.
Kirill Tkhai Dec. 6, 2017, 10 a.m. UTC | #6
On 06.12.2017 12:05, Michal Hocko wrote:
> On Tue 05-12-17 19:02:00, Kirill Tkhai wrote:
>> On 05.12.2017 18:43, Michal Hocko wrote:
>>> On Tue 05-12-17 18:34:59, Kirill Tkhai wrote:
>>>> On 05.12.2017 18:15, Michal Hocko wrote:
>>>>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
>>>>>> Currently, number of available aio requests may be
>>>>>> limited only globally. There are two sysctl variables
>>>>>> aio_max_nr and aio_nr, which implement the limitation
>>>>>> and request accounting. They help to avoid
>>>>>> the situation, when all the memory is eaten in-flight
>>>>>> requests, which are written by slow block device,
>>>>>> and which can't be reclaimed by shrinker.
>>>>>>
>>>>>> This meets the problem in case of many containers
>>>>>> are used on the hardware node. Since aio_max_nr is
>>>>>> a global limit, any container may occupy the whole
>>>>>> available aio requests, and to deprive others the
>>>>>> possibility to use aio at all. The situation may
>>>>>> happen because of evil intentions of the container's
>>>>>> user or because of the program error, when the user
>>>>>> makes this occasionally
>>>>>>
>>>>>> The patch allows to fix the problem. It adds memcg
>>>>>> accounting of user used aio data (the biggest is
>>>>>> the bunch of aio_kiocb; ring buffer is the second
>>>>>> biggest), so a user of a certain memcg won't be able
>>>>>> to allocate more aio requests memory, then the cgroup
>>>>>> allows, and he will bumped into the limit.
>>>>>
>>>>> So what happens when we hit the hard limit and oom kill somebody?
>>>>> Are those charged objects somehow bound to a process context?
>>>>
>>>> There is exit_aio() called from __mmput(), which waits till
>>>> the charged objects complete and decrement reference counter.
>>>
>>> OK, so it is bound to _a_ process context. The oom killer will not know
>>> about which process has consumed those objects but the effect will be at
>>> least reduced to a memcg.
>>>
>>>> If there was a problem with oom in memcg, there would be
>>>> the same problem on global oom, as it can be seen there is
>>>> no __GFP_NOFAIL flags anywhere in aio code.
>>>>
>>>> But it seems everything is safe.
>>>
>>> Could you share your testing scenario and the way how the system behaved
>>> during a heavy aio?
>>>
>>> I am not saying the patch is wrong, I am just trying to undestand all
>>> the consequences.
>>
>> My test is simple program, which creates aio context and then starts
>> infinity io_submit() cycle. I've tested the cases, when certain stages
>> fail: io_setup() meets oom, io_submit() meets oom, io_getevents() meets
>> oom. This was simply tested by inserting sleep() before the stage, and
>> moving the task to appropriate cgroup with low memory limit. The most
>> cases, I get bash killed (I moved it to cgroup too). Also, I've executed
>> the test in parallel.
>>
>> If you want I can send you the source code, but I don't think it will be
>> easy to use it if you are not the author.
> 
> Well, not really, I was merely interest about the testing scenario
> mainly to see how the system behaved because memcg hitting the hard
> limit will OOM kill something only if the failing charge is from the
> page fault path. All kernel allocations therefore return with ENOMEM.
> The fact we are not considering per task charged kernel memory and
> therefore a small task constantly allocating kernel memory can put the
> whole cgroup down. As I've said this is something that _should_ be OK
> because the bad behavior is isolated within the cgroup.

It seems aio charging does not differ from pipe charging, when you have
the same unreclaimable memory pinned in kernel. Pipe buffers are allocated
with __GFP_ACCOUNT, and the behavior of oom is the same.

The only difference is pipe size is limited per-user. Next iteration we
may do the same for aio.
 
> If that is something that is expected behavior for your usecase then OK
Kirill
Michal Hocko Dec. 6, 2017, 12:23 p.m. UTC | #7
On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
[...]
> This meets the problem in case of many containers
> are used on the hardware node. Since aio_max_nr is
> a global limit, any container may occupy the whole
> available aio requests, and to deprive others the
> possibility to use aio at all. The situation may
> happen because of evil intentions of the container's
> user or because of the program error, when the user
> makes this occasionally

I am sorry to beat more on this but finally got around to
http://lkml.kernel.org/r/17b22d53-ad3d-1ba8-854f-fc2a43d86c44@virtuozzo.com
and read the above paragraph once again. I can see how accounting to
a memcg helps to reduce the memory footprint but I fail to see how it
helps the above scenario. Could you clarify wow you set up a limit to
prevent anybody from DoSing other containers by depleting aio_max_nr?
Kirill Tkhai Dec. 6, 2017, 12:36 p.m. UTC | #8
On 06.12.2017 15:23, Michal Hocko wrote:
> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> [...]
>> This meets the problem in case of many containers
>> are used on the hardware node. Since aio_max_nr is
>> a global limit, any container may occupy the whole
>> available aio requests, and to deprive others the
>> possibility to use aio at all. The situation may
>> happen because of evil intentions of the container's
>> user or because of the program error, when the user
>> makes this occasionally
> 
> I am sorry to beat more on this but finally got around to
> http://lkml.kernel.org/r/17b22d53-ad3d-1ba8-854f-fc2a43d86c44@virtuozzo.com
> and read the above paragraph once again. I can see how accounting to
> a memcg helps to reduce the memory footprint but I fail to see how it
> helps the above scenario. Could you clarify wow you set up a limit to
> prevent anybody from DoSing other containers by depleting aio_max_nr?
The memcg limit allows to increase aio_max_nr and the accounting guarantees
container can't exceed the limit. You may configure the limit and aio_max_nr
in the way, all containers requests in sum never reach aio_max_nr.

Kirill
Michal Hocko Dec. 6, 2017, 12:43 p.m. UTC | #9
On Wed 06-12-17 15:36:56, Kirill Tkhai wrote:
> On 06.12.2017 15:23, Michal Hocko wrote:
> > On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
> > [...]
> >> This meets the problem in case of many containers
> >> are used on the hardware node. Since aio_max_nr is
> >> a global limit, any container may occupy the whole
> >> available aio requests, and to deprive others the
> >> possibility to use aio at all. The situation may
> >> happen because of evil intentions of the container's
> >> user or because of the program error, when the user
> >> makes this occasionally
> > 
> > I am sorry to beat more on this but finally got around to
> > http://lkml.kernel.org/r/17b22d53-ad3d-1ba8-854f-fc2a43d86c44@virtuozzo.com
> > and read the above paragraph once again. I can see how accounting to
> > a memcg helps to reduce the memory footprint but I fail to see how it
> > helps the above scenario. Could you clarify wow you set up a limit to
> > prevent anybody from DoSing other containers by depleting aio_max_nr?
> The memcg limit allows to increase aio_max_nr and the accounting guarantees
> container can't exceed the limit. You may configure the limit and aio_max_nr
> in the way, all containers requests in sum never reach aio_max_nr.

So you are essentially saying that you make aio_max_nr unlimited and
rely on the memory consumption tracking by memcg, right? Are there any
downsides (e.g. clog the AIO subsytem)?

Please make sure that all this in the changelog.
Kirill Tkhai Dec. 6, 2017, 1:13 p.m. UTC | #10
On 06.12.2017 15:43, Michal Hocko wrote:
> On Wed 06-12-17 15:36:56, Kirill Tkhai wrote:
>> On 06.12.2017 15:23, Michal Hocko wrote:
>>> On Tue 05-12-17 13:00:54, Kirill Tkhai wrote:
>>> [...]
>>>> This meets the problem in case of many containers
>>>> are used on the hardware node. Since aio_max_nr is
>>>> a global limit, any container may occupy the whole
>>>> available aio requests, and to deprive others the
>>>> possibility to use aio at all. The situation may
>>>> happen because of evil intentions of the container's
>>>> user or because of the program error, when the user
>>>> makes this occasionally
>>>
>>> I am sorry to beat more on this but finally got around to
>>> http://lkml.kernel.org/r/17b22d53-ad3d-1ba8-854f-fc2a43d86c44@virtuozzo.com
>>> and read the above paragraph once again. I can see how accounting to
>>> a memcg helps to reduce the memory footprint but I fail to see how it
>>> helps the above scenario. Could you clarify wow you set up a limit to
>>> prevent anybody from DoSing other containers by depleting aio_max_nr?
>> The memcg limit allows to increase aio_max_nr and the accounting guarantees
>> container can't exceed the limit. You may configure the limit and aio_max_nr
>> in the way, all containers requests in sum never reach aio_max_nr.
> 
> So you are essentially saying that you make aio_max_nr unlimited and
> rely on the memory consumption tracking by memcg, right?

Yes, and this is just the way the rest of similar allocated memory
such as pipes, dcache or page cache are accounted in the kernel.

If there are worries about accounting of every submitted request,
it should not be a problem to account page aligned request number only
(i.e., every PAGE_SIZE/sizeof(struct aio_kiocb) requests).
But I'm not sure primitives like memcg_kmem_charge() are allowed
outside mm subsystem.

> Are there any
> downsides (e.g. clog the AIO subsytem)?

Any new functionality affects the subsystem, where it's used :)
Just because of new processor instructions, though not only.
AIO is not something special in this way. But people,
who are not interested in the accounting, may just turn it off,
and the performance will remain the same.

If you are interested in some special benchmark results I should
to execute, please, point the benchmark, and I'll public the results.
 
> Please make sure that all this in the changelog

OK
diff mbox

Patch

diff --git a/fs/aio.c b/fs/aio.c
index e6de7715228c..1431d0867a7e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -481,7 +481,7 @@  static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 	ctx->ring_pages = ctx->internal_pages;
 	if (nr_pages > AIO_RING_PAGES) {
 		ctx->ring_pages = kcalloc(nr_pages, sizeof(struct page *),
-					  GFP_KERNEL);
+					  GFP_KERNEL_ACCOUNT);
 		if (!ctx->ring_pages) {
 			put_aio_ring_file(ctx);
 			return -ENOMEM;
@@ -490,8 +490,8 @@  static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
-		page = find_or_create_page(file->f_mapping,
-					   i, GFP_HIGHUSER | __GFP_ZERO);
+		page = find_or_create_page(file->f_mapping, i,
+				GFP_HIGHUSER | __GFP_ZERO | __GFP_ACCOUNT);
 		if (!page)
 			break;
 		pr_debug("pid(%d) page[%d]->count=%d\n",
@@ -670,7 +670,7 @@  static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
 		spin_unlock(&mm->ioctx_lock);
 
 		table = kzalloc(sizeof(*table) + sizeof(struct kioctx *) *
-				new_nr, GFP_KERNEL);
+				new_nr, GFP_KERNEL_ACCOUNT);
 		if (!table)
 			return -ENOMEM;
 
@@ -740,7 +740,7 @@  static struct kioctx *ioctx_alloc(unsigned nr_events)
 	if (!nr_events || (unsigned long)max_reqs > aio_max_nr)
 		return ERR_PTR(-EAGAIN);
 
-	ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL);
+	ctx = kmem_cache_zalloc(kioctx_cachep, GFP_KERNEL_ACCOUNT);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
@@ -1030,7 +1030,7 @@  static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx)
 			return NULL;
 	}
 
-	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
+	req = kmem_cache_zalloc(kiocb_cachep, GFP_KERNEL_ACCOUNT);
 	if (unlikely(!req))
 		goto out_put;