mbox series

[RFC,v2,0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

Message ID 20190808231340.53601-1-almasrymina@google.com (mailing list archive)
Headers show
Series hugetlb_cgroup: Add hugetlb_cgroup reservation limits | expand

Message

Mina Almasry Aug. 8, 2019, 11:13 p.m. UTC
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Caveats:
1. This support is implemented for cgroups-v1. I have not tried
   hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
   This is largely because we use cgroups-v1 for now. If required, I
   can add hugetlb_cgroup support to cgroups v2 in this patch or
   a follow up.
2. Most complicated bit of this patch I believe is: where to store the
   pointer to the hugetlb_cgroup to uncharge at unreservation time?
   Normally the cgroup pointers hang off the struct page. But, with
   hugetlb_cgroup reservations, one task can reserve a specific page and another
   task may fault it in (I believe), so storing the pointer in struct
   page is not appropriate. Proposed approach here is to store the pointer in
   the resv_map. See patch for details.

Signed-off-by: Mina Almasry <almasrymina@google.com>

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (5):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests

 include/linux/hugetlb.h                       |  10 +-
 include/linux/hugetlb_cgroup.h                |  19 +-
 mm/hugetlb.c                                  | 256 ++++++++--
 mm/hugetlb_cgroup.c                           | 153 +++++-
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++++++++++++++++++
 .../selftests/vm/write_hugetlb_memory.sh      |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++++++++++
 9 files changed, 1087 insertions(+), 68 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.rc1.153.gdeed80330f-goog

Comments

Mike Kravetz Aug. 9, 2019, 5:54 p.m. UTC | #1
(+CC  Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)

On 8/8/19 4:13 PM, Mina Almasry wrote:
> Problem:
> Currently tasks attempting to allocate more hugetlb memory than is available get
> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> However, if a task attempts to allocate hugetlb memory only more than its
> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> but will SIGBUS the task when it attempts to fault the memory in.
> 
> We have developers interested in using hugetlb_cgroups, and they have expressed
> dissatisfaction regarding this behavior. We'd like to improve this
> behavior such that tasks violating the hugetlb_cgroup limits get an error on
> mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> the excess memory in.
> 
> The underlying problem is that today's hugetlb_cgroup accounting happens
> at hugetlb memory *fault* time, rather than at *reservation* time.
> Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> the offending task gets SIGBUS'd.
> 
> Proposed Solution:
> A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
> counter has slightly different semantics than
> hugetlb.xMB.[limit|usage]_in_bytes:
> 
> - While usage_in_bytes tracks all *faulted* hugetlb memory,
> reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> 
> - If a task attempts to reserve more memory than limit_in_bytes allows,
> the kernel will allow it to do so. But if a task attempts to reserve
> more memory than reservation_limit_in_bytes, the kernel will fail this
> reservation.
> 
> This proposal is implemented in this patch, with tests to verify
> functionality and show the usage.

Thanks for taking on this effort Mina.

Before looking at the details of the code, it might be helpful to discuss
the expected semantics of the proposed reservation limits.

I see you took into account the differences between private and shared
mappings.  This is good, as the reservation behavior is different for each
of these cases.  First let's look at private mappings.

For private mappings, the reservation usage will be the size of the mapping.
This should be fairly simple.  As reservations are consumed in the hugetlbfs
code, reservations in the resv_map are removed.  I see you have a hook into
region_del.  So, the expectation is that as reservations are consumed the
reservation usage will drop for the cgroup.  Correct?
The only tricky thing about private mappings is COW because of fork.  Current
reservation semantics specify that all reservations stay with the parent.
If child faults and can not get page, SIGBUS.  I assume the new reservation
limits will work the same.

I believe tracking reservations for shared mappings can get quite complicated.
The hugetlbfs reservation code around shared mappings 'works' on the basis
that shared mapping reservations are global.  As a result, reservations are
more associated with the inode than with the task making the reservation.
For example, consider a file of size 4 hugetlb pages.
Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
all 4 pages, and 2 additional reservations are taken.  I am not really sure
of the desired semantics here for reservation limits if A and B are in separate
cgroups.  Should B be charged for 4 or 2 reservations?
Also in the example above, after both tasks create their mappings suppose
Task B faults in the first page.  Does the reservation usage of Task A go
down as it originally had the reservation?

It should also be noted that when hugetlbfs reservations are 'consumed' for
shared mappings there are no changes to the resv_map.  Rather the unmap code
compares the contents of the page cache to the resv_map to determine how
many reservations were actually consumed.  I did not look close enough to
determine the code drops reservation usage counts as pages are added to shared
mappings.
Mina Almasry Aug. 9, 2019, 7:42 p.m. UTC | #2
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> (+CC  Michal Koutný, cgroups@vger.kernel.org, Aneesh Kumar)
>
> On 8/8/19 4:13 PM, Mina Almasry wrote:
> > Problem:
> > Currently tasks attempting to allocate more hugetlb memory than is available get
> > a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> > However, if a task attempts to allocate hugetlb memory only more than its
> > hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> > but will SIGBUS the task when it attempts to fault the memory in.
> >
> > We have developers interested in using hugetlb_cgroups, and they have expressed
> > dissatisfaction regarding this behavior. We'd like to improve this
> > behavior such that tasks violating the hugetlb_cgroup limits get an error on
> > mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> > the excess memory in.
> >
> > The underlying problem is that today's hugetlb_cgroup accounting happens
> > at hugetlb memory *fault* time, rather than at *reservation* time.
> > Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> > the offending task gets SIGBUS'd.
> >
> > Proposed Solution:
> > A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
> > counter has slightly different semantics than
> > hugetlb.xMB.[limit|usage]_in_bytes:
> >
> > - While usage_in_bytes tracks all *faulted* hugetlb memory,
> > reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> >
> > - If a task attempts to reserve more memory than limit_in_bytes allows,
> > the kernel will allow it to do so. But if a task attempts to reserve
> > more memory than reservation_limit_in_bytes, the kernel will fail this
> > reservation.
> >
> > This proposal is implemented in this patch, with tests to verify
> > functionality and show the usage.
>
> Thanks for taking on this effort Mina.
>
No problem! Thanks for reviewing!

> Before looking at the details of the code, it might be helpful to discuss
> the expected semantics of the proposed reservation limits.
>
> I see you took into account the differences between private and shared
> mappings.  This is good, as the reservation behavior is different for each
> of these cases.  First let's look at private mappings.
>
> For private mappings, the reservation usage will be the size of the mapping.
> This should be fairly simple.  As reservations are consumed in the hugetlbfs
> code, reservations in the resv_map are removed.  I see you have a hook into
> region_del.  So, the expectation is that as reservations are consumed the
> reservation usage will drop for the cgroup.  Correct?

I assume by 'reservations are consumed' you mean when a reservation
goes from just 'reserved' to actually in use (as in the task is
writing to the hugetlb page or something). If so, then the answer is
no, that is not correct. When reservations are consumed, the
reservation usage stays the same. I.e. the reservation usage tracks
hugetlb memory (reserved + used) you could say. This is 100% the
intention, as we want to know on mmap time if there is enough 'free'
(that is unreserved and unused) memory left over in the cgroup to
satisfy the mmap call.

The hooks in region_add and region_del are to account shared mappings
only. There is a check in those code blocks that makes sure the code
is only engaged in shared mappings. The commit messages of patches 3/5
and 4/5 go into more details regarding this.

> The only tricky thing about private mappings is COW because of fork.  Current
> reservation semantics specify that all reservations stay with the parent.
> If child faults and can not get page, SIGBUS.  I assume the new reservation
> limits will work the same.
>

Although I did not explicitly try it, yes. It should work the same.
The additional reservation due to the COW will get charged to whatever
cgroup the fork is in. If the task can't get a page it gets SIGBUS'd.
If there is not enough room to charge the cgroup it's in, then the
charge will fail, which I assume will trigger error path that also
leads to SIGBUS.

> I believe tracking reservations for shared mappings can get quite complicated.
> The hugetlbfs reservation code around shared mappings 'works' on the basis
> that shared mapping reservations are global.  As a result, reservations are
> more associated with the inode than with the task making the reservation.

FWIW, I found it not too bad. And my tests at least don't detect an
anomaly around shared mappings. The key I think is that I'm tracking
cgroup to uncharge on the file_region entry inside the resv_map, so we
know who allocated each file_region entry exactly and we can uncharge
them when the entry is region_del'd.

> For example, consider a file of size 4 hugetlb pages.
> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
> all 4 pages, and 2 additional reservations are taken.  I am not really sure
> of the desired semantics here for reservation limits if A and B are in separate
> cgroups.  Should B be charged for 4 or 2 reservations?

Task A's cgroup is charged 2 pages to its reservation usage.
Task B's cgroup is charged 2 pages to its reservation usage.

This is analogous to how shared memory accounting is done for things
like tmpfs, and I see no strong reason right now to deviate. I.e. the
task that made the reservation is charged with it, and others use it
without getting charged.

> Also in the example above, after both tasks create their mappings suppose
> Task B faults in the first page.  Does the reservation usage of Task A go
> down as it originally had the reservation?
>

Reservation usage never goes down when pages are consumed. Yes, I
would have this problem if I was planning to decrement reservation
usage when pages are put into use, but, the goal is to find out if
there is 'free' memory (unreserved + unused) in the cgroup at mmap
time, so we want a counter that tracks (reserved + used).

> It should also be noted that when hugetlbfs reservations are 'consumed' for
> shared mappings there are no changes to the resv_map.  Rather the unmap code
> compares the contents of the page cache to the resv_map to determine how
> many reservations were actually consumed.  I did not look close enough to
> determine the code drops reservation usage counts as pages are added to shared
> mappings.

I think this concern also goes away if reservation usage doesn't go
down when pages are consumed, but let me know if you still have
doubts.

Thanks for taking a look so far!
>
> --
> Mike Kravetz
Mike Kravetz Aug. 10, 2019, 6:58 p.m. UTC | #3
On 8/9/19 12:42 PM, Mina Almasry wrote:
> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>> Problem:
>>> Currently tasks attempting to allocate more hugetlb memory than is available get
>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
>>> However, if a task attempts to allocate hugetlb memory only more than its
>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
>>> but will SIGBUS the task when it attempts to fault the memory in.
<snip>
>> I believe tracking reservations for shared mappings can get quite complicated.
>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>> that shared mapping reservations are global.  As a result, reservations are
>> more associated with the inode than with the task making the reservation.
> 
> FWIW, I found it not too bad. And my tests at least don't detect an
> anomaly around shared mappings. The key I think is that I'm tracking
> cgroup to uncharge on the file_region entry inside the resv_map, so we
> know who allocated each file_region entry exactly and we can uncharge
> them when the entry is region_del'd.
> 
>> For example, consider a file of size 4 hugetlb pages.
>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>> of the desired semantics here for reservation limits if A and B are in separate
>> cgroups.  Should B be charged for 4 or 2 reservations?
> 
> Task A's cgroup is charged 2 pages to its reservation usage.
> Task B's cgroup is charged 2 pages to its reservation usage.

OK,
Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
allocation.  The mmap would succeed, but Task B could potentially need to
allocate more than 2 huge pages.  So, when faulting in more than 2 huge
pages B would get a SIGBUS.  Correct?  Or, am I missing something?

Perhaps reservation charge should always be the same as map size/maximum
allocation size?
Mina Almasry Aug. 10, 2019, 10:01 p.m. UTC | #4
On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/9/19 12:42 PM, Mina Almasry wrote:
> > On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >> On 8/8/19 4:13 PM, Mina Almasry wrote:
> >>> Problem:
> >>> Currently tasks attempting to allocate more hugetlb memory than is available get
> >>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> >>> However, if a task attempts to allocate hugetlb memory only more than its
> >>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> >>> but will SIGBUS the task when it attempts to fault the memory in.
> <snip>
> >> I believe tracking reservations for shared mappings can get quite complicated.
> >> The hugetlbfs reservation code around shared mappings 'works' on the basis
> >> that shared mapping reservations are global.  As a result, reservations are
> >> more associated with the inode than with the task making the reservation.
> >
> > FWIW, I found it not too bad. And my tests at least don't detect an
> > anomaly around shared mappings. The key I think is that I'm tracking
> > cgroup to uncharge on the file_region entry inside the resv_map, so we
> > know who allocated each file_region entry exactly and we can uncharge
> > them when the entry is region_del'd.
> >
> >> For example, consider a file of size 4 hugetlb pages.
> >> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
> >> all 4 pages, and 2 additional reservations are taken.  I am not really sure
> >> of the desired semantics here for reservation limits if A and B are in separate
> >> cgroups.  Should B be charged for 4 or 2 reservations?
> >
> > Task A's cgroup is charged 2 pages to its reservation usage.
> > Task B's cgroup is charged 2 pages to its reservation usage.
>
> OK,
> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
> allocation.  The mmap would succeed, but Task B could potentially need to
> allocate more than 2 huge pages.  So, when faulting in more than 2 huge
> pages B would get a SIGBUS.  Correct?  Or, am I missing something?
>
> Perhaps reservation charge should always be the same as map size/maximum
> allocation size?

I'm thinking this would work similar to how other shared memory like
tmpfs is accounted for right now. I.e. if a task conducts an operation
that causes memory to be allocated then that task is charged for that
memory, and if another task uses memory that has already been
allocated and charged by another task, then it can use the memory
without being charged.

So in case of hugetlb memory, if a task is mmaping memory that causes
a new reservation to be made, and new entries to be created in the
resv_map for the shared mapping, then that task gets charged. If the
task is mmaping memory that is already reserved or faulted, then it
reserves or faults it without getting charged.

In the example above, in chronological order:
- Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
- Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
reservations because the first 2 are charged already and can be used
without incurring a charge.
- Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
since none of the 4 pages are faulted in yet. If the task is only
allowed 2 hugetlb page faults then it will actually get a SIGBUS.
- Task A accesses 4 hugetlb pages, gets charged no faults, since all
the hugetlb faults is charged to Task B.

So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
I think that's fine because:
1. Notice that the SIGBUS is due to the faulting limit, and not the
reservation limit, so we're not regressing the status quo per say.
Folks using the fault limit today understand the SIGBUS risk.
2. the way I expect folks to use this is to use 'reservation limits'
to partition the available hugetlb memory on the machine using it and
forgo using the existing fault limits. Using both at the same time I
think would be a superuser feature for folks that really know what
they are doing, and understand the risk of SIGBUS that comes with
using the existing fault limits.
3. I expect userspace to in general handle this correctly because
there are similar challenges with all shared memory and accounting of
it, even in tmpfs, I think.

I would not like to charge the full reservation to every process that
does the mmap. Think of this, much more common scenario: Task A and B
are supposed to collaborate on a 10 hugetlb pages of data. Task B
should not access any hugetlb memory other than the memory it is
working on with Task A, so:

1. Task A is put in a cgroup with 10 hugetlb pages reservation limit.
2. Task B is put in a cgroup with 0 hugetlb pages of reservation limit.
3. Task A mmaps 10 hugetlb pages of hugetlb memory, and notifies Task
B that it is done.
4. Task B, due to programmer error, tries to mmap hugetlb memory
beyond what Task A set up for it, it gets denied at mmap time by the
cgroup reservation limit.
5. Task B mmaps the same 10 hugetlb pages of memory and starts working
on them. The mmap succeeds because Task B is not charged anything.

If we were charging the full reservation to both Tasks A and B, then
both A and B would have be in cgroups that allow 10 pages of hugetlb
reservations, and the mmap in step 4 would succeed, and we
accidentally overcommitted the amount of hugetlb memory available.

> --
> Mike Kravetz
Mike Kravetz Aug. 13, 2019, 11:40 p.m. UTC | #5
On 8/10/19 3:01 PM, Mina Almasry wrote:
> On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 8/9/19 12:42 PM, Mina Almasry wrote:
>>> On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>> On 8/8/19 4:13 PM, Mina Almasry wrote:
>>>>> Problem:
>>>>> Currently tasks attempting to allocate more hugetlb memory than is available get
>>>>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
>>>>> However, if a task attempts to allocate hugetlb memory only more than its
>>>>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
>>>>> but will SIGBUS the task when it attempts to fault the memory in.
>> <snip>
>>>> I believe tracking reservations for shared mappings can get quite complicated.
>>>> The hugetlbfs reservation code around shared mappings 'works' on the basis
>>>> that shared mapping reservations are global.  As a result, reservations are
>>>> more associated with the inode than with the task making the reservation.
>>>
>>> FWIW, I found it not too bad. And my tests at least don't detect an
>>> anomaly around shared mappings. The key I think is that I'm tracking
>>> cgroup to uncharge on the file_region entry inside the resv_map, so we
>>> know who allocated each file_region entry exactly and we can uncharge
>>> them when the entry is region_del'd.
>>>
>>>> For example, consider a file of size 4 hugetlb pages.
>>>> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
>>>> all 4 pages, and 2 additional reservations are taken.  I am not really sure
>>>> of the desired semantics here for reservation limits if A and B are in separate
>>>> cgroups.  Should B be charged for 4 or 2 reservations?
>>>
>>> Task A's cgroup is charged 2 pages to its reservation usage.
>>> Task B's cgroup is charged 2 pages to its reservation usage.
>>
>> OK,
>> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
>> allocation.  The mmap would succeed, but Task B could potentially need to
>> allocate more than 2 huge pages.  So, when faulting in more than 2 huge
>> pages B would get a SIGBUS.  Correct?  Or, am I missing something?
>>
>> Perhaps reservation charge should always be the same as map size/maximum
>> allocation size?
> 
> I'm thinking this would work similar to how other shared memory like
> tmpfs is accounted for right now. I.e. if a task conducts an operation
> that causes memory to be allocated then that task is charged for that
> memory, and if another task uses memory that has already been
> allocated and charged by another task, then it can use the memory
> without being charged.
> 
> So in case of hugetlb memory, if a task is mmaping memory that causes
> a new reservation to be made, and new entries to be created in the
> resv_map for the shared mapping, then that task gets charged. If the
> task is mmaping memory that is already reserved or faulted, then it
> reserves or faults it without getting charged.
> 
> In the example above, in chronological order:
> - Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
> - Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
> reservations because the first 2 are charged already and can be used
> without incurring a charge.
> - Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
> since none of the 4 pages are faulted in yet. If the task is only
> allowed 2 hugetlb page faults then it will actually get a SIGBUS.
> - Task A accesses 4 hugetlb pages, gets charged no faults, since all
> the hugetlb faults is charged to Task B.
> 
> So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
> I think that's fine because:
> 1. Notice that the SIGBUS is due to the faulting limit, and not the
> reservation limit, so we're not regressing the status quo per say.
> Folks using the fault limit today understand the SIGBUS risk.
> 2. the way I expect folks to use this is to use 'reservation limits'
> to partition the available hugetlb memory on the machine using it and
> forgo using the existing fault limits. Using both at the same time I
> think would be a superuser feature for folks that really know what
> they are doing, and understand the risk of SIGBUS that comes with
> using the existing fault limits.
> 3. I expect userspace to in general handle this correctly because
> there are similar challenges with all shared memory and accounting of
> it, even in tmpfs, I think.

Ok, that helps explain your use case.  I agree that it would be difficult
to use both fault and reservation limits together.  Especially in the case
of shared mappings.
Mina Almasry Aug. 15, 2019, 11:21 p.m. UTC | #6
On Wed, Aug 14, 2019 at 8:54 PM Hillf Danton <hdanton@sina.com> wrote:
>
>
> On Thu,  8 Aug 2019 16:13:36 -0700 Mina Almasry wrote:
> >
> > These counters will track hugetlb reservations rather than hugetlb
> > memory faulted in. This patch only adds the counter, following patches
> > add the charging and uncharging of the counter.
> > ---
>
>   !?!
>

Thanks for reviewing. I'm not sure what you're referring to though.
What's wrong here?

> >  include/linux/hugetlb.h |  2 +-
> >  mm/hugetlb_cgroup.c     | 86 +++++++++++++++++++++++++++++++++++++----
> >  2 files changed, 80 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index edfca42783192..6777b3013345d 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -340,7 +340,7 @@ struct hstate {
> >       unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> >  #ifdef CONFIG_CGROUP_HUGETLB
> >       /* cgroup control files */
> > -     struct cftype cgroup_files[5];
> > +     struct cftype cgroup_files[9];
>
> Move that enum in this header file and replace numbers with characters
> to easy both reading and maintaining.
> >  #endif
> >       char name[HSTATE_NAME_LEN];
> >  };
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 68c2f2f3c05b7..708103663988a 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -25,6 +25,10 @@ struct hugetlb_cgroup {
> >        * the counter to account for hugepages from hugetlb.
> >        */
> >       struct page_counter hugepage[HUGE_MAX_HSTATE];
> > +     /*
> > +      * the counter to account for hugepage reservations from hugetlb.
> > +      */
> > +     struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
> >  };
> >
> >  #define MEMFILE_PRIVATE(x, val)      (((x) << 16) | (val))
> > @@ -33,6 +37,15 @@ struct hugetlb_cgroup {
> >
> >  static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> >
> > +static inline
> > +struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx,
> > +                              bool reserved)
>
> s/get_/hugetlb_cgroup_get_/ to make it not too generic.
> > +{
> > +     if (reserved)
> > +             return  &h_cg->reserved_hugepage[idx];
> > +     return &h_cg->hugepage[idx];
> > +}
> > +
> >  static inline
> >  struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
> >  {
> > @@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
> >
> >  enum {
> >       RES_USAGE,
> > +     RES_RESERVATION_USAGE,
> >       RES_LIMIT,
> > +     RES_RESERVATION_LIMIT,
> >       RES_MAX_USAGE,
> > +     RES_RESERVATION_MAX_USAGE,
> >       RES_FAILCNT,
> > +     RES_RESERVATION_FAILCNT,
> >  };
> >
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >                                  struct cftype *cft)
> >  {
> >       struct page_counter *counter;
> > +     struct page_counter *reserved_counter;
> >       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> >
> >       counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> > +     reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
> >
> >       switch (MEMFILE_ATTR(cft->private)) {
> >       case RES_USAGE:
> >               return (u64)page_counter_read(counter) * PAGE_SIZE;
> > +     case RES_RESERVATION_USAGE:
> > +             return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
> >       case RES_LIMIT:
> >               return (u64)counter->max * PAGE_SIZE;
> > +     case RES_RESERVATION_LIMIT:
> > +             return (u64)reserved_counter->max * PAGE_SIZE;
> >       case RES_MAX_USAGE:
> >               return (u64)counter->watermark * PAGE_SIZE;
> > +     case RES_RESERVATION_MAX_USAGE:
> > +             return (u64)reserved_counter->watermark * PAGE_SIZE;
> >       case RES_FAILCNT:
> >               return counter->failcnt;
> > +     case RES_RESERVATION_FAILCNT:
> > +             return reserved_counter->failcnt;
> >       default:
> >               BUG();
> >       }
> > @@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >       int ret, idx;
> >       unsigned long nr_pages;
> >       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> > +     bool reserved = false;
> >
> >       if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
> >               return -EINVAL;
> > @@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
> >       idx = MEMFILE_IDX(of_cft(of)->private);
> >       nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));
> >
> > +     if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) {
> > +             reserved = true;
> > +     }
> > +
> >       switch (MEMFILE_ATTR(of_cft(of)->private)) {
> > +     case RES_RESERVATION_LIMIT:
>                 reserved = true;
>                 /* fall thru */
>
> >       case RES_LIMIT:
> >               mutex_lock(&hugetlb_limit_mutex);
> > -             ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
> > +             ret = page_counter_set_max(get_counter(h_cg, idx, reserved),
> > +                                        nr_pages);
> >               mutex_unlock(&hugetlb_limit_mutex);
> >               break;
> >       default:
> > @@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
> >                                   char *buf, size_t nbytes, loff_t off)
> >  {
> >       int ret = 0;
> > -     struct page_counter *counter;
> > +     struct page_counter *counter, *reserved_counter;
> >       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
> >
> >       counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
> > +     reserved_counter = &h_cg->reserved_hugepage[
> > +             MEMFILE_IDX(of_cft(of)->private)];
> >
> >       switch (MEMFILE_ATTR(of_cft(of)->private)) {
> >       case RES_MAX_USAGE:
> >               page_counter_reset_watermark(counter);
> >               break;
> > +     case RES_RESERVATION_MAX_USAGE:
> > +             page_counter_reset_watermark(reserved_counter);
> > +             break;
> >       case RES_FAILCNT:
> >               counter->failcnt = 0;
> >               break;
> > +     case RES_RESERVATION_FAILCNT:
> > +             reserved_counter->failcnt = 0;
> > +             break;
> >       default:
> >               ret = -EINVAL;
> >               break;
> > @@ -357,7 +399,7 @@ static void __init __hugetlb_cgroup_file_init(int idx)
> >       struct hstate *h = &hstates[idx];
> >
> >       /* format the size */
> > -     mem_fmt(buf, 32, huge_page_size(h));
> > +     mem_fmt(buf, sizeof(buf), huge_page_size(h));
> >
> >       /* Add the limit file */
> >       cft = &h->cgroup_files[0];
> > @@ -366,28 +408,58 @@ static void __init __hugetlb_cgroup_file_init(int idx)
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >       cft->write = hugetlb_cgroup_write;
> >
> > -     /* Add the usage file */
> > +     /* Add the reservation limit file */
> >       cft = &h->cgroup_files[1];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_limit_in_bytes",
> > +              buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_LIMIT);
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +     cft->write = hugetlb_cgroup_write;
> > +
> > +     /* Add the usage file */
> > +     cft = &h->cgroup_files[2];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
> >       cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the reservation usage file */
> > +     cft = &h->cgroup_files[3];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_usage_in_bytes",
> > +                     buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_USAGE);
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +
> >       /* Add the MAX usage file */
> > -     cft = &h->cgroup_files[2];
> > +     cft = &h->cgroup_files[4];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
> >       cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
> >       cft->write = hugetlb_cgroup_reset;
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the MAX reservation usage file */
> > +     cft = &h->cgroup_files[5];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME,
> > +                     "%s.reservation_max_usage_in_bytes", buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, RES_RESERVATION_MAX_USAGE);
> > +     cft->write = hugetlb_cgroup_reset;
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +
> >       /* Add the failcntfile */
> > -     cft = &h->cgroup_files[3];
> > +     cft = &h->cgroup_files[6];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
> >       cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> >       cft->write = hugetlb_cgroup_reset;
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the reservation failcntfile */
> > +     cft = &h->cgroup_files[7];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.reservation_failcnt", buf);
> > +     cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
> > +     cft->write = hugetlb_cgroup_reset;
> > +     cft->read_u64 = hugetlb_cgroup_read_u64;
> > +
> >       /* NULL terminate the last cft */
> > -     cft = &h->cgroup_files[4];
> > +     cft = &h->cgroup_files[8];
> >       memset(cft, 0, sizeof(*cft));
>
> Replace numbers with characters.
> >
> >       WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
>