diff mbox series

[v12,1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

Message ID 20200211213128.73302-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v12,1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter | expand

Commit Message

Mina Almasry Feb. 11, 2020, 9:31 p.m. UTC
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.

This is patch 1 of an 9 patch series.

Problem:
Currently tasks attempting to reserve more hugetlb memory than is
available get a failure at mmap/shmget time. This is thanks to
Hugetlbfs Reservations [1].  However, if a task attempts to reserve
more hugetlb memory than its hugetlb_cgroup limit allows, the kernel
will allow the mmap/shmget call, but will SIGBUS the task when it
attempts to fault in the excess memory.

We have users hitting their hugetlb_cgroup limits and thus we've been
looking at this failure mode. We'd like to improve this behavior such
that users violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in. This gives the user an opportunity to fallback
more gracefully to non-hugetlbfs memory for example.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named
'hugetlb.xMB.rsvd.[limit|usage|max_usage]_in_bytes'. This counter has
slightly different semantics than
'hugetlb.xMB.[limit|usage|max_usage]_in_bytes':

- While usage_in_bytes tracks all *faulted* hugetlb memory,
rsvd.usage_in_bytes tracks all *reserved* hugetlb memory and
hugetlb memory faulted in without a prior reservation.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than rsvd.limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch series, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot
   of code duplication with hugetlb_cgroup. Keeping hugetlb related
   page counters under hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that
   modifies the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do
   accounting at reservation time rather than fault time. Adding a new
   page_counter seems better as userspace could, if it wants, choose to
   enforce different cgroups differently: one via limit_in_bytes, and
   another via rsvd.limit_in_bytes.  This could be very useful if
   you're transitioning how hugetlb memory is partitioned on your system
   one cgroup at a time, for example. Also, someone may find usage for
   both limit_in_bytes and rsvd.limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Testing:
- Added tests passing.
- Used libhugetlbfs for regression testing.

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry <almasrymina@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

---
Changes in v12:
- Minor code formatting.

Changes in v11:
- Renamed resv.* or 'reservation' or 'reserved' to rsvd.*
- Renamed hugetlb_cgroup_get_counter() to
hugetlb_cgroup_counter_from_cgroup().

Changes in v10:
- Renamed reservation_* to resv.*

---
 include/linux/hugetlb.h |   4 +-
 mm/hugetlb_cgroup.c     | 115 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 104 insertions(+), 15 deletions(-)

--
2.25.0.225.g125e21ebc7-goog

Comments

Andrew Morton Feb. 11, 2020, 11:19 p.m. UTC | #1
On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:

> These counters will track hugetlb reservations rather than hugetlb
> memory faulted in. This patch only adds the counter, following patches
> add the charging and uncharging of the counter.

We're still pretty thin on review here, but as it's v12 and Mike
appears to be signed up to look at this work, I'll add them to -next to
help move things forward.
Qian Cai Feb. 18, 2020, 2:21 p.m. UTC | #2
On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> 
> > These counters will track hugetlb reservations rather than hugetlb
> > memory faulted in. This patch only adds the counter, following patches
> > add the charging and uncharging of the counter.
> 
> We're still pretty thin on review here, but as it's v12 and Mike
> appears to be signed up to look at this work, I'll add them to -next to
> help move things forward.
> 

Reverted the whole series on the top of next-20200217 fixed a crash below (I
don't see anything in next-20200218 would make any differences).

[ 7933.691114][T35046] LTP: starting hugemmap06
[ 7933.806377][T14355] ------------[ cut here ]------------
[ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
VM_BUG_ON(t - f <= 1);
[ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
[ 7933.806573][T14355] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256
DEBUG_PAGEALLOC NUMA PowerNV
[ 7933.806594][T14355] Modules linked in: kvm_hv kvm brd ext4 crc16 mbcache jbd2
loop ip_tables x_tables xfs sd_mod bnx2x ahci mdio libahci tg3 libata libphy
firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
binfmt_misc]
[ 7933.806651][T14355] CPU: 54 PID: 14355 Comm: hugemmap06 Tainted:
G           O      5.6.0-rc2-next-20200217 #1
[ 7933.806674][T14355] NIP:  c00000000040d22c LR: c00000000040d210 CTR:
0000000000000000
[ 7933.806696][T14355] REGS: c0000014b71ef660 TRAP: 0700   Tainted:
G           O       (5.6.0-rc2-next-20200217)
[ 7933.806727][T14355] MSR:  900000000282b033
<SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 22022228  XER: 00000000
[ 7933.806772][T14355] CFAR: c00000000040cbec IRQMASK: 0 
[ 7933.806772][T14355] GPR00: c00000000040d210 c0000014b71ef8f0 c000000001657000
0000000000000001 
[ 7933.806772][T14355] GPR04: 0000000000000012 0000000000000013 0000000000000000
0000000000000000 
[ 7933.806772][T14355] GPR08: 0000000000000002 0000000000000002 0000000000000001
0000000000000036 
[ 7933.806772][T14355] GPR12: 0000000022022222 c000001ffffd3d00 00007fffad670000
00007fffa4bc0000 
[ 7933.806772][T14355] GPR16: 0000000000000000 c000000001567178 c0000014b71efa50
0000000000000000 
[ 7933.806772][T14355] GPR20: 0000000000000000 0000000000000013 0000000000000012
0000000000000001 
[ 7933.806772][T14355] GPR24: c0000019f74cd270 5deadbeef0000100 5deadbeef0000122
c0000019f74cd2c0 
[ 7933.806772][T14355] GPR28: 0000000000000001 c0000019f74cd268 c0000014b71ef918
0000000000000001 
[ 7933.806961][T14355] NIP [c00000000040d22c] region_add+0x11c/0x3a0
[ 7933.806980][T14355] LR [c00000000040d210] region_add+0x100/0x3a0
[ 7933.807008][T14355] Call Trace:
[ 7933.807024][T14355] [c0000014b71ef8f0] [c00000000040d210]
region_add+0x100/0x3a0 (unreliable)
[ 7933.807056][T14355] [c0000014b71ef9b0] [c00000000040e0c8]
__vma_reservation_common+0x148/0x210
__vma_reservation_common at mm/hugetlb.c:2150
[ 7933.807087][T14355] [c0000014b71efa20] [c0000000004132a0]
alloc_huge_page+0x350/0x830
alloc_huge_page at mm/hugetlb.c:2359
[ 7933.807100][T14355] [c0000014b71efad0] [c0000000004168f8]
hugetlb_no_page+0x158/0xcb0
[ 7933.807113][T14355] [c0000014b71efc20] [c000000000417bc8]
hugetlb_fault+0x678/0xb30
[ 7933.807136][T14355] [c0000014b71efcd0] [c0000000003b1de4]
handle_mm_fault+0x444/0x450
[ 7933.807158][T14355] [c0000014b71efd20] [c000000000070b1c]
__do_page_fault+0x2bc/0xfd0
[ 7933.807181][T14355] [c0000014b71efe20] [c00000000000aa88]
handle_page_fault+0x10/0x30
[ 7933.807201][T14355] Instruction dump:
[ 7933.807209][T14355] 38c00000 7ea5ab78 7ec4b378 7fa3eb78 4bfff80d e9210020
e91d0050 e95d0068 
[ 7933.807232][T14355] 7d3c4850 7d294214 7faa4800 409c0238 <0b170000> 7f03c378
4858c005 60000000 
[ 7933.807267][T14355] ---[ end trace 7560275de5f409f8 ]---
[ 7933.905258][T14355] 
[ 7934.905339][T14355] Kernel panic - not syncing: Fatal exception
Mina Almasry Feb. 18, 2020, 6:35 p.m. UTC | #3
On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
>
> On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
> > On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> >
> > > These counters will track hugetlb reservations rather than hugetlb
> > > memory faulted in. This patch only adds the counter, following patches
> > > add the charging and uncharging of the counter.
> >
> > We're still pretty thin on review here, but as it's v12 and Mike
> > appears to be signed up to look at this work, I'll add them to -next to
> > help move things forward.
> >
>
> Reverted the whole series on the top of next-20200217 fixed a crash below (I
> don't see anything in next-20200218 would make any differences).
>
> [ 7933.691114][T35046] LTP: starting hugemmap06
> [ 7933.806377][T14355] ------------[ cut here ]------------
> [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
> VM_BUG_ON(t - f <= 1);
> [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
> [ 7933.806573][T14355] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256
> DEBUG_PAGEALLOC NUMA PowerNV
> [ 7933.806594][T14355] Modules linked in: kvm_hv kvm brd ext4 crc16 mbcache jbd2
> loop ip_tables x_tables xfs sd_mod bnx2x ahci mdio libahci tg3 libata libphy
> firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
> binfmt_misc]
> [ 7933.806651][T14355] CPU: 54 PID: 14355 Comm: hugemmap06 Tainted:
> G           O      5.6.0-rc2-next-20200217 #1
> [ 7933.806674][T14355] NIP:  c00000000040d22c LR: c00000000040d210 CTR:
> 0000000000000000
> [ 7933.806696][T14355] REGS: c0000014b71ef660 TRAP: 0700   Tainted:
> G           O       (5.6.0-rc2-next-20200217)
> [ 7933.806727][T14355] MSR:  900000000282b033
> <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 22022228  XER: 00000000
> [ 7933.806772][T14355] CFAR: c00000000040cbec IRQMASK: 0
> [ 7933.806772][T14355] GPR00: c00000000040d210 c0000014b71ef8f0 c000000001657000
> 0000000000000001
> [ 7933.806772][T14355] GPR04: 0000000000000012 0000000000000013 0000000000000000
> 0000000000000000
> [ 7933.806772][T14355] GPR08: 0000000000000002 0000000000000002 0000000000000001
> 0000000000000036
> [ 7933.806772][T14355] GPR12: 0000000022022222 c000001ffffd3d00 00007fffad670000
> 00007fffa4bc0000
> [ 7933.806772][T14355] GPR16: 0000000000000000 c000000001567178 c0000014b71efa50
> 0000000000000000
> [ 7933.806772][T14355] GPR20: 0000000000000000 0000000000000013 0000000000000012
> 0000000000000001
> [ 7933.806772][T14355] GPR24: c0000019f74cd270 5deadbeef0000100 5deadbeef0000122
> c0000019f74cd2c0
> [ 7933.806772][T14355] GPR28: 0000000000000001 c0000019f74cd268 c0000014b71ef918
> 0000000000000001
> [ 7933.806961][T14355] NIP [c00000000040d22c] region_add+0x11c/0x3a0
> [ 7933.806980][T14355] LR [c00000000040d210] region_add+0x100/0x3a0
> [ 7933.807008][T14355] Call Trace:
> [ 7933.807024][T14355] [c0000014b71ef8f0] [c00000000040d210]
> region_add+0x100/0x3a0 (unreliable)
> [ 7933.807056][T14355] [c0000014b71ef9b0] [c00000000040e0c8]
> __vma_reservation_common+0x148/0x210
> __vma_reservation_common at mm/hugetlb.c:2150
> [ 7933.807087][T14355] [c0000014b71efa20] [c0000000004132a0]
> alloc_huge_page+0x350/0x830
> alloc_huge_page at mm/hugetlb.c:2359
> [ 7933.807100][T14355] [c0000014b71efad0] [c0000000004168f8]
> hugetlb_no_page+0x158/0xcb0
> [ 7933.807113][T14355] [c0000014b71efc20] [c000000000417bc8]
> hugetlb_fault+0x678/0xb30
> [ 7933.807136][T14355] [c0000014b71efcd0] [c0000000003b1de4]
> handle_mm_fault+0x444/0x450
> [ 7933.807158][T14355] [c0000014b71efd20] [c000000000070b1c]
> __do_page_fault+0x2bc/0xfd0
> [ 7933.807181][T14355] [c0000014b71efe20] [c00000000000aa88]
> handle_page_fault+0x10/0x30
> [ 7933.807201][T14355] Instruction dump:
> [ 7933.807209][T14355] 38c00000 7ea5ab78 7ec4b378 7fa3eb78 4bfff80d e9210020
> e91d0050 e95d0068
> [ 7933.807232][T14355] 7d3c4850 7d294214 7faa4800 409c0238 <0b170000> 7f03c378
> 4858c005 60000000
> [ 7933.807267][T14355] ---[ end trace 7560275de5f409f8 ]---
> [ 7933.905258][T14355]
> [ 7934.905339][T14355] Kernel panic - not syncing: Fatal exception

Hi Qian,

Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
disable region_add file_region coalescing") so it's definitely related
to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
confirm you reproduce this by running hugemmap06 from the ltp on a
powerpc machine? Can I maybe have your config?

Thanks!
Qian Cai Feb. 18, 2020, 6:41 p.m. UTC | #4
On Tue, 2020-02-18 at 10:35 -0800, Mina Almasry wrote:
> On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
> > 
> > On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
> > > On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> > > 
> > > > These counters will track hugetlb reservations rather than hugetlb
> > > > memory faulted in. This patch only adds the counter, following patches
> > > > add the charging and uncharging of the counter.
> > > 
> > > We're still pretty thin on review here, but as it's v12 and Mike
> > > appears to be signed up to look at this work, I'll add them to -next to
> > > help move things forward.
> > > 
> > 
> > Reverted the whole series on the top of next-20200217 fixed a crash below (I
> > don't see anything in next-20200218 would make any differences).
> > 
> > [ 7933.691114][T35046] LTP: starting hugemmap06
> > [ 7933.806377][T14355] ------------[ cut here ]------------
> > [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
> > VM_BUG_ON(t - f <= 1);
> > [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
> > [ 7933.806573][T14355] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256
> > DEBUG_PAGEALLOC NUMA PowerNV
> > [ 7933.806594][T14355] Modules linked in: kvm_hv kvm brd ext4 crc16 mbcache jbd2
> > loop ip_tables x_tables xfs sd_mod bnx2x ahci mdio libahci tg3 libata libphy
> > firmware_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
> > binfmt_misc]
> > [ 7933.806651][T14355] CPU: 54 PID: 14355 Comm: hugemmap06 Tainted:
> > G           O      5.6.0-rc2-next-20200217 #1
> > [ 7933.806674][T14355] NIP:  c00000000040d22c LR: c00000000040d210 CTR:
> > 0000000000000000
> > [ 7933.806696][T14355] REGS: c0000014b71ef660 TRAP: 0700   Tainted:
> > G           O       (5.6.0-rc2-next-20200217)
> > [ 7933.806727][T14355] MSR:  900000000282b033
> > <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 22022228  XER: 00000000
> > [ 7933.806772][T14355] CFAR: c00000000040cbec IRQMASK: 0
> > [ 7933.806772][T14355] GPR00: c00000000040d210 c0000014b71ef8f0 c000000001657000
> > 0000000000000001
> > [ 7933.806772][T14355] GPR04: 0000000000000012 0000000000000013 0000000000000000
> > 0000000000000000
> > [ 7933.806772][T14355] GPR08: 0000000000000002 0000000000000002 0000000000000001
> > 0000000000000036
> > [ 7933.806772][T14355] GPR12: 0000000022022222 c000001ffffd3d00 00007fffad670000
> > 00007fffa4bc0000
> > [ 7933.806772][T14355] GPR16: 0000000000000000 c000000001567178 c0000014b71efa50
> > 0000000000000000
> > [ 7933.806772][T14355] GPR20: 0000000000000000 0000000000000013 0000000000000012
> > 0000000000000001
> > [ 7933.806772][T14355] GPR24: c0000019f74cd270 5deadbeef0000100 5deadbeef0000122
> > c0000019f74cd2c0
> > [ 7933.806772][T14355] GPR28: 0000000000000001 c0000019f74cd268 c0000014b71ef918
> > 0000000000000001
> > [ 7933.806961][T14355] NIP [c00000000040d22c] region_add+0x11c/0x3a0
> > [ 7933.806980][T14355] LR [c00000000040d210] region_add+0x100/0x3a0
> > [ 7933.807008][T14355] Call Trace:
> > [ 7933.807024][T14355] [c0000014b71ef8f0] [c00000000040d210]
> > region_add+0x100/0x3a0 (unreliable)
> > [ 7933.807056][T14355] [c0000014b71ef9b0] [c00000000040e0c8]
> > __vma_reservation_common+0x148/0x210
> > __vma_reservation_common at mm/hugetlb.c:2150
> > [ 7933.807087][T14355] [c0000014b71efa20] [c0000000004132a0]
> > alloc_huge_page+0x350/0x830
> > alloc_huge_page at mm/hugetlb.c:2359
> > [ 7933.807100][T14355] [c0000014b71efad0] [c0000000004168f8]
> > hugetlb_no_page+0x158/0xcb0
> > [ 7933.807113][T14355] [c0000014b71efc20] [c000000000417bc8]
> > hugetlb_fault+0x678/0xb30
> > [ 7933.807136][T14355] [c0000014b71efcd0] [c0000000003b1de4]
> > handle_mm_fault+0x444/0x450
> > [ 7933.807158][T14355] [c0000014b71efd20] [c000000000070b1c]
> > __do_page_fault+0x2bc/0xfd0
> > [ 7933.807181][T14355] [c0000014b71efe20] [c00000000000aa88]
> > handle_page_fault+0x10/0x30
> > [ 7933.807201][T14355] Instruction dump:
> > [ 7933.807209][T14355] 38c00000 7ea5ab78 7ec4b378 7fa3eb78 4bfff80d e9210020
> > e91d0050 e95d0068
> > [ 7933.807232][T14355] 7d3c4850 7d294214 7faa4800 409c0238 <0b170000> 7f03c378
> > 4858c005 60000000
> > [ 7933.807267][T14355] ---[ end trace 7560275de5f409f8 ]---
> > [ 7933.905258][T14355]
> > [ 7934.905339][T14355] Kernel panic - not syncing: Fatal exception
> 
> Hi Qian,
> 
> Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
> disable region_add file_region coalescing") so it's definitely related
> to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
> confirm you reproduce this by running hugemmap06 from the ltp on a
> powerpc machine? Can I maybe have your config?

Yes, reproduced on both powerpc and x86. Configs are in,

https://github.com/cailca/linux-mm
Mike Kravetz Feb. 18, 2020, 7:14 p.m. UTC | #5
On 2/18/20 10:35 AM, Mina Almasry wrote:
> On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
>>
>> On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
>>> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
>>>
>> [ 7933.806377][T14355] ------------[ cut here ]------------
>> [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
>> VM_BUG_ON(t - f <= 1);
>> [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
<snip>
> Hi Qian,
> 
> Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
> disable region_add file_region coalescing") so it's definitely related
> to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
> confirm you reproduce this by running hugemmap06 from the ltp on a
> powerpc machine? Can I maybe have your config?
> 
> Thanks!

Hi Mina,

Looking at the region_chg code again, we do a

	resv->adds_in_progress += *out_regions_needed;

and then potentially drop the lock to allocate the needed entries.  Could
anopther thread (only adding reservation for a single page) then come in
and notice that there are not enough entries in the cache and hit the
VM_BUG_ON()?
Mina Almasry Feb. 18, 2020, 7:25 p.m. UTC | #6
On Tue, Feb 18, 2020 at 11:14 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 2/18/20 10:35 AM, Mina Almasry wrote:
> > On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
> >>
> >> On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
> >>> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> >>>
> >> [ 7933.806377][T14355] ------------[ cut here ]------------
> >> [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
> >> VM_BUG_ON(t - f <= 1);
> >> [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
> <snip>
> > Hi Qian,
> >
> > Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
> > disable region_add file_region coalescing") so it's definitely related
> > to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
> > confirm you reproduce this by running hugemmap06 from the ltp on a
> > powerpc machine? Can I maybe have your config?
> >
> > Thanks!
>
> Hi Mina,
>
> Looking at the region_chg code again, we do a
>
>         resv->adds_in_progress += *out_regions_needed;
>
> and then potentially drop the lock to allocate the needed entries.  Could
> anopther thread (only adding reservation for a single page) then come in
> and notice that there are not enough entries in the cache and hit the
> VM_BUG_ON()?

Maybe. Also I'm thinking the code thinks actual_regions_needed >=
in_regions_needed, but that doesn't seem like a guarantee. I think
this call sequence with the same t->f range would violate that:

region_chg (regions_needed=1)
region_chg (regions_needed=1)
region_add (fills in the range)
region_add (in_regions_needed = 1, actual_regions_needed = 0, so
assumptions in the code break).

Luckily it seems the ltp readily reproduces this, so I'm working on
reproducing it. I should have a fix soon, at least if I can reproduce
it as well.
Mina Almasry Feb. 18, 2020, 9:36 p.m. UTC | #7
On Tue, Feb 18, 2020 at 11:25 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Feb 18, 2020 at 11:14 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 2/18/20 10:35 AM, Mina Almasry wrote:
> > > On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
> > >>
> > >> On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
> > >>> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> > >>>
> > >> [ 7933.806377][T14355] ------------[ cut here ]------------
> > >> [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
> > >> VM_BUG_ON(t - f <= 1);
> > >> [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
> > <snip>
> > > Hi Qian,
> > >
> > > Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
> > > disable region_add file_region coalescing") so it's definitely related
> > > to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
> > > confirm you reproduce this by running hugemmap06 from the ltp on a
> > > powerpc machine? Can I maybe have your config?
> > >
> > > Thanks!
> >
> > Hi Mina,
> >
> > Looking at the region_chg code again, we do a
> >
> >         resv->adds_in_progress += *out_regions_needed;
> >
> > and then potentially drop the lock to allocate the needed entries.  Could
> > anopther thread (only adding reservation for a single page) then come in
> > and notice that there are not enough entries in the cache and hit the
> > VM_BUG_ON()?
>
> Maybe. Also I'm thinking the code thinks actual_regions_needed >=
> in_regions_needed, but that doesn't seem like a guarantee. I think
> this call sequence with the same t->f range would violate that:
>
> region_chg (regions_needed=1)
> region_chg (regions_needed=1)
> region_add (fills in the range)
> region_add (in_regions_needed = 1, actual_regions_needed = 0, so
> assumptions in the code break).
>
> Luckily it seems the ltp readily reproduces this, so I'm working on
> reproducing it. I should have a fix soon, at least if I can reproduce
> it as well.

I had a bit of trouble reproducing this but I got it just now.

Makes sense I've never run into this even though others can readily
reproduce it. I happen to run my kernels on a pretty beefy 36 core
machine and in that setup things seem to execute fast and there is
never a queue of pending file_region inserts into the resv_map. Once I
limited qemu to only use 2 cores I ran into the issue right away.
Looking into a fix now.
Mike Kravetz Feb. 18, 2020, 9:41 p.m. UTC | #8
On 2/18/20 1:36 PM, Mina Almasry wrote:
> On Tue, Feb 18, 2020 at 11:25 AM Mina Almasry <almasrymina@google.com> wrote:
>>
>> On Tue, Feb 18, 2020 at 11:14 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> On 2/18/20 10:35 AM, Mina Almasry wrote:
>>>> On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
>>>>>
>>>>> On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
>>>>>> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
>>>>>>
>>>>> [ 7933.806377][T14355] ------------[ cut here ]------------
>>>>> [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
>>>>> VM_BUG_ON(t - f <= 1);
>>>>> [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
>>> <snip>
>>>> Hi Qian,
>>>>
>>>> Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
>>>> disable region_add file_region coalescing") so it's definitely related
>>>> to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
>>>> confirm you reproduce this by running hugemmap06 from the ltp on a
>>>> powerpc machine? Can I maybe have your config?
>>>>
>>>> Thanks!
>>>
>>> Hi Mina,
>>>
>>> Looking at the region_chg code again, we do a
>>>
>>>         resv->adds_in_progress += *out_regions_needed;
>>>
>>> and then potentially drop the lock to allocate the needed entries.  Could
>>> anopther thread (only adding reservation for a single page) then come in
>>> and notice that there are not enough entries in the cache and hit the
>>> VM_BUG_ON()?
>>
>> Maybe. Also I'm thinking the code thinks actual_regions_needed >=
>> in_regions_needed, but that doesn't seem like a guarantee. I think
>> this call sequence with the same t->f range would violate that:
>>
>> region_chg (regions_needed=1)
>> region_chg (regions_needed=1)
>> region_add (fills in the range)
>> region_add (in_regions_needed = 1, actual_regions_needed = 0, so
>> assumptions in the code break).
>>
>> Luckily it seems the ltp readily reproduces this, so I'm working on
>> reproducing it. I should have a fix soon, at least if I can reproduce
>> it as well.
> 
> I had a bit of trouble reproducing this but I got it just now.
> 
> Makes sense I've never run into this even though others can readily
> reproduce it. I happen to run my kernels on a pretty beefy 36 core
> machine and in that setup things seem to execute fast and there is
> never a queue of pending file_region inserts into the resv_map. Once I
> limited qemu to only use 2 cores I ran into the issue right away.
> Looking into a fix now.

This may not be optimal, but it resolves the issue for me.  I just put it
together to test the theory that the region_chg code was at fault.
Mina Almasry Feb. 18, 2020, 10:27 p.m. UTC | #9
On Tue, Feb 18, 2020 at 1:41 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 2/18/20 1:36 PM, Mina Almasry wrote:
> > On Tue, Feb 18, 2020 at 11:25 AM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Tue, Feb 18, 2020 at 11:14 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>>
> >>> On 2/18/20 10:35 AM, Mina Almasry wrote:
> >>>> On Tue, Feb 18, 2020 at 6:21 AM Qian Cai <cai@lca.pw> wrote:
> >>>>>
> >>>>> On Tue, 2020-02-11 at 15:19 -0800, Andrew Morton wrote:
> >>>>>> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> >>>>>>
> >>>>> [ 7933.806377][T14355] ------------[ cut here ]------------
> >>>>> [ 7933.806541][T14355] kernel BUG at mm/hugetlb.c:490!
> >>>>> VM_BUG_ON(t - f <= 1);
> >>>>> [ 7933.806562][T14355] Oops: Exception in kernel mode, sig: 5 [#1]
> >>> <snip>
> >>>> Hi Qian,
> >>>>
> >>>> Yes this VM_BUG_ON was added by a patch in the series ("hugetlb:
> >>>> disable region_add file_region coalescing") so it's definitely related
> >>>> to the series. I'm taking a look at why this VM_BUG_ON fires. Can you
> >>>> confirm you reproduce this by running hugemmap06 from the ltp on a
> >>>> powerpc machine? Can I maybe have your config?
> >>>>
> >>>> Thanks!
> >>>
> >>> Hi Mina,
> >>>
> >>> Looking at the region_chg code again, we do a
> >>>
> >>>         resv->adds_in_progress += *out_regions_needed;
> >>>
> >>> and then potentially drop the lock to allocate the needed entries.  Could
> >>> anopther thread (only adding reservation for a single page) then come in
> >>> and notice that there are not enough entries in the cache and hit the
> >>> VM_BUG_ON()?
> >>
> >> Maybe. Also I'm thinking the code thinks actual_regions_needed >=
> >> in_regions_needed, but that doesn't seem like a guarantee. I think
> >> this call sequence with the same t->f range would violate that:
> >>
> >> region_chg (regions_needed=1)
> >> region_chg (regions_needed=1)
> >> region_add (fills in the range)
> >> region_add (in_regions_needed = 1, actual_regions_needed = 0, so
> >> assumptions in the code break).
> >>
> >> Luckily it seems the ltp readily reproduces this, so I'm working on
> >> reproducing it. I should have a fix soon, at least if I can reproduce
> >> it as well.
> >
> > I had a bit of trouble reproducing this but I got it just now.
> >
> > Makes sense I've never run into this even though others can readily
> > reproduce it. I happen to run my kernels on a pretty beefy 36 core
> > machine and in that setup things seem to execute fast and there is
> > never a queue of pending file_region inserts into the resv_map. Once I
> > limited qemu to only use 2 cores I ran into the issue right away.
> > Looking into a fix now.
>
> This may not be optimal, but it resolves the issue for me.  I just put it
> together to test the theory that the region_chg code was at fault.

Thanks! Just sent out a similar patch "[PATCH -next] mm/hugetlb: Fix
file_region entry allocations"
Mina Almasry Feb. 19, 2020, 7:05 p.m. UTC | #10
On Tue, Feb 11, 2020 at 3:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
>
> > These counters will track hugetlb reservations rather than hugetlb
> > memory faulted in. This patch only adds the counter, following patches
> > add the charging and uncharging of the counter.
>
> We're still pretty thin on review here, but as it's v12 and Mike
> appears to be signed up to look at this work, I'll add them to -next to
> help move things forward.
>

Hi Andrew,

Since the patches were merged into -next there have been build fixes
and test fixes and some review comments. Would you like me to submit
*new* patches to address these, or would you like me to squash the
fixes into my existing patch series and submit another iteration of
the patch series?
Andrew Morton Feb. 19, 2020, 9:06 p.m. UTC | #11
On Wed, 19 Feb 2020 11:05:41 -0800 Mina Almasry <almasrymina@google.com> wrote:

> On Tue, Feb 11, 2020 at 3:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> >
> > > These counters will track hugetlb reservations rather than hugetlb
> > > memory faulted in. This patch only adds the counter, following patches
> > > add the charging and uncharging of the counter.
> >
> > We're still pretty thin on review here, but as it's v12 and Mike
> > appears to be signed up to look at this work, I'll add them to -next to
> > help move things forward.
> >
> 
> Hi Andrew,
> 
> Since the patches were merged into -next there have been build fixes
> and test fixes and some review comments. Would you like me to submit
> *new* patches to address these, or would you like me to squash the
> fixes into my existing patch series and submit another iteration of
> the patch series?

What you did worked OK ;)

Please check the end result next time I release a kernel.
Mina Almasry Feb. 20, 2020, 7:22 p.m. UTC | #12
On Wed, Feb 19, 2020 at 1:06 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 19 Feb 2020 11:05:41 -0800 Mina Almasry <almasrymina@google.com> wrote:
>
> > On Tue, Feb 11, 2020 at 3:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > > These counters will track hugetlb reservations rather than hugetlb
> > > > memory faulted in. This patch only adds the counter, following patches
> > > > add the charging and uncharging of the counter.
> > >
> > > We're still pretty thin on review here, but as it's v12 and Mike
> > > appears to be signed up to look at this work, I'll add them to -next to
> > > help move things forward.
> > >
> >
> > Hi Andrew,
> >
> > Since the patches were merged into -next there have been build fixes
> > and test fixes and some review comments. Would you like me to submit
> > *new* patches to address these, or would you like me to squash the
> > fixes into my existing patch series and submit another iteration of
> > the patch series?
>
> What you did worked OK ;)
>
> Please check the end result next time I release a kernel.

Thanks Andrew! Things definitely moved along after the patchseries got
into -next :D

By my count I think all my patches outside of the tests patch have
been acked or reviewed. When you have a chance I have a couple of
questions:

1. For the non-tests patch, anything pending on those preventing
eventual submission to linus's tree?
2. For the tests patch, I only have a Tested-by from Sandipan. Is that
good enough? If the worst comes to worst and I don't get a review on
that patch I would rather (if possible) that 'tests' patch can be
dropped while I nag folks for a review, rather than block submission
of the entire patch series. I ask because it's been out for review for
some time and it's the one I got least discussion on so I'm not sure
I'll have a review by the time it's needed.

Thanks again!
Andrew Morton Feb. 21, 2020, 12:28 a.m. UTC | #13
On Thu, 20 Feb 2020 11:22:58 -0800 Mina Almasry <almasrymina@google.com> wrote:

> On Wed, Feb 19, 2020 at 1:06 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 19 Feb 2020 11:05:41 -0800 Mina Almasry <almasrymina@google.com> wrote:
> >
> > > On Tue, Feb 11, 2020 at 3:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > > These counters will track hugetlb reservations rather than hugetlb
> > > > > memory faulted in. This patch only adds the counter, following patches
> > > > > add the charging and uncharging of the counter.
> > > >
> > > > We're still pretty thin on review here, but as it's v12 and Mike
> > > > appears to be signed up to look at this work, I'll add them to -next to
> > > > help move things forward.
> > > >
> > >
> > > Hi Andrew,
> > >
> > > Since the patches were merged into -next there have been build fixes
> > > and test fixes and some review comments. Would you like me to submit
> > > *new* patches to address these, or would you like me to squash the
> > > fixes into my existing patch series and submit another iteration of
> > > the patch series?
> >
> > What you did worked OK ;)
> >
> > Please check the end result next time I release a kernel.
> 
> Thanks Andrew! Things definitely moved along after the patchseries got
> into -next :D
> 
> By my count I think all my patches outside of the tests patch have
> been acked or reviewed. When you have a chance I have a couple of
> questions:
> 
> 1. For the non-tests patch, anything pending on those preventing
> eventual submission to linus's tree?
> 2. For the tests patch, I only have a Tested-by from Sandipan. Is that
> good enough? If the worst comes to worst and I don't get a review on
> that patch I would rather (if possible) that 'tests' patch can be
> dropped while I nag folks for a review, rather than block submission
> of the entire patch series. I ask because it's been out for review for
> some time and it's the one I got least discussion on so I'm not sure
> I'll have a review by the time it's needed.
> 

It all looks pretty good and I expect we can get everything into
5.7-rc1, unless some issues pop up.

It's unclear to me whether
http://lkml.kernel.org/r/CAHS8izOTipknnYaKz=FdzL-7yW-Z61ck1yPnYWixyMSJuTUYLQ@mail.gmail.com
was going to result in an update?
Mike Kravetz Feb. 21, 2020, 12:41 a.m. UTC | #14
On 2/20/20 4:28 PM, Andrew Morton wrote:
> It all looks pretty good and I expect we can get everything into
> 5.7-rc1, unless some issues pop up.
> 
> It's unclear to me whether
> http://lkml.kernel.org/r/CAHS8izOTipknnYaKz=FdzL-7yW-Z61ck1yPnYWixyMSJuTUYLQ@mail.gmail.com
> was going to result in an update?

Yes there was,

https://lore.kernel.org/linux-mm/20200219233610.13808-1-almasrymina@google.com/

BTW, I've been through the selftest code at a high level.  Not at the level
of detail Shuah may provide.  I am just happy that Mina provided a relatively
comprehensive test for this new functionality.  Will give it an ACK shortly.
Mina Almasry Feb. 21, 2020, 1:52 a.m. UTC | #15
On Thu, Feb 20, 2020 at 4:41 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 2/20/20 4:28 PM, Andrew Morton wrote:
> > It all looks pretty good and I expect we can get everything into
> > 5.7-rc1, unless some issues pop up.
> >

Awesome! I'll be on the lookout for issues.

Folks that reviewed (and especially Mike), thank you so much for the
patient reviews!

> > It's unclear to me whether
> > http://lkml.kernel.org/r/CAHS8izOTipknnYaKz=FdzL-7yW-Z61ck1yPnYWixyMSJuTUYLQ@mail.gmail.com
> > was going to result in an update?
>
> Yes there was,
>
> https://lore.kernel.org/linux-mm/20200219233610.13808-1-almasrymina@google.com/
>
> BTW, I've been through the selftest code at a high level.  Not at the level
> of detail Shuah may provide.  I am just happy that Mina provided a relatively
> comprehensive test for this new functionality.  Will give it an ACK shortly.
> --
> Mike Kravetz
Mina Almasry Feb. 21, 2020, 8:19 p.m. UTC | #16
On Wed, Feb 19, 2020 at 1:06 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 19 Feb 2020 11:05:41 -0800 Mina Almasry <almasrymina@google.com> wrote:
>
> > On Tue, Feb 11, 2020 at 3:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Tue, 11 Feb 2020 13:31:20 -0800 Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > > These counters will track hugetlb reservations rather than hugetlb
> > > > memory faulted in. This patch only adds the counter, following patches
> > > > add the charging and uncharging of the counter.
> > >
> > > We're still pretty thin on review here, but as it's v12 and Mike
> > > appears to be signed up to look at this work, I'll add them to -next to
> > > help move things forward.
> > >
> >
> > Hi Andrew,
> >
> > Since the patches were merged into -next there have been build fixes
> > and test fixes and some review comments. Would you like me to submit
> > *new* patches to address these, or would you like me to squash the
> > fixes into my existing patch series and submit another iteration of
> > the patch series?
>
> What you did worked OK ;)
>
> Please check the end result next time I release a kernel.

Hey Andrew,

Thanks for taking in the patset and fixes. Only pending change in the
latest -next tree is this one:
https://lore.kernel.org/linux-mm/20200219233610.13808-1-almasrymina@google.com/

It's reviewed by Mike here:
https://lore.kernel.org/linux-mm/a0d7b8e1-cb43-3b43-68c3-55631f2ce199@oracle.com/
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1e897e4168ac1..dea6143aa0685 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -432,8 +432,8 @@  struct hstate {
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files_dfl[5];
-	struct cftype cgroup_files_legacy[5];
+	struct cftype cgroup_files_dfl[7];
+	struct cftype cgroup_files_legacy[9];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index e434b05416c68..08b2adcdb5c1c 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -36,6 +36,11 @@  struct hugetlb_cgroup {
 	 */
 	struct page_counter hugepage[HUGE_MAX_HSTATE];

+	/*
+	 * the counter to account for hugepage reservations from hugetlb.
+	 */
+	struct page_counter rsvd_hugepage[HUGE_MAX_HSTATE];
+
 	atomic_long_t events[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];
 	atomic_long_t events_local[HUGE_MAX_HSTATE][HUGETLB_NR_MEMORY_EVENTS];

@@ -55,6 +60,15 @@  struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline struct page_counter *
+hugetlb_cgroup_counter_from_cgroup(struct hugetlb_cgroup *h_cg, int idx,
+				   bool rsvd)
+{
+	if (rsvd)
+		return &h_cg->rsvd_hugepage[idx];
+	return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -295,28 +309,42 @@  void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,

 enum {
 	RES_USAGE,
+	RES_RSVD_USAGE,
 	RES_LIMIT,
+	RES_RSVD_LIMIT,
 	RES_MAX_USAGE,
+	RES_RSVD_MAX_USAGE,
 	RES_FAILCNT,
+	RES_RSVD_FAILCNT,
 };

 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
 	struct page_counter *counter;
+	struct page_counter *rsvd_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

 	counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+	rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(cft->private)];

 	switch (MEMFILE_ATTR(cft->private)) {
 	case RES_USAGE:
 		return (u64)page_counter_read(counter) * PAGE_SIZE;
+	case RES_RSVD_USAGE:
+		return (u64)page_counter_read(rsvd_counter) * PAGE_SIZE;
 	case RES_LIMIT:
 		return (u64)counter->max * PAGE_SIZE;
+	case RES_RSVD_LIMIT:
+		return (u64)rsvd_counter->max * PAGE_SIZE;
 	case RES_MAX_USAGE:
 		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_RSVD_MAX_USAGE:
+		return (u64)rsvd_counter->watermark * PAGE_SIZE;
 	case RES_FAILCNT:
 		return counter->failcnt;
+	case RES_RSVD_FAILCNT:
+		return rsvd_counter->failcnt;
 	default:
 		BUG();
 	}
@@ -338,10 +366,16 @@  static int hugetlb_cgroup_read_u64_max(struct seq_file *seq, void *v)
 			   1 << huge_page_order(&hstates[idx]));

 	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_RSVD_USAGE:
+		counter = &h_cg->rsvd_hugepage[idx];
+		/* Fall through. */
 	case RES_USAGE:
 		val = (u64)page_counter_read(counter);
 		seq_printf(seq, "%llu\n", val * PAGE_SIZE);
 		break;
+	case RES_RSVD_LIMIT:
+		counter = &h_cg->rsvd_hugepage[idx];
+		/* Fall through. */
 	case RES_LIMIT:
 		val = (u64)counter->max;
 		if (val == limit)
@@ -365,6 +399,7 @@  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	int ret, idx;
 	unsigned long nr_pages;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+	bool rsvd = false;

 	if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
 		return -EINVAL;
@@ -378,9 +413,14 @@  static ssize_t hugetlb_cgroup_write(struct kernfs_open_file *of,
 	nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
+	case RES_RSVD_LIMIT:
+		rsvd = true;
+		/* Fall through. */
 	case RES_LIMIT:
 		mutex_lock(&hugetlb_limit_mutex);
-		ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+		ret = page_counter_set_max(
+			hugetlb_cgroup_counter_from_cgroup(h_cg, idx, rsvd),
+			nr_pages);
 		mutex_unlock(&hugetlb_limit_mutex);
 		break;
 	default:
@@ -406,18 +446,25 @@  static ssize_t hugetlb_cgroup_reset(struct kernfs_open_file *of,
 				    char *buf, size_t nbytes, loff_t off)
 {
 	int ret = 0;
-	struct page_counter *counter;
+	struct page_counter *counter, *rsvd_counter;
 	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

 	counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+	rsvd_counter = &h_cg->rsvd_hugepage[MEMFILE_IDX(of_cft(of)->private)];

 	switch (MEMFILE_ATTR(of_cft(of)->private)) {
 	case RES_MAX_USAGE:
 		page_counter_reset_watermark(counter);
 		break;
+	case RES_RSVD_MAX_USAGE:
+		page_counter_reset_watermark(rsvd_counter);
+		break;
 	case RES_FAILCNT:
 		counter->failcnt = 0;
 		break;
+	case RES_RSVD_FAILCNT:
+		rsvd_counter->failcnt = 0;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -472,7 +519,7 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	struct hstate *h = &hstates[idx];

 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));

 	/* Add the limit file */
 	cft = &h->cgroup_files_dfl[0];
@@ -482,15 +529,30 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	cft->write = hugetlb_cgroup_write_dfl;
 	cft->flags = CFTYPE_NOT_ON_ROOT;

-	/* Add the current usage file */
+	/* Add the reservation limit file */
 	cft = &h->cgroup_files_dfl[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT);
+	cft->seq_show = hugetlb_cgroup_read_u64_max;
+	cft->write = hugetlb_cgroup_write_dfl;
+	cft->flags = CFTYPE_NOT_ON_ROOT;
+
+	/* Add the current usage file */
+	cft = &h->cgroup_files_dfl[2];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.current", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
 	cft->seq_show = hugetlb_cgroup_read_u64_max;
 	cft->flags = CFTYPE_NOT_ON_ROOT;

+	/* Add the current reservation usage file */
+	cft = &h->cgroup_files_dfl[3];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.current", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE);
+	cft->seq_show = hugetlb_cgroup_read_u64_max;
+	cft->flags = CFTYPE_NOT_ON_ROOT;
+
 	/* Add the events file */
-	cft = &h->cgroup_files_dfl[2];
+	cft = &h->cgroup_files_dfl[4];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
 	cft->private = MEMFILE_PRIVATE(idx, 0);
 	cft->seq_show = hugetlb_events_show;
@@ -498,7 +560,7 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	cft->flags = CFTYPE_NOT_ON_ROOT;

 	/* Add the events.local file */
-	cft = &h->cgroup_files_dfl[3];
+	cft = &h->cgroup_files_dfl[5];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
 	cft->private = MEMFILE_PRIVATE(idx, 0);
 	cft->seq_show = hugetlb_events_local_show;
@@ -507,7 +569,7 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	cft->flags = CFTYPE_NOT_ON_ROOT;

 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_dfl[4];
+	cft = &h->cgroup_files_dfl[6];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -521,7 +583,7 @@  static void __init __hugetlb_cgroup_file_legacy_init(int idx)
 	struct hstate *h = &hstates[idx];

 	/* format the size */
-	mem_fmt(buf, 32, huge_page_size(h));
+	mem_fmt(buf, sizeof(buf), huge_page_size(h));

 	/* Add the limit file */
 	cft = &h->cgroup_files_legacy[0];
@@ -530,28 +592,55 @@  static void __init __hugetlb_cgroup_file_legacy_init(int idx)
 	cft->read_u64 = hugetlb_cgroup_read_u64;
 	cft->write = hugetlb_cgroup_write_legacy;

-	/* Add the usage file */
+	/* Add the reservation limit file */
 	cft = &h->cgroup_files_legacy[1];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.limit_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_LIMIT);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+	cft->write = hugetlb_cgroup_write_legacy;
+
+	/* Add the usage file */
+	cft = &h->cgroup_files_legacy[2];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.usage_in_bytes", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_USAGE);
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the reservation usage file */
+	cft = &h->cgroup_files_legacy[3];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_USAGE);
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* Add the MAX usage file */
-	cft = &h->cgroup_files_legacy[2];
+	cft = &h->cgroup_files_legacy[4];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.max_usage_in_bytes", buf);
 	cft->private = MEMFILE_PRIVATE(idx, RES_MAX_USAGE);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the MAX reservation usage file */
+	cft = &h->cgroup_files_legacy[5];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.max_usage_in_bytes", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_MAX_USAGE);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
 	/* Add the failcntfile */
-	cft = &h->cgroup_files_legacy[3];
+	cft = &h->cgroup_files_legacy[6];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.failcnt", buf);
-	cft->private  = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->private = MEMFILE_PRIVATE(idx, RES_FAILCNT);
+	cft->write = hugetlb_cgroup_reset;
+	cft->read_u64 = hugetlb_cgroup_read_u64;
+
+	/* Add the reservation failcntfile */
+	cft = &h->cgroup_files_legacy[7];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.rsvd.failcnt", buf);
+	cft->private = MEMFILE_PRIVATE(idx, RES_RSVD_FAILCNT);
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_legacy[4];
+	cft = &h->cgroup_files_legacy[8];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,