mbox series

[00/19,V2] mm: memcontrol: charge swapin pages on instantiation

Message ID 20200508183105.225460-1-hannes@cmpxchg.org (mailing list archive)
Headers show
Series mm: memcontrol: charge swapin pages on instantiation | expand

Message

Johannes Weiner May 8, 2020, 6:30 p.m. UTC
This patch series reworks memcg to charge swapin pages directly at
swapin time, rather than at fault time, which may be much later, or
not happen at all.

Changes in version 2:
- prevent double charges on pre-allocated hugepages in khugepaged
- leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
- fix temporary accounting bug by switching rmap<->commit (Joonsoo)
- fix double swap charge bug in cgroup1/cgroup2 code gating
- simplify swapin error checking (Joonsoo)
- mm: memcontrol: document the new swap control behavior (Alex)
- review tags

The delayed swapin charging scheme we have right now causes problems:

- Alex's per-cgroup lru_lock patches rely on pages that have been
  isolated from the LRU to have a stable page->mem_cgroup; otherwise
  the lock may change underneath him. Swapcache pages are charged only
  after they are added to the LRU, and charging doesn't follow the LRU
  isolation protocol.

- Joonsoo's anon workingset patches need a suitable LRU at the time
  the page enters the swap cache and displaces the non-resident
  info. But the correct LRU is only available after charging.

- It's a containment hole / DoS vector. Users can trigger arbitrarily
  large swap readahead using MADV_WILLNEED. The memory is never
  charged unless somebody actually touches it.

- It complicates the page->mem_cgroup stabilization rules

In order to charge pages directly at swapin time, the memcg code base
needs to be prepared, and several overdue cleanups become a necessity:

To charge pages at swapin time, we need to always have cgroup
ownership tracking of swap records. We also cannot rely on
page->mapping to tell apart page types at charge time, because that's
only set up during a page fault.

To eliminate the page->mapping dependency, memcg needs to ditch its
private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
of the generic vmstat counters and accounting sites, such as
NR_FILE_PAGES, NR_ANON_MAPPED etc.

To switch to generic vmstat counters, the charge sequence must be
adjusted such that page->mem_cgroup is set up by the time these
counters are modified.

The series is structured as follows:

1. Bug fixes
2. Decoupling charging from rmap
3. Swap controller integration into memcg
4. Direct swapin charging

Based on v5.7-rc3-mmots-2020-05-01-20-14.

 Documentation/admin-guide/cgroup-v1/memory.rst |  19 +-
 include/linux/memcontrol.h                     |  53 +--
 include/linux/mm.h                             |   4 +-
 include/linux/swap.h                           |   6 +-
 init/Kconfig                                   |  17 +-
 kernel/events/uprobes.c                        |  10 +-
 mm/filemap.c                                   |  43 +--
 mm/huge_memory.c                               |  13 +-
 mm/khugepaged.c                                |  29 +-
 mm/memcontrol.c                                | 449 +++++++----------------
 mm/memory.c                                    |  51 +--
 mm/migrate.c                                   |  20 +-
 mm/rmap.c                                      |  53 +--
 mm/shmem.c                                     | 108 +++---
 mm/swap_cgroup.c                               |   6 -
 mm/swap_state.c                                |  89 ++---
 mm/swapfile.c                                  |  25 +-
 mm/userfaultfd.c                               |   5 +-
 18 files changed, 367 insertions(+), 633 deletions(-)

Comments

Education Directorate May 13, 2020, 11:30 a.m. UTC | #1
On Fri, May 08, 2020 at 02:30:47PM -0400, Johannes Weiner wrote:
> This patch series reworks memcg to charge swapin pages directly at
> swapin time, rather than at fault time, which may be much later, or
> not happen at all.
> 
> Changes in version 2:
> - prevent double charges on pre-allocated hugepages in khugepaged
> - leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
> - fix temporary accounting bug by switching rmap<->commit (Joonsoo)
> - fix double swap charge bug in cgroup1/cgroup2 code gating
> - simplify swapin error checking (Joonsoo)
> - mm: memcontrol: document the new swap control behavior (Alex)
> - review tags
> 
> The delayed swapin charging scheme we have right now causes problems:
> 
> - Alex's per-cgroup lru_lock patches rely on pages that have been
>   isolated from the LRU to have a stable page->mem_cgroup; otherwise
>   the lock may change underneath him. Swapcache pages are charged only
>   after they are added to the LRU, and charging doesn't follow the LRU
>   isolation protocol.
> 
> - Joonsoo's anon workingset patches need a suitable LRU at the time
>   the page enters the swap cache and displaces the non-resident
>   info. But the correct LRU is only available after charging.
> 
> - It's a containment hole / DoS vector. Users can trigger arbitrarily
>   large swap readahead using MADV_WILLNEED. The memory is never
>   charged unless somebody actually touches it.
> 
> - It complicates the page->mem_cgroup stabilization rules
> 
> In order to charge pages directly at swapin time, the memcg code base
> needs to be prepared, and several overdue cleanups become a necessity:
> 
> To charge pages at swapin time, we need to always have cgroup
> ownership tracking of swap records. We also cannot rely on
> page->mapping to tell apart page types at charge time, because that's
> only set up during a page fault.
> 
> To eliminate the page->mapping dependency, memcg needs to ditch its
> private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
> of the generic vmstat counters and accounting sites, such as
> NR_FILE_PAGES, NR_ANON_MAPPED etc.

Could you elaborate on what this means and the implications of this on
user space programs?

> 
> To switch to generic vmstat counters, the charge sequence must be
> adjusted such that page->mem_cgroup is set up by the time these
> counters are modified.
> 
> The series is structured as follows:
> 
> 1. Bug fixes
> 2. Decoupling charging from rmap
> 3. Swap controller integration into memcg
> 4. Direct swapin charging
>

Thanks,
Balbir Singh.
Johannes Weiner May 13, 2020, 12:35 p.m. UTC | #2
Hello Balbir!

On Wed, May 13, 2020 at 11:30:32AM +0000, Balbir Singh wrote:
> On Fri, May 08, 2020 at 02:30:47PM -0400, Johannes Weiner wrote:
> > To eliminate the page->mapping dependency, memcg needs to ditch its
> > private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
> > of the generic vmstat counters and accounting sites, such as
> > NR_FILE_PAGES, NR_ANON_MAPPED etc.
> 
> Could you elaborate on what this means and the implications of this on
> user space programs?

This has no bearing on userspace. It's just simplifying how
memory.stat is implemented. The output is the same.

For the full story:

In the past, memcg has done its own accounting to produce a breakdown
of consumers in memory.stat. When a page was charged, we relied on
knowing whether it's a file, anon or shmem page, and had our own
MEMCG_RSS, MEMCG_CACHE, MEMCG_SHMEM counters.

As the general VM code already does this type of classification to
produce /proc/vmstat, this meant unnecessary duplication: more places
to bump counters, more places that have to make sure the page state is
stable in all the right ways, more dependencies on when it's safe to
call the charge and the uncharge callbacks.

A while ago we added per-cgroup arrays of the vmstat counters and a
cgroup-aware accounting callback (mod_lruvec_state) that can be a
drop-in replacement for the generic VM code (mod_node_state and
friends). We already had some counters converted over to that.

These patches just do more of that conversion from private memcg
accounting to having callbacks into generic VM accounting sites.

Instead of testing PageAnon() and accounting MEMCG_CACHE/MEMCG_RSS in
the charge code, we switch __add_to_page_cache_locked() and
page_add_new_anon_rmap() to the cgroup-aware mod_lruvec_page_state()
to bump our per-cgroup NR_FILE_PAGES and NR_ANON_MAPPED counters along
with the node and global counters.

As a result, the memcg gets a breakdown for memory.stat without having
to have private knowledge of what a page cache page is - how to test
it, when it's safe to test it, whether there can be huge pages in the
page cache, etc. pp. Memcg can focus on counting bytes, and the VM
code that is specialized in dealing with the page cache (or anon
pages, or shmem pages) can fill in those kinds of details for us.

Less dependencies, less duplication, simpler API rules.

The memory.stat output is the same, it's just much simpler code.
Education Directorate May 14, 2020, 11:04 a.m. UTC | #3
On 13/5/20 10:35 pm, Johannes Weiner wrote:
> As a result, the memcg gets a breakdown for memory.stat without having
> to have private knowledge of what a page cache page is - how to test
> it, when it's safe to test it, whether there can be huge pages in the
> page cache, etc. pp. Memcg can focus on counting bytes, and the VM
> code that is specialized in dealing with the page cache (or anon
> pages, or shmem pages) can fill in those kinds of details for us.
> 
> Less dependencies, less duplication, simpler API rules.
> 
> The memory.stat output is the same, it's just much simpler code.

Makes sense! Thanks, I should spend some time to re-read all of the memcontrol.c code :)

Balbir Singh.