[RFC,0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg

Message ID	20241215073415.88961-1-laoar.shao@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Yafang Shao <laoar.shao@gmail.com> To: hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org Cc: linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com> Subject: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg Date: Sun, 15 Dec 2024 15:34:13 +0800 Message-Id: <20241215073415.88961-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: add nomlock to avoid folios beling mlocked in a memcg \| expand [RFC,0/2] memcg: add nomlock to avoid folios beling mlocked in a memcg [RFC,1/2] mm/memcontrol: add a new cgroup file memory.nomlock [RFC,2/2] mm: Add support for nomlock to avoid folios beling mlocked in a memcg

Message ID

20241215073415.88961-1-laoar.shao@gmail.com (mailing list archive)

Headers

From: Yafang Shao <laoar.shao@gmail.com>
To: hannes@cmpxchg.org,
	mhocko@kernel.org,
	roman.gushchin@linux.dev,
	shakeel.butt@linux.dev,
	muchun.song@linux.dev,
	akpm@linux-foundation.org
Cc: linux-mm@kvack.org,
	Yafang Shao <laoar.shao@gmail.com>
Subject: [RFC PATCH 0/2] memcg: add nomlock to avoid folios beling mlocked in
 a memcg
Date: Sun, 15 Dec 2024 15:34:13 +0800
Message-Id: <20241215073415.88961-1-laoar.shao@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

memcg: add nomlock to avoid folios beling mlocked in a memcg | expand

Message

Yafang Shao Dec. 15, 2024, 7:34 a.m. UTC

The Use Case
============

We have a scenario where multiple services (cgroups) may share the same
file cache, as illustrated below:

    download-proxy       application
                   \    /
         /shared_path/shared_files

When the application needs specific types of files, it sends an RPC request
to the download-proxy. The download-proxy then downloads the files to
shared paths, after which the application reads these shared files. All
disk I/O operations are performed using buffered I/O.

The reason for using buffered I/O, rather than direct I/O, is that the
download-proxy itself may also read these shared files. This is because it
serves as a peer-to-peer (P2P) service:

   download-proxy of server1    <- P2P ->    download-proxy of server2

   /shared_path/shared_files                 /shared_path/shared_files

The Problem
===========

Applications reading these shared files may use mlock to pin the files in
memory for performance reasons. However, the shared file cache is charged
to the memory cgroup of the download-proxy during the download or P2P
process. Consequently, the page cache pages of the shared files might be
mlocked within the download-proxy's memcg, as shown:

    download-proxy     application
          |            /
        (charged)    (mlocked)
          |         /
    pagecache pages
           \
            \
          /shared_path/shared_files

This setup leads to a frequent scenario where the memory usage of the
download-proxy's memcg reaches its limit, potentially resulting in OOM
events. This behavior is undesirable.

The Solution
============

To address this, we propose introducing a new cgroup file, memory.nomlock,
which prevents page cache pages from being mlocked in a specific memcg when
set to 1.

Implementation Options
----------------------

- Solution A: Allow file caches on the unevictable list to become
  reclaimable. 
  This approach would require significant refactoring of the page reclaim
  logic.

- Solution B: Prevent file caches from being moved to the unevictable list
  during mlock and ignore the VM_LOCKED flag during page reclaim.
  This is a more straightforward solution and is the one we have chosen.
  If the file caches are reclaimed from the download-proxy's memcg and
  subsequently accessed by tasks in the application’s memcg, a filemap
  fault will occur. A new file cache will be faulted in, charged to the
  application’s memcg, and locked there.

Current limitations
==================

This solution is in its early stages and has the following limitations:

- Timing Dependency:
  memory.nomlock must be set before file caches are moved to the
  unevictable list. Otherwise, the file caches cannot be reclaimed.

- Metrics Inaccuracy:
  The "unevictable" metric in memory.stat and the "Mlocked" metric in
  /proc/meminfo may not be reliable. However, these metrics are already
  affected by the use of large folios.

If this solution is deemed acceptable, I will proceed with refining the
implementation and addressing these limitations.

Yafang Shao (2):
  mm/memcontrol: add a new cgroup file memory.nomlock
  mm: Add support for nomlock to avoid folios beling mlocked in a memcg

 include/linux/memcontrol.h |  3 +++
 mm/memcontrol.c            | 35 +++++++++++++++++++++++++++++++++++
 mm/mlock.c                 |  9 +++++++++
 mm/rmap.c                  |  8 +++++++-
 mm/vmscan.c                |  5 +++++
 5 files changed, 59 insertions(+), 1 deletion(-)

Comments

Michal Hocko Dec. 20, 2024, 10:23 a.m. UTC | #1

On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> Implementation Options
> ----------------------
> 
> - Solution A: Allow file caches on the unevictable list to become
>   reclaimable. 
>   This approach would require significant refactoring of the page reclaim
>   logic.
> 
> - Solution B: Prevent file caches from being moved to the unevictable list
>   during mlock and ignore the VM_LOCKED flag during page reclaim.
>   This is a more straightforward solution and is the one we have chosen.
>   If the file caches are reclaimed from the download-proxy's memcg and
>   subsequently accessed by tasks in the application’s memcg, a filemap
>   fault will occur. A new file cache will be faulted in, charged to the
>   application’s memcg, and locked there.

Both options are silently breaking userspace because a non failing mlock
doesn't give guarantees it is supposed to AFAICS. So unless I am missing
something really importanant I do not think this is an acceptable memcg
extension.

Yafang Shao Dec. 20, 2024, 11:52 a.m. UTC | #2

On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > Implementation Options
> > ----------------------
> >
> > - Solution A: Allow file caches on the unevictable list to become
> >   reclaimable.
> >   This approach would require significant refactoring of the page reclaim
> >   logic.
> >
> > - Solution B: Prevent file caches from being moved to the unevictable list
> >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> >   This is a more straightforward solution and is the one we have chosen.
> >   If the file caches are reclaimed from the download-proxy's memcg and
> >   subsequently accessed by tasks in the application’s memcg, a filemap
> >   fault will occur. A new file cache will be faulted in, charged to the
> >   application’s memcg, and locked there.
>
> Both options are silently breaking userspace because a non failing mlock
> doesn't give guarantees it is supposed to AFAICS.

It does not bypass the mlock mechanism; rather, it defers the actual
locking operation to the page fault path. Could you clarify what you
mean by "a non-failing mlock"? From what I can see, mlock can indeed
fail if there isn’t sufficient memory available. With this change, we
are simply shifting the potential failure point to the page fault path
instead.

Michal Hocko Dec. 21, 2024, 7:21 a.m. UTC | #3

On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > Implementation Options
> > > ----------------------
> > >
> > > - Solution A: Allow file caches on the unevictable list to become
> > >   reclaimable.
> > >   This approach would require significant refactoring of the page reclaim
> > >   logic.
> > >
> > > - Solution B: Prevent file caches from being moved to the unevictable list
> > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > >   This is a more straightforward solution and is the one we have chosen.
> > >   If the file caches are reclaimed from the download-proxy's memcg and
> > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > >   fault will occur. A new file cache will be faulted in, charged to the
> > >   application’s memcg, and locked there.
> >
> > Both options are silently breaking userspace because a non failing mlock
> > doesn't give guarantees it is supposed to AFAICS.
> 
> It does not bypass the mlock mechanism; rather, it defers the actual
> locking operation to the page fault path. Could you clarify what you
> mean by "a non-failing mlock"? From what I can see, mlock can indeed
> fail if there isn’t sufficient memory available. With this change, we
> are simply shifting the potential failure point to the page fault path
> instead.

Your change will cause mlocked pages (as mlock syscall returns success)
to be reclaimable later on. That breaks the basic mlock contract.

Yafang Shao Dec. 22, 2024, 2:34 a.m. UTC | #4

On Sat, Dec 21, 2024 at 3:21 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 20-12-24 19:52:16, Yafang Shao wrote:
> > On Fri, Dec 20, 2024 at 6:23 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 15-12-24 15:34:13, Yafang Shao wrote:
> > > > Implementation Options
> > > > ----------------------
> > > >
> > > > - Solution A: Allow file caches on the unevictable list to become
> > > >   reclaimable.
> > > >   This approach would require significant refactoring of the page reclaim
> > > >   logic.
> > > >
> > > > - Solution B: Prevent file caches from being moved to the unevictable list
> > > >   during mlock and ignore the VM_LOCKED flag during page reclaim.
> > > >   This is a more straightforward solution and is the one we have chosen.
> > > >   If the file caches are reclaimed from the download-proxy's memcg and
> > > >   subsequently accessed by tasks in the application’s memcg, a filemap
> > > >   fault will occur. A new file cache will be faulted in, charged to the
> > > >   application’s memcg, and locked there.
> > >
> > > Both options are silently breaking userspace because a non failing mlock
> > > doesn't give guarantees it is supposed to AFAICS.
> >
> > It does not bypass the mlock mechanism; rather, it defers the actual
> > locking operation to the page fault path. Could you clarify what you
> > mean by "a non-failing mlock"? From what I can see, mlock can indeed
> > fail if there isn’t sufficient memory available. With this change, we
> > are simply shifting the potential failure point to the page fault path
> > instead.
>
> Your change will cause mlocked pages (as mlock syscall returns success)
> to be reclaimable later on. That breaks the basic mlock contract.

AFAICS, the mlock() behavior was originally designed with only a
single root memory cgroup in mind. In other words, when mlock() was
introduced, all locked pages were confined to the same memcg.

However, this changed with the introduction of memcg support. Now,
mlock() can lock pages that belong to a different memcg than the
current task. This behavior is not explicitly defined in the mlock()
documentation, which could lead to confusion.

To clarify, I propose updating the mlock() documentation as follows:

When memcg is enabled, the page being locked might reside in a
different memcg than the current task. In such cases, the page might
be reclaimed if mlock() is not permitted in its original memcg. If the
locked page is reclaimed, it could be faulted back into the current
task's memcg and then locked again.

Additionally, encountering a single page fault during this process
should be acceptable to most users. If your application cannot
tolerate even a single page fault, you likely wouldn’t enable memcg in
the first place.