mbox series

[v3,0/3] shmem: Allow userspace monitoring of tmpfs for lack of space.

Message ID 20220418213713.273050-1-krisman@collabora.com (mailing list archive)
Headers show
Series shmem: Allow userspace monitoring of tmpfs for lack of space. | expand

Message

Gabriel Krisman Bertazi April 18, 2022, 9:37 p.m. UTC
The only difference from v2 is applying Viro's coment on how the life of
the sbinfo should now be tied to the kobject.  I hope it is correct the
way i did it.  Tested by mount/umount while holding a reference.

* v2 cover:

the only difference from v1 is addressing Amir's comment about
generating the directory in sysfs using the minor number.

* Original cover letter

When provisioning containerized applications, multiple very small tmpfs
are used, for which one cannot always predict the proper file system
size ahead of time.  We want to be able to reliably monitor filesystems
for ENOSPC errors, without depending on the application being executed
reporting the ENOSPC after a failure.  It is also not enough to watch
statfs since that information might be ephemeral (say the application
recovers by deleting data, the issue can get lost).  For this use case,
it is also interesting to differentiate IO errors caused by lack of
virtual memory from lack of FS space.

This patch exposes two counters on sysfs that log the two conditions
that are interesting to observe for container provisioning.  They are
recorded per tmpfs superblock, and can be polled by a monitoring
application.

I proposed a more general approach [1] using fsnotify, but considering
the specificity of this use-case, people agreed it seems that a simpler
solution in sysfs is more than enough.

[1] https://lore.kernel.org/linux-mm/20211116220742.584975-3-krisman@collabora.com/T/#mee338d25b0e1e07cbe0861f9a5ca8cc439b3edb8

To: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Khazhismel Kumykov <khazhy@google.com>
Cc: Linux MM <linux-mm@kvack.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>

Gabriel Krisman Bertazi (3):
  shmem: Keep track of out-of-memory and out-of-space errors
  shmem: Introduce /sys/fs/tmpfs support
  shmem: Expose space and accounting error count

 Documentation/ABI/testing/sysfs-fs-tmpfs | 13 ++++
 include/linux/shmem_fs.h                 |  5 ++
 mm/shmem.c                               | 76 ++++++++++++++++++++++--
 3 files changed, 90 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-fs-tmpfs

Comments

Andrew Morton April 19, 2022, 3:42 a.m. UTC | #1
On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:

> When provisioning containerized applications, multiple very small tmpfs

"files"?

> are used, for which one cannot always predict the proper file system
> size ahead of time.  We want to be able to reliably monitor filesystems
> for ENOSPC errors, without depending on the application being executed
> reporting the ENOSPC after a failure.

Well that sucks.  We need a kernel-side workaround for applications
that fail to check and report storage errors?

We could do this for every syscall in the kernel.  What's special about
tmpfs in this regard?  

Please provide additional justification and usage examples for such an
extraordinary thing.

>  It is also not enough to watch
> statfs since that information might be ephemeral (say the application
> recovers by deleting data, the issue can get lost).

We could fix the apps?  Heck, you could patch libc's write() to the same
effect.

>  For this use case,
> it is also interesting to differentiate IO errors caused by lack of
> virtual memory from lack of FS space.

More details, please.  Why interesting?  What actions can the system
operator take based upon this information?

Whatever that action is, I see no user-facing documentation which
guides the user info how to take advantage of this?
Gabriel Krisman Bertazi April 19, 2022, 3:28 p.m. UTC | #2
Andrew Morton <akpm@linux-foundation.org> writes:

Hi Andrew,

> On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
>
>> When provisioning containerized applications, multiple very small tmpfs
>
> "files"?

Actually, filesystems.  In cloud environments, we have several small
tmpfs associated with containerized tasks.

>> are used, for which one cannot always predict the proper file system
>> size ahead of time.  We want to be able to reliably monitor filesystems
>> for ENOSPC errors, without depending on the application being executed
>> reporting the ENOSPC after a failure.
>
> Well that sucks.  We need a kernel-side workaround for applications
> that fail to check and report storage errors?
>
> We could do this for every syscall in the kernel.  What's special about
> tmpfs in this regard?
>
> Please provide additional justification and usage examples for such an
> extraordinary thing.

For a cloud provider deploying containerized applications, they might
not control the application, so patching userspace wouldn't be a
solution.  More importantly - and why this is shmem specific -
they want to differentiate between a user getting ENOSPC due to
insufficiently provisioned fs size, vs. due to running out of memory in
a container, both of which return ENOSPC to the process.

A system administrator can then use this feature to monitor a fleet of
containerized applications in a uniform way, detect provisioning issues
caused by different reasons and address the deployment.

I originally submitted this as a new fanotify event, but given the
specificity of shmem, Amir suggested the interface I'm implementing
here.  We've raised this discussion originally here:

https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/

> Whatever that action is, I see no user-facing documentation which
> guides the user info how to take advantage of this?

I can follow up with a new version with documentation, if we agree this
feature makes sense.

Thanks,
Amir Goldstein April 21, 2022, 5:33 a.m. UTC | #3
On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> Andrew Morton <akpm@linux-foundation.org> writes:
>
> Hi Andrew,
>
> > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> >
> >> When provisioning containerized applications, multiple very small tmpfs
> >
> > "files"?
>
> Actually, filesystems.  In cloud environments, we have several small
> tmpfs associated with containerized tasks.
>
> >> are used, for which one cannot always predict the proper file system
> >> size ahead of time.  We want to be able to reliably monitor filesystems
> >> for ENOSPC errors, without depending on the application being executed
> >> reporting the ENOSPC after a failure.
> >
> > Well that sucks.  We need a kernel-side workaround for applications
> > that fail to check and report storage errors?
> >
> > We could do this for every syscall in the kernel.  What's special about
> > tmpfs in this regard?
> >
> > Please provide additional justification and usage examples for such an
> > extraordinary thing.
>
> For a cloud provider deploying containerized applications, they might
> not control the application, so patching userspace wouldn't be a
> solution.  More importantly - and why this is shmem specific -
> they want to differentiate between a user getting ENOSPC due to
> insufficiently provisioned fs size, vs. due to running out of memory in
> a container, both of which return ENOSPC to the process.
>

Isn't there already a per memcg OOM handler that could be used by
orchestrator to detect the latter?

> A system administrator can then use this feature to monitor a fleet of
> containerized applications in a uniform way, detect provisioning issues
> caused by different reasons and address the deployment.
>
> I originally submitted this as a new fanotify event, but given the
> specificity of shmem, Amir suggested the interface I'm implementing
> here.  We've raised this discussion originally here:
>
> https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/
>

To put things in context, the points I was trying to make in this
discussion are:

1. Why isn't monitoring with statfs() a sufficient solution? and more
    specifically, the shared disk space provisioning problem does not sound
    very tmpfs specific to me.
    It is a well known issue for thin provisioned storage in environments
    with shared resources as the ones that you describe
2. OTOH, exporting internal fs stats via /sys/fs for debugging, health
monitoring
    or whatever seems legit to me and is widely practiced by other fs, so
    exposing those tmpfs stats as this patch set is doing seems fine to me.

Another point worth considering in favor of /sys/fs/tmpfs -
since tmpfs is FS_USERNS_MOUNT, the ability of sysadmin to monitor all
tmpfs mounts in the system and their usage is limited.

Therefore, having a central way to enumerate all tmpfs instances in the system
like blockdev fs instances and like fuse fs instances, does not sound
like a terrible
idea in general.

> > Whatever that action is, I see no user-facing documentation which
> > guides the user info how to take advantage of this?
>
> I can follow up with a new version with documentation, if we agree this
> feature makes sense.
>

Given the time of year and participants involved, shall we continue
this discussion
in LSFMM?

I am not sure if this even requires a shared FS/MM session, but I
don't mind trying
to allocate a shared FS/MM slot if Andrew and MM guys are interested
to take part
in the discussion.

As long as memcg is able to report OOM to the orchestrator, the problem does not
sound very tmpfs specific to me.

As Ted explained, cloud providers (for some reason) charge by disk size and not
by disk usage, so also for non-tmpfs, online growing the fs on demand could
prove to be a rewarding practice for cloud applications.

Thanks,
Amir.
Gabriel Krisman Bertazi April 21, 2022, 10:37 p.m. UTC | #4
Amir Goldstein <amir73il@gmail.com> writes:

> On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
>> > Well that sucks.  We need a kernel-side workaround for applications
>> > that fail to check and report storage errors?
>> >
>> > We could do this for every syscall in the kernel.  What's special about
>> > tmpfs in this regard?
>> >
>> > Please provide additional justification and usage examples for such an
>> > extraordinary thing.
>>
>> For a cloud provider deploying containerized applications, they might
>> not control the application, so patching userspace wouldn't be a
>> solution.  More importantly - and why this is shmem specific -
>> they want to differentiate between a user getting ENOSPC due to
>> insufficiently provisioned fs size, vs. due to running out of memory in
>> a container, both of which return ENOSPC to the process.
>>
>
> Isn't there already a per memcg OOM handler that could be used by
> orchestrator to detect the latter?

Hi Amir,

Thanks for the added context.  I'm actually not sure if an OOM handler
completely solves the latter case.  If shmem_inode_acct_block fails, it
happens before the allocation. The OOM won't trigger and we won't know
about it, as far as I understand.  I'm not sure it's real problem for
Google's use case.  Khazhy is the expert on their implementation and
might be able to better discuss it.

I wanna mention that, for the insufficiently-provisioned-fs-size case,
we still can't rely just on statfs.  We need a polling interface -
generic or tmpfs specific - to make sure we don't miss these events, I
think.

Thanks,
Khazhy Kumykov April 21, 2022, 11:19 p.m. UTC | #5
On Wed, Apr 20, 2022 at 10:34 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi
> <krisman@collabora.com> wrote:
> >
> > Andrew Morton <akpm@linux-foundation.org> writes:
> >
> > Hi Andrew,
> >
> > > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> > >
> > >> When provisioning containerized applications, multiple very small tmpfs
> > >
> > > "files"?
> >
> > Actually, filesystems.  In cloud environments, we have several small
> > tmpfs associated with containerized tasks.
> >
> > >> are used, for which one cannot always predict the proper file system
> > >> size ahead of time.  We want to be able to reliably monitor filesystems
> > >> for ENOSPC errors, without depending on the application being executed
> > >> reporting the ENOSPC after a failure.
> > >
> > > Well that sucks.  We need a kernel-side workaround for applications
> > > that fail to check and report storage errors?
> > >
> > > We could do this for every syscall in the kernel.  What's special about
> > > tmpfs in this regard?
> > >
> > > Please provide additional justification and usage examples for such an
> > > extraordinary thing.
> >
> > For a cloud provider deploying containerized applications, they might
> > not control the application, so patching userspace wouldn't be a
> > solution.  More importantly - and why this is shmem specific -
> > they want to differentiate between a user getting ENOSPC due to
> > insufficiently provisioned fs size, vs. due to running out of memory in
> > a container, both of which return ENOSPC to the process.
> >
>
> Isn't there already a per memcg OOM handler that could be used by
> orchestrator to detect the latter?
>
> > A system administrator can then use this feature to monitor a fleet of
> > containerized applications in a uniform way, detect provisioning issues
> > caused by different reasons and address the deployment.
> >
> > I originally submitted this as a new fanotify event, but given the
> > specificity of shmem, Amir suggested the interface I'm implementing
> > here.  We've raised this discussion originally here:
> >
> > https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/
> >
>
> To put things in context, the points I was trying to make in this
> discussion are:
>
> 1. Why isn't monitoring with statfs() a sufficient solution? and more
>     specifically, the shared disk space provisioning problem does not sound
>     very tmpfs specific to me.
>     It is a well known issue for thin provisioned storage in environments
>     with shared resources as the ones that you describe

I think this solves a different problem: to my understanding statfs
polling is useful for determining if a long lived, slowly growing FS
is approaching its limits - the tmpfs here are generally short lived,
and may be intentionally running close to limits (e.g. if they "know"
exactly how much they need, and don't expect to write any more than
that). In this case, the limits are there to guard against runaway
(and assist with scheduling), so "monitor and increase limits
periodically" isn't appropriate.

It's meant just to make it easier to distinguish between "tmpfs write
failed due to OOM" and "tmpfs write failed because you exceeded tmpfs'
max size" (what makes tmpfs "special" is that tmpfs, for good reason,
returns ENOSPC for both of these situations to the user). For a small
task a user could easily go from 0% to full, or OOM, rather quickly,
so statfs polling would likely miss the event. The orchestrator can,
when the task fails, easily (and reliably) look at this statistic to
determine if a user exceeded the tmpfs limit.

(I do see the parallel here to thin provisioned storage - "exceeded
your individual budget" vs. "underlying overcommitted system ran out
of bytes")

> 2. OTOH, exporting internal fs stats via /sys/fs for debugging, health
> monitoring
>     or whatever seems legit to me and is widely practiced by other fs, so
>     exposing those tmpfs stats as this patch set is doing seems fine to me.
>
> Another point worth considering in favor of /sys/fs/tmpfs -
> since tmpfs is FS_USERNS_MOUNT, the ability of sysadmin to monitor all
> tmpfs mounts in the system and their usage is limited.
>
> Therefore, having a central way to enumerate all tmpfs instances in the system
> like blockdev fs instances and like fuse fs instances, does not sound
> like a terrible
> idea in general.
>
> > > Whatever that action is, I see no user-facing documentation which
> > > guides the user info how to take advantage of this?
> >
> > I can follow up with a new version with documentation, if we agree this
> > feature makes sense.
> >
>
> Given the time of year and participants involved, shall we continue
> this discussion
> in LSFMM?
>
> I am not sure if this even requires a shared FS/MM session, but I
> don't mind trying
> to allocate a shared FS/MM slot if Andrew and MM guys are interested
> to take part
> in the discussion.
>
> As long as memcg is able to report OOM to the orchestrator, the problem does not
> sound very tmpfs specific to me.
>
> As Ted explained, cloud providers (for some reason) charge by disk size and not
> by disk usage, so also for non-tmpfs, online growing the fs on demand could
> prove to be a rewarding practice for cloud applications.
>
> Thanks,
> Amir.
Amir Goldstein April 22, 2022, 9:02 a.m. UTC | #6
On Fri, Apr 22, 2022 at 2:19 AM Khazhy Kumykov <khazhy@google.com> wrote:
>
> On Wed, Apr 20, 2022 at 10:34 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi
> > <krisman@collabora.com> wrote:
> > >
> > > Andrew Morton <akpm@linux-foundation.org> writes:
> > >
> > > Hi Andrew,
> > >
> > > > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote:
> > > >
> > > >> When provisioning containerized applications, multiple very small tmpfs
> > > >
> > > > "files"?
> > >
> > > Actually, filesystems.  In cloud environments, we have several small
> > > tmpfs associated with containerized tasks.
> > >
> > > >> are used, for which one cannot always predict the proper file system
> > > >> size ahead of time.  We want to be able to reliably monitor filesystems
> > > >> for ENOSPC errors, without depending on the application being executed
> > > >> reporting the ENOSPC after a failure.
> > > >
> > > > Well that sucks.  We need a kernel-side workaround for applications
> > > > that fail to check and report storage errors?
> > > >
> > > > We could do this for every syscall in the kernel.  What's special about
> > > > tmpfs in this regard?
> > > >
> > > > Please provide additional justification and usage examples for such an
> > > > extraordinary thing.
> > >
> > > For a cloud provider deploying containerized applications, they might
> > > not control the application, so patching userspace wouldn't be a
> > > solution.  More importantly - and why this is shmem specific -
> > > they want to differentiate between a user getting ENOSPC due to
> > > insufficiently provisioned fs size, vs. due to running out of memory in
> > > a container, both of which return ENOSPC to the process.
> > >
> >
> > Isn't there already a per memcg OOM handler that could be used by
> > orchestrator to detect the latter?
> >
> > > A system administrator can then use this feature to monitor a fleet of
> > > containerized applications in a uniform way, detect provisioning issues
> > > caused by different reasons and address the deployment.
> > >
> > > I originally submitted this as a new fanotify event, but given the
> > > specificity of shmem, Amir suggested the interface I'm implementing
> > > here.  We've raised this discussion originally here:
> > >
> > > https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/
> > >
> >
> > To put things in context, the points I was trying to make in this
> > discussion are:
> >
> > 1. Why isn't monitoring with statfs() a sufficient solution? and more
> >     specifically, the shared disk space provisioning problem does not sound
> >     very tmpfs specific to me.
> >     It is a well known issue for thin provisioned storage in environments
> >     with shared resources as the ones that you describe
>
> I think this solves a different problem: to my understanding statfs
> polling is useful for determining if a long lived, slowly growing FS
> is approaching its limits - the tmpfs here are generally short lived,
> and may be intentionally running close to limits (e.g. if they "know"
> exactly how much they need, and don't expect to write any more than
> that). In this case, the limits are there to guard against runaway
> (and assist with scheduling), so "monitor and increase limits
> periodically" isn't appropriate.
>
> It's meant just to make it easier to distinguish between "tmpfs write
> failed due to OOM" and "tmpfs write failed because you exceeded tmpfs'
> max size" (what makes tmpfs "special" is that tmpfs, for good reason,
> returns ENOSPC for both of these situations to the user). For a small

Maybe it's for a good reason, but it clearly is not the desired behavior
in your use case. Perhaps what is needed here is a way for user to opt-in
to a different OOM behavior from shmem using a mount option?
Would that be enough to cover your use case?

> task a user could easily go from 0% to full, or OOM, rather quickly,
> so statfs polling would likely miss the event. The orchestrator can,
> when the task fails, easily (and reliably) look at this statistic to
> determine if a user exceeded the tmpfs limit.
>
> (I do see the parallel here to thin provisioned storage - "exceeded
> your individual budget" vs. "underlying overcommitted system ran out
> of bytes")

Right, and in this case, the application gets a different error in case
of "underlying space overcommitted", usually EIO, that's why I think that
opting-in for this same behavior could make sense for tmpfs.

We can even consider shutdown behavior for shmem in that case, but
that is up to whoever may be interested in that kind of behavior.

Thanks,
Amir.
Gabriel Krisman Bertazi May 5, 2022, 9:16 p.m. UTC | #7
Amir Goldstein <amir73il@gmail.com> writes:

>> task a user could easily go from 0% to full, or OOM, rather quickly,
>> so statfs polling would likely miss the event. The orchestrator can,
>> when the task fails, easily (and reliably) look at this statistic to
>> determine if a user exceeded the tmpfs limit.
>>
>> (I do see the parallel here to thin provisioned storage - "exceeded
>> your individual budget" vs. "underlying overcommitted system ran out
>> of bytes")
>
> Right, and in this case, the application gets a different error in case
> of "underlying space overcommitted", usually EIO, that's why I think that
> opting-in for this same behavior could make sense for tmpfs.

Amir,

If I understand correctly, that would allow the application to catch the
lack of memory vs. lack of fs space, but it wouldn't facilitate life for
an orchestrator trying to detect the condition.  Still it seems like a
step in the right direction.  For the orchestrator, it seems necessary
that we expose this is some out-of-band mechanism, a WB_ERROR
notification or sysfs.

As a first step:

>8
Subject: [PATCH] shmem: Differentiate overcommit failure from lack of fs space

When provisioning user applications in cloud environments, it is common
to allocate containers with very small tmpfs and little available
memory.  In such scenarios, it is hard for an application to
differentiate whether its tmpfs IO failed due do insufficient
provisioned filesystem space, or due to running out of memory in the
container, because both situations will return ENOSPC in shmem.

This patch modifies the behavior of shmem failure due to overcommit to
return EIO instead of ENOSPC in this scenario.  In order to preserve the
existing interface, this feature must be enabled through a new
shmem-specific mount option.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
---
 Documentation/filesystems/tmpfs.rst | 16 +++++++++++++++
 include/linux/shmem_fs.h            |  3 +++
 mm/shmem.c                          | 30 ++++++++++++++++++++---------
 3 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 0408c245785e..83278d2b15a3 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -171,6 +171,22 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB
 RAM/SWAP in 10240 inodes and it is only accessible by root.
 
 
+When provisioning containerized applications, it is common to allocate
+the system with a very small tmpfs and little total memory.  In such
+scenarios, it is sometimes useful for an application to differentiate
+whether an IO operation failed due to insufficient provisioned
+filesystem space or due to running out of container memory.  tmpfs
+includes a mount parameter to treat a memory overcommit limit error
+differently from a lack of filesystem space error, allowing the
+application to differentiate these two scenarios.  If the following
+mount option is specified, surpassing memory overcommit limits on a
+tmpfs will return EIO.  ENOSPC is then only used to report lack of
+filesystem space.
+
+=================   ===================================================
+report_overcommit   Report overcommit issues with EIO instead of ENOSPC
+=================   ===================================================
+
 :Author:
    Christoph Rohland <cr@sap.com>, 1.12.01
 :Updated:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index e65b80ed09e7..1be57531b257 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -44,6 +44,9 @@ struct shmem_sb_info {
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
 	unsigned long shrinklist_len; /* Length of shrinklist */
+
+	/* Assist userspace with detecting overcommit errors */
+	bool report_overcommit;
 };
 
 static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
diff --git a/mm/shmem.c b/mm/shmem.c
index a09b29ec2b45..23f2780678df 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -112,6 +112,7 @@ struct shmem_options {
 	kgid_t gid;
 	umode_t mode;
 	bool full_inums;
+	bool report_overcommit;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
@@ -207,13 +208,16 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages)
 		vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
 }
 
-static inline bool shmem_inode_acct_block(struct inode *inode, long pages)
+static inline int shmem_inode_acct_block(struct inode *inode, long pages)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
 
-	if (shmem_acct_block(info->flags, pages))
-		return false;
+	if (shmem_acct_block(info->flags, pages)) {
+		if (sbinfo->report_overcommit)
+			return -EIO;
+		return -ENOSPC;
+	}
 
 	if (sbinfo->max_blocks) {
 		if (percpu_counter_compare(&sbinfo->used_blocks,
@@ -222,11 +226,11 @@ static inline bool shmem_inode_acct_block(struct inode *inode, long pages)
 		percpu_counter_add(&sbinfo->used_blocks, pages);
 	}
 
-	return true;
+	return 0;
 
 unacct:
 	shmem_unacct_blocks(info->flags, pages);
-	return false;
+	return -ENOSPC;
 }
 
 static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages)
@@ -372,7 +376,7 @@ bool shmem_charge(struct inode *inode, long pages)
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	unsigned long flags;
 
-	if (!shmem_inode_acct_block(inode, pages))
+	if (shmem_inode_acct_block(inode, pages))
 		return false;
 
 	/* nrpages adjustment first, then shmem_recalc_inode() when balanced */
@@ -1555,13 +1559,14 @@ static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct page *page;
 	int nr;
-	int err = -ENOSPC;
+	int err;
 
 	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
 		huge = false;
 	nr = huge ? HPAGE_PMD_NR : 1;
 
-	if (!shmem_inode_acct_block(inode, nr))
+	err = shmem_inode_acct_block(inode, nr);
+	if (err)
 		goto failed;
 
 	if (huge)
@@ -2324,7 +2329,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	int ret;
 	pgoff_t max_off;
 
-	if (!shmem_inode_acct_block(inode, 1)) {
+	if (shmem_inode_acct_block(inode, 1)) {
 		/*
 		 * We may have got a page, returned -ENOENT triggering a retry,
 		 * and now we find ourselves with -ENOMEM. Release the page, to
@@ -3301,6 +3306,7 @@ enum shmem_param {
 	Opt_uid,
 	Opt_inode32,
 	Opt_inode64,
+	Opt_report_overcommit,
 };
 
 static const struct constant_table shmem_param_enums_huge[] = {
@@ -3322,6 +3328,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
 	fsparam_u32   ("uid",		Opt_uid),
 	fsparam_flag  ("inode32",	Opt_inode32),
 	fsparam_flag  ("inode64",	Opt_inode64),
+	fsparam_flag  ("report_overcommit", Opt_report_overcommit),
 	{}
 };
 
@@ -3405,6 +3412,9 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->full_inums = true;
 		ctx->seen |= SHMEM_SEEN_INUMS;
 		break;
+	case Opt_report_overcommit:
+		ctx->report_overcommit = true;
+		break;
 	}
 	return 0;
 
@@ -3513,6 +3523,7 @@ static int shmem_reconfigure(struct fs_context *fc)
 		sbinfo->max_inodes  = ctx->inodes;
 		sbinfo->free_inodes = ctx->inodes - inodes;
 	}
+	sbinfo->report_overcommit = ctx->report_overcommit;
 
 	/*
 	 * Preserve previous mempolicy unless mpol remount option was specified.
@@ -3640,6 +3651,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->mode = ctx->mode;
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;
+	sbinfo->report_overcommit = ctx->report_overcommit;
 	ctx->mpol = NULL;
 
 	raw_spin_lock_init(&sbinfo->stat_lock);
Gabriel Krisman Bertazi May 12, 2022, 8 p.m. UTC | #8
Gabriel Krisman Bertazi <krisman@collabora.com> writes:

> Amir Goldstein <amir73il@gmail.com> writes:
>
>>> task a user could easily go from 0% to full, or OOM, rather quickly,
>>> so statfs polling would likely miss the event. The orchestrator can,
>>> when the task fails, easily (and reliably) look at this statistic to
>>> determine if a user exceeded the tmpfs limit.
>>>
>>> (I do see the parallel here to thin provisioned storage - "exceeded
>>> your individual budget" vs. "underlying overcommitted system ran out
>>> of bytes")
>>
>> Right, and in this case, the application gets a different error in case
>> of "underlying space overcommitted", usually EIO, that's why I think that
>> opting-in for this same behavior could make sense for tmpfs.
>
> Amir,
>
> If I understand correctly, that would allow the application to catch the
> lack of memory vs. lack of fs space, but it wouldn't facilitate life for
> an orchestrator trying to detect the condition.  Still it seems like a
> step in the right direction.  For the orchestrator, it seems necessary
> that we expose this is some out-of-band mechanism, a WB_ERROR
> notification or sysfs.

Amir,

Regarding allowing an orchestrator to catch this situation, I'd like to
go back to the original proposal and create a new tmpfs
"thin-provisioned" option that will return a different error code (as
the patch below, that I sent last week) and also issue a special
FAN_FS_ERROR/WB_ERROR to notify the orchestrator of this situation.
This would completely solve the use case, I believe.  Since this is
quite specific to tmpfs, it is reasonable to implement the notification
at FS level, similar to how other FS_ERRORs are implemented.

> As a first step:
>
>>8
> Subject: [PATCH] shmem: Differentiate overcommit failure from lack of fs space
>
> When provisioning user applications in cloud environments, it is common
> to allocate containers with very small tmpfs and little available
> memory.  In such scenarios, it is hard for an application to
> differentiate whether its tmpfs IO failed due do insufficient
> provisioned filesystem space, or due to running out of memory in the
> container, because both situations will return ENOSPC in shmem.
>
> This patch modifies the behavior of shmem failure due to overcommit to
> return EIO instead of ENOSPC in this scenario.  In order to preserve the
> existing interface, this feature must be enabled through a new
> shmem-specific mount option.
>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
> ---
>  Documentation/filesystems/tmpfs.rst | 16 +++++++++++++++
>  include/linux/shmem_fs.h            |  3 +++
>  mm/shmem.c                          | 30 ++++++++++++++++++++---------
>  3 files changed, 40 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
> index 0408c245785e..83278d2b15a3 100644
> --- a/Documentation/filesystems/tmpfs.rst
> +++ b/Documentation/filesystems/tmpfs.rst
> @@ -171,6 +171,22 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB
>  RAM/SWAP in 10240 inodes and it is only accessible by root.
>  
>  
> +When provisioning containerized applications, it is common to allocate
> +the system with a very small tmpfs and little total memory.  In such
> +scenarios, it is sometimes useful for an application to differentiate
> +whether an IO operation failed due to insufficient provisioned
> +filesystem space or due to running out of container memory.  tmpfs
> +includes a mount parameter to treat a memory overcommit limit error
> +differently from a lack of filesystem space error, allowing the
> +application to differentiate these two scenarios.  If the following
> +mount option is specified, surpassing memory overcommit limits on a
> +tmpfs will return EIO.  ENOSPC is then only used to report lack of
> +filesystem space.
> +
> +=================   ===================================================
> +report_overcommit   Report overcommit issues with EIO instead of ENOSPC
> +=================   ===================================================
> +
>  :Author:
>     Christoph Rohland <cr@sap.com>, 1.12.01
>  :Updated:
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index e65b80ed09e7..1be57531b257 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -44,6 +44,9 @@ struct shmem_sb_info {
>  	spinlock_t shrinklist_lock;   /* Protects shrinklist */
>  	struct list_head shrinklist;  /* List of shinkable inodes */
>  	unsigned long shrinklist_len; /* Length of shrinklist */
> +
> +	/* Assist userspace with detecting overcommit errors */
> +	bool report_overcommit;
>  };
>  
>  static inline struct shmem_inode_info *SHMEM_I(struct inode *inode)
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a09b29ec2b45..23f2780678df 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -112,6 +112,7 @@ struct shmem_options {
>  	kgid_t gid;
>  	umode_t mode;
>  	bool full_inums;
> +	bool report_overcommit;
>  	int huge;
>  	int seen;
>  #define SHMEM_SEEN_BLOCKS 1
> @@ -207,13 +208,16 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages)
>  		vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE));
>  }
>  
> -static inline bool shmem_inode_acct_block(struct inode *inode, long pages)
> +static inline int shmem_inode_acct_block(struct inode *inode, long pages)
>  {
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>  
> -	if (shmem_acct_block(info->flags, pages))
> -		return false;
> +	if (shmem_acct_block(info->flags, pages)) {
> +		if (sbinfo->report_overcommit)
> +			return -EIO;
> +		return -ENOSPC;
> +	}
>  
>  	if (sbinfo->max_blocks) {
>  		if (percpu_counter_compare(&sbinfo->used_blocks,
> @@ -222,11 +226,11 @@ static inline bool shmem_inode_acct_block(struct inode *inode, long pages)
>  		percpu_counter_add(&sbinfo->used_blocks, pages);
>  	}
>  
> -	return true;
> +	return 0;
>  
>  unacct:
>  	shmem_unacct_blocks(info->flags, pages);
> -	return false;
> +	return -ENOSPC;
>  }
>  
>  static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages)
> @@ -372,7 +376,7 @@ bool shmem_charge(struct inode *inode, long pages)
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	unsigned long flags;
>  
> -	if (!shmem_inode_acct_block(inode, pages))
> +	if (shmem_inode_acct_block(inode, pages))
>  		return false;
>  
>  	/* nrpages adjustment first, then shmem_recalc_inode() when balanced */
> @@ -1555,13 +1559,14 @@ static struct page *shmem_alloc_and_acct_page(gfp_t gfp,
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct page *page;
>  	int nr;
> -	int err = -ENOSPC;
> +	int err;
>  
>  	if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
>  		huge = false;
>  	nr = huge ? HPAGE_PMD_NR : 1;
>  
> -	if (!shmem_inode_acct_block(inode, nr))
> +	err = shmem_inode_acct_block(inode, nr);
> +	if (err)
>  		goto failed;
>  
>  	if (huge)
> @@ -2324,7 +2329,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
>  	int ret;
>  	pgoff_t max_off;
>  
> -	if (!shmem_inode_acct_block(inode, 1)) {
> +	if (shmem_inode_acct_block(inode, 1)) {
>  		/*
>  		 * We may have got a page, returned -ENOENT triggering a retry,
>  		 * and now we find ourselves with -ENOMEM. Release the page, to
> @@ -3301,6 +3306,7 @@ enum shmem_param {
>  	Opt_uid,
>  	Opt_inode32,
>  	Opt_inode64,
> +	Opt_report_overcommit,
>  };
>  
>  static const struct constant_table shmem_param_enums_huge[] = {
> @@ -3322,6 +3328,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
>  	fsparam_u32   ("uid",		Opt_uid),
>  	fsparam_flag  ("inode32",	Opt_inode32),
>  	fsparam_flag  ("inode64",	Opt_inode64),
> +	fsparam_flag  ("report_overcommit", Opt_report_overcommit),
>  	{}
>  };
>  
> @@ -3405,6 +3412,9 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>  		ctx->full_inums = true;
>  		ctx->seen |= SHMEM_SEEN_INUMS;
>  		break;
> +	case Opt_report_overcommit:
> +		ctx->report_overcommit = true;
> +		break;
>  	}
>  	return 0;
>  
> @@ -3513,6 +3523,7 @@ static int shmem_reconfigure(struct fs_context *fc)
>  		sbinfo->max_inodes  = ctx->inodes;
>  		sbinfo->free_inodes = ctx->inodes - inodes;
>  	}
> +	sbinfo->report_overcommit = ctx->report_overcommit;
>  
>  	/*
>  	 * Preserve previous mempolicy unless mpol remount option was specified.
> @@ -3640,6 +3651,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  	sbinfo->mode = ctx->mode;
>  	sbinfo->huge = ctx->huge;
>  	sbinfo->mpol = ctx->mpol;
> +	sbinfo->report_overcommit = ctx->report_overcommit;
>  	ctx->mpol = NULL;
>  
>  	raw_spin_lock_init(&sbinfo->stat_lock);