Message ID | 20220418213713.273050-1-krisman@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | shmem: Allow userspace monitoring of tmpfs for lack of space. | expand |
On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > When provisioning containerized applications, multiple very small tmpfs "files"? > are used, for which one cannot always predict the proper file system > size ahead of time. We want to be able to reliably monitor filesystems > for ENOSPC errors, without depending on the application being executed > reporting the ENOSPC after a failure. Well that sucks. We need a kernel-side workaround for applications that fail to check and report storage errors? We could do this for every syscall in the kernel. What's special about tmpfs in this regard? Please provide additional justification and usage examples for such an extraordinary thing. > It is also not enough to watch > statfs since that information might be ephemeral (say the application > recovers by deleting data, the issue can get lost). We could fix the apps? Heck, you could patch libc's write() to the same effect. > For this use case, > it is also interesting to differentiate IO errors caused by lack of > virtual memory from lack of FS space. More details, please. Why interesting? What actions can the system operator take based upon this information? Whatever that action is, I see no user-facing documentation which guides the user info how to take advantage of this?
Andrew Morton <akpm@linux-foundation.org> writes: Hi Andrew, > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > >> When provisioning containerized applications, multiple very small tmpfs > > "files"? Actually, filesystems. In cloud environments, we have several small tmpfs associated with containerized tasks. >> are used, for which one cannot always predict the proper file system >> size ahead of time. We want to be able to reliably monitor filesystems >> for ENOSPC errors, without depending on the application being executed >> reporting the ENOSPC after a failure. > > Well that sucks. We need a kernel-side workaround for applications > that fail to check and report storage errors? > > We could do this for every syscall in the kernel. What's special about > tmpfs in this regard? > > Please provide additional justification and usage examples for such an > extraordinary thing. For a cloud provider deploying containerized applications, they might not control the application, so patching userspace wouldn't be a solution. More importantly - and why this is shmem specific - they want to differentiate between a user getting ENOSPC due to insufficiently provisioned fs size, vs. due to running out of memory in a container, both of which return ENOSPC to the process. A system administrator can then use this feature to monitor a fleet of containerized applications in a uniform way, detect provisioning issues caused by different reasons and address the deployment. I originally submitted this as a new fanotify event, but given the specificity of shmem, Amir suggested the interface I'm implementing here. We've raised this discussion originally here: https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/ > Whatever that action is, I see no user-facing documentation which > guides the user info how to take advantage of this? I can follow up with a new version with documentation, if we agree this feature makes sense. Thanks,
On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > Andrew Morton <akpm@linux-foundation.org> writes: > > Hi Andrew, > > > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > > >> When provisioning containerized applications, multiple very small tmpfs > > > > "files"? > > Actually, filesystems. In cloud environments, we have several small > tmpfs associated with containerized tasks. > > >> are used, for which one cannot always predict the proper file system > >> size ahead of time. We want to be able to reliably monitor filesystems > >> for ENOSPC errors, without depending on the application being executed > >> reporting the ENOSPC after a failure. > > > > Well that sucks. We need a kernel-side workaround for applications > > that fail to check and report storage errors? > > > > We could do this for every syscall in the kernel. What's special about > > tmpfs in this regard? > > > > Please provide additional justification and usage examples for such an > > extraordinary thing. > > For a cloud provider deploying containerized applications, they might > not control the application, so patching userspace wouldn't be a > solution. More importantly - and why this is shmem specific - > they want to differentiate between a user getting ENOSPC due to > insufficiently provisioned fs size, vs. due to running out of memory in > a container, both of which return ENOSPC to the process. > Isn't there already a per memcg OOM handler that could be used by orchestrator to detect the latter? > A system administrator can then use this feature to monitor a fleet of > containerized applications in a uniform way, detect provisioning issues > caused by different reasons and address the deployment. > > I originally submitted this as a new fanotify event, but given the > specificity of shmem, Amir suggested the interface I'm implementing > here. We've raised this discussion originally here: > > https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/ > To put things in context, the points I was trying to make in this discussion are: 1. Why isn't monitoring with statfs() a sufficient solution? and more specifically, the shared disk space provisioning problem does not sound very tmpfs specific to me. It is a well known issue for thin provisioned storage in environments with shared resources as the ones that you describe 2. OTOH, exporting internal fs stats via /sys/fs for debugging, health monitoring or whatever seems legit to me and is widely practiced by other fs, so exposing those tmpfs stats as this patch set is doing seems fine to me. Another point worth considering in favor of /sys/fs/tmpfs - since tmpfs is FS_USERNS_MOUNT, the ability of sysadmin to monitor all tmpfs mounts in the system and their usage is limited. Therefore, having a central way to enumerate all tmpfs instances in the system like blockdev fs instances and like fuse fs instances, does not sound like a terrible idea in general. > > Whatever that action is, I see no user-facing documentation which > > guides the user info how to take advantage of this? > > I can follow up with a new version with documentation, if we agree this > feature makes sense. > Given the time of year and participants involved, shall we continue this discussion in LSFMM? I am not sure if this even requires a shared FS/MM session, but I don't mind trying to allocate a shared FS/MM slot if Andrew and MM guys are interested to take part in the discussion. As long as memcg is able to report OOM to the orchestrator, the problem does not sound very tmpfs specific to me. As Ted explained, cloud providers (for some reason) charge by disk size and not by disk usage, so also for non-tmpfs, online growing the fs on demand could prove to be a rewarding practice for cloud applications. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: > On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi > <krisman@collabora.com> wrote: >> > Well that sucks. We need a kernel-side workaround for applications >> > that fail to check and report storage errors? >> > >> > We could do this for every syscall in the kernel. What's special about >> > tmpfs in this regard? >> > >> > Please provide additional justification and usage examples for such an >> > extraordinary thing. >> >> For a cloud provider deploying containerized applications, they might >> not control the application, so patching userspace wouldn't be a >> solution. More importantly - and why this is shmem specific - >> they want to differentiate between a user getting ENOSPC due to >> insufficiently provisioned fs size, vs. due to running out of memory in >> a container, both of which return ENOSPC to the process. >> > > Isn't there already a per memcg OOM handler that could be used by > orchestrator to detect the latter? Hi Amir, Thanks for the added context. I'm actually not sure if an OOM handler completely solves the latter case. If shmem_inode_acct_block fails, it happens before the allocation. The OOM won't trigger and we won't know about it, as far as I understand. I'm not sure it's real problem for Google's use case. Khazhy is the expert on their implementation and might be able to better discuss it. I wanna mention that, for the insufficiently-provisioned-fs-size case, we still can't rely just on statfs. We need a polling interface - generic or tmpfs specific - to make sure we don't miss these events, I think. Thanks,
On Wed, Apr 20, 2022 at 10:34 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi > <krisman@collabora.com> wrote: > > > > Andrew Morton <akpm@linux-foundation.org> writes: > > > > Hi Andrew, > > > > > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > > > > >> When provisioning containerized applications, multiple very small tmpfs > > > > > > "files"? > > > > Actually, filesystems. In cloud environments, we have several small > > tmpfs associated with containerized tasks. > > > > >> are used, for which one cannot always predict the proper file system > > >> size ahead of time. We want to be able to reliably monitor filesystems > > >> for ENOSPC errors, without depending on the application being executed > > >> reporting the ENOSPC after a failure. > > > > > > Well that sucks. We need a kernel-side workaround for applications > > > that fail to check and report storage errors? > > > > > > We could do this for every syscall in the kernel. What's special about > > > tmpfs in this regard? > > > > > > Please provide additional justification and usage examples for such an > > > extraordinary thing. > > > > For a cloud provider deploying containerized applications, they might > > not control the application, so patching userspace wouldn't be a > > solution. More importantly - and why this is shmem specific - > > they want to differentiate between a user getting ENOSPC due to > > insufficiently provisioned fs size, vs. due to running out of memory in > > a container, both of which return ENOSPC to the process. > > > > Isn't there already a per memcg OOM handler that could be used by > orchestrator to detect the latter? > > > A system administrator can then use this feature to monitor a fleet of > > containerized applications in a uniform way, detect provisioning issues > > caused by different reasons and address the deployment. > > > > I originally submitted this as a new fanotify event, but given the > > specificity of shmem, Amir suggested the interface I'm implementing > > here. We've raised this discussion originally here: > > > > https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/ > > > > To put things in context, the points I was trying to make in this > discussion are: > > 1. Why isn't monitoring with statfs() a sufficient solution? and more > specifically, the shared disk space provisioning problem does not sound > very tmpfs specific to me. > It is a well known issue for thin provisioned storage in environments > with shared resources as the ones that you describe I think this solves a different problem: to my understanding statfs polling is useful for determining if a long lived, slowly growing FS is approaching its limits - the tmpfs here are generally short lived, and may be intentionally running close to limits (e.g. if they "know" exactly how much they need, and don't expect to write any more than that). In this case, the limits are there to guard against runaway (and assist with scheduling), so "monitor and increase limits periodically" isn't appropriate. It's meant just to make it easier to distinguish between "tmpfs write failed due to OOM" and "tmpfs write failed because you exceeded tmpfs' max size" (what makes tmpfs "special" is that tmpfs, for good reason, returns ENOSPC for both of these situations to the user). For a small task a user could easily go from 0% to full, or OOM, rather quickly, so statfs polling would likely miss the event. The orchestrator can, when the task fails, easily (and reliably) look at this statistic to determine if a user exceeded the tmpfs limit. (I do see the parallel here to thin provisioned storage - "exceeded your individual budget" vs. "underlying overcommitted system ran out of bytes") > 2. OTOH, exporting internal fs stats via /sys/fs for debugging, health > monitoring > or whatever seems legit to me and is widely practiced by other fs, so > exposing those tmpfs stats as this patch set is doing seems fine to me. > > Another point worth considering in favor of /sys/fs/tmpfs - > since tmpfs is FS_USERNS_MOUNT, the ability of sysadmin to monitor all > tmpfs mounts in the system and their usage is limited. > > Therefore, having a central way to enumerate all tmpfs instances in the system > like blockdev fs instances and like fuse fs instances, does not sound > like a terrible > idea in general. > > > > Whatever that action is, I see no user-facing documentation which > > > guides the user info how to take advantage of this? > > > > I can follow up with a new version with documentation, if we agree this > > feature makes sense. > > > > Given the time of year and participants involved, shall we continue > this discussion > in LSFMM? > > I am not sure if this even requires a shared FS/MM session, but I > don't mind trying > to allocate a shared FS/MM slot if Andrew and MM guys are interested > to take part > in the discussion. > > As long as memcg is able to report OOM to the orchestrator, the problem does not > sound very tmpfs specific to me. > > As Ted explained, cloud providers (for some reason) charge by disk size and not > by disk usage, so also for non-tmpfs, online growing the fs on demand could > prove to be a rewarding practice for cloud applications. > > Thanks, > Amir.
On Fri, Apr 22, 2022 at 2:19 AM Khazhy Kumykov <khazhy@google.com> wrote: > > On Wed, Apr 20, 2022 at 10:34 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Tue, Apr 19, 2022 at 6:29 PM Gabriel Krisman Bertazi > > <krisman@collabora.com> wrote: > > > > > > Andrew Morton <akpm@linux-foundation.org> writes: > > > > > > Hi Andrew, > > > > > > > On Mon, 18 Apr 2022 17:37:10 -0400 Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > > > > > > >> When provisioning containerized applications, multiple very small tmpfs > > > > > > > > "files"? > > > > > > Actually, filesystems. In cloud environments, we have several small > > > tmpfs associated with containerized tasks. > > > > > > >> are used, for which one cannot always predict the proper file system > > > >> size ahead of time. We want to be able to reliably monitor filesystems > > > >> for ENOSPC errors, without depending on the application being executed > > > >> reporting the ENOSPC after a failure. > > > > > > > > Well that sucks. We need a kernel-side workaround for applications > > > > that fail to check and report storage errors? > > > > > > > > We could do this for every syscall in the kernel. What's special about > > > > tmpfs in this regard? > > > > > > > > Please provide additional justification and usage examples for such an > > > > extraordinary thing. > > > > > > For a cloud provider deploying containerized applications, they might > > > not control the application, so patching userspace wouldn't be a > > > solution. More importantly - and why this is shmem specific - > > > they want to differentiate between a user getting ENOSPC due to > > > insufficiently provisioned fs size, vs. due to running out of memory in > > > a container, both of which return ENOSPC to the process. > > > > > > > Isn't there already a per memcg OOM handler that could be used by > > orchestrator to detect the latter? > > > > > A system administrator can then use this feature to monitor a fleet of > > > containerized applications in a uniform way, detect provisioning issues > > > caused by different reasons and address the deployment. > > > > > > I originally submitted this as a new fanotify event, but given the > > > specificity of shmem, Amir suggested the interface I'm implementing > > > here. We've raised this discussion originally here: > > > > > > https://lore.kernel.org/linux-mm/CACGdZYLLCqzS4VLUHvzYG=rX3SEJaG7Vbs8_Wb_iUVSvXsqkxA@mail.gmail.com/ > > > > > > > To put things in context, the points I was trying to make in this > > discussion are: > > > > 1. Why isn't monitoring with statfs() a sufficient solution? and more > > specifically, the shared disk space provisioning problem does not sound > > very tmpfs specific to me. > > It is a well known issue for thin provisioned storage in environments > > with shared resources as the ones that you describe > > I think this solves a different problem: to my understanding statfs > polling is useful for determining if a long lived, slowly growing FS > is approaching its limits - the tmpfs here are generally short lived, > and may be intentionally running close to limits (e.g. if they "know" > exactly how much they need, and don't expect to write any more than > that). In this case, the limits are there to guard against runaway > (and assist with scheduling), so "monitor and increase limits > periodically" isn't appropriate. > > It's meant just to make it easier to distinguish between "tmpfs write > failed due to OOM" and "tmpfs write failed because you exceeded tmpfs' > max size" (what makes tmpfs "special" is that tmpfs, for good reason, > returns ENOSPC for both of these situations to the user). For a small Maybe it's for a good reason, but it clearly is not the desired behavior in your use case. Perhaps what is needed here is a way for user to opt-in to a different OOM behavior from shmem using a mount option? Would that be enough to cover your use case? > task a user could easily go from 0% to full, or OOM, rather quickly, > so statfs polling would likely miss the event. The orchestrator can, > when the task fails, easily (and reliably) look at this statistic to > determine if a user exceeded the tmpfs limit. > > (I do see the parallel here to thin provisioned storage - "exceeded > your individual budget" vs. "underlying overcommitted system ran out > of bytes") Right, and in this case, the application gets a different error in case of "underlying space overcommitted", usually EIO, that's why I think that opting-in for this same behavior could make sense for tmpfs. We can even consider shutdown behavior for shmem in that case, but that is up to whoever may be interested in that kind of behavior. Thanks, Amir.
Amir Goldstein <amir73il@gmail.com> writes: >> task a user could easily go from 0% to full, or OOM, rather quickly, >> so statfs polling would likely miss the event. The orchestrator can, >> when the task fails, easily (and reliably) look at this statistic to >> determine if a user exceeded the tmpfs limit. >> >> (I do see the parallel here to thin provisioned storage - "exceeded >> your individual budget" vs. "underlying overcommitted system ran out >> of bytes") > > Right, and in this case, the application gets a different error in case > of "underlying space overcommitted", usually EIO, that's why I think that > opting-in for this same behavior could make sense for tmpfs. Amir, If I understand correctly, that would allow the application to catch the lack of memory vs. lack of fs space, but it wouldn't facilitate life for an orchestrator trying to detect the condition. Still it seems like a step in the right direction. For the orchestrator, it seems necessary that we expose this is some out-of-band mechanism, a WB_ERROR notification or sysfs. As a first step: >8 Subject: [PATCH] shmem: Differentiate overcommit failure from lack of fs space When provisioning user applications in cloud environments, it is common to allocate containers with very small tmpfs and little available memory. In such scenarios, it is hard for an application to differentiate whether its tmpfs IO failed due do insufficient provisioned filesystem space, or due to running out of memory in the container, because both situations will return ENOSPC in shmem. This patch modifies the behavior of shmem failure due to overcommit to return EIO instead of ENOSPC in this scenario. In order to preserve the existing interface, this feature must be enabled through a new shmem-specific mount option. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> --- Documentation/filesystems/tmpfs.rst | 16 +++++++++++++++ include/linux/shmem_fs.h | 3 +++ mm/shmem.c | 30 ++++++++++++++++++++--------- 3 files changed, 40 insertions(+), 9 deletions(-) diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 0408c245785e..83278d2b15a3 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -171,6 +171,22 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB RAM/SWAP in 10240 inodes and it is only accessible by root. +When provisioning containerized applications, it is common to allocate +the system with a very small tmpfs and little total memory. In such +scenarios, it is sometimes useful for an application to differentiate +whether an IO operation failed due to insufficient provisioned +filesystem space or due to running out of container memory. tmpfs +includes a mount parameter to treat a memory overcommit limit error +differently from a lack of filesystem space error, allowing the +application to differentiate these two scenarios. If the following +mount option is specified, surpassing memory overcommit limits on a +tmpfs will return EIO. ENOSPC is then only used to report lack of +filesystem space. + +================= =================================================== +report_overcommit Report overcommit issues with EIO instead of ENOSPC +================= =================================================== + :Author: Christoph Rohland <cr@sap.com>, 1.12.01 :Updated: diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index e65b80ed09e7..1be57531b257 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -44,6 +44,9 @@ struct shmem_sb_info { spinlock_t shrinklist_lock; /* Protects shrinklist */ struct list_head shrinklist; /* List of shinkable inodes */ unsigned long shrinklist_len; /* Length of shrinklist */ + + /* Assist userspace with detecting overcommit errors */ + bool report_overcommit; }; static inline struct shmem_inode_info *SHMEM_I(struct inode *inode) diff --git a/mm/shmem.c b/mm/shmem.c index a09b29ec2b45..23f2780678df 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -112,6 +112,7 @@ struct shmem_options { kgid_t gid; umode_t mode; bool full_inums; + bool report_overcommit; int huge; int seen; #define SHMEM_SEEN_BLOCKS 1 @@ -207,13 +208,16 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages) vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE)); } -static inline bool shmem_inode_acct_block(struct inode *inode, long pages) +static inline int shmem_inode_acct_block(struct inode *inode, long pages) { struct shmem_inode_info *info = SHMEM_I(inode); struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); - if (shmem_acct_block(info->flags, pages)) - return false; + if (shmem_acct_block(info->flags, pages)) { + if (sbinfo->report_overcommit) + return -EIO; + return -ENOSPC; + } if (sbinfo->max_blocks) { if (percpu_counter_compare(&sbinfo->used_blocks, @@ -222,11 +226,11 @@ static inline bool shmem_inode_acct_block(struct inode *inode, long pages) percpu_counter_add(&sbinfo->used_blocks, pages); } - return true; + return 0; unacct: shmem_unacct_blocks(info->flags, pages); - return false; + return -ENOSPC; } static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) @@ -372,7 +376,7 @@ bool shmem_charge(struct inode *inode, long pages) struct shmem_inode_info *info = SHMEM_I(inode); unsigned long flags; - if (!shmem_inode_acct_block(inode, pages)) + if (shmem_inode_acct_block(inode, pages)) return false; /* nrpages adjustment first, then shmem_recalc_inode() when balanced */ @@ -1555,13 +1559,14 @@ static struct page *shmem_alloc_and_acct_page(gfp_t gfp, struct shmem_inode_info *info = SHMEM_I(inode); struct page *page; int nr; - int err = -ENOSPC; + int err; if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) huge = false; nr = huge ? HPAGE_PMD_NR : 1; - if (!shmem_inode_acct_block(inode, nr)) + err = shmem_inode_acct_block(inode, nr); + if (err) goto failed; if (huge) @@ -2324,7 +2329,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, int ret; pgoff_t max_off; - if (!shmem_inode_acct_block(inode, 1)) { + if (shmem_inode_acct_block(inode, 1)) { /* * We may have got a page, returned -ENOENT triggering a retry, * and now we find ourselves with -ENOMEM. Release the page, to @@ -3301,6 +3306,7 @@ enum shmem_param { Opt_uid, Opt_inode32, Opt_inode64, + Opt_report_overcommit, }; static const struct constant_table shmem_param_enums_huge[] = { @@ -3322,6 +3328,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = { fsparam_u32 ("uid", Opt_uid), fsparam_flag ("inode32", Opt_inode32), fsparam_flag ("inode64", Opt_inode64), + fsparam_flag ("report_overcommit", Opt_report_overcommit), {} }; @@ -3405,6 +3412,9 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) ctx->full_inums = true; ctx->seen |= SHMEM_SEEN_INUMS; break; + case Opt_report_overcommit: + ctx->report_overcommit = true; + break; } return 0; @@ -3513,6 +3523,7 @@ static int shmem_reconfigure(struct fs_context *fc) sbinfo->max_inodes = ctx->inodes; sbinfo->free_inodes = ctx->inodes - inodes; } + sbinfo->report_overcommit = ctx->report_overcommit; /* * Preserve previous mempolicy unless mpol remount option was specified. @@ -3640,6 +3651,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) sbinfo->mode = ctx->mode; sbinfo->huge = ctx->huge; sbinfo->mpol = ctx->mpol; + sbinfo->report_overcommit = ctx->report_overcommit; ctx->mpol = NULL; raw_spin_lock_init(&sbinfo->stat_lock);
Gabriel Krisman Bertazi <krisman@collabora.com> writes: > Amir Goldstein <amir73il@gmail.com> writes: > >>> task a user could easily go from 0% to full, or OOM, rather quickly, >>> so statfs polling would likely miss the event. The orchestrator can, >>> when the task fails, easily (and reliably) look at this statistic to >>> determine if a user exceeded the tmpfs limit. >>> >>> (I do see the parallel here to thin provisioned storage - "exceeded >>> your individual budget" vs. "underlying overcommitted system ran out >>> of bytes") >> >> Right, and in this case, the application gets a different error in case >> of "underlying space overcommitted", usually EIO, that's why I think that >> opting-in for this same behavior could make sense for tmpfs. > > Amir, > > If I understand correctly, that would allow the application to catch the > lack of memory vs. lack of fs space, but it wouldn't facilitate life for > an orchestrator trying to detect the condition. Still it seems like a > step in the right direction. For the orchestrator, it seems necessary > that we expose this is some out-of-band mechanism, a WB_ERROR > notification or sysfs. Amir, Regarding allowing an orchestrator to catch this situation, I'd like to go back to the original proposal and create a new tmpfs "thin-provisioned" option that will return a different error code (as the patch below, that I sent last week) and also issue a special FAN_FS_ERROR/WB_ERROR to notify the orchestrator of this situation. This would completely solve the use case, I believe. Since this is quite specific to tmpfs, it is reasonable to implement the notification at FS level, similar to how other FS_ERRORs are implemented. > As a first step: > >>8 > Subject: [PATCH] shmem: Differentiate overcommit failure from lack of fs space > > When provisioning user applications in cloud environments, it is common > to allocate containers with very small tmpfs and little available > memory. In such scenarios, it is hard for an application to > differentiate whether its tmpfs IO failed due do insufficient > provisioned filesystem space, or due to running out of memory in the > container, because both situations will return ENOSPC in shmem. > > This patch modifies the behavior of shmem failure due to overcommit to > return EIO instead of ENOSPC in this scenario. In order to preserve the > existing interface, this feature must be enabled through a new > shmem-specific mount option. > > Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> > --- > Documentation/filesystems/tmpfs.rst | 16 +++++++++++++++ > include/linux/shmem_fs.h | 3 +++ > mm/shmem.c | 30 ++++++++++++++++++++--------- > 3 files changed, 40 insertions(+), 9 deletions(-) > > diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst > index 0408c245785e..83278d2b15a3 100644 > --- a/Documentation/filesystems/tmpfs.rst > +++ b/Documentation/filesystems/tmpfs.rst > @@ -171,6 +171,22 @@ will give you tmpfs instance on /mytmpfs which can allocate 10GB > RAM/SWAP in 10240 inodes and it is only accessible by root. > > > +When provisioning containerized applications, it is common to allocate > +the system with a very small tmpfs and little total memory. In such > +scenarios, it is sometimes useful for an application to differentiate > +whether an IO operation failed due to insufficient provisioned > +filesystem space or due to running out of container memory. tmpfs > +includes a mount parameter to treat a memory overcommit limit error > +differently from a lack of filesystem space error, allowing the > +application to differentiate these two scenarios. If the following > +mount option is specified, surpassing memory overcommit limits on a > +tmpfs will return EIO. ENOSPC is then only used to report lack of > +filesystem space. > + > +================= =================================================== > +report_overcommit Report overcommit issues with EIO instead of ENOSPC > +================= =================================================== > + > :Author: > Christoph Rohland <cr@sap.com>, 1.12.01 > :Updated: > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > index e65b80ed09e7..1be57531b257 100644 > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -44,6 +44,9 @@ struct shmem_sb_info { > spinlock_t shrinklist_lock; /* Protects shrinklist */ > struct list_head shrinklist; /* List of shinkable inodes */ > unsigned long shrinklist_len; /* Length of shrinklist */ > + > + /* Assist userspace with detecting overcommit errors */ > + bool report_overcommit; > }; > > static inline struct shmem_inode_info *SHMEM_I(struct inode *inode) > diff --git a/mm/shmem.c b/mm/shmem.c > index a09b29ec2b45..23f2780678df 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -112,6 +112,7 @@ struct shmem_options { > kgid_t gid; > umode_t mode; > bool full_inums; > + bool report_overcommit; > int huge; > int seen; > #define SHMEM_SEEN_BLOCKS 1 > @@ -207,13 +208,16 @@ static inline void shmem_unacct_blocks(unsigned long flags, long pages) > vm_unacct_memory(pages * VM_ACCT(PAGE_SIZE)); > } > > -static inline bool shmem_inode_acct_block(struct inode *inode, long pages) > +static inline int shmem_inode_acct_block(struct inode *inode, long pages) > { > struct shmem_inode_info *info = SHMEM_I(inode); > struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); > > - if (shmem_acct_block(info->flags, pages)) > - return false; > + if (shmem_acct_block(info->flags, pages)) { > + if (sbinfo->report_overcommit) > + return -EIO; > + return -ENOSPC; > + } > > if (sbinfo->max_blocks) { > if (percpu_counter_compare(&sbinfo->used_blocks, > @@ -222,11 +226,11 @@ static inline bool shmem_inode_acct_block(struct inode *inode, long pages) > percpu_counter_add(&sbinfo->used_blocks, pages); > } > > - return true; > + return 0; > > unacct: > shmem_unacct_blocks(info->flags, pages); > - return false; > + return -ENOSPC; > } > > static inline void shmem_inode_unacct_blocks(struct inode *inode, long pages) > @@ -372,7 +376,7 @@ bool shmem_charge(struct inode *inode, long pages) > struct shmem_inode_info *info = SHMEM_I(inode); > unsigned long flags; > > - if (!shmem_inode_acct_block(inode, pages)) > + if (shmem_inode_acct_block(inode, pages)) > return false; > > /* nrpages adjustment first, then shmem_recalc_inode() when balanced */ > @@ -1555,13 +1559,14 @@ static struct page *shmem_alloc_and_acct_page(gfp_t gfp, > struct shmem_inode_info *info = SHMEM_I(inode); > struct page *page; > int nr; > - int err = -ENOSPC; > + int err; > > if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) > huge = false; > nr = huge ? HPAGE_PMD_NR : 1; > > - if (!shmem_inode_acct_block(inode, nr)) > + err = shmem_inode_acct_block(inode, nr); > + if (err) > goto failed; > > if (huge) > @@ -2324,7 +2329,7 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm, > int ret; > pgoff_t max_off; > > - if (!shmem_inode_acct_block(inode, 1)) { > + if (shmem_inode_acct_block(inode, 1)) { > /* > * We may have got a page, returned -ENOENT triggering a retry, > * and now we find ourselves with -ENOMEM. Release the page, to > @@ -3301,6 +3306,7 @@ enum shmem_param { > Opt_uid, > Opt_inode32, > Opt_inode64, > + Opt_report_overcommit, > }; > > static const struct constant_table shmem_param_enums_huge[] = { > @@ -3322,6 +3328,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = { > fsparam_u32 ("uid", Opt_uid), > fsparam_flag ("inode32", Opt_inode32), > fsparam_flag ("inode64", Opt_inode64), > + fsparam_flag ("report_overcommit", Opt_report_overcommit), > {} > }; > > @@ -3405,6 +3412,9 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) > ctx->full_inums = true; > ctx->seen |= SHMEM_SEEN_INUMS; > break; > + case Opt_report_overcommit: > + ctx->report_overcommit = true; > + break; > } > return 0; > > @@ -3513,6 +3523,7 @@ static int shmem_reconfigure(struct fs_context *fc) > sbinfo->max_inodes = ctx->inodes; > sbinfo->free_inodes = ctx->inodes - inodes; > } > + sbinfo->report_overcommit = ctx->report_overcommit; > > /* > * Preserve previous mempolicy unless mpol remount option was specified. > @@ -3640,6 +3651,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) > sbinfo->mode = ctx->mode; > sbinfo->huge = ctx->huge; > sbinfo->mpol = ctx->mpol; > + sbinfo->report_overcommit = ctx->report_overcommit; > ctx->mpol = NULL; > > raw_spin_lock_init(&sbinfo->stat_lock);