Message ID | 20240926221103.24423-1-linmaxi@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | vfs/ext4: Fixed a potential problem related to an infinite loop | expand |
On Fri, Sep 27, 2024 at 01:11:03AM +0300, Max Brener wrote: > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219306 > > This patch fixes a potential infinite journal-truncate-print > problem. When systemd's journald is called, ftruncate syscall is > called. If anywhere down the call stack of ftruncate a printk of > some sort happens, it triggers journald again therefore an infinite > loop is established. This isn't a good justification for this change; in general, whenever you have code paths which get triggered when a logging daemon is triggered, whether it's systemd-journald, or syslog, and this can cause this kind of infinite loop. For example, suppose you are using remote logging (where a log message gets sent over the network via the remote syslog facility), and anything in the networking stack triggers a printk, that will also trigger an "infinite loop". This falls in the "Doctor, doctor, it hurts when I do that --- so don't do that!" In this particular situation, journald is doing something silly/stupid which is whenver a message is logged, it is issuing a no-op ftruncate to the journald log file. It's also worth noting that ext4's truncate path does *not* trigger a printk unless something really haw gone wrong (e.g., a WARN_ON when a kernel bug has happened and flags in the in-memory get erronously set, or the file system gets corrupted and this gets reported via ext4_error()). The reporter discovered this by explicitly adding a printk in their privatea kernel sources, and in general, when you add random changes to the kernel, any unfortunate consequences are not something that upstream code can be expected to defend against. For context, see: https://bugzilla.kernel.org/show_bug.cgi?id=219306 We can justify an optimization here so that in the case of silly/stupid userspace programs which are constnatly calling truncate(2) which are no-ops, we can optimize ext4's handling of these silly/stupid programs. The ext4_truncate() code path causes starting a journal handle, adding the inode to the orphan list, and then removing it at the end of the truncate. In the case where sopme program calls truncate() in a tight loop, we can optimize the behaviour. It's not a high priority optimization, but if given that we can't necessarily change silly/stupid userspace programmers, it can be something that we can do if the patch is too invasive. HOWEVER.... > To fix this issue: > Add a new inode flag S_TRUNCATED which helps in stopping such an infinite loop by marking an in-memory inode as already truncated. Adding a generic VFS-level flag is not something that we can justify here. The VFS maintainers would NACK such a change, and deservedly so. What I had in mind was to define a new EXT4 state flag, say, EXT4_STATE_TRUNCATED, and then test, set, and clear it using ext4_{test,set,clear}_inode_state(). Cheers, - Ted
On Fri 27-09-24 08:50:19, Theodore Ts'o wrote: > On Fri, Sep 27, 2024 at 01:11:03AM +0300, Max Brener wrote: > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219306 > > > > This patch fixes a potential infinite journal-truncate-print > > problem. When systemd's journald is called, ftruncate syscall is > > called. If anywhere down the call stack of ftruncate a printk of > > some sort happens, it triggers journald again therefore an infinite > > loop is established. > > This isn't a good justification for this change; in general, whenever > you have code paths which get triggered when a logging daemon is > triggered, whether it's systemd-journald, or syslog, and this can > cause this kind of infinite loop. For example, suppose you are using > remote logging (where a log message gets sent over the network via the > remote syslog facility), and anything in the networking stack triggers > a printk, that will also trigger an "infinite loop". This falls in > the "Doctor, doctor, it hurts when I do that --- so don't do that!" > > In this particular situation, journald is doing something silly/stupid > which is whenver a message is logged, it is issuing a no-op ftruncate > to the journald log file. It's also worth noting that ext4's truncate > path does *not* trigger a printk unless something really haw gone > wrong (e.g., a WARN_ON when a kernel bug has happened and flags in the > in-memory get erronously set, or the file system gets corrupted and > this gets reported via ext4_error()). The reporter discovered this by > explicitly adding a printk in their privatea kernel sources, and in > general, when you add random changes to the kernel, any unfortunate > consequences are not something that upstream code can be expected to > defend against. > > For context, see: https://bugzilla.kernel.org/show_bug.cgi?id=219306 Right, loops like these are not something we should be fixing in the kernel. > We can justify an optimization here so that in the case of > silly/stupid userspace programs which are constnatly calling > truncate(2) which are no-ops, we can optimize ext4's handling of these > silly/stupid programs. The ext4_truncate() code path causes starting > a journal handle, adding the inode to the orphan list, and then > removing it at the end of the truncate. In the case where sopme > program calls truncate() in a tight loop, we can optimize the > behaviour. It's not a high priority optimization, but if given that > we can't necessarily change silly/stupid userspace programmers, it can > be something that we can do if the patch is too invasive. > > HOWEVER.... > > > > To fix this issue: > > Add a new inode flag S_TRUNCATED which helps in stopping such an infinite loop by marking an in-memory inode as already truncated. > > Adding a generic VFS-level flag is not something that we can justify > here. The VFS maintainers would NACK such a change, and deservedly > so. > > What I had in mind was to define a new EXT4 state flag, say, > EXT4_STATE_TRUNCATED, and then test, set, and clear it using > ext4_{test,set,clear}_inode_state(). Agreed as well. I'll also note that keeping such flag uptodate is not as simple as it seems because there are various places that may be allocating blocks beyond EOF (for example extending writes) and that rely on ext4_truncate() removing them so one needs to be careful to capture all the places where the "truncated" state needs to be cleared. Honza
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 54bdd4884fe6..c2a9e9be23e2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4193,6 +4193,8 @@ int ext4_truncate(struct inode *inode) if (IS_SYNC(inode)) ext4_handle_sync(handle); + inode->i_flags |= S_TRUNCATED; + out_stop: /* * If this was a simple ftruncate() and the file will remain alive, @@ -5492,7 +5494,8 @@ int ext4_setattr(struct mnt_idmap *idmap, struct dentry *dentry, * Call ext4_truncate() even if i_size didn't change to * truncate possible preallocated blocks. */ - if (attr->ia_size <= oldsize) { + if (attr->ia_size < oldsize || + (attr->ia_size == oldsize && !IS_TRUNCATED(inode))) { rc = ext4_truncate(inode); if (rc) error = rc; diff --git a/fs/open.c b/fs/open.c index daf1b55ca818..3f34dda79479 100644 --- a/fs/open.c +++ b/fs/open.c @@ -329,6 +329,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) if (!file->f_op->fallocate) return -EOPNOTSUPP; + if (mode & FALLOC_FL_MODE_MASK & FALLOC_FL_KEEP_SIZE) + inode->i_flags &= (~S_TRUNCATED); + file_start_write(file); ret = file->f_op->fallocate(file, mode, offset, len); diff --git a/include/linux/fs.h b/include/linux/fs.h index 6b8df574729c..ae613481dc29 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2272,6 +2272,7 @@ struct super_operations { #define S_CASEFOLD (1 << 15) /* Casefolded file */ #define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */ #define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */ +#define S_TRUNCATED (1 << 18) /* Is the file truncated */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -2328,6 +2329,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) +#define IS_TRUNCATED(inode) ((inode)->i_flags & S_TRUNCATED) static inline bool HAS_UNMAPPED_ID(struct mnt_idmap *idmap, struct inode *inode)
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219306 This patch fixes a potential infinite journal-truncate-print problem. When systemd's journald is called, ftruncate syscall is called. If anywhere down the call stack of ftruncate a printk of some sort happens, it triggers journald again therefore an infinite loop is established. Some example of such situation (aside from the example given in the added link): ext4_es_print_tree(struct inode*) is one of the functions being called as a part of the call stack of ftruncate (System.journal is an extent file). In this function, some debug prints occur. There can be more printk statements down the call stack (and if not, then there might be in the future). This patch does two things: prevents an infinite loop, and optimizes truncate operation by avoiding doing it if it is a no-op call. To fix this issue: Add a new inode flag S_TRUNCATED which helps in stopping such an infinite loop by marking an in-memory inode as already truncated. When ext4_setattr() is triggered, it won't call ext4_truncate() if the size of the file is unchanged and the file was truncated at least once (which might be necessary for removing preallocated blocks). Calling truncate for the second time with no size change, is a no-op call, therefore the call is avoided. Next, turn on the flag of the inode if it is truncated in ext4_truncate(). In ext4_setattr, avoid calling ext4_truncate if the inode is already truncated and the size of the file is not meant to change in the current call. Finally, zero S_TRUNCATED flag of the inode at fallocate() when it is being called with FALLOC_FL_KEEP_SIZE flag. Signed-off-by: Max Brener <linmaxi@gmail.com> --- fs/ext4/inode.c | 5 ++++- fs/open.c | 3 +++ include/linux/fs.h | 2 ++ 3 files changed, 9 insertions(+), 1 deletion(-)