[v2,0/6] Composefs: an opportunistically sharing verified image filesystem

Message ID	cover.1673623253.git.alexl@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@vger.kernel.org> From: Alexander Larsson <alexl@redhat.com> To: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org, gscrivan@redhat.com, Alexander Larsson <alexl@redhat.com> Subject: [PATCH v2 0/6] Composefs: an opportunistically sharing verified image filesystem Date: Fri, 13 Jan 2023 16:33:53 +0100 Message-Id: <cover.1673623253.git.alexl@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Composefs: an opportunistically sharing verified image filesystem \| expand [v2,0/6] Composefs: an opportunistically sharing verified image filesystem [v2,1/6] fsverity: Export fsverity_get_digest [v2,2/6] composefs: Add on-disk layout [v2,3/6] composefs: Add descriptor parsing code [v2,4/6] composefs: Add filesystem implementation [v2,5/6] composefs: Add documentation [v2,6/6] composefs: Add kconfig and build support

Alexander Larsson Jan. 13, 2023, 3:33 p.m. UTC

Giuseppe Scrivano and I have recently been working on a new project we
call composefs. This is the first time we propose this publically and
we would like some feedback on it.

At its core, composefs is a way to construct and use read only images
that are used similar to how you would use e.g. loop-back mounted
squashfs images. On top of this composefs has two fundamental
features. First it allows sharing of file data (both on disk and in
page cache) between images, and secondly it has dm-verity like
validation on read.

Let me first start with a minimal example of how this can be used,
before going into the details:

Suppose we have this source for an image:

rootfs/
├── dir
│ └── another_a
├── file_a
└── file_b

We can then use this to generate an image file and a set of
content-addressed backing files:

# mkcomposefs --digest-store=objects rootfs/ rootfs.img
# ls -l rootfs.img objects/*/*
-rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
-rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
-rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img

The rootfs.img file contains all information about directory and file
metadata plus references to the backing files by name. We can now
mount this and look at the result:

# mount -t composefs rootfs.img -o basedir=objects /mnt
# ls /mnt/
dir file_a file_b
# cat /mnt/file_a
content_a

When reading this file the kernel is actually reading the backing
file, in a fashion similar to overlayfs. Since the backing file is
content-addressed, the objects directory can be shared for multiple
images, and any files that happen to have the same content are
shared. I refer to this as opportunistic sharing, as it is different
than the more course-grained explicit sharing used by e.g. container
base images.

The next step is the validation. Note how the object files have
fs-verity enabled. In fact, they are named by their fs-verity digest:

# fsverity digest objects/*/*
sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f

The generated filesystm image may contain the expected digest for the
backing files. When the backing file digest is incorrect, the open
will fail, and if the open succeeds, any other on-disk file-changes
will be detected by fs-verity:

# cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
content_a
# rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
# echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
# cat /mnt/file_a
WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
cat: /mnt/file_a: Input/output error

This re-uses the existing fs-verity functionallity to protect against
changes in file contents, while adding on top of it protection against
changes in filesystem metadata and structure. I.e. protecting against
replacing a fs-verity enabled file or modifying file permissions or
xattrs.

To be fully verified we need another step: we use fs-verity on the
image itself. Then we pass the expected digest on the mount command
line (which will be verified at mount time):

# fsverity enable rootfs.img
# fsverity digest rootfs.img
sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
# mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt

So, given a trusted set of mount options (say unlocked from TPM), we
have a fully verified filesystem tree mounted, with opportunistic
finegrained sharing of identical files.

So, why do we want this? There are two initial users. First of all we
want to use the opportunistic sharing for the podman container image
baselayer. The idea is to use a composefs mount as the lower directory
in an overlay mount, with the upper directory being the container work
dir. This will allow automatical file-level disk and page-cache
sharning between any two images, independent of details like the
permissions and timestamps of the files.

Secondly we are interested in using the verification aspects of
composefs in the ostree project. Ostree already supports a
content-addressed object store, but it is currently referenced by
hardlink farms. The object store and the trees that reference it are
signed and verified at download time, but there is no runtime
verification. If we replace the hardlink farm with a composefs image
that points into the existing object store we can use the verification
to implement runtime verification.

In fact, the tooling to create composefs images is 100% reproducible,
so all we need is to add the composefs image fs-verity digest into the
ostree commit. Then the image can be reconstructed from the ostree
commit info, generating a file with the same fs-verity digest.

These are the usecases we're currently interested in, but there seems
to be a breadth of other possible uses. For example, many systems use
loopback mounts for images (like lxc or snap), and these could take
advantage of the opportunistic sharing. We've also talked about using
fuse to implement a local cache for the backing files. I.e. you would
have the second basedir be a fuse filesystem. On lookup failure in the
first basedir it downloads the file and saves it in the first basedir
for later lookups. There are many interesting possibilities here.

The patch series contains some documentation on the file format and
how to use the filesystem.

The userspace tools (and a standalone kernel module) is available
here:
https://github.com/containers/composefs

Initial work on ostree integration is here:
https://github.com/ostreedev/ostree/pull/2640

Changes since v1:
- Fixed some minor compiler warnings
- Fixed build with !CONFIG_MMU
- Documentation fixes from review by Bagas Sanjaya
- Code style and cleanup from review by Brian Masney
- Use existing kernel helpers for hex digit conversion
- Use kmap_local_page() instead of deprecated kmap()

Alexander Larsson (6):
fsverity: Export fsverity_get_digest
composefs: Add on-disk layout
composefs: Add descriptor parsing code
composefs: Add filesystem implementation
composefs: Add documentation
composefs: Add kconfig and build support

Documentation/filesystems/composefs.rst | 169 +++++
Documentation/filesystems/index.rst | 1 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/composefs/Kconfig | 18 +
fs/composefs/Makefile | 5 +
fs/composefs/cfs-internals.h | 63 ++
fs/composefs/cfs-reader.c | 927 ++++++++++++++++++++++++
fs/composefs/cfs.c | 903 +++++++++++++++++++++++
fs/composefs/cfs.h | 203 ++++++
fs/verity/measure.c | 1 +
11 files changed, 2292 insertions(+)
create mode 100644 Documentation/filesystems/composefs.rst
create mode 100644 fs/composefs/Kconfig
create mode 100644 fs/composefs/Makefile
create mode 100644 fs/composefs/cfs-internals.h
create mode 100644 fs/composefs/cfs-reader.c
create mode 100644 fs/composefs/cfs.c
create mode 100644 fs/composefs/cfs.h

Gao Xiang Jan. 16, 2023, 4:44 a.m. UTC | #1

Hi Alexander and folks,

On 2023/1/13 23:33, Alexander Larsson wrote:
> Giuseppe Scrivano and I have recently been working on a new project we
> call composefs. This is the first time we propose this publically and
> we would like some feedback on it.
> 
> At its core, composefs is a way to construct and use read only images
> that are used similar to how you would use e.g. loop-back mounted
> squashfs images. On top of this composefs has two fundamental
> features. First it allows sharing of file data (both on disk and in
> page cache) between images, and secondly it has dm-verity like
> validation on read.
> 
> Let me first start with a minimal example of how this can be used,
> before going into the details:
> 
> Suppose we have this source for an image:
> 
> rootfs/
> ├── dir
> │   └── another_a
> ├── file_a
> └── file_b
> 
> We can then use this to generate an image file and a set of
> content-addressed backing files:
> 
> # mkcomposefs --digest-store=objects rootfs/ rootfs.img
> # ls -l rootfs.img objects/*/*
> -rw-------. 1 root root   10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> -rw-------. 1 root root   10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img
> 
> The rootfs.img file contains all information about directory and file
> metadata plus references to the backing files by name. We can now
> mount this and look at the result:
> 
> # mount -t composefs rootfs.img -o basedir=objects /mnt
> # ls  /mnt/
> dir  file_a  file_b
> # cat /mnt/file_a
> content_a
> 
> When reading this file the kernel is actually reading the backing
> file, in a fashion similar to overlayfs. Since the backing file is
> content-addressed, the objects directory can be shared for multiple
> images, and any files that happen to have the same content are
> shared. I refer to this as opportunistic sharing, as it is different
> than the more course-grained explicit sharing used by e.g. container
> base images.
> 

I'd like to say sorry about comments in LWN.net article.  If it helps
to the community,  my own concern about this new overlay model was
(which is different from overlayfs since overlayfs doesn't have
  different permission of original files) somewhat a security issue (as
I told Giuseppe Scrivano before when he initially found me on slack):

As composefs on-disk shown:

struct cfs_inode_s {

         ...

	u32 st_mode; /* File type and mode.  */
	u32 st_nlink; /* Number of hard links, only for regular files.  */
	u32 st_uid; /* User ID of owner.  */
	u32 st_gid; /* Group ID of owner.  */

         ...
};

It seems Composefs can override uid / gid and mode bits of the
original file

    considering a rootfs image:
      ├── /bin
      │   └── su

/bin/su has SUID bit set in the Composefs inode metadata, but I didn't
find some clues if ostree "objects/abc" could be actually replaced
with data of /bin/sh if composefs fsverity feature is disabled (it
doesn't seem composefs enforcely enables fsverity according to
documentation).

I think that could cause _privilege escalation attack_ of these SUID
files is replaced with some root shell.  Administrators cannot keep
all the time of these SUID files because such files can also be
replaced at runtime.

Composefs may assume that ostree is always for such content-addressed
directory.  But if considering it could laterly be an upstream fs, I
think we cannot always tell people "no, don't use this way, it doesn't
work" if people use Composefs under an untrusted repo (maybe even
without ostree).

That was my own concern at that time when Giuseppe Scrivano told me
to enhance EROFS as this way, and I requested him to discuss this in
the fsdevel mailing list in order to resolve this, but it doesn't
happen.

Otherwise, EROFS could face such issue as well, that is why I think
it needs to be discussed first.

> The next step is the validation. Note how the object files have
> fs-verity enabled. In fact, they are named by their fs-verity digest:
> 
> # fsverity digest objects/*/*
> sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4
> sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> 
> The generated filesystm image may contain the expected digest for the
> backing files. When the backing file digest is incorrect, the open
> will fail, and if the open succeeds, any other on-disk file-changes
> will be detected by fs-verity:
> 
> # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> content_a
> # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f
> # cat /mnt/file_a
> WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest
> cat: /mnt/file_a: Input/output error
> 
> This re-uses the existing fs-verity functionallity to protect against
> changes in file contents, while adding on top of it protection against
> changes in filesystem metadata and structure. I.e. protecting against
> replacing a fs-verity enabled file or modifying file permissions or
> xattrs.
> 
> To be fully verified we need another step: we use fs-verity on the
> image itself. Then we pass the expected digest on the mount command
> line (which will be verified at mount time):
> 
> # fsverity enable rootfs.img
> # fsverity digest rootfs.img
> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img
> # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt
> 

It seems that Composefs uses fsverity_get_digest() to do fsverity
check.  If Composefs uses symlink-like payload to redirect a file to
another underlayfs file, such underlayfs file can exist in any other
fses.

I can see Composefs could work with ext4, btrfs, f2fs, and later XFS
but I'm not sure how it could work with overlayfs, FUSE, or other
network fses.  That could limit the use cases as well.

Except for the above, I think EROFS could implement this in about
300~500 new lines of code as Giuseppe found me, or squashfs or
overlayfs.

I'm very happy to implement such model if it can be proved as safe
(I'd also like to say here by no means I dislike ostree) and I'm
also glad if folks feel like to introduce a new file system for
this as long as this overlay model is proved as safe.

Hopefully it helps.

Thanks,
Gao Xiang

Alexander Larsson Jan. 16, 2023, 9:30 a.m. UTC | #2

On Mon, 2023-01-16 at 12:44 +0800, Gao Xiang wrote:
> Hi Alexander and folks,
> 
> I'd like to say sorry about comments in LWN.net article.  If it helps
> to the community,  my own concern about this new overlay model was
> (which is different from overlayfs since overlayfs doesn't have
>   different permission of original files) somewhat a security issue
> (as
> I told Giuseppe Scrivano before when he initially found me on slack):
> 
> As composefs on-disk shown:
> 
> struct cfs_inode_s {
> 
>          ...
> 
>         u32 st_mode; /* File type and mode.  */
>         u32 st_nlink; /* Number of hard links, only for regular
> files.  */
>         u32 st_uid; /* User ID of owner.  */
>         u32 st_gid; /* Group ID of owner.  */
> 
>          ...
> };
> 
> It seems Composefs can override uid / gid and mode bits of the
> original file
> 
>     considering a rootfs image:
>       ├── /bin
>       │   └── su
> 
> /bin/su has SUID bit set in the Composefs inode metadata, but I
> didn't
> find some clues if ostree "objects/abc" could be actually replaced
> with data of /bin/sh if composefs fsverity feature is disabled (it
> doesn't seem composefs enforcely enables fsverity according to
> documentation).
> 
> I think that could cause _privilege escalation attack_ of these SUID
> files is replaced with some root shell.  Administrators cannot keep
> all the time of these SUID files because such files can also be
> replaced at runtime.
> 
> Composefs may assume that ostree is always for such content-addressed
> directory.  But if considering it could laterly be an upstream fs, I
> think we cannot always tell people "no, don't use this way, it
> doesn't
> work" if people use Composefs under an untrusted repo (maybe even
> without ostree).
> 
> That was my own concern at that time when Giuseppe Scrivano told me
> to enhance EROFS as this way, and I requested him to discuss this in
> the fsdevel mailing list in order to resolve this, but it doesn't
> happen.
> 
> Otherwise, EROFS could face such issue as well, that is why I think
> it needs to be discussed first.

I mean, you're not wrong about this being possible. But I don't see
that this is necessarily a new problem. For example, consider the case
of loopback mounting an ext4 filesystem containing a setuid /bin/su
file. If you have the right permissions, nothing prohibits you from
modifying the loopback mounted file and replacing the content of the su
file with a copy of bash.

In both these cases, the security of the system is fully defined by the
filesystem permissions of the backing file data. I think viewing
composefs as a "new type" of overlayfs gets the wrong idea across. Its
more similar to a "new type" of loopback mount. In particular, the
backing file metadata is completely unrelated to the metadata exposed
by the filesystem, which means that you can chose to protect the
backing files (and directories) in ways which protect against changes
from non-privileged users.

Note: The above assumes that mounting either a loopback mount or a
composefs image is a privileged operation. Allowing unprivileged mounts
is a very different thing.

> > To be fully verified we need another step: we use fs-verity on the
> > image itself. Then we pass the expected digest on the mount command
> > line (which will be verified at mount time):
> > 
> > # fsverity enable rootfs.img
> > # fsverity digest rootfs.img
> > sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af8
> > 6a76 rootfs.img
> > # mount -t composefs rootfs.img -o
> > basedir=objects,digest=da42003782992856240a3e25264b19601016114775de
> > bd80c01620260af86a76 /mnt
> > 
> 
> 
> It seems that Composefs uses fsverity_get_digest() to do fsverity
> check.  If Composefs uses symlink-like payload to redirect a file to
> another underlayfs file, such underlayfs file can exist in any other
> fses.
> 
> I can see Composefs could work with ext4, btrfs, f2fs, and later XFS
> but I'm not sure how it could work with overlayfs, FUSE, or other
> network fses.  That could limit the use cases as well.

Yes, if you chose to store backing files on a non-fs-verity enabled
filesystem you cannot use the fs-verity feature. But this is just a
decision users of composefs have to take if they wish to use this
particular feature. I think re-using fs-verity like this is a better
approach than re-implementing verity.

> Except for the above, I think EROFS could implement this in about
> 300~500 new lines of code as Giuseppe found me, or squashfs or
> overlayfs.
> 
> I'm very happy to implement such model if it can be proved as safe
> (I'd also like to say here by no means I dislike ostree) and I'm
> also glad if folks feel like to introduce a new file system for
> this as long as this overlay model is proved as safe.

My personal target usecase is that of the ostree trusted root
filesystem, and it has a lot of specific requirements that lead to
choices in the design of composefs. I took a look at EROFS a while ago,
and I think that even with some verify-like feature it would not fit
this usecase. 

EROFS does indeed do some of the file-sharing aspects of composefs with
its use of fs-cache (although the current n_chunk limit would need to
be raised). However, I think there are two problems with this. 

First of all is the complexity of having to involve a userspace for the
cache. For trusted boot to work we have to have all the cachefs
userspace machinery on the (signed) initrd, and then have to properly
transition this across the pivot-root into the full os boot. I'm sure
it is technically *possible*, but it is very complex and a pain to set
up and maintain.

Secondly, the use of fs-cache doesn't stack, as there can only be one
cachefs agent. For example, mixing an ostree EROFS boot with a
container backend using EROFS isn't possible (at least without deep
integration between the two userspaces).

Also, f we ignore the file sharing aspects there is the question of how
to actually integrate a new digest-based image format with the pre-
existing ostree formats and distribution mechanisms. If we just replace
everything with distributing a signed image file then we can easily use
existing technology (say dm-verity + squashfs + loopback). However,
this would be essentially A/B booting and we would lose all the
advantages of ostree. 

Instead what we have done with composefs is to make filesystem image
generation from the ostree repository 100% reproducible. Then we can
keep the entire pre-existing ostree distribution mechanism and on-disk
repo format, adding just a single piece of metadata to the ostree
commit, containing the composefs toplevel digest. Then the client can
easily and efficiently re-generate the composefs image locally, and
boot into it specifying the trusted not-locally-generated digest. A
filesystem that doesn't have this reproduceability feature isn't going
to be possible to integrate with ostree without enormous changes to
ostree, and a filesystem more complex that composefs will have a hard
time giving such guarantees.

Gao Xiang Jan. 16, 2023, 10:19 a.m. UTC | #3

Hi Alexander,

On 2023/1/16 17:30, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 12:44 +0800, Gao Xiang wrote:
>> Hi Alexander and folks,
>>
>> I'd like to say sorry about comments in LWN.net article.  If it helps
>> to the community,  my own concern about this new overlay model was
>> (which is different from overlayfs since overlayfs doesn't have
>>    different permission of original files) somewhat a security issue
>> (as
>> I told Giuseppe Scrivano before when he initially found me on slack):
>>
>> As composefs on-disk shown:
>>
>> struct cfs_inode_s {
>>
>>           ...
>>
>>          u32 st_mode; /* File type and mode.  */
>>          u32 st_nlink; /* Number of hard links, only for regular
>> files.  */
>>          u32 st_uid; /* User ID of owner.  */
>>          u32 st_gid; /* Group ID of owner.  */
>>
>>           ...
>> };
>>
>> It seems Composefs can override uid / gid and mode bits of the
>> original file
>>
>>      considering a rootfs image:
>>        ├── /bin
>>        │   └── su
>>
>> /bin/su has SUID bit set in the Composefs inode metadata, but I
>> didn't
>> find some clues if ostree "objects/abc" could be actually replaced
>> with data of /bin/sh if composefs fsverity feature is disabled (it
>> doesn't seem composefs enforcely enables fsverity according to
>> documentation).
>>
>> I think that could cause _privilege escalation attack_ of these SUID
>> files is replaced with some root shell.  Administrators cannot keep
>> all the time of these SUID files because such files can also be
>> replaced at runtime.
>>
>> Composefs may assume that ostree is always for such content-addressed
>> directory.  But if considering it could laterly be an upstream fs, I
>> think we cannot always tell people "no, don't use this way, it
>> doesn't
>> work" if people use Composefs under an untrusted repo (maybe even
>> without ostree).
>>
>> That was my own concern at that time when Giuseppe Scrivano told me
>> to enhance EROFS as this way, and I requested him to discuss this in
>> the fsdevel mailing list in order to resolve this, but it doesn't
>> happen.
>>
>> Otherwise, EROFS could face such issue as well, that is why I think
>> it needs to be discussed first.
> 
> I mean, you're not wrong about this being possible. But I don't see
> that this is necessarily a new problem. For example, consider the case
> of loopback mounting an ext4 filesystem containing a setuid /bin/su
> file. If you have the right permissions, nothing prohibits you from
> modifying the loopback mounted file and replacing the content of the su
> file with a copy of bash.
> 
> In both these cases, the security of the system is fully defined by the
> filesystem permissions of the backing file data. I think viewing
> composefs as a "new type" of overlayfs gets the wrong idea across. Its
> more similar to a "new type" of loopback mount. In particular, the
> backing file metadata is completely unrelated to the metadata exposed
> by the filesystem, which means that you can chose to protect the
> backing files (and directories) in ways which protect against changes
> from non-privileged users.
> 
> Note: The above assumes that mounting either a loopback mount or a
> composefs image is a privileged operation. Allowing unprivileged mounts
> is a very different thing.

Thanks for the reply.  I think if I understand correctly, I could
answer some of your questions.  Hopefully help to everyone interested.

Let's avoid thinking unprivileged mounts first, although Giuseppe told
me earilier that is also a future step of Composefs. But I don't know
how it could work reliably if a fs has some on-disk format, we could
discuss it later.

I think as a loopback mount, such loopback files are quite under control
(take ext4 loopback mount as an example, each ext4 has the only one file
  to access when setting up loopback devices and such loopback file was
  also opened when setting up loopback mount so it cannot be replaced.

  If you enables fsverity for such loopback mount before, it cannot be
  modified as well) by admins.


But IMHO, here composefs shows a new model that some stackable
filesystem can point to massive files under a random directory as what
ostree does (even files in such directory can be bind-mounted later in
principle).  But the original userspace ostree strictly follows
underlayfs permission check but Composefs can override
uid/gid/permission instead.

That is also why we selected fscache at the first time to manage all
local cache data for EROFS, since such content-defined directory is
quite under control by in-kernel fscache instead of selecting a
random directory created and given by some userspace program.

If you are interested in looking info the current in-kernel fscache
behavior, I think that is much similar as what ostree does now.

It just needs new features like
   - multiple directories;
   - daemonless
to match.

> 
>>> To be fully verified we need another step: we use fs-verity on the
>>> image itself. Then we pass the expected digest on the mount command
>>> line (which will be verified at mount time):
>>>
>>> # fsverity enable rootfs.img
>>> # fsverity digest rootfs.img
>>> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af8
>>> 6a76 rootfs.img
>>> # mount -t composefs rootfs.img -o
>>> basedir=objects,digest=da42003782992856240a3e25264b19601016114775de
>>> bd80c01620260af86a76 /mnt
>>>
>>
>>
>> It seems that Composefs uses fsverity_get_digest() to do fsverity
>> check.  If Composefs uses symlink-like payload to redirect a file to
>> another underlayfs file, such underlayfs file can exist in any other
>> fses.
>>
>> I can see Composefs could work with ext4, btrfs, f2fs, and later XFS
>> but I'm not sure how it could work with overlayfs, FUSE, or other
>> network fses.  That could limit the use cases as well.
> 
> Yes, if you chose to store backing files on a non-fs-verity enabled
> filesystem you cannot use the fs-verity feature. But this is just a
> decision users of composefs have to take if they wish to use this
> particular feature. I think re-using fs-verity like this is a better
> approach than re-implementing verity.
> 
>> Except for the above, I think EROFS could implement this in about
>> 300~500 new lines of code as Giuseppe found me, or squashfs or
>> overlayfs.
>>
>> I'm very happy to implement such model if it can be proved as safe
>> (I'd also like to say here by no means I dislike ostree) and I'm
>> also glad if folks feel like to introduce a new file system for
>> this as long as this overlay model is proved as safe.
> 
> My personal target usecase is that of the ostree trusted root
> filesystem, and it has a lot of specific requirements that lead to
> choices in the design of composefs. I took a look at EROFS a while ago,
> and I think that even with some verify-like feature it would not fit
> this usecase.
> 
> EROFS does indeed do some of the file-sharing aspects of composefs with
> its use of fs-cache (although the current n_chunk limit would need to
> be raised). However, I think there are two problems with this.
> 
> First of all is the complexity of having to involve a userspace for the
> cache. For trusted boot to work we have to have all the cachefs
> userspace machinery on the (signed) initrd, and then have to properly
> transition this across the pivot-root into the full os boot. I'm sure
> it is technically *possible*, but it is very complex and a pain to set
> up and maintain.
> 
> Secondly, the use of fs-cache doesn't stack, as there can only be one
> cachefs agent. For example, mixing an ostree EROFS boot with a
> container backend using EROFS isn't possible (at least without deep
> integration between the two userspaces).

The reasons above are all current fscache implementation limitation:

  - First, if such overlay model really works, EROFS can do it without
fscache feature as well to integrate userspace ostree.  But even that
I hope this new feature can be landed in overlayfs rather than some
other ways since it has native writable layer so we don't need another
overlayfs mount at all for writing;

  - Second, as I mentioned above, the limitation above is what fscache
behaves now not fscache will behave.  I did discuss with David Howells
that he also would like to develop multiple directories and daemonless
features for network fses.

> 
> Also, f we ignore the file sharing aspects there is the question of how
> to actually integrate a new digest-based image format with the pre-
> existing ostree formats and distribution mechanisms. If we just replace
> everything with distributing a signed image file then we can easily use
> existing technology (say dm-verity + squashfs + loopback). However,
> this would be essentially A/B booting and we would lose all the
> advantages of ostree.
EROFS now can do data-duplication and later page cache sharing as well.

> 
> Instead what we have done with composefs is to make filesystem image
> generation from the ostree repository 100% reproducible. Then we can

EROFS is all 100% reproduciable as well.

> keep the entire pre-existing ostree distribution mechanism and on-disk
> repo format, adding just a single piece of metadata to the ostree
> commit, containing the composefs toplevel digest. Then the client can
> easily and efficiently re-generate the composefs image locally, and
> boot into it specifying the trusted not-locally-generated digest. A
> filesystem that doesn't have this reproduceability feature isn't going
> to be possible to integrate with ostree without enormous changes to
> ostree, and a filesystem more complex that composefs will have a hard
> time giving such guarantees.

I'm not sure why EROFS is not good at this, I could also make an
EROFS-version the same as what Composefs does with some symlink path
attached to each regular file. And ostree can also make use of it.

But really, personally I think the issue above is different from
loopback devices and may need to be resolved first. And if possible,
I hope it could be an new overlayfs feature for everyone.

Thanks,
Gao Xiang

> 
>

Alexander Larsson Jan. 16, 2023, 12:33 p.m. UTC | #4

On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote:
> Hi Alexander,
> 
> On 2023/1/16 17:30, Alexander Larsson wrote:
> > 
> > I mean, you're not wrong about this being possible. But I don't see
> > that this is necessarily a new problem. For example, consider the
> > case
> > of loopback mounting an ext4 filesystem containing a setuid /bin/su
> > file. If you have the right permissions, nothing prohibits you from
> > modifying the loopback mounted file and replacing the content of
> > the su
> > file with a copy of bash.
> > 
> > In both these cases, the security of the system is fully defined by
> > the
> > filesystem permissions of the backing file data. I think viewing
> > composefs as a "new type" of overlayfs gets the wrong idea across.
> > Its
> > more similar to a "new type" of loopback mount. In particular, the
> > backing file metadata is completely unrelated to the metadata
> > exposed
> > by the filesystem, which means that you can chose to protect the
> > backing files (and directories) in ways which protect against
> > changes
> > from non-privileged users.
> > 
> > Note: The above assumes that mounting either a loopback mount or a
> > composefs image is a privileged operation. Allowing unprivileged
> > mounts
> > is a very different thing.
> 
> Thanks for the reply.  I think if I understand correctly, I could
> answer some of your questions.  Hopefully help to everyone
> interested.
> 
> Let's avoid thinking unprivileged mounts first, although Giuseppe
> told
> me earilier that is also a future step of Composefs. But I don't know
> how it could work reliably if a fs has some on-disk format, we could
> discuss it later.
> 
> I think as a loopback mount, such loopback files are quite under
> control
> (take ext4 loopback mount as an example, each ext4 has the only one
> file
>   to access when setting up loopback devices and such loopback file
> was
>   also opened when setting up loopback mount so it cannot be
> replaced.
> 
>   If you enables fsverity for such loopback mount before, it cannot
> be
>   modified as well) by admins.
> 
> 
> But IMHO, here composefs shows a new model that some stackable
> filesystem can point to massive files under a random directory as
> what
> ostree does (even files in such directory can be bind-mounted later
> in
> principle).  But the original userspace ostree strictly follows
> underlayfs permission check but Composefs can override
> uid/gid/permission instead.

Suppose you have:

-rw-r--r-- root root image.ext4
-rw-r--r-- root root image.composefs
drwxr--r-- root root objects/
-rw-r--r-- root root objects/backing.file

Are you saying it is easier for someone to modify backing.file than
image.ext4? 

I argue it is not, but composefs takes some steps to avoid issues here.
At mount time, when the basedir ("objects/" above) argument is parsed,
we resolve that path and then create a private vfsmount for it: 

 resolve_basedir(path) {
        ...
	mnt = clone_private_mount(&path);
        ...
 }

 fsi->bases[i] = resolve_basedir(path);

Then we open backing files with this mount as root:

 real_file = file_open_root_mnt(fsi->bases[i], real_path,
 			        file->f_flags, 0);

This will never resolve outside the initially specified basedir, even
with symlinks or whatever. It will also not be affected by later mount
changes in the original mount namespace, as this is a private mount. 

This is the same mechanism that overlayfs uses for its upper dirs.

I would argue that anyone who has rights to modify the contents of
files in "objects" (supposing they were created with sane permissions)
would also have rights to modify "image.ext4".

> That is also why we selected fscache at the first time to manage all
> local cache data for EROFS, since such content-defined directory is
> quite under control by in-kernel fscache instead of selecting a
> random directory created and given by some userspace program.
> 
> If you are interested in looking info the current in-kernel fscache
> behavior, I think that is much similar as what ostree does now.
> 
> It just needs new features like
>    - multiple directories;
>    - daemonless
> to match.
> 

Obviously everything can be extended to support everything. But
composefs is very small and simple (2128 lines of code), while at the
same time being easy to use (just mount it with one syscall) and needs
no complex userspace machinery and configuration. But even without the
above feature additions fscache + cachefiles is 7982 lines, plus erofs
is 9075 lines, and then on top of that you need userspace integration
to even use the thing.

Don't take me wrong, EROFS is great for its usecases, but I don't
really think it is the right choice for my usecase.

> > > 
> > Secondly, the use of fs-cache doesn't stack, as there can only be
> > one
> > cachefs agent. For example, mixing an ostree EROFS boot with a
> > container backend using EROFS isn't possible (at least without deep
> > integration between the two userspaces).
> 
> The reasons above are all current fscache implementation limitation:
> 
>   - First, if such overlay model really works, EROFS can do it
> without
> fscache feature as well to integrate userspace ostree.  But even that
> I hope this new feature can be landed in overlayfs rather than some
> other ways since it has native writable layer so we don't need
> another
> overlayfs mount at all for writing;

I don't think it is the right approach for overlayfs to integrate
something like image support. Merging the two codebases would
complicate both while adding costs to users who need only support for
one of the features. I think reusing and stacking separate features is
a better idea than combining them. 

> 
> > 
> > Instead what we have done with composefs is to make filesystem
> > image
> > generation from the ostree repository 100% reproducible. Then we
> > can
> 
> EROFS is all 100% reproduciable as well.
> 

Really, so if I today, on fedora 36 run:
# tar xvf oci-image.tar
# mkfs.erofs oci-dir/ oci.erofs

And then in 5 years, if someone on debian 13 runs the same, with the
same tar file, then both oci.erofs files will have the same sha256
checksum?

How do you handle things like different versions or builds of
compression libraries creating different results? Do you guarantee to
not add any new backwards compat changes by default, or change any
default options? Do you guarantee that the files are read from "oci-
dir" in the same order each time? It doesn't look like it.

> 
> But really, personally I think the issue above is different from
> loopback devices and may need to be resolved first. And if possible,
> I hope it could be an new overlayfs feature for everyone.

Yeah. Independent of composefs, I think EROFS would be better if you
could just point it to a chunk directory at mount time rather than
having to route everything through a system-wide global cachefs
singleton. I understand that cachefs does help with the on-demand
download aspect, but when you don't need that it is just in the way.

Gao Xiang Jan. 16, 2023, 1:26 p.m. UTC | #5

On 2023/1/16 20:33, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote:
>> Hi Alexander,
>>
>> On 2023/1/16 17:30, Alexander Larsson wrote:
>>>
>>> I mean, you're not wrong about this being possible. But I don't see
>>> that this is necessarily a new problem. For example, consider the
>>> case
>>> of loopback mounting an ext4 filesystem containing a setuid /bin/su
>>> file. If you have the right permissions, nothing prohibits you from
>>> modifying the loopback mounted file and replacing the content of
>>> the su
>>> file with a copy of bash.
>>>
>>> In both these cases, the security of the system is fully defined by
>>> the
>>> filesystem permissions of the backing file data. I think viewing
>>> composefs as a "new type" of overlayfs gets the wrong idea across.
>>> Its
>>> more similar to a "new type" of loopback mount. In particular, the
>>> backing file metadata is completely unrelated to the metadata
>>> exposed
>>> by the filesystem, which means that you can chose to protect the
>>> backing files (and directories) in ways which protect against
>>> changes
>>> from non-privileged users.
>>>
>>> Note: The above assumes that mounting either a loopback mount or a
>>> composefs image is a privileged operation. Allowing unprivileged
>>> mounts
>>> is a very different thing.
>>
>> Thanks for the reply.  I think if I understand correctly, I could
>> answer some of your questions.  Hopefully help to everyone
>> interested.
>>
>> Let's avoid thinking unprivileged mounts first, although Giuseppe
>> told
>> me earilier that is also a future step of Composefs. But I don't know
>> how it could work reliably if a fs has some on-disk format, we could
>> discuss it later.
>>
>> I think as a loopback mount, such loopback files are quite under
>> control
>> (take ext4 loopback mount as an example, each ext4 has the only one
>> file
>>    to access when setting up loopback devices and such loopback file
>> was
>>    also opened when setting up loopback mount so it cannot be
>> replaced.
>>
>>    If you enables fsverity for such loopback mount before, it cannot
>> be
>>    modified as well) by admins.
>>
>>
>> But IMHO, here composefs shows a new model that some stackable
>> filesystem can point to massive files under a random directory as
>> what
>> ostree does (even files in such directory can be bind-mounted later
>> in
>> principle).  But the original userspace ostree strictly follows
>> underlayfs permission check but Composefs can override
>> uid/gid/permission instead.
> 
> Suppose you have:
> 
> -rw-r--r-- root root image.ext4
> -rw-r--r-- root root image.composefs
> drwxr--r-- root root objects/
> -rw-r--r-- root root objects/backing.file
> 
> Are you saying it is easier for someone to modify backing.file than
> image.ext4?
> 
> I argue it is not, but composefs takes some steps to avoid issues here.
> At mount time, when the basedir ("objects/" above) argument is parsed,
> we resolve that path and then create a private vfsmount for it:
> 
>   resolve_basedir(path) {
>          ...
> 	mnt = clone_private_mount(&path);
>          ...
>   }
> 
>   fsi->bases[i] = resolve_basedir(path);
> 
> Then we open backing files with this mount as root:
> 
>   real_file = file_open_root_mnt(fsi->bases[i], real_path,
>   			        file->f_flags, 0);
> 
> This will never resolve outside the initially specified basedir, even
> with symlinks or whatever. It will also not be affected by later mount
> changes in the original mount namespace, as this is a private mount.
> 
> This is the same mechanism that overlayfs uses for its upper dirs.

Ok.  I have no problem of this part.

> 
> I would argue that anyone who has rights to modify the contents of
> files in "objects" (supposing they were created with sane permissions)
> would also have rights to modify "image.ext4".

But you don't have any permission check for files in such
"objects/" directory in composefs source code, do you?

As I said in my original reply, don't assume random users or
malicious people just passing in or behaving like your expected
way.  Sometimes they're not but I think in-kernel fses should
handle such cases by design.  Obviously, any system written by
human can cause unexpected bugs, but that is another story.
I think in general it needs to have such design at least.

> 
>> That is also why we selected fscache at the first time to manage all
>> local cache data for EROFS, since such content-defined directory is
>> quite under control by in-kernel fscache instead of selecting a
>> random directory created and given by some userspace program.
>>
>> If you are interested in looking info the current in-kernel fscache
>> behavior, I think that is much similar as what ostree does now.
>>
>> It just needs new features like
>>     - multiple directories;
>>     - daemonless
>> to match.
>>
> 
> Obviously everything can be extended to support everything. But
> composefs is very small and simple (2128 lines of code), while at the
> same time being easy to use (just mount it with one syscall) and needs
> no complex userspace machinery and configuration. But even without the
> above feature additions fscache + cachefiles is 7982 lines, plus erofs
> is 9075 lines, and then on top of that you need userspace integration
> to even use the thing.

I've replied this in the comment of LWN.net.  EROFS can handle both
device-based or file-based images. It can handle FSDAX, compression,
data deduplication, rolling-hash finer compressed data duplication,
etc.  Of course, for your use cases, you can just turn them off by
Kconfig, I think such code is useless to your use cases as well.

And as a team work these years, EROFS always accept useful features
from other people.  And I've been always working on cleaning up
EROFS, but as long as it gains more features, the code can expand
of course.

Also take your project -- flatpak for example, I don't think the
total line of current version is as same as the original version.

Also you will always maintain Composefs source code below 2.5k Loc?

> 
> Don't take me wrong, EROFS is great for its usecases, but I don't
> really think it is the right choice for my usecase.
> 
>>>>
>>> Secondly, the use of fs-cache doesn't stack, as there can only be
>>> one
>>> cachefs agent. For example, mixing an ostree EROFS boot with a
>>> container backend using EROFS isn't possible (at least without deep
>>> integration between the two userspaces).
>>
>> The reasons above are all current fscache implementation limitation:
>>
>>    - First, if such overlay model really works, EROFS can do it
>> without
>> fscache feature as well to integrate userspace ostree.  But even that
>> I hope this new feature can be landed in overlayfs rather than some
>> other ways since it has native writable layer so we don't need
>> another
>> overlayfs mount at all for writing;
> 
> I don't think it is the right approach for overlayfs to integrate
> something like image support. Merging the two codebases would
> complicate both while adding costs to users who need only support for
> one of the features. I think reusing and stacking separate features is
> a better idea than combining them.

Why? overlayfs could have metadata support as well, if they'd like
to support advanced features like partial copy-up without fscache
support.

> 
>>
>>>
>>> Instead what we have done with composefs is to make filesystem
>>> image
>>> generation from the ostree repository 100% reproducible. Then we
>>> can
>>
>> EROFS is all 100% reproduciable as well.
>>
> 
> 
> Really, so if I today, on fedora 36 run:
> # tar xvf oci-image.tar
> # mkfs.erofs oci-dir/ oci.erofs
> 
> And then in 5 years, if someone on debian 13 runs the same, with the
> same tar file, then both oci.erofs files will have the same sha256
> checksum?

Why it doesn't?  Reproducable builds is a MUST for Android use cases
as well.

Yes, it may break between versions by mistake, but I think
reproducable builds is a basic functionalaity for all image
use cases.

> 
> How do you handle things like different versions or builds of
> compression libraries creating different results? Do you guarantee to
> not add any new backwards compat changes by default, or change any
> default options? Do you guarantee that the files are read from "oci-
> dir" in the same order each time? It doesn't look like it.

If you'd like to say like that, why mkcomposefs doesn't have the
same issue that it may be broken by some bug.

> 
>>
>> But really, personally I think the issue above is different from
>> loopback devices and may need to be resolved first. And if possible,
>> I hope it could be an new overlayfs feature for everyone.
> 
> Yeah. Independent of composefs, I think EROFS would be better if you
> could just point it to a chunk directory at mount time rather than
> having to route everything through a system-wide global cachefs
> singleton. I understand that cachefs does help with the on-demand
> download aspect, but when you don't need that it is just in the way.

Just check your reply to Dave's review, it seems that how
composefs dir on-disk format works is also much similar to
EROFS as well, see:

https://docs.kernel.org/filesystems/erofs.html -- Directories

a block vs a chunk = dirent + names

cfs_dir_lookup -> erofs_namei + find_target_block_classic;
cfs_dir_lookup_in_chunk -> find_target_dirent.

Yes, great projects could be much similar to each other
occasionally, not to mention opensource projects ;)

Anyway, I'm not opposed to Composefs if folks really like a
new read-only filesystem for this. That is almost all I'd like
to say about Composefs formally, have fun!

Thanks,
Gao Xiang

> 
>

Giuseppe Scrivano Jan. 16, 2023, 2:18 p.m. UTC | #6

Gao Xiang <hsiangkao@linux.alibaba.com> writes:

> On 2023/1/16 20:33, Alexander Larsson wrote:
>> On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote:
>>> Hi Alexander,
>>>
>>> On 2023/1/16 17:30, Alexander Larsson wrote:
>>>>
>>>> I mean, you're not wrong about this being possible. But I don't see
>>>> that this is necessarily a new problem. For example, consider the
>>>> case
>>>> of loopback mounting an ext4 filesystem containing a setuid /bin/su
>>>> file. If you have the right permissions, nothing prohibits you from
>>>> modifying the loopback mounted file and replacing the content of
>>>> the su
>>>> file with a copy of bash.
>>>>
>>>> In both these cases, the security of the system is fully defined by
>>>> the
>>>> filesystem permissions of the backing file data. I think viewing
>>>> composefs as a "new type" of overlayfs gets the wrong idea across.
>>>> Its
>>>> more similar to a "new type" of loopback mount. In particular, the
>>>> backing file metadata is completely unrelated to the metadata
>>>> exposed
>>>> by the filesystem, which means that you can chose to protect the
>>>> backing files (and directories) in ways which protect against
>>>> changes
>>>> from non-privileged users.
>>>>
>>>> Note: The above assumes that mounting either a loopback mount or a
>>>> composefs image is a privileged operation. Allowing unprivileged
>>>> mounts
>>>> is a very different thing.
>>>
>>> Thanks for the reply.  I think if I understand correctly, I could
>>> answer some of your questions.  Hopefully help to everyone
>>> interested.
>>>
>>> Let's avoid thinking unprivileged mounts first, although Giuseppe
>>> told
>>> me earilier that is also a future step of Composefs. But I don't know
>>> how it could work reliably if a fs has some on-disk format, we could
>>> discuss it later.
>>>
>>> I think as a loopback mount, such loopback files are quite under
>>> control
>>> (take ext4 loopback mount as an example, each ext4 has the only one
>>> file
>>>    to access when setting up loopback devices and such loopback file
>>> was
>>>    also opened when setting up loopback mount so it cannot be
>>> replaced.
>>>
>>>    If you enables fsverity for such loopback mount before, it cannot
>>> be
>>>    modified as well) by admins.
>>>
>>>
>>> But IMHO, here composefs shows a new model that some stackable
>>> filesystem can point to massive files under a random directory as
>>> what
>>> ostree does (even files in such directory can be bind-mounted later
>>> in
>>> principle).  But the original userspace ostree strictly follows
>>> underlayfs permission check but Composefs can override
>>> uid/gid/permission instead.
>> Suppose you have:
>> -rw-r--r-- root root image.ext4
>> -rw-r--r-- root root image.composefs
>> drwxr--r-- root root objects/
>> -rw-r--r-- root root objects/backing.file
>> Are you saying it is easier for someone to modify backing.file than
>> image.ext4?
>> I argue it is not, but composefs takes some steps to avoid issues
>> here.
>> At mount time, when the basedir ("objects/" above) argument is parsed,
>> we resolve that path and then create a private vfsmount for it:
>>   resolve_basedir(path) {
>>          ...
>> 	mnt = clone_private_mount(&path);
>>          ...
>>   }
>>   fsi->bases[i] = resolve_basedir(path);
>> Then we open backing files with this mount as root:
>>   real_file = file_open_root_mnt(fsi->bases[i], real_path,
>>   			        file->f_flags, 0);
>> This will never resolve outside the initially specified basedir,
>> even
>> with symlinks or whatever. It will also not be affected by later mount
>> changes in the original mount namespace, as this is a private mount.
>> This is the same mechanism that overlayfs uses for its upper dirs.
>
> Ok.  I have no problem of this part.
>
>> I would argue that anyone who has rights to modify the contents of
>> files in "objects" (supposing they were created with sane permissions)
>> would also have rights to modify "image.ext4".
>
> But you don't have any permission check for files in such
> "objects/" directory in composefs source code, do you?
>
> As I said in my original reply, don't assume random users or
> malicious people just passing in or behaving like your expected
> way.  Sometimes they're not but I think in-kernel fses should
> handle such cases by design.  Obviously, any system written by
> human can cause unexpected bugs, but that is another story.
> I think in general it needs to have such design at least.

what malicious people are you worried about?

composefs is usable only in the initial user namespace for now so only
root can use it and has the responsibility to use trusted files.

>> 
>>> That is also why we selected fscache at the first time to manage all
>>> local cache data for EROFS, since such content-defined directory is
>>> quite under control by in-kernel fscache instead of selecting a
>>> random directory created and given by some userspace program.
>>>
>>> If you are interested in looking info the current in-kernel fscache
>>> behavior, I think that is much similar as what ostree does now.
>>>
>>> It just needs new features like
>>>     - multiple directories;
>>>     - daemonless
>>> to match.
>>>
>> Obviously everything can be extended to support everything. But
>> composefs is very small and simple (2128 lines of code), while at the
>> same time being easy to use (just mount it with one syscall) and needs
>> no complex userspace machinery and configuration. But even without the
>> above feature additions fscache + cachefiles is 7982 lines, plus erofs
>> is 9075 lines, and then on top of that you need userspace integration
>> to even use the thing.
>
> I've replied this in the comment of LWN.net.  EROFS can handle both
> device-based or file-based images. It can handle FSDAX, compression,
> data deduplication, rolling-hash finer compressed data duplication,
> etc.  Of course, for your use cases, you can just turn them off by
> Kconfig, I think such code is useless to your use cases as well.
>
> And as a team work these years, EROFS always accept useful features
> from other people.  And I've been always working on cleaning up
> EROFS, but as long as it gains more features, the code can expand
> of course.
>
> Also take your project -- flatpak for example, I don't think the
> total line of current version is as same as the original version.
>
> Also you will always maintain Composefs source code below 2.5k Loc?
>
>> Don't take me wrong, EROFS is great for its usecases, but I don't
>> really think it is the right choice for my usecase.
>> 
>>>>>
>>>> Secondly, the use of fs-cache doesn't stack, as there can only be
>>>> one
>>>> cachefs agent. For example, mixing an ostree EROFS boot with a
>>>> container backend using EROFS isn't possible (at least without deep
>>>> integration between the two userspaces).
>>>
>>> The reasons above are all current fscache implementation limitation:
>>>
>>>    - First, if such overlay model really works, EROFS can do it
>>> without
>>> fscache feature as well to integrate userspace ostree.  But even that
>>> I hope this new feature can be landed in overlayfs rather than some
>>> other ways since it has native writable layer so we don't need
>>> another
>>> overlayfs mount at all for writing;
>> I don't think it is the right approach for overlayfs to integrate
>> something like image support. Merging the two codebases would
>> complicate both while adding costs to users who need only support for
>> one of the features. I think reusing and stacking separate features is
>> a better idea than combining them.
>
> Why? overlayfs could have metadata support as well, if they'd like
> to support advanced features like partial copy-up without fscache
> support.
>
>> 
>>>
>>>>
>>>> Instead what we have done with composefs is to make filesystem
>>>> image
>>>> generation from the ostree repository 100% reproducible. Then we
>>>> can
>>>
>>> EROFS is all 100% reproduciable as well.
>>>
>> Really, so if I today, on fedora 36 run:
>> # tar xvf oci-image.tar
>> # mkfs.erofs oci-dir/ oci.erofs
>> And then in 5 years, if someone on debian 13 runs the same, with the
>> same tar file, then both oci.erofs files will have the same sha256
>> checksum?
>
> Why it doesn't?  Reproducable builds is a MUST for Android use cases
> as well.
>
> Yes, it may break between versions by mistake, but I think
> reproducable builds is a basic functionalaity for all image
> use cases.
>
>> How do you handle things like different versions or builds of
>> compression libraries creating different results? Do you guarantee to
>> not add any new backwards compat changes by default, or change any
>> default options? Do you guarantee that the files are read from "oci-
>> dir" in the same order each time? It doesn't look like it.
>
> If you'd like to say like that, why mkcomposefs doesn't have the
> same issue that it may be broken by some bug.
>
>> 
>>>
>>> But really, personally I think the issue above is different from
>>> loopback devices and may need to be resolved first. And if possible,
>>> I hope it could be an new overlayfs feature for everyone.
>> Yeah. Independent of composefs, I think EROFS would be better if you
>> could just point it to a chunk directory at mount time rather than
>> having to route everything through a system-wide global cachefs
>> singleton. I understand that cachefs does help with the on-demand
>> download aspect, but when you don't need that it is just in the way.
>
> Just check your reply to Dave's review, it seems that how
> composefs dir on-disk format works is also much similar to
> EROFS as well, see:
>
> https://docs.kernel.org/filesystems/erofs.html -- Directories
>
> a block vs a chunk = dirent + names
>
> cfs_dir_lookup -> erofs_namei + find_target_block_classic;
> cfs_dir_lookup_in_chunk -> find_target_dirent.
>
> Yes, great projects could be much similar to each other
> occasionally, not to mention opensource projects ;)
>
> Anyway, I'm not opposed to Composefs if folks really like a
> new read-only filesystem for this. That is almost all I'd like
> to say about Composefs formally, have fun!
>
> Thanks,
> Gao Xiang
>
>>

Alexander Larsson Jan. 16, 2023, 3:27 p.m. UTC | #7

On Mon, 2023-01-16 at 21:26 +0800, Gao Xiang wrote:
> 
> 
> On 2023/1/16 20:33, Alexander Larsson wrote:
> > 
> > 
> > Suppose you have:
> > 
> > -rw-r--r-- root root image.ext4
> > -rw-r--r-- root root image.composefs
> > drwxr--r-- root root objects/
> > -rw-r--r-- root root objects/backing.file
> > 
> > Are you saying it is easier for someone to modify backing.file than
> > image.ext4?
> > 
> > I argue it is not, but composefs takes some steps to avoid issues
> > here.
> > At mount time, when the basedir ("objects/" above) argument is
> > parsed,
> > we resolve that path and then create a private vfsmount for it:
> > 
> >   resolve_basedir(path) {
> >          ...
> >         mnt = clone_private_mount(&path);
> >          ...
> >   }
> > 
> >   fsi->bases[i] = resolve_basedir(path);
> > 
> > Then we open backing files with this mount as root:
> > 
> >   real_file = file_open_root_mnt(fsi->bases[i], real_path,
> >                                 file->f_flags, 0);
> > 
> > This will never resolve outside the initially specified basedir,
> > even
> > with symlinks or whatever. It will also not be affected by later
> > mount
> > changes in the original mount namespace, as this is a private
> > mount.
> > 
> > This is the same mechanism that overlayfs uses for its upper dirs.
> 
> Ok.  I have no problem of this part.
> 
> > 
> > I would argue that anyone who has rights to modify the contents of
> > files in "objects" (supposing they were created with sane
> > permissions)
> > would also have rights to modify "image.ext4".
> 
> But you don't have any permission check for files in such
> "objects/" directory in composefs source code, do you?

I don't see how permission checks would make any difference to the
ability to modify the image by anyone? Do you mean the kernel should
validate the basedir so that is has sane permissions rather than
trusting the user? That seems weird to me.

Or do you mean that someone would create a composefs image that
references a file they could not otherwise read, and then use it as a
basedir in a composefs mount to read the file? Such a mount can only
happen if you are root, and it can only read files inside that
particular directory. However, maybe we should use the callers
credentials to ensure that they are allowed to read the backing file,
just in case. That can't hurt.

> As I said in my original reply, don't assume random users or
> malicious people just passing in or behaving like your expected
> way.  Sometimes they're not but I think in-kernel fses should
> handle such cases by design.  Obviously, any system written by
> human can cause unexpected bugs, but that is another story.
> I think in general it needs to have such design at least.

You need to be root to mount a fs, an operation which is generally
unsafe (because few filesystems are completely resistant to hostile
filesystem data). Therefore I think we can expect a certain amount of
sanity in its use, such as "don't pass in directories that are world
writable".

> > 
> > > That is also why we selected fscache at the first time to manage
> > > all
> > > local cache data for EROFS, since such content-defined directory
> > > is
> > > quite under control by in-kernel fscache instead of selecting a
> > > random directory created and given by some userspace program.
> > > 
> > > If you are interested in looking info the current in-kernel
> > > fscache
> > > behavior, I think that is much similar as what ostree does now.
> > > 
> > > It just needs new features like
> > >     - multiple directories;
> > >     - daemonless
> > > to match.
> > > 
> > 
> > Obviously everything can be extended to support everything. But
> > composefs is very small and simple (2128 lines of code), while at
> > the
> > same time being easy to use (just mount it with one syscall) and
> > needs
> > no complex userspace machinery and configuration. But even without
> > the
> > above feature additions fscache + cachefiles is 7982 lines, plus
> > erofs
> > is 9075 lines, and then on top of that you need userspace
> > integration
> > to even use the thing.
> 
> I've replied this in the comment of LWN.net.  EROFS can handle both
> device-based or file-based images. It can handle FSDAX, compression,
> data deduplication, rolling-hash finer compressed data duplication,
> etc.  Of course, for your use cases, you can just turn them off by
> Kconfig, I think such code is useless to your use cases as well.
>
> And as a team work these years, EROFS always accept useful features
> from other people.  And I've been always working on cleaning up
> EROFS, but as long as it gains more features, the code can expand
> of course.
> 
> Also take your project -- flatpak for example, I don't think the
> total line of current version is as same as the original version.
> 
> Also you will always maintain Composefs source code below 2.5k Loc?
> 
> > 
> > Don't take me wrong, EROFS is great for its usecases, but I don't
> > really think it is the right choice for my usecase.
> > 
> > > > > 
> > > > Secondly, the use of fs-cache doesn't stack, as there can only
> > > > be
> > > > one
> > > > cachefs agent. For example, mixing an ostree EROFS boot with a
> > > > container backend using EROFS isn't possible (at least without
> > > > deep
> > > > integration between the two userspaces).
> > > 
> > > The reasons above are all current fscache implementation
> > > limitation:
> > > 
> > >    - First, if such overlay model really works, EROFS can do it
> > > without
> > > fscache feature as well to integrate userspace ostree.  But even
> > > that
> > > I hope this new feature can be landed in overlayfs rather than
> > > some
> > > other ways since it has native writable layer so we don't need
> > > another
> > > overlayfs mount at all for writing;
> > 
> > I don't think it is the right approach for overlayfs to integrate
> > something like image support. Merging the two codebases would
> > complicate both while adding costs to users who need only support
> > for
> > one of the features. I think reusing and stacking separate features
> > is
> > a better idea than combining them.
> 
> Why? overlayfs could have metadata support as well, if they'd like
> to support advanced features like partial copy-up without fscache
> support.
> 
> > 
> > > 
> > > > 
> > > > Instead what we have done with composefs is to make filesystem
> > > > image
> > > > generation from the ostree repository 100% reproducible. Then
> > > > we
> > > > can
> > > 
> > > EROFS is all 100% reproduciable as well.
> > > 
> > 
> > 
> > Really, so if I today, on fedora 36 run:
> > # tar xvf oci-image.tar
> > # mkfs.erofs oci-dir/ oci.erofs
> > 
> > And then in 5 years, if someone on debian 13 runs the same, with
> > the
> > same tar file, then both oci.erofs files will have the same sha256
> > checksum?
> 
> Why it doesn't?  Reproducable builds is a MUST for Android use cases
> as well.

That is not quite the same requirements. A reproducible build in the
traditional sense is limited to a particular build configuration. You
define a set of tools for the build, and use the same ones for each
build, and get a fixed output. You don't expect to be able to change
e.g. the compiler and get the same result. Similarly, it is often the
case that different builds or versions of compression libraries gives
different results, so you can't expect to use e.g. a different libz and
get identical images.

> Yes, it may break between versions by mistake, but I think
> reproducable builds is a basic functionalaity for all image
> use cases.
> 
> > 
> > How do you handle things like different versions or builds of
> > compression libraries creating different results? Do you guarantee
> > to
> > not add any new backwards compat changes by default, or change any
> > default options? Do you guarantee that the files are read from
> > "oci-
> > dir" in the same order each time? It doesn't look like it.
> 
> If you'd like to say like that, why mkcomposefs doesn't have the
> same issue that it may be broken by some bug.
> 

libcomposefs defines a normalized form for everything like file order,
xattr orders, etc, and carefully normalizes everything such that we can
guarantee these properties. It is possible that some detail was missed,
because we're humans. But it was a very conscious and deliberate design
choice that is deeply encoded in the code and format. For example, this
is why we don't use compression but try to minimize size in other ways.

> > > 
> > > But really, personally I think the issue above is different from
> > > loopback devices and may need to be resolved first. And if
> > > possible,
> > > I hope it could be an new overlayfs feature for everyone.
> > 
> > Yeah. Independent of composefs, I think EROFS would be better if
> > you
> > could just point it to a chunk directory at mount time rather than
> > having to route everything through a system-wide global cachefs
> > singleton. I understand that cachefs does help with the on-demand
> > download aspect, but when you don't need that it is just in the
> > way.
> 
> Just check your reply to Dave's review, it seems that how
> composefs dir on-disk format works is also much similar to
> EROFS as well, see:
> 
> https://docs.kernel.org/filesystems/erofs.html -- Directories
> 
> a block vs a chunk = dirent + names
> 
> cfs_dir_lookup -> erofs_namei + find_target_block_classic;
> cfs_dir_lookup_in_chunk -> find_target_dirent.

Yeah, the dirent layout looks very similar. I guess great minds think
alike! My approach was simpler initially, but it kinda converged on
this when I started optimizing the kernel lookup code with binary
search.

> Yes, great projects could be much similar to each other
> occasionally, not to mention opensource projects ;)
> 
> Anyway, I'm not opposed to Composefs if folks really like a
> new read-only filesystem for this. That is almost all I'd like
> to say about Composefs formally, have fun!
> 
> Thanks,
> Gao Xiang

Cool, thanks for the feedback.

Gao Xiang Jan. 17, 2023, 12:12 a.m. UTC | #8

On 2023/1/16 23:27, Alexander Larsson wrote:
> On Mon, 2023-01-16 at 21:26 +0800, Gao Xiang wrote:

I will stop saying this overlay permission model anymore since
there are more experienced folks working on this, although SUID
stuff is still dangerous to me as an end-user:  IMHO, its hard
for me to identify proper sub-sub-subdir UID/GID in "objects"
at runtime, even they could happen much deep which is different
from localfs with loopback devices or overlayfs.  I don't know
what then inproper sub-sub-subdir UID/GID in "objects" could
cause.

It seems currently ostree uses "root" all the time for such
"objects" subdirs, I don't know.

>>>
>>>>
>>>>>
>>>>> Instead what we have done with composefs is to make filesystem
>>>>> image
>>>>> generation from the ostree repository 100% reproducible. Then
>>>>> we
>>>>> can
>>>>
>>>> EROFS is all 100% reproduciable as well.
>>>>
>>>
>>>
>>> Really, so if I today, on fedora 36 run:
>>> # tar xvf oci-image.tar
>>> # mkfs.erofs oci-dir/ oci.erofs
>>>
>>> And then in 5 years, if someone on debian 13 runs the same, with
>>> the
>>> same tar file, then both oci.erofs files will have the same sha256
>>> checksum?
>>
>> Why it doesn't?  Reproducable builds is a MUST for Android use cases
>> as well.
> 
> That is not quite the same requirements. A reproducible build in the
> traditional sense is limited to a particular build configuration. You
> define a set of tools for the build, and use the same ones for each
> build, and get a fixed output. You don't expect to be able to change
> e.g. the compiler and get the same result. Similarly, it is often the
> case that different builds or versions of compression libraries gives
> different results, so you can't expect to use e.g. a different libz and
> get identical images.
> 
>> Yes, it may break between versions by mistake, but I think
>> reproducable builds is a basic functionalaity for all image
>> use cases.
>>
>>>
>>> How do you handle things like different versions or builds of
>>> compression libraries creating different results? Do you guarantee
>>> to
>>> not add any new backwards compat changes by default, or change any
>>> default options? Do you guarantee that the files are read from
>>> "oci-
>>> dir" in the same order each time? It doesn't look like it.
>>
>> If you'd like to say like that, why mkcomposefs doesn't have the
>> same issue that it may be broken by some bug.
>>
> 
> libcomposefs defines a normalized form for everything like file order,
> xattr orders, etc, and carefully normalizes everything such that we can
> guarantee these properties. It is possible that some detail was missed,
> because we're humans. But it was a very conscious and deliberate design
> choice that is deeply encoded in the code and format. For example, this
> is why we don't use compression but try to minimize size in other ways.

EROFS is reproducable since its dirents are all sorted because
of its on-disk definition.  And its xattrs are also sorted if
images needs to be reproducable.

I don't know what's the difference between these two
reproducable builds.  EROFS is designed for golden images, if
you pass in a set of configuration options for mkfs.erofs, it
should output the same output, otherwise they are really
buges and need to be fixed.

Compression algorithms could generate different outputs between
versions, and generally compressed data is stable for most
compression algorithms in a specific version but that is another
story.

EROFS can live without compression.

> 
>>>>
>>>> But really, personally I think the issue above is different from
>>>> loopback devices and may need to be resolved first. And if
>>>> possible,
>>>> I hope it could be an new overlayfs feature for everyone.
>>>
>>> Yeah. Independent of composefs, I think EROFS would be better if
>>> you
>>> could just point it to a chunk directory at mount time rather than
>>> having to route everything through a system-wide global cachefs
>>> singleton. I understand that cachefs does help with the on-demand
>>> download aspect, but when you don't need that it is just in the
>>> way.
>>
>> Just check your reply to Dave's review, it seems that how
>> composefs dir on-disk format works is also much similar to
>> EROFS as well, see:
>>
>> https://docs.kernel.org/filesystems/erofs.html -- Directories
>>
>> a block vs a chunk = dirent + names
>>
>> cfs_dir_lookup -> erofs_namei + find_target_block_classic;
>> cfs_dir_lookup_in_chunk -> find_target_dirent.
> 
> Yeah, the dirent layout looks very similar. I guess great minds think
> alike! My approach was simpler initially, but it kinda converged on
> this when I started optimizing the kernel lookup code with binary
> search.
> 
>> Yes, great projects could be much similar to each other
>> occasionally, not to mention opensource projects ;)
>>
>> Anyway, I'm not opposed to Composefs if folks really like a
>> new read-only filesystem for this. That is almost all I'd like
>> to say about Composefs formally, have fun!
Because, anyway, I have no idea considering opensource projects
could also do folk, so (maybe) such is life.

It seems rather another an incomplete EROFS from several points
of view.  Also see:
https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u

I will go on making a better EROFS as a promise to the
community initially.

Thanks,
Gao Xiang

>>
>> Thanks,
>> Gao Xiang
> 
> Cool, thanks for the feedback.
> 
>

Amir Goldstein Jan. 17, 2023, 7:05 a.m. UTC | #9

> It seems rather another an incomplete EROFS from several points
> of view.  Also see:
> https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
>

Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
where the community reactions rhyme with the reactions to composefs.
The discussion on Incremental FS resembles composefs case even more [1].
AFAIK, Android is still maintaining Incremental FS out-of-tree.

Alexander and Giuseppe,

I'd like to join Gao is saying that I think it is in the best interest
of everyone,
composefs developers and prospect users included,
if the composefs requirements would drive improvement to existing
kernel subsystems rather than adding a custom filesystem driver
that partly duplicates other subsystems.

Especially so, when the modifications to existing components
(erofs and overlayfs) appear to be relatively minor and the maintainer
of erofs is receptive to new features and happy to collaborate with you.

w.r.t overlayfs, I am not even sure that anything needs to be modified
in the driver.
overlayfs already supports "metacopy" feature which means that an upper layer
could be composed in a way that the file content would be read from an arbitrary
path in lower fs, e.g. objects/cc/XXX.

I gave a talk on LPC a few years back about overlayfs and container images [2].
The emphasis was that overlayfs driver supports many new features, but userland
tools for building advanced overlayfs images based on those new features are
nowhere to be found.

I may be wrong, but it looks to me like composefs could potentially
fill this void,
without having to modify the overlayfs driver at all, or maybe just a
little bit.
Please start a discussion with overlayfs developers about missing driver
features if you have any.

Overall, this sounds like a fun discussion to have at LSFMMBPF23 [3]
so you are most welcome to submit a topic proposal for
"opportunistically sharing verified image filesystem".

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAK8JDrGRzA+yphpuX+GQ0syRwF_p2Fora+roGCnYqB5E1eOmXA@mail.gmail.com/
[2] https://lpc.events/event/7/contributions/639/attachments/501/969/Overlayfs-containers-lpc-2020.pdf
[3] https://lore.kernel.org/linux-fsdevel/Y7hDVliKq+PzY1yY@localhost.localdomain/

Christian Brauner Jan. 17, 2023, 10:12 a.m. UTC | #10

On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> > It seems rather another an incomplete EROFS from several points
> > of view.  Also see:
> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
> >
> 
> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
> where the community reactions rhyme with the reactions to composefs.
> The discussion on Incremental FS resembles composefs case even more [1].
> AFAIK, Android is still maintaining Incremental FS out-of-tree.
> 
> Alexander and Giuseppe,
> 
> I'd like to join Gao is saying that I think it is in the best interest
> of everyone,
> composefs developers and prospect users included,
> if the composefs requirements would drive improvement to existing
> kernel subsystems rather than adding a custom filesystem driver
> that partly duplicates other subsystems.
> 
> Especially so, when the modifications to existing components
> (erofs and overlayfs) appear to be relatively minor and the maintainer
> of erofs is receptive to new features and happy to collaborate with you.
> 
> w.r.t overlayfs, I am not even sure that anything needs to be modified
> in the driver.
> overlayfs already supports "metacopy" feature which means that an upper layer
> could be composed in a way that the file content would be read from an arbitrary
> path in lower fs, e.g. objects/cc/XXX.
> 
> I gave a talk on LPC a few years back about overlayfs and container images [2].
> The emphasis was that overlayfs driver supports many new features, but userland
> tools for building advanced overlayfs images based on those new features are
> nowhere to be found.
> 
> I may be wrong, but it looks to me like composefs could potentially
> fill this void,
> without having to modify the overlayfs driver at all, or maybe just a
> little bit.
> Please start a discussion with overlayfs developers about missing driver
> features if you have any.

Surprising that I and others weren't Cced on this given that we had a
meeting with the main developers and a few others where we had said the
same thing. I hadn't followed this. 

We have at least 58 filesystems currently in the kernel (and that's a
conservative count just based on going by obvious directories and
ignoring most virtual filesystems).

A non-insignificant portion is probably slowly rotting away with little
fixes coming in, with few users, and not much attention is being paid to
syzkaller reports for them if they show up. I haven't quantified this of
course.

Taking in a new filesystems into kernel in the worst case means that
it's being dumped there once and will slowly become unmaintained. Then
we'll have a few users for the next 20 years and we can't reasonably
deprecate it (Maybe that's another good topic: How should we fade out
filesystems.).

Of course, for most fs developers it probably doesn't matter how many
other filesystems there are in the kernel (aside from maybe competing
for the same users).

But for developers who touch the vfs every new filesystems may increase
the cost of maintaining and reworking existing functionality, or adding
new functionality. Making it more likely to accumulate hacks, adding
workarounds, or flatout being unable to kill off infrastructure that
should reasonably go away. Maybe this is an unfair complaint but just
from experience a new filesystem potentially means one or two weeks to
make a larger vfs change.

I want to stress that I'm not at all saying "no more new fs" but we
should be hesitant before we merge new filesystems into the kernel.

Especially for filesystems that are tailored to special use-cases.
Every few years another filesystem tailored to container use-cases shows
up. And frankly, a good portion of the issues that they are trying to
solve are caused by design choices in userspace.

And I have to say I'm especially NAK-friendly about anything that comes
even close to yet another stacking filesystems or anything that layers
on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
overlayfs. They are hard to get right, with lots of corner cases and
they cause the most headaches when making vfs changes.

Gao Xiang Jan. 17, 2023, 10:30 a.m. UTC | #11

Hi Amir and Christian,

On 2023/1/17 18:12, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
>>> It seems rather another an incomplete EROFS from several points
>>> of view.  Also see:
>>> https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
>>>
>>
>> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
>> where the community reactions rhyme with the reactions to composefs.
>> The discussion on Incremental FS resembles composefs case even more [1].
>> AFAIK, Android is still maintaining Incremental FS out-of-tree.
>>
>> Alexander and Giuseppe,
>>
>> I'd like to join Gao is saying that I think it is in the best interest
>> of everyone,
>> composefs developers and prospect users included,
>> if the composefs requirements would drive improvement to existing
>> kernel subsystems rather than adding a custom filesystem driver
>> that partly duplicates other subsystems.
>>
>> Especially so, when the modifications to existing components
>> (erofs and overlayfs) appear to be relatively minor and the maintainer
>> of erofs is receptive to new features and happy to collaborate with you.
>>
>> w.r.t overlayfs, I am not even sure that anything needs to be modified
>> in the driver.
>> overlayfs already supports "metacopy" feature which means that an upper layer
>> could be composed in a way that the file content would be read from an arbitrary
>> path in lower fs, e.g. objects/cc/XXX.
>>
>> I gave a talk on LPC a few years back about overlayfs and container images [2].
>> The emphasis was that overlayfs driver supports many new features, but userland
>> tools for building advanced overlayfs images based on those new features are
>> nowhere to be found.
>>
>> I may be wrong, but it looks to me like composefs could potentially
>> fill this void,
>> without having to modify the overlayfs driver at all, or maybe just a
>> little bit.
>> Please start a discussion with overlayfs developers about missing driver
>> features if you have any.
> 

...

> 
> I want to stress that I'm not at all saying "no more new fs" but we
> should be hesitant before we merge new filesystems into the kernel.
> 
> Especially for filesystems that are tailored to special use-cases.
> Every few years another filesystem tailored to container use-cases shows
> up. And frankly, a good portion of the issues that they are trying to
> solve are caused by design choices in userspace.
> 
> And I have to say I'm especially NAK-friendly about anything that comes
> even close to yet another stacking filesystems or anything that layers
> on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
> overlayfs. They are hard to get right, with lots of corner cases and
> they cause the most headaches when making vfs changes.

That is also my original (little) request if such overlay model is
correct...

In principle, it's not hard for EROFS since currently EROFS already
has symlink on-disk layout, and the difference is just applying them
to all regular files (even without some on-disk changes, but maybe
we need to optimize them if there are other special requirements
for specific use cases like ostree), and makes EROFS do like
stackable way... That is not hard honestly (on-disk compatible)...

But I'm not sure whether it's fortunate for EROFS now to support
this without a proper overlay model for careful discussion.

So if there could be some discussion for this overlay model on
LSF/MM/BPF, I'd like to attend (thanks!) And I support to make it
in overlayfs (if possible) but it seems EROFS could do as well as
long as it has enough constrait to conclude...

Thanks,
Gao Xiang


>

Giuseppe Scrivano Jan. 17, 2023, 1:56 p.m. UTC | #12

Christian Brauner <brauner@kernel.org> writes:

> On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
>> > It seems rather another an incomplete EROFS from several points
>> > of view.  Also see:
>> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
>> >
>> 
>> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
>> where the community reactions rhyme with the reactions to composefs.
>> The discussion on Incremental FS resembles composefs case even more [1].
>> AFAIK, Android is still maintaining Incremental FS out-of-tree.
>> 
>> Alexander and Giuseppe,
>> 
>> I'd like to join Gao is saying that I think it is in the best interest
>> of everyone,
>> composefs developers and prospect users included,
>> if the composefs requirements would drive improvement to existing
>> kernel subsystems rather than adding a custom filesystem driver
>> that partly duplicates other subsystems.
>> 
>> Especially so, when the modifications to existing components
>> (erofs and overlayfs) appear to be relatively minor and the maintainer
>> of erofs is receptive to new features and happy to collaborate with you.
>> 
>> w.r.t overlayfs, I am not even sure that anything needs to be modified
>> in the driver.
>> overlayfs already supports "metacopy" feature which means that an upper layer
>> could be composed in a way that the file content would be read from an arbitrary
>> path in lower fs, e.g. objects/cc/XXX.
>> 
>> I gave a talk on LPC a few years back about overlayfs and container images [2].
>> The emphasis was that overlayfs driver supports many new features, but userland
>> tools for building advanced overlayfs images based on those new features are
>> nowhere to be found.
>> 
>> I may be wrong, but it looks to me like composefs could potentially
>> fill this void,
>> without having to modify the overlayfs driver at all, or maybe just a
>> little bit.
>> Please start a discussion with overlayfs developers about missing driver
>> features if you have any.
>
> Surprising that I and others weren't Cced on this given that we had a
> meeting with the main developers and a few others where we had said the
> same thing. I hadn't followed this. 

well that wasn't done on purpose, sorry for that.

After our meeting, I thought it was clear that we have different needs
for our use cases and that we were going to submit composefs upstream,
as we did, to gather some feedbacks from the wider community.

Of course we looked at overlay before we decided to upstream composefs.

Some of the use cases we have in mind are not easily doable, some others
are not possible at all.  metacopy is a good starting point, but from
user space it works quite differently than what we can do with
composefs.

Let's assume we have a git like repository with a bunch of files stored
by their checksum and that they can be shared among different containers.

Using the overlayfs model:

1) We need to create the final image layout, either using reflinks or
hardlinks:

- reflinks: we can reflect a correct st_nlink value for the inode but we
  lose page cache sharing.

- hardlinks: make the st_nlink bogus.  Another problem is that overlay
  expects the lower layer to never change and now st_nlink can change
  for files in other lower layers.

These operations have a cost.  Even if we all the files are already
available locally, we still need at least one operation per file to
create it, and more than one if we start tweaking the inode metadata.

2) no multi repo support:

Both reflinks and hardlinks do not work across mount points, so we
cannot have images that span multiple file systems; one common use case
is to have a network file system to share some images/files and be able
to use files from there when they are available.

At the moment we deduplicate entire layers, and with overlay we can do
something like the following without problems:

# mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2

but this won't work with the files granularity we are looking at.  So in
this case we need to do a full copy of the files that are not on the
same file system.

3) no support for fs-verity.  No idea how overlay could ever support it,
it doesn't fit there.  If we want this feature we need to look at
another RO file system.

We looked at EROFS since it is already upstream but it is quite
different than what we are doing as Alex already pointed out.

Sure we could bloat EROFS and add all the new features there, after all
composefs is quite simple, but I don't see how this is any cleaner than
having a simple file system that does just one thing.

On top of what was already said: I wish at some point we can do all of
this from a user namespace.  That is the main reason for having an easy
on-disk format for composefs.  This seems much more difficult to achieve
with EROFS given its complexity.

> We have at least 58 filesystems currently in the kernel (and that's a
> conservative count just based on going by obvious directories and
> ignoring most virtual filesystems).
>
> A non-insignificant portion is probably slowly rotting away with little
> fixes coming in, with few users, and not much attention is being paid to
> syzkaller reports for them if they show up. I haven't quantified this of
> course.
>
> Taking in a new filesystems into kernel in the worst case means that
> it's being dumped there once and will slowly become unmaintained. Then
> we'll have a few users for the next 20 years and we can't reasonably
> deprecate it (Maybe that's another good topic: How should we fade out
> filesystems.).
>
> Of course, for most fs developers it probably doesn't matter how many
> other filesystems there are in the kernel (aside from maybe competing
> for the same users).
>
> But for developers who touch the vfs every new filesystems may increase
> the cost of maintaining and reworking existing functionality, or adding
> new functionality. Making it more likely to accumulate hacks, adding
> workarounds, or flatout being unable to kill off infrastructure that
> should reasonably go away. Maybe this is an unfair complaint but just
> from experience a new filesystem potentially means one or two weeks to
> make a larger vfs change.
>
> I want to stress that I'm not at all saying "no more new fs" but we
> should be hesitant before we merge new filesystems into the kernel.
>
> Especially for filesystems that are tailored to special use-cases.
> Every few years another filesystem tailored to container use-cases shows
> up. And frankly, a good portion of the issues that they are trying to
> solve are caused by design choices in userspace.

Having a way to deprecate file systems seem like a good idea in general,
and IMHO seems to make more sense than blocking new components that
can be useful to some users.

We are aware the bar for a new file system is high, and we were
expecting criticism and push back, but so far it doesn't seem there is a
way to achieve what we are trying to do.

> And I have to say I'm especially NAK-friendly about anything that comes
> even close to yet another stacking filesystems or anything that layers
> on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
> overlayfs. They are hard to get right, with lots of corner cases and
> they cause the most headaches when making vfs changes.

Gao Xiang Jan. 17, 2023, 2:28 p.m. UTC | #13

On 2023/1/17 21:56, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 

...

> 
> We looked at EROFS since it is already upstream but it is quite
> different than what we are doing as Alex already pointed out.
> 

Sigh..  please kindly help me find out what's the difference if
EROFS uses some symlink layout for each regular inode?

Some question for me to ask about this new overlay permission
model once again:

What's the difference between symlink (maybe with some limitations)
and this new overlay model? I'm not sure why symlink permission bits
is ignored (AFAIK)?  I don't think it too further since I don't quite
an experienced one in the unionfs field, but if possible, I'm quite
happy to learn new stuffs as a newbie filesystem developer to gain
more knowledge if it could be some topic at LSF/MM/BPF 2023.

> Sure we could bloat EROFS and add all the new features there, after all
> composefs is quite simple, but I don't see how this is any cleaner than
> having a simple file system that does just one thing.

Also if I have time, I could do a code-truncated EROFS without any
useless features specificly for ostree use cases.  Or I could just
seperate out all of that useless code of Ostree-specific use cases
by using Kconfig.

If you don't want to use EROFS from whatever reason, I'm not oppose
to it (You also could use other in-kernel local filesystem for this
as well).  Except for this new overlay model, I just tried to say
how it works similiar to EROFS.

> 
> On top of what was already said: I wish at some point we can do all of
> this from a user namespace.  That is the main reason for having an easy
> on-disk format for composefs.  This seems much more difficult to achieve
> with EROFS given its complexity.

Why?

[ Gao Xiang: this time I will try my best stop talking about EROFS under
   the Composefs patchset anymore because I'd like to avoid appearing at
   the first time (unless such permission model is never discussed until
   now)...

   No matter in the cover letter it never mentioned EROFS at all. ]

Thanks,
Gao Xiang

Christian Brauner Jan. 17, 2023, 3:27 p.m. UTC | #14

On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
> Christian Brauner <brauner@kernel.org> writes:
> 
> > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> >> > It seems rather another an incomplete EROFS from several points
> >> > of view.  Also see:
> >> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
> >> >
> >> 
> >> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19,
> >> where the community reactions rhyme with the reactions to composefs.
> >> The discussion on Incremental FS resembles composefs case even more [1].
> >> AFAIK, Android is still maintaining Incremental FS out-of-tree.
> >> 
> >> Alexander and Giuseppe,
> >> 
> >> I'd like to join Gao is saying that I think it is in the best interest
> >> of everyone,
> >> composefs developers and prospect users included,
> >> if the composefs requirements would drive improvement to existing
> >> kernel subsystems rather than adding a custom filesystem driver
> >> that partly duplicates other subsystems.
> >> 
> >> Especially so, when the modifications to existing components
> >> (erofs and overlayfs) appear to be relatively minor and the maintainer
> >> of erofs is receptive to new features and happy to collaborate with you.
> >> 
> >> w.r.t overlayfs, I am not even sure that anything needs to be modified
> >> in the driver.
> >> overlayfs already supports "metacopy" feature which means that an upper layer
> >> could be composed in a way that the file content would be read from an arbitrary
> >> path in lower fs, e.g. objects/cc/XXX.
> >> 
> >> I gave a talk on LPC a few years back about overlayfs and container images [2].
> >> The emphasis was that overlayfs driver supports many new features, but userland
> >> tools for building advanced overlayfs images based on those new features are
> >> nowhere to be found.
> >> 
> >> I may be wrong, but it looks to me like composefs could potentially
> >> fill this void,
> >> without having to modify the overlayfs driver at all, or maybe just a
> >> little bit.
> >> Please start a discussion with overlayfs developers about missing driver
> >> features if you have any.
> >
> > Surprising that I and others weren't Cced on this given that we had a
> > meeting with the main developers and a few others where we had said the
> > same thing. I hadn't followed this. 
> 
> well that wasn't done on purpose, sorry for that.

I understand. I was just surprised given that I very much work on the
vfs on a day to day basis.

> 
> After our meeting, I thought it was clear that we have different needs
> for our use cases and that we were going to submit composefs upstream,
> as we did, to gather some feedbacks from the wider community.
> 
> Of course we looked at overlay before we decided to upstream composefs.
> 
> Some of the use cases we have in mind are not easily doable, some others
> are not possible at all.  metacopy is a good starting point, but from
> user space it works quite differently than what we can do with
> composefs.
> 
> Let's assume we have a git like repository with a bunch of files stored
> by their checksum and that they can be shared among different containers.
> 
> Using the overlayfs model:
> 
> 1) We need to create the final image layout, either using reflinks or
> hardlinks:
> 
> - reflinks: we can reflect a correct st_nlink value for the inode but we
>   lose page cache sharing.
> 
> - hardlinks: make the st_nlink bogus.  Another problem is that overlay
>   expects the lower layer to never change and now st_nlink can change
>   for files in other lower layers.
> 
> These operations have a cost.  Even if we all the files are already
> available locally, we still need at least one operation per file to
> create it, and more than one if we start tweaking the inode metadata.

Which you now encode in a manifest file which changes properties on a
per file basis without any vfs involvement which makes me pretty uneasy.

If you combine overlayfs with idmapped mounts you can already change
ownership on a fairly granular basis.

If you need additional per file ownership use overlayfs which gives you
the ability to change file attributes on a per file per container basis.

> 
> 2) no multi repo support:
> 
> Both reflinks and hardlinks do not work across mount points, so we

Just fwiw, afaict reflinks work across mount points since at least 5.18.

> cannot have images that span multiple file systems; one common use case
> is to have a network file system to share some images/files and be able
> to use files from there when they are available.
> 
> At the moment we deduplicate entire layers, and with overlay we can do
> something like the following without problems:
> 
> # mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2
> 
> but this won't work with the files granularity we are looking at.  So in
> this case we need to do a full copy of the files that are not on the
> same file system.
> 
> 3) no support for fs-verity.  No idea how overlay could ever support it,
> it doesn't fit there.  If we want this feature we need to look at
> another RO file system.
> 
> We looked at EROFS since it is already upstream but it is quite
> different than what we are doing as Alex already pointed out.
> 
> Sure we could bloat EROFS and add all the new features there, after all
> composefs is quite simple, but I don't see how this is any cleaner than
> having a simple file system that does just one thing.
> 
> On top of what was already said: I wish at some point we can do all of
> this from a user namespace.  That is the main reason for having an easy
> on-disk format for composefs.  This seems much more difficult to achieve

I'm pretty skeptical of this plan whether we should add more filesystems
that are mountable by unprivileged users. FUSE and Overlayfs are
adventurous enough and they don't have their own on-disk format. The
track record of bugs exploitable due to userns isn't making this
very attractive.

Dave Chinner Jan. 18, 2023, 12:22 a.m. UTC | #15

On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
> > Christian Brauner <brauner@kernel.org> writes:
> > 2) no multi repo support:
> > 
> > Both reflinks and hardlinks do not work across mount points, so we
> 
> Just fwiw, afaict reflinks work across mount points since at least 5.18.

The might work for NFS server *file clones* across different exports
within the same NFS server (or server cluster), but they most
certainly don't work across mountpoints for local filesystems, or
across different types of filesystems.

I'm not here to advocate that composefs as the right solution, I'm
just pointing out that the proposed alternatives do not, in any way,
have the same critical behavioural characteristics as composefs
provides container orchestration systems and hence do not solve the
problems that composefs is attempting to solve.

In short: any solution that requires userspace to create a new
filesystem heirarchy one file at a time via standard syscall
mechanisms is not going to perform acceptibly at scale - that's a
major problem that composefs addresses.

The whole problem with file copying to create images - even with
reflink or hardlinks avoiding data copying - is the overhead of
creating and destroying those copies in the first place. A reflink
copy of a tens of thousands of files in a complex directory
structure is not free - each individual reflink has a time, CPU,
memory and IO cost to it. The teardown cost is similar - the only
way to remove the "container image" built with reflinks is "rm -rf",
and that has significant time, CPU memory and IO costs associated
with it as well.

Further, you can't ship container images to remote hosts using
reflink copies - they can only be created at runtime on the host
that the container will be instantiated on. IOWs, the entire cost of
reflink copies for container instances must be taken at container
instantiation and destruction time.

When you have container instances that might only be needed for a
few seconds, taking half a minute to set up the container instance
and then another half a minute to tear it down just isn't viable -
we need instantiation and teardown times in the order of a second or
two.

From my reading of the code, composefs is based around the concept
of a verifiable "shipping manifest", where the filesystem namespace
presented to users by the kernel is derived from the manifest rahter
than from some other filesystem namespace. Overlay, reflinks, etc
all use some other filesystem namespace to generate the container
namespace that links to the common data, whilst composefs uses the
manifest for that.

The use of a minfest file means there is almost zero container setup
overhead - ship the manifest file, mount it, all done - and zero
teardown overhead as unmounting the filesystem is all that is needed
to remove all traces of the container instance from the system.

In having a custom manifest format, the manifest can easily contain
verification information alongside the pointer to the content the
namespace should expose. i.e. the manifest references a secure
content addressed repository that is protected by fsverity and
contains the fsverity digests itself. Hence it doesn't rely on the
repository to self-verify, it actually ensures that the repository
files actually contain the data the manifest expects them to
contain.

Hence if the composefs kernel module is provided with a mechanism
for validating the chain of trust for the manifest file that a user
is trying to mount, then we just don't care who the mounting user
is.  This architecture is a viable path to rootless mounting of
pre-built third party container images.

Also, with the host's content addressed repository being managed
separately by the trusted host and distro package management, the
manifest is not be unique to a single container host. The distro can
build manifests so that containers are running known, signed and
verified container images built by the distro. The container
orchestration software or admin could also build manifests on demand
and sign them.

If the manifest is not signed, not signed with a key loaded
into the kernel keyring, or does not pass verification, then we
simply fall back to root-in-the-init-ns permissions being required
to mount the manifest. This fallback is exactly the same security
model we have for every other type of filesystem image that the
linux kernel can mount - we trust root not to be mounting malicious
images.

Essentially, I don't think any of the filesystems in the linux
kernel currently provide a viable solution to the problem that
composefs is trying to solve. We need a different way of solving the
ephemeral container namespace creation and destruction overhead
problem. Composefs provides a mechanism that not only solves this
problem and potentially several others, whilst also being easy to
retrofit into existing production container stacks.

As such, I think composefs is definitely worth further time and
investment as a unique line of filesystem development for Linux.
Solve the chain of trust problem (i.e. crypto signing for the
manifest files) and we potentially have game changing container
infrastructure in a couple of thousand lines of code...

Cheers,

Dave.

Gao Xiang Jan. 18, 2023, 1:27 a.m. UTC | #16

On 2023/1/18 08:22, Dave Chinner wrote:
> On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote:
>> On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote:
>>> Christian Brauner <brauner@kernel.org> writes:
>>> 2) no multi repo support:
>>>
>>> Both reflinks and hardlinks do not work across mount points, so we
>>
>> Just fwiw, afaict reflinks work across mount points since at least 5.18.
> 

...

> 
> As such, I think composefs is definitely worth further time and
> investment as a unique line of filesystem development for Linux.
> Solve the chain of trust problem (i.e. crypto signing for the
> manifest files) and we potentially have game changing container
> infrastructure in a couple of thousand lines of code...

I think that is the last time I write some words in this v2
patchset.  At a quick glance of the current v2 patchset:

   1) struct cfs_buf {  -> struct erofs_buf;

   2) cfs_buf_put -> erofs_put_metabuf;

   3) cfs_get_buf -> erofs_bread -> (but erofs_read_metabuf() in
                                        v5.17 is much closer);
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/erofs/data.c?h=linux-5.17.y

   4) cfs_dentry_s -> erofs_dirent;

   ...

Also it drops EROFS __lexx and uses buggy uxx instead.

It drops iomap/fscache interface with a stackable file
interface and it doesn't have ACL and (else) I don't
have time to look into more.

That is the current my point of view of the current
Composefs. Yes, you could use/fork any code in
open-source projects, but it currently seems like an
immature EROFS-truncated copy and its cover letter
never mentioned EROFS at all.

I'd suggest you guys refactor similar code (if you
claim that is not another EROFS) before it really
needs to be upstreamed, otherwise I would feel
uneasy as well.  Apart from that, again I have no
objection if folks feel like a new read-only
stackable filesystem like this.

Apart from the codebase, I do hope there could be some
discussion of this topic at LSF/MM/BPF 2023 as Amir
suggested because I don't think this overlay model is
really safe without fs-verity enforcing.

Thank all for the time.  I'm done.

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.

Alexander Larsson Jan. 20, 2023, 9:22 a.m. UTC | #17

On Tue, 2023-01-17 at 11:12 +0100, Christian Brauner wrote:
> On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote:
> > > It seems rather another an incomplete EROFS from several points
> > > of view.  Also see:
> > > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u
> > > 
> > 
> > Ironically, ZUFS is one of two new filesystems that were discussed
> > in LSFMM19,
> > where the community reactions rhyme with the reactions to
> > composefs.
> > The discussion on Incremental FS resembles composefs case even more
> > [1].
> > AFAIK, Android is still maintaining Incremental FS out-of-tree.
> > 
> > Alexander and Giuseppe,
> > 
> > I'd like to join Gao is saying that I think it is in the best
> > interest
> > of everyone,
> > composefs developers and prospect users included,
> > if the composefs requirements would drive improvement to existing
> > kernel subsystems rather than adding a custom filesystem driver
> > that partly duplicates other subsystems.
> > 
> > Especially so, when the modifications to existing components
> > (erofs and overlayfs) appear to be relatively minor and the
> > maintainer
> > of erofs is receptive to new features and happy to collaborate with
> > you.
> > 
> > w.r.t overlayfs, I am not even sure that anything needs to be
> > modified
> > in the driver.
> > overlayfs already supports "metacopy" feature which means that an
> > upper layer
> > could be composed in a way that the file content would be read from
> > an arbitrary
> > path in lower fs, e.g. objects/cc/XXX.
> > 
> > I gave a talk on LPC a few years back about overlayfs and container
> > images [2].
> > The emphasis was that overlayfs driver supports many new features,
> > but userland
> > tools for building advanced overlayfs images based on those new
> > features are
> > nowhere to be found.
> > 
> > I may be wrong, but it looks to me like composefs could potentially
> > fill this void,
> > without having to modify the overlayfs driver at all, or maybe just
> > a
> > little bit.
> > Please start a discussion with overlayfs developers about missing
> > driver
> > features if you have any.
> 
> Surprising that I and others weren't Cced on this given that we had a
> meeting with the main developers and a few others where we had said
> the
> same thing. I hadn't followed this. 

Sorry about that, I'm just not very used to the kernel submission
mechanism. I'll CC you on the next version.

> 
> We have at least 58 filesystems currently in the kernel (and that's a
> conservative count just based on going by obvious directories and
> ignoring most virtual filesystems).
> 
> A non-insignificant portion is probably slowly rotting away with
> little
> fixes coming in, with few users, and not much attention is being paid
> to
> syzkaller reports for them if they show up. I haven't quantified this
> of
> course.
> 
> Taking in a new filesystems into kernel in the worst case means that
> it's being dumped there once and will slowly become unmaintained.
> Then
> we'll have a few users for the next 20 years and we can't reasonably
> deprecate it (Maybe that's another good topic: How should we fade out
> filesystems.).
> 
> Of course, for most fs developers it probably doesn't matter how many
> other filesystems there are in the kernel (aside from maybe competing
> for the same users).
> 
> But for developers who touch the vfs every new filesystems may
> increase
> the cost of maintaining and reworking existing functionality, or
> adding
> new functionality. Making it more likely to accumulate hacks, adding
> workarounds, or flatout being unable to kill off infrastructure that
> should reasonably go away. Maybe this is an unfair complaint but just
> from experience a new filesystem potentially means one or two weeks
> to
> make a larger vfs change.
> 
> I want to stress that I'm not at all saying "no more new fs" but we
> should be hesitant before we merge new filesystems into the kernel.

Well, it sure reads as "no more new fs" to me. But I understand that
there is hesitation towards this. The new version will be even simpler
(based on feedback from dave), weighing in at < 2000 lines. Hopefully
this will make it easier to review and maintain and somewhat countering
the cost of yet another filesystem.

> Especially for filesystems that are tailored to special use-cases.
> Every few years another filesystem tailored to container use-cases
> shows
> up. And frankly, a good portion of the issues that they are trying to
> solve are caused by design choices in userspace.

Well, we have at least two use cases, but sure, it is not a general
purpose filesystem.

> And I have to say I'm especially NAK-friendly about anything that
> comes
> even close to yet another stacking filesystems or anything that
> layers
> on top of a lower filesystem/mount such as ecryptfs, ksmbd, and
> overlayfs. They are hard to get right, with lots of corner cases and
> they cause the most headaches when making vfs changes.

I can't disagree here, because I'm not a vfs maintainer, but I will say
that composefs is fundamentally much simpler that these examples. First
because it is completely read-only, and secondly because it doesn't
rely on the lower filesystem for anything but file content (i.e. lower
fs metadata or directory structure doesn't affect the upper fs).

[v2,0/6] Composefs: an opportunistically sharing verified image filesystem

Message

Comments