Message ID | cover.1673623253.git.alexl@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Composefs: an opportunistically sharing verified image filesystem | expand |
Hi Alexander and folks, On 2023/1/13 23:33, Alexander Larsson wrote: > Giuseppe Scrivano and I have recently been working on a new project we > call composefs. This is the first time we propose this publically and > we would like some feedback on it. > > At its core, composefs is a way to construct and use read only images > that are used similar to how you would use e.g. loop-back mounted > squashfs images. On top of this composefs has two fundamental > features. First it allows sharing of file data (both on disk and in > page cache) between images, and secondly it has dm-verity like > validation on read. > > Let me first start with a minimal example of how this can be used, > before going into the details: > > Suppose we have this source for an image: > > rootfs/ > ├── dir > │ └── another_a > ├── file_a > └── file_b > > We can then use this to generate an image file and a set of > content-addressed backing files: > > # mkcomposefs --digest-store=objects rootfs/ rootfs.img > # ls -l rootfs.img objects/*/* > -rw-------. 1 root root 10 Nov 18 13:20 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > -rw-------. 1 root root 10 Nov 18 13:20 objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > -rw-r--r--. 1 root root 4228 Nov 18 13:20 rootfs.img > > The rootfs.img file contains all information about directory and file > metadata plus references to the backing files by name. We can now > mount this and look at the result: > > # mount -t composefs rootfs.img -o basedir=objects /mnt > # ls /mnt/ > dir file_a file_b > # cat /mnt/file_a > content_a > > When reading this file the kernel is actually reading the backing > file, in a fashion similar to overlayfs. Since the backing file is > content-addressed, the objects directory can be shared for multiple > images, and any files that happen to have the same content are > shared. I refer to this as opportunistic sharing, as it is different > than the more course-grained explicit sharing used by e.g. container > base images. > I'd like to say sorry about comments in LWN.net article. If it helps to the community, my own concern about this new overlay model was (which is different from overlayfs since overlayfs doesn't have different permission of original files) somewhat a security issue (as I told Giuseppe Scrivano before when he initially found me on slack): As composefs on-disk shown: struct cfs_inode_s { ... u32 st_mode; /* File type and mode. */ u32 st_nlink; /* Number of hard links, only for regular files. */ u32 st_uid; /* User ID of owner. */ u32 st_gid; /* Group ID of owner. */ ... }; It seems Composefs can override uid / gid and mode bits of the original file considering a rootfs image: ├── /bin │ └── su /bin/su has SUID bit set in the Composefs inode metadata, but I didn't find some clues if ostree "objects/abc" could be actually replaced with data of /bin/sh if composefs fsverity feature is disabled (it doesn't seem composefs enforcely enables fsverity according to documentation). I think that could cause _privilege escalation attack_ of these SUID files is replaced with some root shell. Administrators cannot keep all the time of these SUID files because such files can also be replaced at runtime. Composefs may assume that ostree is always for such content-addressed directory. But if considering it could laterly be an upstream fs, I think we cannot always tell people "no, don't use this way, it doesn't work" if people use Composefs under an untrusted repo (maybe even without ostree). That was my own concern at that time when Giuseppe Scrivano told me to enhance EROFS as this way, and I requested him to discuss this in the fsdevel mailing list in order to resolve this, but it doesn't happen. Otherwise, EROFS could face such issue as well, that is why I think it needs to be discussed first. > The next step is the validation. Note how the object files have > fs-verity enabled. In fact, they are named by their fs-verity digest: > > # fsverity digest objects/*/* > sha256:02927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 objects/02/927862b4ab9fb69919187bb78d394e235ce444eeb0a890d37e955827fe4bf4 > sha256:cc3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > > The generated filesystm image may contain the expected digest for the > backing files. When the backing file digest is incorrect, the open > will fail, and if the open succeeds, any other on-disk file-changes > will be detected by fs-verity: > > # cat objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > content_a > # rm -f objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > # echo modified > objects/cc/3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f > # cat /mnt/file_a > WARNING: composefs backing file '3da5b14909626fc99443f580e4d8c9b990e85e0a1d18883dc89b23d43e173f' unexpectedly had no fs-verity digest > cat: /mnt/file_a: Input/output error > > This re-uses the existing fs-verity functionallity to protect against > changes in file contents, while adding on top of it protection against > changes in filesystem metadata and structure. I.e. protecting against > replacing a fs-verity enabled file or modifying file permissions or > xattrs. > > To be fully verified we need another step: we use fs-verity on the > image itself. Then we pass the expected digest on the mount command > line (which will be verified at mount time): > > # fsverity enable rootfs.img > # fsverity digest rootfs.img > sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 rootfs.img > # mount -t composefs rootfs.img -o basedir=objects,digest=da42003782992856240a3e25264b19601016114775debd80c01620260af86a76 /mnt > It seems that Composefs uses fsverity_get_digest() to do fsverity check. If Composefs uses symlink-like payload to redirect a file to another underlayfs file, such underlayfs file can exist in any other fses. I can see Composefs could work with ext4, btrfs, f2fs, and later XFS but I'm not sure how it could work with overlayfs, FUSE, or other network fses. That could limit the use cases as well. Except for the above, I think EROFS could implement this in about 300~500 new lines of code as Giuseppe found me, or squashfs or overlayfs. I'm very happy to implement such model if it can be proved as safe (I'd also like to say here by no means I dislike ostree) and I'm also glad if folks feel like to introduce a new file system for this as long as this overlay model is proved as safe. Hopefully it helps. Thanks, Gao Xiang
On Mon, 2023-01-16 at 12:44 +0800, Gao Xiang wrote: > Hi Alexander and folks, > > I'd like to say sorry about comments in LWN.net article. If it helps > to the community, my own concern about this new overlay model was > (which is different from overlayfs since overlayfs doesn't have > different permission of original files) somewhat a security issue > (as > I told Giuseppe Scrivano before when he initially found me on slack): > > As composefs on-disk shown: > > struct cfs_inode_s { > > ... > > u32 st_mode; /* File type and mode. */ > u32 st_nlink; /* Number of hard links, only for regular > files. */ > u32 st_uid; /* User ID of owner. */ > u32 st_gid; /* Group ID of owner. */ > > ... > }; > > It seems Composefs can override uid / gid and mode bits of the > original file > > considering a rootfs image: > ├── /bin > │ └── su > > /bin/su has SUID bit set in the Composefs inode metadata, but I > didn't > find some clues if ostree "objects/abc" could be actually replaced > with data of /bin/sh if composefs fsverity feature is disabled (it > doesn't seem composefs enforcely enables fsverity according to > documentation). > > I think that could cause _privilege escalation attack_ of these SUID > files is replaced with some root shell. Administrators cannot keep > all the time of these SUID files because such files can also be > replaced at runtime. > > Composefs may assume that ostree is always for such content-addressed > directory. But if considering it could laterly be an upstream fs, I > think we cannot always tell people "no, don't use this way, it > doesn't > work" if people use Composefs under an untrusted repo (maybe even > without ostree). > > That was my own concern at that time when Giuseppe Scrivano told me > to enhance EROFS as this way, and I requested him to discuss this in > the fsdevel mailing list in order to resolve this, but it doesn't > happen. > > Otherwise, EROFS could face such issue as well, that is why I think > it needs to be discussed first. I mean, you're not wrong about this being possible. But I don't see that this is necessarily a new problem. For example, consider the case of loopback mounting an ext4 filesystem containing a setuid /bin/su file. If you have the right permissions, nothing prohibits you from modifying the loopback mounted file and replacing the content of the su file with a copy of bash. In both these cases, the security of the system is fully defined by the filesystem permissions of the backing file data. I think viewing composefs as a "new type" of overlayfs gets the wrong idea across. Its more similar to a "new type" of loopback mount. In particular, the backing file metadata is completely unrelated to the metadata exposed by the filesystem, which means that you can chose to protect the backing files (and directories) in ways which protect against changes from non-privileged users. Note: The above assumes that mounting either a loopback mount or a composefs image is a privileged operation. Allowing unprivileged mounts is a very different thing. > > To be fully verified we need another step: we use fs-verity on the > > image itself. Then we pass the expected digest on the mount command > > line (which will be verified at mount time): > > > > # fsverity enable rootfs.img > > # fsverity digest rootfs.img > > sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af8 > > 6a76 rootfs.img > > # mount -t composefs rootfs.img -o > > basedir=objects,digest=da42003782992856240a3e25264b19601016114775de > > bd80c01620260af86a76 /mnt > > > > > It seems that Composefs uses fsverity_get_digest() to do fsverity > check. If Composefs uses symlink-like payload to redirect a file to > another underlayfs file, such underlayfs file can exist in any other > fses. > > I can see Composefs could work with ext4, btrfs, f2fs, and later XFS > but I'm not sure how it could work with overlayfs, FUSE, or other > network fses. That could limit the use cases as well. Yes, if you chose to store backing files on a non-fs-verity enabled filesystem you cannot use the fs-verity feature. But this is just a decision users of composefs have to take if they wish to use this particular feature. I think re-using fs-verity like this is a better approach than re-implementing verity. > Except for the above, I think EROFS could implement this in about > 300~500 new lines of code as Giuseppe found me, or squashfs or > overlayfs. > > I'm very happy to implement such model if it can be proved as safe > (I'd also like to say here by no means I dislike ostree) and I'm > also glad if folks feel like to introduce a new file system for > this as long as this overlay model is proved as safe. My personal target usecase is that of the ostree trusted root filesystem, and it has a lot of specific requirements that lead to choices in the design of composefs. I took a look at EROFS a while ago, and I think that even with some verify-like feature it would not fit this usecase. EROFS does indeed do some of the file-sharing aspects of composefs with its use of fs-cache (although the current n_chunk limit would need to be raised). However, I think there are two problems with this. First of all is the complexity of having to involve a userspace for the cache. For trusted boot to work we have to have all the cachefs userspace machinery on the (signed) initrd, and then have to properly transition this across the pivot-root into the full os boot. I'm sure it is technically *possible*, but it is very complex and a pain to set up and maintain. Secondly, the use of fs-cache doesn't stack, as there can only be one cachefs agent. For example, mixing an ostree EROFS boot with a container backend using EROFS isn't possible (at least without deep integration between the two userspaces). Also, f we ignore the file sharing aspects there is the question of how to actually integrate a new digest-based image format with the pre- existing ostree formats and distribution mechanisms. If we just replace everything with distributing a signed image file then we can easily use existing technology (say dm-verity + squashfs + loopback). However, this would be essentially A/B booting and we would lose all the advantages of ostree. Instead what we have done with composefs is to make filesystem image generation from the ostree repository 100% reproducible. Then we can keep the entire pre-existing ostree distribution mechanism and on-disk repo format, adding just a single piece of metadata to the ostree commit, containing the composefs toplevel digest. Then the client can easily and efficiently re-generate the composefs image locally, and boot into it specifying the trusted not-locally-generated digest. A filesystem that doesn't have this reproduceability feature isn't going to be possible to integrate with ostree without enormous changes to ostree, and a filesystem more complex that composefs will have a hard time giving such guarantees.
Hi Alexander, On 2023/1/16 17:30, Alexander Larsson wrote: > On Mon, 2023-01-16 at 12:44 +0800, Gao Xiang wrote: >> Hi Alexander and folks, >> >> I'd like to say sorry about comments in LWN.net article. If it helps >> to the community, my own concern about this new overlay model was >> (which is different from overlayfs since overlayfs doesn't have >> different permission of original files) somewhat a security issue >> (as >> I told Giuseppe Scrivano before when he initially found me on slack): >> >> As composefs on-disk shown: >> >> struct cfs_inode_s { >> >> ... >> >> u32 st_mode; /* File type and mode. */ >> u32 st_nlink; /* Number of hard links, only for regular >> files. */ >> u32 st_uid; /* User ID of owner. */ >> u32 st_gid; /* Group ID of owner. */ >> >> ... >> }; >> >> It seems Composefs can override uid / gid and mode bits of the >> original file >> >> considering a rootfs image: >> ├── /bin >> │ └── su >> >> /bin/su has SUID bit set in the Composefs inode metadata, but I >> didn't >> find some clues if ostree "objects/abc" could be actually replaced >> with data of /bin/sh if composefs fsverity feature is disabled (it >> doesn't seem composefs enforcely enables fsverity according to >> documentation). >> >> I think that could cause _privilege escalation attack_ of these SUID >> files is replaced with some root shell. Administrators cannot keep >> all the time of these SUID files because such files can also be >> replaced at runtime. >> >> Composefs may assume that ostree is always for such content-addressed >> directory. But if considering it could laterly be an upstream fs, I >> think we cannot always tell people "no, don't use this way, it >> doesn't >> work" if people use Composefs under an untrusted repo (maybe even >> without ostree). >> >> That was my own concern at that time when Giuseppe Scrivano told me >> to enhance EROFS as this way, and I requested him to discuss this in >> the fsdevel mailing list in order to resolve this, but it doesn't >> happen. >> >> Otherwise, EROFS could face such issue as well, that is why I think >> it needs to be discussed first. > > I mean, you're not wrong about this being possible. But I don't see > that this is necessarily a new problem. For example, consider the case > of loopback mounting an ext4 filesystem containing a setuid /bin/su > file. If you have the right permissions, nothing prohibits you from > modifying the loopback mounted file and replacing the content of the su > file with a copy of bash. > > In both these cases, the security of the system is fully defined by the > filesystem permissions of the backing file data. I think viewing > composefs as a "new type" of overlayfs gets the wrong idea across. Its > more similar to a "new type" of loopback mount. In particular, the > backing file metadata is completely unrelated to the metadata exposed > by the filesystem, which means that you can chose to protect the > backing files (and directories) in ways which protect against changes > from non-privileged users. > > Note: The above assumes that mounting either a loopback mount or a > composefs image is a privileged operation. Allowing unprivileged mounts > is a very different thing. Thanks for the reply. I think if I understand correctly, I could answer some of your questions. Hopefully help to everyone interested. Let's avoid thinking unprivileged mounts first, although Giuseppe told me earilier that is also a future step of Composefs. But I don't know how it could work reliably if a fs has some on-disk format, we could discuss it later. I think as a loopback mount, such loopback files are quite under control (take ext4 loopback mount as an example, each ext4 has the only one file to access when setting up loopback devices and such loopback file was also opened when setting up loopback mount so it cannot be replaced. If you enables fsverity for such loopback mount before, it cannot be modified as well) by admins. But IMHO, here composefs shows a new model that some stackable filesystem can point to massive files under a random directory as what ostree does (even files in such directory can be bind-mounted later in principle). But the original userspace ostree strictly follows underlayfs permission check but Composefs can override uid/gid/permission instead. That is also why we selected fscache at the first time to manage all local cache data for EROFS, since such content-defined directory is quite under control by in-kernel fscache instead of selecting a random directory created and given by some userspace program. If you are interested in looking info the current in-kernel fscache behavior, I think that is much similar as what ostree does now. It just needs new features like - multiple directories; - daemonless to match. > >>> To be fully verified we need another step: we use fs-verity on the >>> image itself. Then we pass the expected digest on the mount command >>> line (which will be verified at mount time): >>> >>> # fsverity enable rootfs.img >>> # fsverity digest rootfs.img >>> sha256:da42003782992856240a3e25264b19601016114775debd80c01620260af8 >>> 6a76 rootfs.img >>> # mount -t composefs rootfs.img -o >>> basedir=objects,digest=da42003782992856240a3e25264b19601016114775de >>> bd80c01620260af86a76 /mnt >>> >> >> >> It seems that Composefs uses fsverity_get_digest() to do fsverity >> check. If Composefs uses symlink-like payload to redirect a file to >> another underlayfs file, such underlayfs file can exist in any other >> fses. >> >> I can see Composefs could work with ext4, btrfs, f2fs, and later XFS >> but I'm not sure how it could work with overlayfs, FUSE, or other >> network fses. That could limit the use cases as well. > > Yes, if you chose to store backing files on a non-fs-verity enabled > filesystem you cannot use the fs-verity feature. But this is just a > decision users of composefs have to take if they wish to use this > particular feature. I think re-using fs-verity like this is a better > approach than re-implementing verity. > >> Except for the above, I think EROFS could implement this in about >> 300~500 new lines of code as Giuseppe found me, or squashfs or >> overlayfs. >> >> I'm very happy to implement such model if it can be proved as safe >> (I'd also like to say here by no means I dislike ostree) and I'm >> also glad if folks feel like to introduce a new file system for >> this as long as this overlay model is proved as safe. > > My personal target usecase is that of the ostree trusted root > filesystem, and it has a lot of specific requirements that lead to > choices in the design of composefs. I took a look at EROFS a while ago, > and I think that even with some verify-like feature it would not fit > this usecase. > > EROFS does indeed do some of the file-sharing aspects of composefs with > its use of fs-cache (although the current n_chunk limit would need to > be raised). However, I think there are two problems with this. > > First of all is the complexity of having to involve a userspace for the > cache. For trusted boot to work we have to have all the cachefs > userspace machinery on the (signed) initrd, and then have to properly > transition this across the pivot-root into the full os boot. I'm sure > it is technically *possible*, but it is very complex and a pain to set > up and maintain. > > Secondly, the use of fs-cache doesn't stack, as there can only be one > cachefs agent. For example, mixing an ostree EROFS boot with a > container backend using EROFS isn't possible (at least without deep > integration between the two userspaces). The reasons above are all current fscache implementation limitation: - First, if such overlay model really works, EROFS can do it without fscache feature as well to integrate userspace ostree. But even that I hope this new feature can be landed in overlayfs rather than some other ways since it has native writable layer so we don't need another overlayfs mount at all for writing; - Second, as I mentioned above, the limitation above is what fscache behaves now not fscache will behave. I did discuss with David Howells that he also would like to develop multiple directories and daemonless features for network fses. > > Also, f we ignore the file sharing aspects there is the question of how > to actually integrate a new digest-based image format with the pre- > existing ostree formats and distribution mechanisms. If we just replace > everything with distributing a signed image file then we can easily use > existing technology (say dm-verity + squashfs + loopback). However, > this would be essentially A/B booting and we would lose all the > advantages of ostree. EROFS now can do data-duplication and later page cache sharing as well. > > Instead what we have done with composefs is to make filesystem image > generation from the ostree repository 100% reproducible. Then we can EROFS is all 100% reproduciable as well. > keep the entire pre-existing ostree distribution mechanism and on-disk > repo format, adding just a single piece of metadata to the ostree > commit, containing the composefs toplevel digest. Then the client can > easily and efficiently re-generate the composefs image locally, and > boot into it specifying the trusted not-locally-generated digest. A > filesystem that doesn't have this reproduceability feature isn't going > to be possible to integrate with ostree without enormous changes to > ostree, and a filesystem more complex that composefs will have a hard > time giving such guarantees. I'm not sure why EROFS is not good at this, I could also make an EROFS-version the same as what Composefs does with some symlink path attached to each regular file. And ostree can also make use of it. But really, personally I think the issue above is different from loopback devices and may need to be resolved first. And if possible, I hope it could be an new overlayfs feature for everyone. Thanks, Gao Xiang > >
On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote: > Hi Alexander, > > On 2023/1/16 17:30, Alexander Larsson wrote: > > > > I mean, you're not wrong about this being possible. But I don't see > > that this is necessarily a new problem. For example, consider the > > case > > of loopback mounting an ext4 filesystem containing a setuid /bin/su > > file. If you have the right permissions, nothing prohibits you from > > modifying the loopback mounted file and replacing the content of > > the su > > file with a copy of bash. > > > > In both these cases, the security of the system is fully defined by > > the > > filesystem permissions of the backing file data. I think viewing > > composefs as a "new type" of overlayfs gets the wrong idea across. > > Its > > more similar to a "new type" of loopback mount. In particular, the > > backing file metadata is completely unrelated to the metadata > > exposed > > by the filesystem, which means that you can chose to protect the > > backing files (and directories) in ways which protect against > > changes > > from non-privileged users. > > > > Note: The above assumes that mounting either a loopback mount or a > > composefs image is a privileged operation. Allowing unprivileged > > mounts > > is a very different thing. > > Thanks for the reply. I think if I understand correctly, I could > answer some of your questions. Hopefully help to everyone > interested. > > Let's avoid thinking unprivileged mounts first, although Giuseppe > told > me earilier that is also a future step of Composefs. But I don't know > how it could work reliably if a fs has some on-disk format, we could > discuss it later. > > I think as a loopback mount, such loopback files are quite under > control > (take ext4 loopback mount as an example, each ext4 has the only one > file > to access when setting up loopback devices and such loopback file > was > also opened when setting up loopback mount so it cannot be > replaced. > > If you enables fsverity for such loopback mount before, it cannot > be > modified as well) by admins. > > > But IMHO, here composefs shows a new model that some stackable > filesystem can point to massive files under a random directory as > what > ostree does (even files in such directory can be bind-mounted later > in > principle). But the original userspace ostree strictly follows > underlayfs permission check but Composefs can override > uid/gid/permission instead. Suppose you have: -rw-r--r-- root root image.ext4 -rw-r--r-- root root image.composefs drwxr--r-- root root objects/ -rw-r--r-- root root objects/backing.file Are you saying it is easier for someone to modify backing.file than image.ext4? I argue it is not, but composefs takes some steps to avoid issues here. At mount time, when the basedir ("objects/" above) argument is parsed, we resolve that path and then create a private vfsmount for it: resolve_basedir(path) { ... mnt = clone_private_mount(&path); ... } fsi->bases[i] = resolve_basedir(path); Then we open backing files with this mount as root: real_file = file_open_root_mnt(fsi->bases[i], real_path, file->f_flags, 0); This will never resolve outside the initially specified basedir, even with symlinks or whatever. It will also not be affected by later mount changes in the original mount namespace, as this is a private mount. This is the same mechanism that overlayfs uses for its upper dirs. I would argue that anyone who has rights to modify the contents of files in "objects" (supposing they were created with sane permissions) would also have rights to modify "image.ext4". > That is also why we selected fscache at the first time to manage all > local cache data for EROFS, since such content-defined directory is > quite under control by in-kernel fscache instead of selecting a > random directory created and given by some userspace program. > > If you are interested in looking info the current in-kernel fscache > behavior, I think that is much similar as what ostree does now. > > It just needs new features like > - multiple directories; > - daemonless > to match. > Obviously everything can be extended to support everything. But composefs is very small and simple (2128 lines of code), while at the same time being easy to use (just mount it with one syscall) and needs no complex userspace machinery and configuration. But even without the above feature additions fscache + cachefiles is 7982 lines, plus erofs is 9075 lines, and then on top of that you need userspace integration to even use the thing. Don't take me wrong, EROFS is great for its usecases, but I don't really think it is the right choice for my usecase. > > > > > Secondly, the use of fs-cache doesn't stack, as there can only be > > one > > cachefs agent. For example, mixing an ostree EROFS boot with a > > container backend using EROFS isn't possible (at least without deep > > integration between the two userspaces). > > The reasons above are all current fscache implementation limitation: > > - First, if such overlay model really works, EROFS can do it > without > fscache feature as well to integrate userspace ostree. But even that > I hope this new feature can be landed in overlayfs rather than some > other ways since it has native writable layer so we don't need > another > overlayfs mount at all for writing; I don't think it is the right approach for overlayfs to integrate something like image support. Merging the two codebases would complicate both while adding costs to users who need only support for one of the features. I think reusing and stacking separate features is a better idea than combining them. > > > > > Instead what we have done with composefs is to make filesystem > > image > > generation from the ostree repository 100% reproducible. Then we > > can > > EROFS is all 100% reproduciable as well. > Really, so if I today, on fedora 36 run: # tar xvf oci-image.tar # mkfs.erofs oci-dir/ oci.erofs And then in 5 years, if someone on debian 13 runs the same, with the same tar file, then both oci.erofs files will have the same sha256 checksum? How do you handle things like different versions or builds of compression libraries creating different results? Do you guarantee to not add any new backwards compat changes by default, or change any default options? Do you guarantee that the files are read from "oci- dir" in the same order each time? It doesn't look like it. > > But really, personally I think the issue above is different from > loopback devices and may need to be resolved first. And if possible, > I hope it could be an new overlayfs feature for everyone. Yeah. Independent of composefs, I think EROFS would be better if you could just point it to a chunk directory at mount time rather than having to route everything through a system-wide global cachefs singleton. I understand that cachefs does help with the on-demand download aspect, but when you don't need that it is just in the way.
On 2023/1/16 20:33, Alexander Larsson wrote: > On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote: >> Hi Alexander, >> >> On 2023/1/16 17:30, Alexander Larsson wrote: >>> >>> I mean, you're not wrong about this being possible. But I don't see >>> that this is necessarily a new problem. For example, consider the >>> case >>> of loopback mounting an ext4 filesystem containing a setuid /bin/su >>> file. If you have the right permissions, nothing prohibits you from >>> modifying the loopback mounted file and replacing the content of >>> the su >>> file with a copy of bash. >>> >>> In both these cases, the security of the system is fully defined by >>> the >>> filesystem permissions of the backing file data. I think viewing >>> composefs as a "new type" of overlayfs gets the wrong idea across. >>> Its >>> more similar to a "new type" of loopback mount. In particular, the >>> backing file metadata is completely unrelated to the metadata >>> exposed >>> by the filesystem, which means that you can chose to protect the >>> backing files (and directories) in ways which protect against >>> changes >>> from non-privileged users. >>> >>> Note: The above assumes that mounting either a loopback mount or a >>> composefs image is a privileged operation. Allowing unprivileged >>> mounts >>> is a very different thing. >> >> Thanks for the reply. I think if I understand correctly, I could >> answer some of your questions. Hopefully help to everyone >> interested. >> >> Let's avoid thinking unprivileged mounts first, although Giuseppe >> told >> me earilier that is also a future step of Composefs. But I don't know >> how it could work reliably if a fs has some on-disk format, we could >> discuss it later. >> >> I think as a loopback mount, such loopback files are quite under >> control >> (take ext4 loopback mount as an example, each ext4 has the only one >> file >> to access when setting up loopback devices and such loopback file >> was >> also opened when setting up loopback mount so it cannot be >> replaced. >> >> If you enables fsverity for such loopback mount before, it cannot >> be >> modified as well) by admins. >> >> >> But IMHO, here composefs shows a new model that some stackable >> filesystem can point to massive files under a random directory as >> what >> ostree does (even files in such directory can be bind-mounted later >> in >> principle). But the original userspace ostree strictly follows >> underlayfs permission check but Composefs can override >> uid/gid/permission instead. > > Suppose you have: > > -rw-r--r-- root root image.ext4 > -rw-r--r-- root root image.composefs > drwxr--r-- root root objects/ > -rw-r--r-- root root objects/backing.file > > Are you saying it is easier for someone to modify backing.file than > image.ext4? > > I argue it is not, but composefs takes some steps to avoid issues here. > At mount time, when the basedir ("objects/" above) argument is parsed, > we resolve that path and then create a private vfsmount for it: > > resolve_basedir(path) { > ... > mnt = clone_private_mount(&path); > ... > } > > fsi->bases[i] = resolve_basedir(path); > > Then we open backing files with this mount as root: > > real_file = file_open_root_mnt(fsi->bases[i], real_path, > file->f_flags, 0); > > This will never resolve outside the initially specified basedir, even > with symlinks or whatever. It will also not be affected by later mount > changes in the original mount namespace, as this is a private mount. > > This is the same mechanism that overlayfs uses for its upper dirs. Ok. I have no problem of this part. > > I would argue that anyone who has rights to modify the contents of > files in "objects" (supposing they were created with sane permissions) > would also have rights to modify "image.ext4". But you don't have any permission check for files in such "objects/" directory in composefs source code, do you? As I said in my original reply, don't assume random users or malicious people just passing in or behaving like your expected way. Sometimes they're not but I think in-kernel fses should handle such cases by design. Obviously, any system written by human can cause unexpected bugs, but that is another story. I think in general it needs to have such design at least. > >> That is also why we selected fscache at the first time to manage all >> local cache data for EROFS, since such content-defined directory is >> quite under control by in-kernel fscache instead of selecting a >> random directory created and given by some userspace program. >> >> If you are interested in looking info the current in-kernel fscache >> behavior, I think that is much similar as what ostree does now. >> >> It just needs new features like >> - multiple directories; >> - daemonless >> to match. >> > > Obviously everything can be extended to support everything. But > composefs is very small and simple (2128 lines of code), while at the > same time being easy to use (just mount it with one syscall) and needs > no complex userspace machinery and configuration. But even without the > above feature additions fscache + cachefiles is 7982 lines, plus erofs > is 9075 lines, and then on top of that you need userspace integration > to even use the thing. I've replied this in the comment of LWN.net. EROFS can handle both device-based or file-based images. It can handle FSDAX, compression, data deduplication, rolling-hash finer compressed data duplication, etc. Of course, for your use cases, you can just turn them off by Kconfig, I think such code is useless to your use cases as well. And as a team work these years, EROFS always accept useful features from other people. And I've been always working on cleaning up EROFS, but as long as it gains more features, the code can expand of course. Also take your project -- flatpak for example, I don't think the total line of current version is as same as the original version. Also you will always maintain Composefs source code below 2.5k Loc? > > Don't take me wrong, EROFS is great for its usecases, but I don't > really think it is the right choice for my usecase. > >>>> >>> Secondly, the use of fs-cache doesn't stack, as there can only be >>> one >>> cachefs agent. For example, mixing an ostree EROFS boot with a >>> container backend using EROFS isn't possible (at least without deep >>> integration between the two userspaces). >> >> The reasons above are all current fscache implementation limitation: >> >> - First, if such overlay model really works, EROFS can do it >> without >> fscache feature as well to integrate userspace ostree. But even that >> I hope this new feature can be landed in overlayfs rather than some >> other ways since it has native writable layer so we don't need >> another >> overlayfs mount at all for writing; > > I don't think it is the right approach for overlayfs to integrate > something like image support. Merging the two codebases would > complicate both while adding costs to users who need only support for > one of the features. I think reusing and stacking separate features is > a better idea than combining them. Why? overlayfs could have metadata support as well, if they'd like to support advanced features like partial copy-up without fscache support. > >> >>> >>> Instead what we have done with composefs is to make filesystem >>> image >>> generation from the ostree repository 100% reproducible. Then we >>> can >> >> EROFS is all 100% reproduciable as well. >> > > > Really, so if I today, on fedora 36 run: > # tar xvf oci-image.tar > # mkfs.erofs oci-dir/ oci.erofs > > And then in 5 years, if someone on debian 13 runs the same, with the > same tar file, then both oci.erofs files will have the same sha256 > checksum? Why it doesn't? Reproducable builds is a MUST for Android use cases as well. Yes, it may break between versions by mistake, but I think reproducable builds is a basic functionalaity for all image use cases. > > How do you handle things like different versions or builds of > compression libraries creating different results? Do you guarantee to > not add any new backwards compat changes by default, or change any > default options? Do you guarantee that the files are read from "oci- > dir" in the same order each time? It doesn't look like it. If you'd like to say like that, why mkcomposefs doesn't have the same issue that it may be broken by some bug. > >> >> But really, personally I think the issue above is different from >> loopback devices and may need to be resolved first. And if possible, >> I hope it could be an new overlayfs feature for everyone. > > Yeah. Independent of composefs, I think EROFS would be better if you > could just point it to a chunk directory at mount time rather than > having to route everything through a system-wide global cachefs > singleton. I understand that cachefs does help with the on-demand > download aspect, but when you don't need that it is just in the way. Just check your reply to Dave's review, it seems that how composefs dir on-disk format works is also much similar to EROFS as well, see: https://docs.kernel.org/filesystems/erofs.html -- Directories a block vs a chunk = dirent + names cfs_dir_lookup -> erofs_namei + find_target_block_classic; cfs_dir_lookup_in_chunk -> find_target_dirent. Yes, great projects could be much similar to each other occasionally, not to mention opensource projects ;) Anyway, I'm not opposed to Composefs if folks really like a new read-only filesystem for this. That is almost all I'd like to say about Composefs formally, have fun! Thanks, Gao Xiang > >
Gao Xiang <hsiangkao@linux.alibaba.com> writes: > On 2023/1/16 20:33, Alexander Larsson wrote: >> On Mon, 2023-01-16 at 18:19 +0800, Gao Xiang wrote: >>> Hi Alexander, >>> >>> On 2023/1/16 17:30, Alexander Larsson wrote: >>>> >>>> I mean, you're not wrong about this being possible. But I don't see >>>> that this is necessarily a new problem. For example, consider the >>>> case >>>> of loopback mounting an ext4 filesystem containing a setuid /bin/su >>>> file. If you have the right permissions, nothing prohibits you from >>>> modifying the loopback mounted file and replacing the content of >>>> the su >>>> file with a copy of bash. >>>> >>>> In both these cases, the security of the system is fully defined by >>>> the >>>> filesystem permissions of the backing file data. I think viewing >>>> composefs as a "new type" of overlayfs gets the wrong idea across. >>>> Its >>>> more similar to a "new type" of loopback mount. In particular, the >>>> backing file metadata is completely unrelated to the metadata >>>> exposed >>>> by the filesystem, which means that you can chose to protect the >>>> backing files (and directories) in ways which protect against >>>> changes >>>> from non-privileged users. >>>> >>>> Note: The above assumes that mounting either a loopback mount or a >>>> composefs image is a privileged operation. Allowing unprivileged >>>> mounts >>>> is a very different thing. >>> >>> Thanks for the reply. I think if I understand correctly, I could >>> answer some of your questions. Hopefully help to everyone >>> interested. >>> >>> Let's avoid thinking unprivileged mounts first, although Giuseppe >>> told >>> me earilier that is also a future step of Composefs. But I don't know >>> how it could work reliably if a fs has some on-disk format, we could >>> discuss it later. >>> >>> I think as a loopback mount, such loopback files are quite under >>> control >>> (take ext4 loopback mount as an example, each ext4 has the only one >>> file >>> to access when setting up loopback devices and such loopback file >>> was >>> also opened when setting up loopback mount so it cannot be >>> replaced. >>> >>> If you enables fsverity for such loopback mount before, it cannot >>> be >>> modified as well) by admins. >>> >>> >>> But IMHO, here composefs shows a new model that some stackable >>> filesystem can point to massive files under a random directory as >>> what >>> ostree does (even files in such directory can be bind-mounted later >>> in >>> principle). But the original userspace ostree strictly follows >>> underlayfs permission check but Composefs can override >>> uid/gid/permission instead. >> Suppose you have: >> -rw-r--r-- root root image.ext4 >> -rw-r--r-- root root image.composefs >> drwxr--r-- root root objects/ >> -rw-r--r-- root root objects/backing.file >> Are you saying it is easier for someone to modify backing.file than >> image.ext4? >> I argue it is not, but composefs takes some steps to avoid issues >> here. >> At mount time, when the basedir ("objects/" above) argument is parsed, >> we resolve that path and then create a private vfsmount for it: >> resolve_basedir(path) { >> ... >> mnt = clone_private_mount(&path); >> ... >> } >> fsi->bases[i] = resolve_basedir(path); >> Then we open backing files with this mount as root: >> real_file = file_open_root_mnt(fsi->bases[i], real_path, >> file->f_flags, 0); >> This will never resolve outside the initially specified basedir, >> even >> with symlinks or whatever. It will also not be affected by later mount >> changes in the original mount namespace, as this is a private mount. >> This is the same mechanism that overlayfs uses for its upper dirs. > > Ok. I have no problem of this part. > >> I would argue that anyone who has rights to modify the contents of >> files in "objects" (supposing they were created with sane permissions) >> would also have rights to modify "image.ext4". > > But you don't have any permission check for files in such > "objects/" directory in composefs source code, do you? > > As I said in my original reply, don't assume random users or > malicious people just passing in or behaving like your expected > way. Sometimes they're not but I think in-kernel fses should > handle such cases by design. Obviously, any system written by > human can cause unexpected bugs, but that is another story. > I think in general it needs to have such design at least. what malicious people are you worried about? composefs is usable only in the initial user namespace for now so only root can use it and has the responsibility to use trusted files. >> >>> That is also why we selected fscache at the first time to manage all >>> local cache data for EROFS, since such content-defined directory is >>> quite under control by in-kernel fscache instead of selecting a >>> random directory created and given by some userspace program. >>> >>> If you are interested in looking info the current in-kernel fscache >>> behavior, I think that is much similar as what ostree does now. >>> >>> It just needs new features like >>> - multiple directories; >>> - daemonless >>> to match. >>> >> Obviously everything can be extended to support everything. But >> composefs is very small and simple (2128 lines of code), while at the >> same time being easy to use (just mount it with one syscall) and needs >> no complex userspace machinery and configuration. But even without the >> above feature additions fscache + cachefiles is 7982 lines, plus erofs >> is 9075 lines, and then on top of that you need userspace integration >> to even use the thing. > > I've replied this in the comment of LWN.net. EROFS can handle both > device-based or file-based images. It can handle FSDAX, compression, > data deduplication, rolling-hash finer compressed data duplication, > etc. Of course, for your use cases, you can just turn them off by > Kconfig, I think such code is useless to your use cases as well. > > And as a team work these years, EROFS always accept useful features > from other people. And I've been always working on cleaning up > EROFS, but as long as it gains more features, the code can expand > of course. > > Also take your project -- flatpak for example, I don't think the > total line of current version is as same as the original version. > > Also you will always maintain Composefs source code below 2.5k Loc? > >> Don't take me wrong, EROFS is great for its usecases, but I don't >> really think it is the right choice for my usecase. >> >>>>> >>>> Secondly, the use of fs-cache doesn't stack, as there can only be >>>> one >>>> cachefs agent. For example, mixing an ostree EROFS boot with a >>>> container backend using EROFS isn't possible (at least without deep >>>> integration between the two userspaces). >>> >>> The reasons above are all current fscache implementation limitation: >>> >>> - First, if such overlay model really works, EROFS can do it >>> without >>> fscache feature as well to integrate userspace ostree. But even that >>> I hope this new feature can be landed in overlayfs rather than some >>> other ways since it has native writable layer so we don't need >>> another >>> overlayfs mount at all for writing; >> I don't think it is the right approach for overlayfs to integrate >> something like image support. Merging the two codebases would >> complicate both while adding costs to users who need only support for >> one of the features. I think reusing and stacking separate features is >> a better idea than combining them. > > Why? overlayfs could have metadata support as well, if they'd like > to support advanced features like partial copy-up without fscache > support. > >> >>> >>>> >>>> Instead what we have done with composefs is to make filesystem >>>> image >>>> generation from the ostree repository 100% reproducible. Then we >>>> can >>> >>> EROFS is all 100% reproduciable as well. >>> >> Really, so if I today, on fedora 36 run: >> # tar xvf oci-image.tar >> # mkfs.erofs oci-dir/ oci.erofs >> And then in 5 years, if someone on debian 13 runs the same, with the >> same tar file, then both oci.erofs files will have the same sha256 >> checksum? > > Why it doesn't? Reproducable builds is a MUST for Android use cases > as well. > > Yes, it may break between versions by mistake, but I think > reproducable builds is a basic functionalaity for all image > use cases. > >> How do you handle things like different versions or builds of >> compression libraries creating different results? Do you guarantee to >> not add any new backwards compat changes by default, or change any >> default options? Do you guarantee that the files are read from "oci- >> dir" in the same order each time? It doesn't look like it. > > If you'd like to say like that, why mkcomposefs doesn't have the > same issue that it may be broken by some bug. > >> >>> >>> But really, personally I think the issue above is different from >>> loopback devices and may need to be resolved first. And if possible, >>> I hope it could be an new overlayfs feature for everyone. >> Yeah. Independent of composefs, I think EROFS would be better if you >> could just point it to a chunk directory at mount time rather than >> having to route everything through a system-wide global cachefs >> singleton. I understand that cachefs does help with the on-demand >> download aspect, but when you don't need that it is just in the way. > > Just check your reply to Dave's review, it seems that how > composefs dir on-disk format works is also much similar to > EROFS as well, see: > > https://docs.kernel.org/filesystems/erofs.html -- Directories > > a block vs a chunk = dirent + names > > cfs_dir_lookup -> erofs_namei + find_target_block_classic; > cfs_dir_lookup_in_chunk -> find_target_dirent. > > Yes, great projects could be much similar to each other > occasionally, not to mention opensource projects ;) > > Anyway, I'm not opposed to Composefs if folks really like a > new read-only filesystem for this. That is almost all I'd like > to say about Composefs formally, have fun! > > Thanks, > Gao Xiang > >>
On Mon, 2023-01-16 at 21:26 +0800, Gao Xiang wrote: > > > On 2023/1/16 20:33, Alexander Larsson wrote: > > > > > > Suppose you have: > > > > -rw-r--r-- root root image.ext4 > > -rw-r--r-- root root image.composefs > > drwxr--r-- root root objects/ > > -rw-r--r-- root root objects/backing.file > > > > Are you saying it is easier for someone to modify backing.file than > > image.ext4? > > > > I argue it is not, but composefs takes some steps to avoid issues > > here. > > At mount time, when the basedir ("objects/" above) argument is > > parsed, > > we resolve that path and then create a private vfsmount for it: > > > > resolve_basedir(path) { > > ... > > mnt = clone_private_mount(&path); > > ... > > } > > > > fsi->bases[i] = resolve_basedir(path); > > > > Then we open backing files with this mount as root: > > > > real_file = file_open_root_mnt(fsi->bases[i], real_path, > > file->f_flags, 0); > > > > This will never resolve outside the initially specified basedir, > > even > > with symlinks or whatever. It will also not be affected by later > > mount > > changes in the original mount namespace, as this is a private > > mount. > > > > This is the same mechanism that overlayfs uses for its upper dirs. > > Ok. I have no problem of this part. > > > > > I would argue that anyone who has rights to modify the contents of > > files in "objects" (supposing they were created with sane > > permissions) > > would also have rights to modify "image.ext4". > > But you don't have any permission check for files in such > "objects/" directory in composefs source code, do you? I don't see how permission checks would make any difference to the ability to modify the image by anyone? Do you mean the kernel should validate the basedir so that is has sane permissions rather than trusting the user? That seems weird to me. Or do you mean that someone would create a composefs image that references a file they could not otherwise read, and then use it as a basedir in a composefs mount to read the file? Such a mount can only happen if you are root, and it can only read files inside that particular directory. However, maybe we should use the callers credentials to ensure that they are allowed to read the backing file, just in case. That can't hurt. > As I said in my original reply, don't assume random users or > malicious people just passing in or behaving like your expected > way. Sometimes they're not but I think in-kernel fses should > handle such cases by design. Obviously, any system written by > human can cause unexpected bugs, but that is another story. > I think in general it needs to have such design at least. You need to be root to mount a fs, an operation which is generally unsafe (because few filesystems are completely resistant to hostile filesystem data). Therefore I think we can expect a certain amount of sanity in its use, such as "don't pass in directories that are world writable". > > > > > That is also why we selected fscache at the first time to manage > > > all > > > local cache data for EROFS, since such content-defined directory > > > is > > > quite under control by in-kernel fscache instead of selecting a > > > random directory created and given by some userspace program. > > > > > > If you are interested in looking info the current in-kernel > > > fscache > > > behavior, I think that is much similar as what ostree does now. > > > > > > It just needs new features like > > > - multiple directories; > > > - daemonless > > > to match. > > > > > > > Obviously everything can be extended to support everything. But > > composefs is very small and simple (2128 lines of code), while at > > the > > same time being easy to use (just mount it with one syscall) and > > needs > > no complex userspace machinery and configuration. But even without > > the > > above feature additions fscache + cachefiles is 7982 lines, plus > > erofs > > is 9075 lines, and then on top of that you need userspace > > integration > > to even use the thing. > > I've replied this in the comment of LWN.net. EROFS can handle both > device-based or file-based images. It can handle FSDAX, compression, > data deduplication, rolling-hash finer compressed data duplication, > etc. Of course, for your use cases, you can just turn them off by > Kconfig, I think such code is useless to your use cases as well. > > And as a team work these years, EROFS always accept useful features > from other people. And I've been always working on cleaning up > EROFS, but as long as it gains more features, the code can expand > of course. > > Also take your project -- flatpak for example, I don't think the > total line of current version is as same as the original version. > > Also you will always maintain Composefs source code below 2.5k Loc? > > > > > Don't take me wrong, EROFS is great for its usecases, but I don't > > really think it is the right choice for my usecase. > > > > > > > > > > > Secondly, the use of fs-cache doesn't stack, as there can only > > > > be > > > > one > > > > cachefs agent. For example, mixing an ostree EROFS boot with a > > > > container backend using EROFS isn't possible (at least without > > > > deep > > > > integration between the two userspaces). > > > > > > The reasons above are all current fscache implementation > > > limitation: > > > > > > - First, if such overlay model really works, EROFS can do it > > > without > > > fscache feature as well to integrate userspace ostree. But even > > > that > > > I hope this new feature can be landed in overlayfs rather than > > > some > > > other ways since it has native writable layer so we don't need > > > another > > > overlayfs mount at all for writing; > > > > I don't think it is the right approach for overlayfs to integrate > > something like image support. Merging the two codebases would > > complicate both while adding costs to users who need only support > > for > > one of the features. I think reusing and stacking separate features > > is > > a better idea than combining them. > > Why? overlayfs could have metadata support as well, if they'd like > to support advanced features like partial copy-up without fscache > support. > > > > > > > > > > > > > > Instead what we have done with composefs is to make filesystem > > > > image > > > > generation from the ostree repository 100% reproducible. Then > > > > we > > > > can > > > > > > EROFS is all 100% reproduciable as well. > > > > > > > > > Really, so if I today, on fedora 36 run: > > # tar xvf oci-image.tar > > # mkfs.erofs oci-dir/ oci.erofs > > > > And then in 5 years, if someone on debian 13 runs the same, with > > the > > same tar file, then both oci.erofs files will have the same sha256 > > checksum? > > Why it doesn't? Reproducable builds is a MUST for Android use cases > as well. That is not quite the same requirements. A reproducible build in the traditional sense is limited to a particular build configuration. You define a set of tools for the build, and use the same ones for each build, and get a fixed output. You don't expect to be able to change e.g. the compiler and get the same result. Similarly, it is often the case that different builds or versions of compression libraries gives different results, so you can't expect to use e.g. a different libz and get identical images. > Yes, it may break between versions by mistake, but I think > reproducable builds is a basic functionalaity for all image > use cases. > > > > > How do you handle things like different versions or builds of > > compression libraries creating different results? Do you guarantee > > to > > not add any new backwards compat changes by default, or change any > > default options? Do you guarantee that the files are read from > > "oci- > > dir" in the same order each time? It doesn't look like it. > > If you'd like to say like that, why mkcomposefs doesn't have the > same issue that it may be broken by some bug. > libcomposefs defines a normalized form for everything like file order, xattr orders, etc, and carefully normalizes everything such that we can guarantee these properties. It is possible that some detail was missed, because we're humans. But it was a very conscious and deliberate design choice that is deeply encoded in the code and format. For example, this is why we don't use compression but try to minimize size in other ways. > > > > > > But really, personally I think the issue above is different from > > > loopback devices and may need to be resolved first. And if > > > possible, > > > I hope it could be an new overlayfs feature for everyone. > > > > Yeah. Independent of composefs, I think EROFS would be better if > > you > > could just point it to a chunk directory at mount time rather than > > having to route everything through a system-wide global cachefs > > singleton. I understand that cachefs does help with the on-demand > > download aspect, but when you don't need that it is just in the > > way. > > Just check your reply to Dave's review, it seems that how > composefs dir on-disk format works is also much similar to > EROFS as well, see: > > https://docs.kernel.org/filesystems/erofs.html -- Directories > > a block vs a chunk = dirent + names > > cfs_dir_lookup -> erofs_namei + find_target_block_classic; > cfs_dir_lookup_in_chunk -> find_target_dirent. Yeah, the dirent layout looks very similar. I guess great minds think alike! My approach was simpler initially, but it kinda converged on this when I started optimizing the kernel lookup code with binary search. > Yes, great projects could be much similar to each other > occasionally, not to mention opensource projects ;) > > Anyway, I'm not opposed to Composefs if folks really like a > new read-only filesystem for this. That is almost all I'd like > to say about Composefs formally, have fun! > > Thanks, > Gao Xiang Cool, thanks for the feedback.
On 2023/1/16 23:27, Alexander Larsson wrote: > On Mon, 2023-01-16 at 21:26 +0800, Gao Xiang wrote: I will stop saying this overlay permission model anymore since there are more experienced folks working on this, although SUID stuff is still dangerous to me as an end-user: IMHO, its hard for me to identify proper sub-sub-subdir UID/GID in "objects" at runtime, even they could happen much deep which is different from localfs with loopback devices or overlayfs. I don't know what then inproper sub-sub-subdir UID/GID in "objects" could cause. It seems currently ostree uses "root" all the time for such "objects" subdirs, I don't know. >>> >>>> >>>>> >>>>> Instead what we have done with composefs is to make filesystem >>>>> image >>>>> generation from the ostree repository 100% reproducible. Then >>>>> we >>>>> can >>>> >>>> EROFS is all 100% reproduciable as well. >>>> >>> >>> >>> Really, so if I today, on fedora 36 run: >>> # tar xvf oci-image.tar >>> # mkfs.erofs oci-dir/ oci.erofs >>> >>> And then in 5 years, if someone on debian 13 runs the same, with >>> the >>> same tar file, then both oci.erofs files will have the same sha256 >>> checksum? >> >> Why it doesn't? Reproducable builds is a MUST for Android use cases >> as well. > > That is not quite the same requirements. A reproducible build in the > traditional sense is limited to a particular build configuration. You > define a set of tools for the build, and use the same ones for each > build, and get a fixed output. You don't expect to be able to change > e.g. the compiler and get the same result. Similarly, it is often the > case that different builds or versions of compression libraries gives > different results, so you can't expect to use e.g. a different libz and > get identical images. > >> Yes, it may break between versions by mistake, but I think >> reproducable builds is a basic functionalaity for all image >> use cases. >> >>> >>> How do you handle things like different versions or builds of >>> compression libraries creating different results? Do you guarantee >>> to >>> not add any new backwards compat changes by default, or change any >>> default options? Do you guarantee that the files are read from >>> "oci- >>> dir" in the same order each time? It doesn't look like it. >> >> If you'd like to say like that, why mkcomposefs doesn't have the >> same issue that it may be broken by some bug. >> > > libcomposefs defines a normalized form for everything like file order, > xattr orders, etc, and carefully normalizes everything such that we can > guarantee these properties. It is possible that some detail was missed, > because we're humans. But it was a very conscious and deliberate design > choice that is deeply encoded in the code and format. For example, this > is why we don't use compression but try to minimize size in other ways. EROFS is reproducable since its dirents are all sorted because of its on-disk definition. And its xattrs are also sorted if images needs to be reproducable. I don't know what's the difference between these two reproducable builds. EROFS is designed for golden images, if you pass in a set of configuration options for mkfs.erofs, it should output the same output, otherwise they are really buges and need to be fixed. Compression algorithms could generate different outputs between versions, and generally compressed data is stable for most compression algorithms in a specific version but that is another story. EROFS can live without compression. > >>>> >>>> But really, personally I think the issue above is different from >>>> loopback devices and may need to be resolved first. And if >>>> possible, >>>> I hope it could be an new overlayfs feature for everyone. >>> >>> Yeah. Independent of composefs, I think EROFS would be better if >>> you >>> could just point it to a chunk directory at mount time rather than >>> having to route everything through a system-wide global cachefs >>> singleton. I understand that cachefs does help with the on-demand >>> download aspect, but when you don't need that it is just in the >>> way. >> >> Just check your reply to Dave's review, it seems that how >> composefs dir on-disk format works is also much similar to >> EROFS as well, see: >> >> https://docs.kernel.org/filesystems/erofs.html -- Directories >> >> a block vs a chunk = dirent + names >> >> cfs_dir_lookup -> erofs_namei + find_target_block_classic; >> cfs_dir_lookup_in_chunk -> find_target_dirent. > > Yeah, the dirent layout looks very similar. I guess great minds think > alike! My approach was simpler initially, but it kinda converged on > this when I started optimizing the kernel lookup code with binary > search. > >> Yes, great projects could be much similar to each other >> occasionally, not to mention opensource projects ;) >> >> Anyway, I'm not opposed to Composefs if folks really like a >> new read-only filesystem for this. That is almost all I'd like >> to say about Composefs formally, have fun! Because, anyway, I have no idea considering opensource projects could also do folk, so (maybe) such is life. It seems rather another an incomplete EROFS from several points of view. Also see: https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u I will go on making a better EROFS as a promise to the community initially. Thanks, Gao Xiang >> >> Thanks, >> Gao Xiang > > Cool, thanks for the feedback. > >
> It seems rather another an incomplete EROFS from several points > of view. Also see: > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u > Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19, where the community reactions rhyme with the reactions to composefs. The discussion on Incremental FS resembles composefs case even more [1]. AFAIK, Android is still maintaining Incremental FS out-of-tree. Alexander and Giuseppe, I'd like to join Gao is saying that I think it is in the best interest of everyone, composefs developers and prospect users included, if the composefs requirements would drive improvement to existing kernel subsystems rather than adding a custom filesystem driver that partly duplicates other subsystems. Especially so, when the modifications to existing components (erofs and overlayfs) appear to be relatively minor and the maintainer of erofs is receptive to new features and happy to collaborate with you. w.r.t overlayfs, I am not even sure that anything needs to be modified in the driver. overlayfs already supports "metacopy" feature which means that an upper layer could be composed in a way that the file content would be read from an arbitrary path in lower fs, e.g. objects/cc/XXX. I gave a talk on LPC a few years back about overlayfs and container images [2]. The emphasis was that overlayfs driver supports many new features, but userland tools for building advanced overlayfs images based on those new features are nowhere to be found. I may be wrong, but it looks to me like composefs could potentially fill this void, without having to modify the overlayfs driver at all, or maybe just a little bit. Please start a discussion with overlayfs developers about missing driver features if you have any. Overall, this sounds like a fun discussion to have at LSFMMBPF23 [3] so you are most welcome to submit a topic proposal for "opportunistically sharing verified image filesystem". Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/CAK8JDrGRzA+yphpuX+GQ0syRwF_p2Fora+roGCnYqB5E1eOmXA@mail.gmail.com/ [2] https://lpc.events/event/7/contributions/639/attachments/501/969/Overlayfs-containers-lpc-2020.pdf [3] https://lore.kernel.org/linux-fsdevel/Y7hDVliKq+PzY1yY@localhost.localdomain/
On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote: > > It seems rather another an incomplete EROFS from several points > > of view. Also see: > > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u > > > > Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19, > where the community reactions rhyme with the reactions to composefs. > The discussion on Incremental FS resembles composefs case even more [1]. > AFAIK, Android is still maintaining Incremental FS out-of-tree. > > Alexander and Giuseppe, > > I'd like to join Gao is saying that I think it is in the best interest > of everyone, > composefs developers and prospect users included, > if the composefs requirements would drive improvement to existing > kernel subsystems rather than adding a custom filesystem driver > that partly duplicates other subsystems. > > Especially so, when the modifications to existing components > (erofs and overlayfs) appear to be relatively minor and the maintainer > of erofs is receptive to new features and happy to collaborate with you. > > w.r.t overlayfs, I am not even sure that anything needs to be modified > in the driver. > overlayfs already supports "metacopy" feature which means that an upper layer > could be composed in a way that the file content would be read from an arbitrary > path in lower fs, e.g. objects/cc/XXX. > > I gave a talk on LPC a few years back about overlayfs and container images [2]. > The emphasis was that overlayfs driver supports many new features, but userland > tools for building advanced overlayfs images based on those new features are > nowhere to be found. > > I may be wrong, but it looks to me like composefs could potentially > fill this void, > without having to modify the overlayfs driver at all, or maybe just a > little bit. > Please start a discussion with overlayfs developers about missing driver > features if you have any. Surprising that I and others weren't Cced on this given that we had a meeting with the main developers and a few others where we had said the same thing. I hadn't followed this. We have at least 58 filesystems currently in the kernel (and that's a conservative count just based on going by obvious directories and ignoring most virtual filesystems). A non-insignificant portion is probably slowly rotting away with little fixes coming in, with few users, and not much attention is being paid to syzkaller reports for them if they show up. I haven't quantified this of course. Taking in a new filesystems into kernel in the worst case means that it's being dumped there once and will slowly become unmaintained. Then we'll have a few users for the next 20 years and we can't reasonably deprecate it (Maybe that's another good topic: How should we fade out filesystems.). Of course, for most fs developers it probably doesn't matter how many other filesystems there are in the kernel (aside from maybe competing for the same users). But for developers who touch the vfs every new filesystems may increase the cost of maintaining and reworking existing functionality, or adding new functionality. Making it more likely to accumulate hacks, adding workarounds, or flatout being unable to kill off infrastructure that should reasonably go away. Maybe this is an unfair complaint but just from experience a new filesystem potentially means one or two weeks to make a larger vfs change. I want to stress that I'm not at all saying "no more new fs" but we should be hesitant before we merge new filesystems into the kernel. Especially for filesystems that are tailored to special use-cases. Every few years another filesystem tailored to container use-cases shows up. And frankly, a good portion of the issues that they are trying to solve are caused by design choices in userspace. And I have to say I'm especially NAK-friendly about anything that comes even close to yet another stacking filesystems or anything that layers on top of a lower filesystem/mount such as ecryptfs, ksmbd, and overlayfs. They are hard to get right, with lots of corner cases and they cause the most headaches when making vfs changes.
Hi Amir and Christian, On 2023/1/17 18:12, Christian Brauner wrote: > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote: >>> It seems rather another an incomplete EROFS from several points >>> of view. Also see: >>> https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u >>> >> >> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19, >> where the community reactions rhyme with the reactions to composefs. >> The discussion on Incremental FS resembles composefs case even more [1]. >> AFAIK, Android is still maintaining Incremental FS out-of-tree. >> >> Alexander and Giuseppe, >> >> I'd like to join Gao is saying that I think it is in the best interest >> of everyone, >> composefs developers and prospect users included, >> if the composefs requirements would drive improvement to existing >> kernel subsystems rather than adding a custom filesystem driver >> that partly duplicates other subsystems. >> >> Especially so, when the modifications to existing components >> (erofs and overlayfs) appear to be relatively minor and the maintainer >> of erofs is receptive to new features and happy to collaborate with you. >> >> w.r.t overlayfs, I am not even sure that anything needs to be modified >> in the driver. >> overlayfs already supports "metacopy" feature which means that an upper layer >> could be composed in a way that the file content would be read from an arbitrary >> path in lower fs, e.g. objects/cc/XXX. >> >> I gave a talk on LPC a few years back about overlayfs and container images [2]. >> The emphasis was that overlayfs driver supports many new features, but userland >> tools for building advanced overlayfs images based on those new features are >> nowhere to be found. >> >> I may be wrong, but it looks to me like composefs could potentially >> fill this void, >> without having to modify the overlayfs driver at all, or maybe just a >> little bit. >> Please start a discussion with overlayfs developers about missing driver >> features if you have any. > ... > > I want to stress that I'm not at all saying "no more new fs" but we > should be hesitant before we merge new filesystems into the kernel. > > Especially for filesystems that are tailored to special use-cases. > Every few years another filesystem tailored to container use-cases shows > up. And frankly, a good portion of the issues that they are trying to > solve are caused by design choices in userspace. > > And I have to say I'm especially NAK-friendly about anything that comes > even close to yet another stacking filesystems or anything that layers > on top of a lower filesystem/mount such as ecryptfs, ksmbd, and > overlayfs. They are hard to get right, with lots of corner cases and > they cause the most headaches when making vfs changes. That is also my original (little) request if such overlay model is correct... In principle, it's not hard for EROFS since currently EROFS already has symlink on-disk layout, and the difference is just applying them to all regular files (even without some on-disk changes, but maybe we need to optimize them if there are other special requirements for specific use cases like ostree), and makes EROFS do like stackable way... That is not hard honestly (on-disk compatible)... But I'm not sure whether it's fortunate for EROFS now to support this without a proper overlay model for careful discussion. So if there could be some discussion for this overlay model on LSF/MM/BPF, I'd like to attend (thanks!) And I support to make it in overlayfs (if possible) but it seems EROFS could do as well as long as it has enough constrait to conclude... Thanks, Gao Xiang >
Christian Brauner <brauner@kernel.org> writes: > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote: >> > It seems rather another an incomplete EROFS from several points >> > of view. Also see: >> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u >> > >> >> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19, >> where the community reactions rhyme with the reactions to composefs. >> The discussion on Incremental FS resembles composefs case even more [1]. >> AFAIK, Android is still maintaining Incremental FS out-of-tree. >> >> Alexander and Giuseppe, >> >> I'd like to join Gao is saying that I think it is in the best interest >> of everyone, >> composefs developers and prospect users included, >> if the composefs requirements would drive improvement to existing >> kernel subsystems rather than adding a custom filesystem driver >> that partly duplicates other subsystems. >> >> Especially so, when the modifications to existing components >> (erofs and overlayfs) appear to be relatively minor and the maintainer >> of erofs is receptive to new features and happy to collaborate with you. >> >> w.r.t overlayfs, I am not even sure that anything needs to be modified >> in the driver. >> overlayfs already supports "metacopy" feature which means that an upper layer >> could be composed in a way that the file content would be read from an arbitrary >> path in lower fs, e.g. objects/cc/XXX. >> >> I gave a talk on LPC a few years back about overlayfs and container images [2]. >> The emphasis was that overlayfs driver supports many new features, but userland >> tools for building advanced overlayfs images based on those new features are >> nowhere to be found. >> >> I may be wrong, but it looks to me like composefs could potentially >> fill this void, >> without having to modify the overlayfs driver at all, or maybe just a >> little bit. >> Please start a discussion with overlayfs developers about missing driver >> features if you have any. > > Surprising that I and others weren't Cced on this given that we had a > meeting with the main developers and a few others where we had said the > same thing. I hadn't followed this. well that wasn't done on purpose, sorry for that. After our meeting, I thought it was clear that we have different needs for our use cases and that we were going to submit composefs upstream, as we did, to gather some feedbacks from the wider community. Of course we looked at overlay before we decided to upstream composefs. Some of the use cases we have in mind are not easily doable, some others are not possible at all. metacopy is a good starting point, but from user space it works quite differently than what we can do with composefs. Let's assume we have a git like repository with a bunch of files stored by their checksum and that they can be shared among different containers. Using the overlayfs model: 1) We need to create the final image layout, either using reflinks or hardlinks: - reflinks: we can reflect a correct st_nlink value for the inode but we lose page cache sharing. - hardlinks: make the st_nlink bogus. Another problem is that overlay expects the lower layer to never change and now st_nlink can change for files in other lower layers. These operations have a cost. Even if we all the files are already available locally, we still need at least one operation per file to create it, and more than one if we start tweaking the inode metadata. 2) no multi repo support: Both reflinks and hardlinks do not work across mount points, so we cannot have images that span multiple file systems; one common use case is to have a network file system to share some images/files and be able to use files from there when they are available. At the moment we deduplicate entire layers, and with overlay we can do something like the following without problems: # mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2 but this won't work with the files granularity we are looking at. So in this case we need to do a full copy of the files that are not on the same file system. 3) no support for fs-verity. No idea how overlay could ever support it, it doesn't fit there. If we want this feature we need to look at another RO file system. We looked at EROFS since it is already upstream but it is quite different than what we are doing as Alex already pointed out. Sure we could bloat EROFS and add all the new features there, after all composefs is quite simple, but I don't see how this is any cleaner than having a simple file system that does just one thing. On top of what was already said: I wish at some point we can do all of this from a user namespace. That is the main reason for having an easy on-disk format for composefs. This seems much more difficult to achieve with EROFS given its complexity. > We have at least 58 filesystems currently in the kernel (and that's a > conservative count just based on going by obvious directories and > ignoring most virtual filesystems). > > A non-insignificant portion is probably slowly rotting away with little > fixes coming in, with few users, and not much attention is being paid to > syzkaller reports for them if they show up. I haven't quantified this of > course. > > Taking in a new filesystems into kernel in the worst case means that > it's being dumped there once and will slowly become unmaintained. Then > we'll have a few users for the next 20 years and we can't reasonably > deprecate it (Maybe that's another good topic: How should we fade out > filesystems.). > > Of course, for most fs developers it probably doesn't matter how many > other filesystems there are in the kernel (aside from maybe competing > for the same users). > > But for developers who touch the vfs every new filesystems may increase > the cost of maintaining and reworking existing functionality, or adding > new functionality. Making it more likely to accumulate hacks, adding > workarounds, or flatout being unable to kill off infrastructure that > should reasonably go away. Maybe this is an unfair complaint but just > from experience a new filesystem potentially means one or two weeks to > make a larger vfs change. > > I want to stress that I'm not at all saying "no more new fs" but we > should be hesitant before we merge new filesystems into the kernel. > > Especially for filesystems that are tailored to special use-cases. > Every few years another filesystem tailored to container use-cases shows > up. And frankly, a good portion of the issues that they are trying to > solve are caused by design choices in userspace. Having a way to deprecate file systems seem like a good idea in general, and IMHO seems to make more sense than blocking new components that can be useful to some users. We are aware the bar for a new file system is high, and we were expecting criticism and push back, but so far it doesn't seem there is a way to achieve what we are trying to do. > And I have to say I'm especially NAK-friendly about anything that comes > even close to yet another stacking filesystems or anything that layers > on top of a lower filesystem/mount such as ecryptfs, ksmbd, and > overlayfs. They are hard to get right, with lots of corner cases and > they cause the most headaches when making vfs changes.
On 2023/1/17 21:56, Giuseppe Scrivano wrote: > Christian Brauner <brauner@kernel.org> writes: > ... > > We looked at EROFS since it is already upstream but it is quite > different than what we are doing as Alex already pointed out. > Sigh.. please kindly help me find out what's the difference if EROFS uses some symlink layout for each regular inode? Some question for me to ask about this new overlay permission model once again: What's the difference between symlink (maybe with some limitations) and this new overlay model? I'm not sure why symlink permission bits is ignored (AFAIK)? I don't think it too further since I don't quite an experienced one in the unionfs field, but if possible, I'm quite happy to learn new stuffs as a newbie filesystem developer to gain more knowledge if it could be some topic at LSF/MM/BPF 2023. > Sure we could bloat EROFS and add all the new features there, after all > composefs is quite simple, but I don't see how this is any cleaner than > having a simple file system that does just one thing. Also if I have time, I could do a code-truncated EROFS without any useless features specificly for ostree use cases. Or I could just seperate out all of that useless code of Ostree-specific use cases by using Kconfig. If you don't want to use EROFS from whatever reason, I'm not oppose to it (You also could use other in-kernel local filesystem for this as well). Except for this new overlay model, I just tried to say how it works similiar to EROFS. > > On top of what was already said: I wish at some point we can do all of > this from a user namespace. That is the main reason for having an easy > on-disk format for composefs. This seems much more difficult to achieve > with EROFS given its complexity. Why? [ Gao Xiang: this time I will try my best stop talking about EROFS under the Composefs patchset anymore because I'd like to avoid appearing at the first time (unless such permission model is never discussed until now)... No matter in the cover letter it never mentioned EROFS at all. ] Thanks, Gao Xiang
On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote: > Christian Brauner <brauner@kernel.org> writes: > > > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote: > >> > It seems rather another an incomplete EROFS from several points > >> > of view. Also see: > >> > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u > >> > > >> > >> Ironically, ZUFS is one of two new filesystems that were discussed in LSFMM19, > >> where the community reactions rhyme with the reactions to composefs. > >> The discussion on Incremental FS resembles composefs case even more [1]. > >> AFAIK, Android is still maintaining Incremental FS out-of-tree. > >> > >> Alexander and Giuseppe, > >> > >> I'd like to join Gao is saying that I think it is in the best interest > >> of everyone, > >> composefs developers and prospect users included, > >> if the composefs requirements would drive improvement to existing > >> kernel subsystems rather than adding a custom filesystem driver > >> that partly duplicates other subsystems. > >> > >> Especially so, when the modifications to existing components > >> (erofs and overlayfs) appear to be relatively minor and the maintainer > >> of erofs is receptive to new features and happy to collaborate with you. > >> > >> w.r.t overlayfs, I am not even sure that anything needs to be modified > >> in the driver. > >> overlayfs already supports "metacopy" feature which means that an upper layer > >> could be composed in a way that the file content would be read from an arbitrary > >> path in lower fs, e.g. objects/cc/XXX. > >> > >> I gave a talk on LPC a few years back about overlayfs and container images [2]. > >> The emphasis was that overlayfs driver supports many new features, but userland > >> tools for building advanced overlayfs images based on those new features are > >> nowhere to be found. > >> > >> I may be wrong, but it looks to me like composefs could potentially > >> fill this void, > >> without having to modify the overlayfs driver at all, or maybe just a > >> little bit. > >> Please start a discussion with overlayfs developers about missing driver > >> features if you have any. > > > > Surprising that I and others weren't Cced on this given that we had a > > meeting with the main developers and a few others where we had said the > > same thing. I hadn't followed this. > > well that wasn't done on purpose, sorry for that. I understand. I was just surprised given that I very much work on the vfs on a day to day basis. > > After our meeting, I thought it was clear that we have different needs > for our use cases and that we were going to submit composefs upstream, > as we did, to gather some feedbacks from the wider community. > > Of course we looked at overlay before we decided to upstream composefs. > > Some of the use cases we have in mind are not easily doable, some others > are not possible at all. metacopy is a good starting point, but from > user space it works quite differently than what we can do with > composefs. > > Let's assume we have a git like repository with a bunch of files stored > by their checksum and that they can be shared among different containers. > > Using the overlayfs model: > > 1) We need to create the final image layout, either using reflinks or > hardlinks: > > - reflinks: we can reflect a correct st_nlink value for the inode but we > lose page cache sharing. > > - hardlinks: make the st_nlink bogus. Another problem is that overlay > expects the lower layer to never change and now st_nlink can change > for files in other lower layers. > > These operations have a cost. Even if we all the files are already > available locally, we still need at least one operation per file to > create it, and more than one if we start tweaking the inode metadata. Which you now encode in a manifest file which changes properties on a per file basis without any vfs involvement which makes me pretty uneasy. If you combine overlayfs with idmapped mounts you can already change ownership on a fairly granular basis. If you need additional per file ownership use overlayfs which gives you the ability to change file attributes on a per file per container basis. > > 2) no multi repo support: > > Both reflinks and hardlinks do not work across mount points, so we Just fwiw, afaict reflinks work across mount points since at least 5.18. > cannot have images that span multiple file systems; one common use case > is to have a network file system to share some images/files and be able > to use files from there when they are available. > > At the moment we deduplicate entire layers, and with overlay we can do > something like the following without problems: > > # mount overlay -t overlay -olowerdir=/first/disk/layer1:/second/disk/layer2 > > but this won't work with the files granularity we are looking at. So in > this case we need to do a full copy of the files that are not on the > same file system. > > 3) no support for fs-verity. No idea how overlay could ever support it, > it doesn't fit there. If we want this feature we need to look at > another RO file system. > > We looked at EROFS since it is already upstream but it is quite > different than what we are doing as Alex already pointed out. > > Sure we could bloat EROFS and add all the new features there, after all > composefs is quite simple, but I don't see how this is any cleaner than > having a simple file system that does just one thing. > > On top of what was already said: I wish at some point we can do all of > this from a user namespace. That is the main reason for having an easy > on-disk format for composefs. This seems much more difficult to achieve I'm pretty skeptical of this plan whether we should add more filesystems that are mountable by unprivileged users. FUSE and Overlayfs are adventurous enough and they don't have their own on-disk format. The track record of bugs exploitable due to userns isn't making this very attractive.
On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote: > On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote: > > Christian Brauner <brauner@kernel.org> writes: > > 2) no multi repo support: > > > > Both reflinks and hardlinks do not work across mount points, so we > > Just fwiw, afaict reflinks work across mount points since at least 5.18. The might work for NFS server *file clones* across different exports within the same NFS server (or server cluster), but they most certainly don't work across mountpoints for local filesystems, or across different types of filesystems. I'm not here to advocate that composefs as the right solution, I'm just pointing out that the proposed alternatives do not, in any way, have the same critical behavioural characteristics as composefs provides container orchestration systems and hence do not solve the problems that composefs is attempting to solve. In short: any solution that requires userspace to create a new filesystem heirarchy one file at a time via standard syscall mechanisms is not going to perform acceptibly at scale - that's a major problem that composefs addresses. The whole problem with file copying to create images - even with reflink or hardlinks avoiding data copying - is the overhead of creating and destroying those copies in the first place. A reflink copy of a tens of thousands of files in a complex directory structure is not free - each individual reflink has a time, CPU, memory and IO cost to it. The teardown cost is similar - the only way to remove the "container image" built with reflinks is "rm -rf", and that has significant time, CPU memory and IO costs associated with it as well. Further, you can't ship container images to remote hosts using reflink copies - they can only be created at runtime on the host that the container will be instantiated on. IOWs, the entire cost of reflink copies for container instances must be taken at container instantiation and destruction time. When you have container instances that might only be needed for a few seconds, taking half a minute to set up the container instance and then another half a minute to tear it down just isn't viable - we need instantiation and teardown times in the order of a second or two. From my reading of the code, composefs is based around the concept of a verifiable "shipping manifest", where the filesystem namespace presented to users by the kernel is derived from the manifest rahter than from some other filesystem namespace. Overlay, reflinks, etc all use some other filesystem namespace to generate the container namespace that links to the common data, whilst composefs uses the manifest for that. The use of a minfest file means there is almost zero container setup overhead - ship the manifest file, mount it, all done - and zero teardown overhead as unmounting the filesystem is all that is needed to remove all traces of the container instance from the system. In having a custom manifest format, the manifest can easily contain verification information alongside the pointer to the content the namespace should expose. i.e. the manifest references a secure content addressed repository that is protected by fsverity and contains the fsverity digests itself. Hence it doesn't rely on the repository to self-verify, it actually ensures that the repository files actually contain the data the manifest expects them to contain. Hence if the composefs kernel module is provided with a mechanism for validating the chain of trust for the manifest file that a user is trying to mount, then we just don't care who the mounting user is. This architecture is a viable path to rootless mounting of pre-built third party container images. Also, with the host's content addressed repository being managed separately by the trusted host and distro package management, the manifest is not be unique to a single container host. The distro can build manifests so that containers are running known, signed and verified container images built by the distro. The container orchestration software or admin could also build manifests on demand and sign them. If the manifest is not signed, not signed with a key loaded into the kernel keyring, or does not pass verification, then we simply fall back to root-in-the-init-ns permissions being required to mount the manifest. This fallback is exactly the same security model we have for every other type of filesystem image that the linux kernel can mount - we trust root not to be mounting malicious images. Essentially, I don't think any of the filesystems in the linux kernel currently provide a viable solution to the problem that composefs is trying to solve. We need a different way of solving the ephemeral container namespace creation and destruction overhead problem. Composefs provides a mechanism that not only solves this problem and potentially several others, whilst also being easy to retrofit into existing production container stacks. As such, I think composefs is definitely worth further time and investment as a unique line of filesystem development for Linux. Solve the chain of trust problem (i.e. crypto signing for the manifest files) and we potentially have game changing container infrastructure in a couple of thousand lines of code... Cheers, Dave.
On 2023/1/18 08:22, Dave Chinner wrote: > On Tue, Jan 17, 2023 at 04:27:56PM +0100, Christian Brauner wrote: >> On Tue, Jan 17, 2023 at 02:56:56PM +0100, Giuseppe Scrivano wrote: >>> Christian Brauner <brauner@kernel.org> writes: >>> 2) no multi repo support: >>> >>> Both reflinks and hardlinks do not work across mount points, so we >> >> Just fwiw, afaict reflinks work across mount points since at least 5.18. > ... > > As such, I think composefs is definitely worth further time and > investment as a unique line of filesystem development for Linux. > Solve the chain of trust problem (i.e. crypto signing for the > manifest files) and we potentially have game changing container > infrastructure in a couple of thousand lines of code... I think that is the last time I write some words in this v2 patchset. At a quick glance of the current v2 patchset: 1) struct cfs_buf { -> struct erofs_buf; 2) cfs_buf_put -> erofs_put_metabuf; 3) cfs_get_buf -> erofs_bread -> (but erofs_read_metabuf() in v5.17 is much closer); https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/erofs/data.c?h=linux-5.17.y 4) cfs_dentry_s -> erofs_dirent; ... Also it drops EROFS __lexx and uses buggy uxx instead. It drops iomap/fscache interface with a stackable file interface and it doesn't have ACL and (else) I don't have time to look into more. That is the current my point of view of the current Composefs. Yes, you could use/fork any code in open-source projects, but it currently seems like an immature EROFS-truncated copy and its cover letter never mentioned EROFS at all. I'd suggest you guys refactor similar code (if you claim that is not another EROFS) before it really needs to be upstreamed, otherwise I would feel uneasy as well. Apart from that, again I have no objection if folks feel like a new read-only stackable filesystem like this. Apart from the codebase, I do hope there could be some discussion of this topic at LSF/MM/BPF 2023 as Amir suggested because I don't think this overlay model is really safe without fs-verity enforcing. Thank all for the time. I'm done. Thanks, Gao Xiang > > Cheers, > > Dave.
On Tue, 2023-01-17 at 11:12 +0100, Christian Brauner wrote: > On Tue, Jan 17, 2023 at 09:05:53AM +0200, Amir Goldstein wrote: > > > It seems rather another an incomplete EROFS from several points > > > of view. Also see: > > > https://lore.kernel.org/all/1b192a85-e1da-0925-ef26-178b93d0aa45@plexistor.com/T/#u > > > > > > > Ironically, ZUFS is one of two new filesystems that were discussed > > in LSFMM19, > > where the community reactions rhyme with the reactions to > > composefs. > > The discussion on Incremental FS resembles composefs case even more > > [1]. > > AFAIK, Android is still maintaining Incremental FS out-of-tree. > > > > Alexander and Giuseppe, > > > > I'd like to join Gao is saying that I think it is in the best > > interest > > of everyone, > > composefs developers and prospect users included, > > if the composefs requirements would drive improvement to existing > > kernel subsystems rather than adding a custom filesystem driver > > that partly duplicates other subsystems. > > > > Especially so, when the modifications to existing components > > (erofs and overlayfs) appear to be relatively minor and the > > maintainer > > of erofs is receptive to new features and happy to collaborate with > > you. > > > > w.r.t overlayfs, I am not even sure that anything needs to be > > modified > > in the driver. > > overlayfs already supports "metacopy" feature which means that an > > upper layer > > could be composed in a way that the file content would be read from > > an arbitrary > > path in lower fs, e.g. objects/cc/XXX. > > > > I gave a talk on LPC a few years back about overlayfs and container > > images [2]. > > The emphasis was that overlayfs driver supports many new features, > > but userland > > tools for building advanced overlayfs images based on those new > > features are > > nowhere to be found. > > > > I may be wrong, but it looks to me like composefs could potentially > > fill this void, > > without having to modify the overlayfs driver at all, or maybe just > > a > > little bit. > > Please start a discussion with overlayfs developers about missing > > driver > > features if you have any. > > Surprising that I and others weren't Cced on this given that we had a > meeting with the main developers and a few others where we had said > the > same thing. I hadn't followed this. Sorry about that, I'm just not very used to the kernel submission mechanism. I'll CC you on the next version. > > We have at least 58 filesystems currently in the kernel (and that's a > conservative count just based on going by obvious directories and > ignoring most virtual filesystems). > > A non-insignificant portion is probably slowly rotting away with > little > fixes coming in, with few users, and not much attention is being paid > to > syzkaller reports for them if they show up. I haven't quantified this > of > course. > > Taking in a new filesystems into kernel in the worst case means that > it's being dumped there once and will slowly become unmaintained. > Then > we'll have a few users for the next 20 years and we can't reasonably > deprecate it (Maybe that's another good topic: How should we fade out > filesystems.). > > Of course, for most fs developers it probably doesn't matter how many > other filesystems there are in the kernel (aside from maybe competing > for the same users). > > But for developers who touch the vfs every new filesystems may > increase > the cost of maintaining and reworking existing functionality, or > adding > new functionality. Making it more likely to accumulate hacks, adding > workarounds, or flatout being unable to kill off infrastructure that > should reasonably go away. Maybe this is an unfair complaint but just > from experience a new filesystem potentially means one or two weeks > to > make a larger vfs change. > > I want to stress that I'm not at all saying "no more new fs" but we > should be hesitant before we merge new filesystems into the kernel. Well, it sure reads as "no more new fs" to me. But I understand that there is hesitation towards this. The new version will be even simpler (based on feedback from dave), weighing in at < 2000 lines. Hopefully this will make it easier to review and maintain and somewhat countering the cost of yet another filesystem. > Especially for filesystems that are tailored to special use-cases. > Every few years another filesystem tailored to container use-cases > shows > up. And frankly, a good portion of the issues that they are trying to > solve are caused by design choices in userspace. Well, we have at least two use cases, but sure, it is not a general purpose filesystem. > And I have to say I'm especially NAK-friendly about anything that > comes > even close to yet another stacking filesystems or anything that > layers > on top of a lower filesystem/mount such as ecryptfs, ksmbd, and > overlayfs. They are hard to get right, with lots of corner cases and > they cause the most headaches when making vfs changes. I can't disagree here, because I'm not a vfs maintainer, but I will say that composefs is fundamentally much simpler that these examples. First because it is completely read-only, and secondly because it doesn't rely on the lower filesystem for anything but file content (i.e. lower fs metadata or directory structure doesn't affect the upper fs).