Message ID | 153313703562.13253.5766498657900728120.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
Headers | show |
Series | VFS: Introduce filesystem context [ver #11] | expand |
There is a serious problem with mount options today that fsopen does not address. The problem is that mount options are ignored for block based filesystems, and any other type of filesystem that follows the same pattern. The script below demonstrates this bug. Showing this bug can cause the ext4 "acl" "quota" and "user_xattr" options to be silently ignored. fsopen has my nack until it addresses this issue. I don't know if we can fix this in the context of sys_mount. But we if we are redoing the option parsing of how we mount filesystems this needs to be fixed before we start worrying about bug compatibility. Hopefully this report is simple and clear enough that we can at least agree on the problem. Eric # cat ~/bin/bdev-loop0.sh #!/bin/sh set -x set -e LOOP=loop0 dd if=/dev/zero bs=1024 count=1048576 of=$LOOP-file losetup /dev/$LOOP $LOOP-file mkfs.ext4 /dev/$LOOP mkdir $LOOP-noacl-noquota-nouser_xattr mount -t ext4 /dev/$LOOP -o "noacl,noquota,nouser_xattr" $LOOP-noacl-noquota-nouser_xattr mkdir $LOOP-acl-quota-user_xattr mount -t ext4 /dev/$LOOP -o "acl,quota,user_xattr" $LOOP-acl-quota-user_xattr cat /proc/mounts | grep loop0 root@finagle:~# ~/bin/bdev-loop0.sh + set -e + LOOP=loop0 + dd if=/dev/zero bs=1024 count=1048576 of=loop0-file 1048576+0 records in 1048576+0 records out 1073741824 bytes (1.1 GB) copied, 4.37645 s, 245 MB/s + losetup /dev/loop0 loop0-file + mkfs.ext4 /dev/loop0 mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 65536 inodes, 262144 blocks 13107 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=268435456 8 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376 Writing inode tables: done Creating journal (8192 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 29 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. + mkdir loop0-noacl-noquota-nouser_xattr + mount -t ext4 /dev/loop0 -o noacl,noquota,nouser_xattr loop0-noacl-noquota-nouser_xattr + mkdir loop0-acl-quota-user_xattr + mount -t ext4 /dev/loop0 -o acl,quota,user_xattr loop0-acl-quota-user_xattr + + grep loop0 cat /proc/mounts /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0
> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > > There is a serious problem with mount options today that fsopen does not > address. The problem is that mount options are ignored for block based > filesystems, and any other type of filesystem that follows the same > pattern. > > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right? For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that.
Eric W. Biederman <ebiederm@xmission.com> wrote: > There is a serious problem with mount options today that fsopen does not > address. The problem is that mount options are ignored for block based > filesystems, and any other type of filesystem that follows the same > pattern. Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or* *else*, I'm working up a set of additional patches to give userspace the option of whether they want no sharing; sharing, but only with exactly the same parameters; or to ignore the parameter differences and just accept sharing of what's already already mounted (ie. the current behaviour). The second option, however, is not trivial as it needs to compare the fs contexts, including the LSM parameters. To make that work, I really need to remove the old security_mnt_opts stuff - which means I need to port btrfs to the new context stuff. We discussed this yesterday, and I proposed a solution, and I'm working on it. Yes, I agree it would be nice to have, but it *doesn't* really need supporting right this minute, since what I have now oughtn't to break the current behaviour. David
On 2018/08/10 23:05, Eric W. Biederman wrote: > > There is a serious problem with mount options today that fsopen does not > address. The problem is that mount options are ignored for block based > filesystems, and any other type of filesystem that follows the same > pattern. > > The script below demonstrates this bug. Showing this bug can cause the > ext4 "acl" "quota" and "user_xattr" options to be silently ignored. > > fsopen has my nack until it addresses this issue. > > I don't know if we can fix this in the context of sys_mount. But we if > we are redoing the option parsing of how we mount filesystems this needs > to be fixed before we start worrying about bug compatibility. > > Hopefully this report is simple and clear enough that we can at least > agree on the problem. > > Eric This might be related to a problem that syzbot is failing to reproduce a problem. https://groups.google.com/forum/#!msg/syzkaller-bugs/R03vI7RCVco/0PijCTrcCgAJ syzbot found a reproducer, and the reproducer was working until next-20180803. But the reproducer is failing to reproduce this problem in next-20180806 despite there is no mm related change between next-20180803 and next-20180806. Therefore, I suspect that the reproducer is no longer working as intended. And there was parser change (David Howells' patch) between next-20180803 and next-20180806. I'm waiting for response from David Howells...
Andy Lutomirski <luto@amacapital.net> wrote: > > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 > > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 > > To make sure I understand correctly: the problem is that the second mount > ignored the options because the device was already mounted, right? > > For the new API, I think the only remotely sane approach is to refuse to > mount or init or whatever you call it an already mounted bdev. If user code > genuinely needs to bind-mount an existing mount that is known only by its > bdev, we can add a specific API just for that. I'm adding some flags to fsopen() to allow userspace to say whether it wants no sharing, same parameters-only sharing or anything-goes sharing (as now). I'm also adding a flag whereby userspace can forbid anyone else from sharing a new superblock it has just set up. David
On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote: > > There is a serious problem with mount options today that fsopen does not > address. The problem is that mount options are ignored for block based > filesystems, and any other type of filesystem that follows the same > pattern. > > The script below demonstrates this bug. Showing this bug can cause the > ext4 "acl" "quota" and "user_xattr" options to be silently ignored. > > fsopen has my nack until it addresses this issue. > > I don't know if we can fix this in the context of sys_mount. But we if > we are redoing the option parsing of how we mount filesystems this needs > to be fixed before we start worrying about bug compatibility. > > Hopefully this report is simple and clear enough that we can at least > agree on the problem. Sure, it is simple. So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that would get passed to filesystems, so that Eric would be able to implement his mount(2)-incompatible behaviour at leisure, without worrying about compatibility issues. Does that address your complaint? Because one thing we are not going to do is changing mount(2) behaviour. Reason: userland-visible behaviour of hell knows how many local scripts. Another thing that is flat-out not feasible is some kind of blanket "compare options" stuff; it *can* be done as helpers to be used by filesystem when it sees that new flag, but it's simply not going to work at the fs-independent level. Trivial example with the same ext4: mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b ext4 can tell that these are the same. syscall itself has no clue. What's more, it's not just explicitly spelled default options - it's the stuff that has more than one form. And while we are at it, the things like two NFS mounts of different trees from the same server; they might or might not get the same superblock. Depending upon the options. Convenience helper that would allow ext4 to compare options and reject the incompatible mount? Not sure how much ext4-specific knowledge would have to go in it, but if you can come up with one - more power to you. But the decision to use it *must* be ext4-specific. Because for e.g. NFS such thing as -o fsid=..., while certainly a part of options, has a very different meaning - it's "use a separate fs instance" (and let the server deal with coherency issues on its end). Decision to use sget() (and the way it's used) is up to filesystem. We *can't* lift that into syscall. Not without breaking the fuck out of existing behaviour. Having something like a second callback for mount_bdev() that would be called when we'd found an existing instance for the same block device? Sure, no problem. Having a helper for doing such comparison that would work in enough cases to bother, so that different fs could avoid boilerplate in that callback? Again, more power to you. But I don't see what the hell does that have to the syscall interface.
Andy Lutomirski <luto@amacapital.net> writes: >> On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> >> There is a serious problem with mount options today that fsopen does not >> address. The problem is that mount options are ignored for block based >> filesystems, and any other type of filesystem that follows the same >> pattern. >> > >> /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 >> /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 > > To make sure I understand correctly: the problem is that the second > mount ignored the options because the device was already mounted, > right? Yes. > For the new API, I think the only remotely sane approach is to refuse > to mount or init or whatever you call it an already mounted bdev. If > user code genuinely needs to bind-mount an existing mount that is > known only by its bdev, we can add a specific API just for that. Eric
On Fri, Aug 10, 2018 at 07:36:17AM -0700, Andy Lutomirski wrote: > > > > On Aug 10, 2018, at 7:05 AM, Eric W. Biederman <ebiederm@xmission.com> wrote: > > > > > > There is a serious problem with mount options today that fsopen does not > > address. The problem is that mount options are ignored for block based > > filesystems, and any other type of filesystem that follows the same > > pattern. > > > > > /dev/loop0 /root/loop0-noacl-noquota-nouser_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 > > /dev/loop0 /root/loop0-acl-quota-user_xattr ext4 rw,relatime,nouser_xattr,noacl 0 0 > > To make sure I understand correctly: the problem is that the second mount ignored the options because the device was already mounted, right? > > For the new API, I think the only remotely sane approach is to refuse to mount or init or whatever you call it an already mounted bdev. If user code genuinely needs to bind-mount an existing mount that is known only by its bdev, we can add a specific API just for that. First of all, that does NOT belong anywhere other than fs itself. Example: NFS. Not every attempt to mount something leads to creation of new fs instance; moreover, whether it will or not can't be predicted in general. PS: for pity sake, fix your MUA; 270-character lines are way over the top.
On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote: > > Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or* > *else*, I'm working up a set of additional patches to give userspace the > option of whether they want no sharing; sharing, but only with exactly the > same parameters; or to ignore the parameter differences and just accept > sharing of what's already already mounted (ie. the current behaviour). But there's no way to support "no sharing", at least not in the general case. A file system can only be mounted once, and without file system support, there's no way for a file system to be mounted with the bsddf or minixdf mount simultaneously. Even *with* file system support, there's no way today for the VFS to keep track of whether a pathname resolution came through one mountpoint or another, so I can't do something like this: mount /dev/sdXX -o casefold /android-data mount /dev/sdXX -o nocasefold /android-data-2 Which is a pity, since if we could we could much more easily get rid of the horror which is Android's wrapfs... So if the file system has been mounted with one set of mount options, and you want to try to mount it with a conflicting set of mount options and you don't want it to silently ignore the mount options, the *only* thing we can today is to refuse the mount and return an error. I'm not sure Eric would really consider that an improvement for the container use case.... - Ted P.S. And as Al has pointed out, this would require special, per-file system support to determine whether the mount options are conflicting or not....
Theodore Y. Ts'o <tytso@mit.edu> wrote: > Even *with* file system support, there's no way today for the VFS to > keep track of whether a pathname resolution came through one > mountpoint or another, so I can't do something like this: Ummm... Isn't that encoded in the vfsmount pointer in struct path? However, the case folding stuff - is that a superblockism of a mountpointism? > So if the file system has been mounted with one set of mount options, > and you want to try to mount it with a conflicting set of mount > options and you don't want it to silently ignore the mount options, > the *only* thing we can today is to refuse the mount and return an > error. With fsopen() there is the option to have the filesystem and the LSM attempt to compare the non-key[*] mount options and reject the attempt to share if they differ in any way. David [*] sget lookup keys, that is.
On 8/10/2018 8:39 AM, Theodore Y. Ts'o wrote: > On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote: >> Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or* >> *else*, I'm working up a set of additional patches to give userspace the >> option of whether they want no sharing; sharing, but only with exactly the >> same parameters; or to ignore the parameter differences and just accept >> sharing of what's already already mounted (ie. the current behaviour). > But there's no way to support "no sharing", at least not in the > general case. A file system can only be mounted once, and without > file system support, there's no way for a file system to be mounted > with the bsddf or minixdf mount simultaneously. > > Even *with* file system support, there's no way today for the VFS to > keep track of whether a pathname resolution came through one > mountpoint or another, so I can't do something like this: > > mount /dev/sdXX -o casefold /android-data > mount /dev/sdXX -o nocasefold /android-data-2 > > Which is a pity, since if we could we could much more easily get rid > of the horror which is Android's wrapfs... > > So if the file system has been mounted with one set of mount options, > and you want to try to mount it with a conflicting set of mount > options and you don't want it to silently ignore the mount options, > the *only* thing we can today is to refuse the mount and return an > error. > > I'm not sure Eric would really consider that an improvement for the > container use case.... > > - Ted > > P.S. And as Al has pointed out, this would require special, per-file > system support to determine whether the mount options are conflicting > or not.... This extends to LSMs that support mount options (SELinux and Smack) as well.
Casey Schaufler <casey@schaufler-ca.com> wrote: > > P.S. And as Al has pointed out, this would require special, per-file > > system support to determine whether the mount options are conflicting > > or not.... > > This extends to LSMs that support mount options (SELinux and Smack) > as well. Yes. I'm doing that. David
On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote: > Theodore Y. Ts'o <tytso@mit.edu> wrote: > > > Even *with* file system support, there's no way today for the VFS to > > keep track of whether a pathname resolution came through one > > mountpoint or another, so I can't do something like this: > > Ummm... Isn't that encoded in the vfsmount pointer in struct path? Well, yes, and we do use this as a hack to make read-only bind mounts work. But that's done as a special case, and it's for permissions checking only. The big problem is that there is single dentry cache object regardless of which mount point was used to access it. So that makes it impossible to support case folding as a mount-pointism. > > However, the case folding stuff - is that a superblockism of a mountpointism? It's a superblock-ism. As far as I know the *only* thing that we can support as a mount-pointism is the ro flag, and that's handled as a special case, and only if the original superblock was mounted read/write. ey That was my point; aside from the ro flag, we can't support any other mount options as a per-mount point thing, so the only thing we can do is to fail the mount if there are conflicting mount options. And I'm not really sure it helps the container use case, since the whole point is they want their "guest" to be able to blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the fact that in some other container, someone had run "mount /dev/sda1 -o xattr /mnt". But having the second mount fail because of conflicting mount option breaks the illusion that containers are functionally as rich as VM's. So before you put in lots of work to support rejecting the attmpted mount if the mount options conflict, are we sure people will actually find this to be useful? Because it's not only fsopen() work for you, but each file system is going to have to implement new functions to answer the question "are these mount options conflicting or not?". Are we sure it's worth the effort? - Ted
"Theodore Y. Ts'o" <tytso@mit.edu> writes: > On Fri, Aug 10, 2018 at 04:11:31PM +0100, David Howells wrote: >> >> Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or* >> *else*, I'm working up a set of additional patches to give userspace the >> option of whether they want no sharing; sharing, but only with exactly the >> same parameters; or to ignore the parameter differences and just accept >> sharing of what's already already mounted (ie. the current behaviour). > > But there's no way to support "no sharing", at least not in the > general case. A file system can only be mounted once, and without > file system support, there's no way for a file system to be mounted > with the bsddf or minixdf mount simultaneously. > > Even *with* file system support, there's no way today for the VFS to > keep track of whether a pathname resolution came through one > mountpoint or another, so I can't do something like this: > > mount /dev/sdXX -o casefold /android-data > mount /dev/sdXX -o nocasefold /android-data-2 > > Which is a pity, since if we could we could much more easily get rid > of the horror which is Android's wrapfs... > > So if the file system has been mounted with one set of mount options, > and you want to try to mount it with a conflicting set of mount > options and you don't want it to silently ignore the mount options, > the *only* thing we can today is to refuse the mount and return an > error. > > I'm not sure Eric would really consider that an improvement for the > container use case.... I think I would consider it an improvement. I keep running into cases where the mount options differed and something was done silently and that causes problems. Eric
On Fri, Aug 10, 2018 at 9:14 AM, Theodore Y. Ts'o <tytso@mit.edu> wrote: > And I'm not really sure it helps the container use > case, since the whole point is they want their "guest" to be able to > blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the > fact that in some other container, someone had run "mount /dev/sda1 -o > xattr /mnt". But having the second mount fail because of conflicting > mount option breaks the illusion that containers are functionally as > rich as VM's. If the same block device is visible, with rw access, in two different containers, I don't see any anything good can happen. Sure, with the current somewhat erratic semantics of mount(2), something kind of sort of reasonable happens if they both mount it. But if one or both of them try to use, say, tune2fs or fsck, it's not going to go well. And a situation where they mount with different options and the result depends on the order of the mounts is just plain bad. I see four sane ways to deal with this: 1. Don't put the block device in the container at all. The container manager mounts it. 2. Use seccomp or a similar mechanism to intercept and emulate the mount request. 3. Teach the filesystem driver to do something sensible. This will inherently be per-fs, and probably involves some serious magic or allowing filesystem-specific vfsmount options. 4. Introduce a concept of a special kind of fake block device that refers to an existing superblock, doesn't allow direct read or write, and does the right thing when mounted. Not obviously worth the effort. It seems to me that the current approach mostly involves crossing our fingers.
On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote: > If the same block device is visible, with rw access, in two different > containers, I don't see any anything good can happen. It's worse than that. I've fixed a lot of bugs which cause the kernel to crash, and a few that might be levered into a privilege escalationh attack, when you mount a maliciously corrupted file system using ext4. I'm told told the security researcher filed similar reports with the XFS community, and he was told, "that's what metadata checksums are for; go away". Given how much time it takes to work with these security researchers, I don't blame them. But in light of that, I'd make a somewhat stronger statement. If you let an untrusted container mount arbitrary block devices where they have rw acccess to the underlying block device, nothing good can happen. Period. :-) Which is why I don't think the lack of being able to reject "conflicting mount options" is really all that important. It certainly shouldn't block the fsopen patch series. #1, it's a problem we have today, and #2, I'm really not all sure supporting bind mounts via specifying block device was ever a good idea to begin with. And #3, while I've been fixing ext4 against security issues caused by maliciously corrupted file system images, I'm still sure that allowing untrusted containers access to mount *any* file system via a block device for which they have r/w access is a Really Bad Idea. > It seems to me that the current approach mostly involves crossing our fingers. Agreed! - Ted
On Fri, Aug 10, 2018 at 04:46:39PM -0400, Theodore Y. Ts'o wrote: > On Fri, Aug 10, 2018 at 01:06:54PM -0700, Andy Lutomirski wrote: > > If the same block device is visible, with rw access, in two different > > containers, I don't see any anything good can happen. > > It's worse than that. I've fixed a lot of bugs which cause the kernel > to crash, and a few that might be levered into a privilege escalationh > attack, when you mount a maliciously corrupted file system using ext4. > I'm told told the security researcher filed similar reports with the > XFS community, and he was told, "that's what metadata checksums are > for; go away". Hey now, there was a little more nuance to it than that[1][2]. The complaint in the first instance had much more to do with breaking existing V4 filesystems by adding format requirements that mkfs didn't know about when the filesystem was created. Yes, you can create V4 filesystems that will hang the system if the log was totally unformatted and metadata updates are made, but OTOH it's fairly obvious when that happens, you have to be root to mount a disk filesystem, and we try to avoid breaking existing users. XFS developers have been and will continue to examine security problems when they are brought to our attention and strengthen validation as needed to minimize the risk of incorrect behaviors, but filesystems are complex machines, complex machinery is risky, and we arbitrate some of that risk by requiring administrators to elect to mount an XFS. > Given how much time it takes to work with these security researchers, > I don't blame them. > > But in light of that, I'd make a somewhat stronger statement. If you > let an untrusted container mount arbitrary block devices where they > have rw acccess to the underlying block device, nothing good can > happen. Period. :-) > > Which is why I don't think the lack of being able to reject > "conflicting mount options" is really all that important. It > certainly shouldn't block the fsopen patch series. #1, it's a problem > we have today, and #2, I'm really not all sure supporting bind mounts > via specifying block device was ever a good idea to begin with. And > #3, while I've been fixing ext4 against security issues caused by > maliciously corrupted file system images, I'm still sure that allowing > untrusted containers access to mount *any* file system via a block > device for which they have r/w access is a Really Bad Idea. > > > It seems to me that the current approach mostly involves crossing our fingers. > > Agreed! Crossing our fingers and demanding administrator intentionality when mounting filesystems off some piece of storage. --D [1] https://lkml.org/lkml/2018/5/21/649 [2] https://lkml.org/lkml/2018/4/2/572
On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote: > Hey now, there was a little more nuance to it than that[1][2]. The > complaint in the first instance had much more to do with breaking > existing V4 filesystems by adding format requirements that mkfs didn't > know about when the filesystem was created. Yes, you can create V4 > filesystems that will hang the system if the log was totally unformatted > and metadata updates are made, but OTOH it's fairly obvious when that > happens, you have to be root to mount a disk filesystem, and we try to > avoid breaking existing users. I wasn't thinking about syzbot reports; I've largely written them off as far as file system testing is concerned, but rather Wen Xu at Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd me out a lot; and has complained that the XFS folks haven't been engaging with him. In either case, both security researchers are fuzzing file system images, and then fixing the checksums, and discovering that this can lead to kernel crashes, and in a few cases, buffer overruns that can lead to potential privilege escalations. Wen can generate reports faster than syzbot, but at least he gives me file system images (as opposed to having to dig them out of syzbot repro C files) and he actually does some analysis and explains what he thinks is going on. I don't think anyone was claiming that format requirements should be added to ext4 or xfs file systems. But rather, that kernel code should be made more robust against maliciously corrupted file system images that have valid checksums. I've been more willing to work with Wen; Dave has expressed the opinion that these are not realistic bug reports, and since only root can mount file systems, it's not high priority. The reason why I bring this up here is that in container land, there are those who believe that "container root" should be able to mount file systems, and if the "container root" isn't trusted, the fact that the "container root" can crash the host kernel, or worse, corrupt the host kernel and break out of the container as a result, that would be sad. I was pretty sure most file system developers are on the same page that allowing untrusted "container roots" the ability to mount arbitrary block device file systems is insanity. Whether or not we try to fix these sorts of bugs submitted by security researchers. :-) - Ted
"Theodore Y. Ts'o" <tytso@mit.edu> writes: > On Fri, Aug 10, 2018 at 04:53:58PM +0100, David Howells wrote: >> Theodore Y. Ts'o <tytso@mit.edu> wrote: >> >> > Even *with* file system support, there's no way today for the VFS to >> > keep track of whether a pathname resolution came through one >> > mountpoint or another, so I can't do something like this: >> >> However, the case folding stuff - is that a superblockism of a mountpointism? > > It's a superblock-ism. As far as I know the *only* thing that we can > support as a mount-pointism is the ro flag, and that's handled as a > special case, and only if the original superblock was mounted > read/write. ey That was my point; aside from the ro flag, we can't > support any other mount options as a per-mount point thing, so the > only thing we can do is to fail the mount if there are conflicting > mount options. And I'm not really sure it helps the container use > case, since the whole point is they want their "guest" to be able to > blithely run "mount /dev/sda1 -o noxattr /mnt" and not worry about the > fact that in some other container, someone had run "mount /dev/sda1 -o > xattr /mnt". But having the second mount fail because of conflicting > mount option breaks the illusion that containers are functionally as > rich as VM's. Ted this isn't about some container case. It about the fact that practically every filesystem in the kernel has the behavior I have described and it means that if root is not super careful root will shoot himself in the foot with the shotgun we have pointed there. It really is about loosing acls or some other filesystem option. Eric
On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote: > On Fri, Aug 10, 2018 at 03:12:34PM -0700, Darrick J. Wong wrote: > > Hey now, there was a little more nuance to it than that[1][2]. The > > complaint in the first instance had much more to do with breaking > > existing V4 filesystems by adding format requirements that mkfs didn't > > know about when the filesystem was created. Yes, you can create V4 > > filesystems that will hang the system if the log was totally unformatted > > and metadata updates are made, but OTOH it's fairly obvious when that > > happens, you have to be root to mount a disk filesystem, and we try to > > avoid breaking existing users. > > I wasn't thinking about syzbot reports; I've largely written them off > as far as file system testing is concerned, but rather Wen Xu at > Georgia Tech, who is much more reasonable than Dmitry, and has helpeyd > me out a lot; and has complained that the XFS folks haven't been > engaging with him. Ahh, ok. Yes, Wen has been easier to work with, and gives out filesystem images. Hm, I'll go comb the bugzilla again... > In either case, both security researchers are fuzzing file system > images, and then fixing the checksums, and discovering that this can > lead to kernel crashes, and in a few cases, buffer overruns that can > lead to potential privilege escalations. Wen can generate reports > faster than syzbot, but at least he gives me file system images (as > opposed to having to dig them out of syzbot repro C files) and he > actually does some analysis and explains what he thinks is going on. (FWIW I tried to figure out how to add fs image dumping to syzbot and whoah that was horrifying. > I don't think anyone was claiming that format requirements should be > added to ext4 or xfs file systems. But rather, that kernel code > should be made more robust against maliciously corrupted file system > images that have valid checksums. I've been more willing to work with > Wen; Dave has expressed the opinion that these are not realistic bug > reports, and since only root can mount file systems, it's not high > priority. I don't think they're high priority either, but they're at least worth /some/ attention. > The reason why I bring this up here is that in container land, there > are those who believe that "container root" should be able to mount > file systems, and if the "container root" isn't trusted, the fact that > the "container root" can crash the host kernel, or worse, corrupt the > host kernel and break out of the container as a result, that would be > sad. > > I was pretty sure most file system developers are on the same page > that allowing untrusted "container roots" the ability to mount > arbitrary block device file systems is insanity. Agreed. > Whether or not we try to fix these sorts of bugs submitted by security > researchers. :-) and agreed. :) --D > - Ted
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Fri, Aug 10, 2018 at 09:05:22AM -0500, Eric W. Biederman wrote: >> >> There is a serious problem with mount options today that fsopen does not >> address. The problem is that mount options are ignored for block based >> filesystems, and any other type of filesystem that follows the same >> pattern. >> >> The script below demonstrates this bug. Showing this bug can cause the >> ext4 "acl" "quota" and "user_xattr" options to be silently ignored. >> >> fsopen has my nack until it addresses this issue. >> >> I don't know if we can fix this in the context of sys_mount. But we if >> we are redoing the option parsing of how we mount filesystems this needs >> to be fixed before we start worrying about bug compatibility. >> >> Hopefully this report is simple and clear enough that we can at least >> agree on the problem. > > Sure, it is simple. So's the solution: MNT_USERNS_SPECIAL_SEMANTICS that > would get passed to filesystems, so that Eric would be able to implement > his mount(2)-incompatible behaviour at leisure, without worrying about > compatibility issues. > > Does that address your complaint? Absolutely not. My complaint is that the current implemented behavior of practically every filesystem in the kernel, is that it will ignore mount options when mounted a second time. It is not some weird special case. It is not some container thing. It is that the behavior of mount(2) with practically every filesystem type when that filesystem is already mounted somewhere else behaves in ways no one would expect. With the new fsopen api the easy thing to do is simply have CMD_CREATE CMD_BIND_INTERNAL and be done with it. CMD_CREATE guarantee that a new superblock is created. CMD_BIND_INTERNAL would only work with an existing superblock. Then root would at least know that he is connecting to an already mounted filesystem and could look at the options etc and fail if he didn't like what he saw. No surprises, no muss, no fuss simple. But I have been told the simple solution above is somehow unacceptable. And an option to compare the mount options and see if they are the same was offered. That would will work to. I just care that we define the semantics in such a way that it is not easy for root to get confused and do something stupid that will bite later, and that we build the infrastructure so that all filesystems can implement it easily. So yes this is 100% a question about how filesystems should behave with respect to their option when mounted for a second time. That is what Dave Howells patchset is addressing. > Because one thing we are not going to do is changing mount(2) > behaviour. I have not asked for that. I have asked that we get it right for fsopen. > Reason: userland-visible behaviour of hell knows how many local scripts. > Another thing that > is flat-out not feasible is some kind of blanket "compare options" > stuff; it *can* be done as helpers to be used by filesystem when > it sees that new flag, but it's simply not going to work at the > fs-independent level. > > Trivial example with the same ext4: > mount /dev/sda1 /mnt/a -o bsddf vs. mount /dev/sda1 /mnt/b > ext4 can tell that these are the same. syscall itself has no > clue. What's more, it's not just explicitly spelled default > options - it's the stuff that has more than one form. And while > we are at it, the things like two NFS mounts of different trees > from the same server; they might or might not get the same superblock. > Depending upon the options. > > Convenience helper that would allow ext4 to compare options and reject > the incompatible mount? Not sure how much ext4-specific knowledge > would have to go in it, but if you can come up with one - more power > to you. But the decision to use it *must* be ext4-specific. Because > for e.g. NFS such thing as -o fsid=..., while certainly a part of > options, has a very different meaning - it's "use a separate fs instance" > (and let the server deal with coherency issues on its end). > > Decision to use sget() (and the way it's used) is up to filesystem. > We *can't* lift that into syscall. Not without breaking the fuck out > of existing behaviour. I have never proposed that. See above. I may have talked in terms of what sget does and muddied the waters. If so I apologize. All I proposed was that we distinguish between a first mount and an additional mount so that userspace knows the options will be ignored. Then the code to replicate the current behavior can look like: fd = fsopen(...); fsconfig(fd, ...); fsconfig(fd, ...); fsconfig(fd, ...); fsconfig(fd, ...); fsconfig(fd, ...); fsconfig(fd, ...); fsconfig(fd, ...); if (fsconfig(fd, CMD_CREATE) == -EBUSY) { fsconfig(fd, CMD_BIND_INTERNAL); } But userspace would then be free to issue a warning or do something else if CMD_CREATE returns -EBUSY. I don't know how the above wound up being construed as asking that the code call sget directly but that is what has happened. > Having something like a second callback for mount_bdev() that would > be called when we'd found an existing instance for the same block > device? Sure, no problem. Having a helper for doing such comparison > that would work in enough cases to bother, so that different fs > could avoid boilerplate in that callback? Again, more power to you. Normal forms etc. If we want to do that it just requires a wee bit of discipline. And if all of the option parsing is being rewritten and retested anyway I don't see why we can't do something like that as well. So it does not sound unreasonable to me. It does sound like more work than what I was proposing. Eric
David Howells <dhowells@redhat.com> writes: > Eric W. Biederman <ebiederm@xmission.com> wrote: > >> There is a serious problem with mount options today that fsopen does not >> address. The problem is that mount options are ignored for block based >> filesystems, and any other type of filesystem that follows the same >> pattern. > > Yes. Since you *absolutely* *insist* on this being fixed *right* *now* *or* > *else*, I'm working up a set of additional patches to give userspace the > option of whether they want no sharing; sharing, but only with exactly the > same parameters; or to ignore the parameter differences and just accept > sharing of what's already already mounted (ie. the current behaviour). > > The second option, however, is not trivial as it needs to compare the fs > contexts, including the LSM parameters. To make that work, I really need to > remove the old security_mnt_opts stuff - which means I need to port btrfs to > the new context stuff. > > We discussed this yesterday, and I proposed a solution, and I'm working on it. I repeated this because after some comments from Al on IRC yesterday and Miklos's email replay. It appeared clear that I had not specified why my issue was clearly enough for people reading the thread to understand the problem that I see. > Yes, I agree it would be nice to have, but it *doesn't* really need supporting > right this minute, since what I have now oughtn't to break the current > behaviour. I am really reluctant to endorse anything that propagates the issues of the current interface in the new mount interface. Eric
"Darrick J. Wong" <darrick.wong@oracle.com> writes: > On Fri, Aug 10, 2018 at 07:54:47PM -0400, Theodore Y. Ts'o wrote: >> The reason why I bring this up here is that in container land, there >> are those who believe that "container root" should be able to mount >> file systems, and if the "container root" isn't trusted, the fact that >> the "container root" can crash the host kernel, or worse, corrupt the >> host kernel and break out of the container as a result, that would be >> sad. >> >> I was pretty sure most file system developers are on the same page >> that allowing untrusted "container roots" the ability to mount >> arbitrary block device file systems is insanity. > > Agreed. For me I am happy with fuse. That is sufficient to cover any container use cases people have. If anyone comes bugging you for more I will be happy to push back. The only thing that containers have to do with this is I wind up touching a lot of the kernel/user boundary so I get to see a lot of it and sometimes see weird things. Eric
On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote: > > My complaint is that the current implemented behavior of practically > every filesystem in the kernel, is that it will ignore mount options > when mounted a second time. The file system is ***not*** mounted a second time. The design bug is that we allow bind mounts to be specified via a block device. A bind mount is not "a second mount" of the file system. Bind mounts != mounts. I had assumed we had allowed bind mounts to be specified via the block device because of container use cases. If the container folks don't want it, I would be pushing to simply not allow bind mounts to be specified via block device at all. The only reason why we should support it is because we don't want to break scripts; and if the goal is not to break scripts, then we have to keep to the current semantics, however broken you think it is. - Ted
On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote: > All I proposed was that we distinguish between a first mount and an > additional mount so that userspace knows the options will be ignored. For pity sake, just what does it take to explain to you that your notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT and may depend upon the pieces of state userland (especially in container) simply does not have? One more time, slowly: mount -t nfs4 wank.example.org:/foo/bar /mnt/a mount -t nfs4 wank.example.org:/baz/barf /mnt/b yield the same superblock. Is anyone who mounts something over NFS required to know if anybody else has mounted something from the same server, and if so how the hell are they supposed to find that out, so that they could decide whether they are creating the "first" or "additional" mount, whatever that might mean in this situation? And how, kernel-side, is that supposed to be handled by generic code of any description? While we are at it, mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c is *NOT* the same superblock as the previous two. > I don't know how the above wound up being construed as asking that the > code call sget directly but that is what has happened. Not by me. What I'm saying is that the entire superblock-creating machinery - all of it - is nothing but library helpers. With the decision of when/how/if they are to be used being down to filesystem driver. Your "first mount"/"additional mount" simply do not map to anything universally applicable. > > Having something like a second callback for mount_bdev() that would > > be called when we'd found an existing instance for the same block > > device? Sure, no problem. Having a helper for doing such comparison > > that would work in enough cases to bother, so that different fs > > could avoid boilerplate in that callback? Again, more power to you. > > Normal forms etc. If we want to do that it just requires a wee bit of > discipline. And if all of the option parsing is being rewritten and > retested anyway I don't see why we can't do something like that as well. > So it does not sound unreasonable to me. See above.
On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote: > On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote: > > > All I proposed was that we distinguish between a first mount and an > > additional mount so that userspace knows the options will be ignored. > > For pity sake, just what does it take to explain to you that your > notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT > and may depend upon the pieces of state userland (especially in container) > simply does not have? > > One more time, slowly: > > mount -t nfs4 wank.example.org:/foo/bar /mnt/a > mount -t nfs4 wank.example.org:/baz/barf /mnt/b > > yield the same superblock. Is anyone who mounts something over NFS > required to know if anybody else has mounted something from the same > server, and if so how the hell are they supposed to find that out, > so that they could decide whether they are creating the "first" or > "additional" mount, whatever that might mean in this situation? > > And how, kernel-side, is that supposed to be handled by generic code > of any description? > > While we are at it, > mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c > is *NOT* the same superblock as the previous two. s/as the previous two/as in the previous two cases/, that is - the first two examples yield one superblock, this one - another.
Al Viro <viro@ZenIV.linux.org.uk> writes: > On Sat, Aug 11, 2018 at 02:58:15AM +0100, Al Viro wrote: >> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote: >> >> > All I proposed was that we distinguish between a first mount and an >> > additional mount so that userspace knows the options will be ignored. >> >> For pity sake, just what does it take to explain to you that your >> notions of "first mount" and "additional mount" ARE HEAVILY FS-DEPENDENT >> and may depend upon the pieces of state userland (especially in container) >> simply does not have? >> >> One more time, slowly: >> >> mount -t nfs4 wank.example.org:/foo/bar /mnt/a >> mount -t nfs4 wank.example.org:/baz/barf /mnt/b >> >> yield the same superblock. Is anyone who mounts something over NFS >> required to know if anybody else has mounted something from the same >> server, and if so how the hell are they supposed to find that out, >> so that they could decide whether they are creating the "first" or >> "additional" mount, whatever that might mean in this situation? >> >> And how, kernel-side, is that supposed to be handled by generic code >> of any description? >> >> While we are at it, >> mount -t nfs4 wank.example.org:/foo/bar -o wsize=16384 /mnt/c >> is *NOT* the same superblock as the previous two. > > s/as the previous two/as in the previous two cases/, that is - the first two > examples yield one superblock, this one - another. Exactly because the mount options differ. I don't have a problem if we have something sophisticated like nfs that handles all of the hairy details and does not reuse a superblock unless the mount options match. What I have a problem with is the helper for ordinary filesystems that are not as sophisticated as nfs that don't handle all of the option magic and give userspace something different from what userspace asked for. It may take a little generalization of the definitions I proposed but it still remains simple and straight forward. CMD_THESE_MOUNT_OPTIONS_NO_SURPRISES CMD_WHATEVER_ALREADY_EXISTS Or we can make the filesystems more sophisticated when we move them to the new API and perform the comparisons there. I think that is what David Howells is working on. Eric
"Theodore Y. Ts'o" <tytso@mit.edu> writes: > On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote: >> >> My complaint is that the current implemented behavior of practically >> every filesystem in the kernel, is that it will ignore mount options >> when mounted a second time. > > The file system is ***not*** mounted a second time. > > The design bug is that we allow bind mounts to be specified via a > block device. A bind mount is not "a second mount" of the file > system. Bind mounts != mounts. > > I had assumed we had allowed bind mounts to be specified via the block > device because of container use cases. If the container folks don't > want it, I would be pushing to simply not allow bind mounts to be > specified via block device at all. No it is not a container thing. > The only reason why we should support it is because we don't want to > break scripts; and if the goal is not to break scripts, then we have > to keep to the current semantics, however broken you think it is. But we don't have to support returning filesystems with mismatched mount options in the new fsopen api. That is my concern. Confusing userspace this way has been shown to be harmful let's not keep doing it. Eric
Eric W. Biederman <ebiederm@xmission.com> wrote: > > Yes, I agree it would be nice to have, but it *doesn't* really need > > supporting right this minute, since what I have now oughtn't to break the > > current behaviour. > > I am really reluctant to endorse anything that propagates the issues of > the current interface in the new mount interface. Do realise that your problem cannot be solved through fsopen() until every filesystem is converted to the new fs_context-based sget() since the flag has to make it from the VFS through the filesystem to sget(). I'm reluctant to add this flag till that point until that time unless we error out if the flag is set against a legacy filesystem. David
> On Aug 11, 2018, at 12:29 AM, David Howells <dhowells@redhat.com> wrote: > > Eric W. Biederman <ebiederm@xmission.com> wrote: > >>> Yes, I agree it would be nice to have, but it *doesn't* really need >>> supporting right this minute, since what I have now oughtn't to break the >>> current behaviour. >> >> I am really reluctant to endorse anything that propagates the issues of >> the current interface in the new mount interface. > > Do realise that your problem cannot be solved through fsopen() until every > filesystem is converted to the new fs_context-based sget() since the flag has > to make it from the VFS through the filesystem to sget(). > > I'm reluctant to add this flag till that point until that time unless we error > out if the flag is set against a legacy filesystem. > > I don’t see why we need all this fancy “do the options match” stuff. For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen, we can keep that behavior in the new API. For other filesystems that don’t have this feature, we should simply fail the request. IOW I see so compelling reason to call sget() at all from the new API. The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem. As another way of looking at it: for a network filesystem, mounting the same target ip and path from two different Linux machines works, so mounting it twice from the same machine should also work. But mounting the same underlying ext4 block device from two different Linux machines (using nbd, iscsi, etc) would be a catastrophe, so I see no reason that it needs to be supported if it’s two mounts from one machine. The case folding example is interesting, and I think it should probably have a slightly different API. A program could open_tree a nocasefold mount and then make a request to create what is functionally a bind mount but with different options. mount(8) will presumably just keep using mount(2).
On Sat, Aug 11, 2018 at 09:31:29AM -0700, Andy Lutomirski wrote: > I don’t see why we need all this fancy “do the options match” stuff. For the handful of filesystems (like NFS) that do something intelligent when multiple non-bind mount requests against the same underlying storage happen, we can keep that behavior in the new API. For other filesystems that don’t have this feature, we should simply fail the request. > IOW I see so compelling reason to call sget() at all from the new API. The only sort-of-legit use case I can think of is mounting more than one btrfs subvolume. But even that should probably not be done by asking the kernel to separately instantiate the filesystem. May I politely suggest the esteemed participants of that conversation to RTFS? Yes, I know that it's less fun that talking about your rather vague ideas of how the things (surely) work, but it just might avoid the feats of idiocy like the above. Andy, I don't know how to put it more plainly: read the fucking source. Even grep would do. The same NFS you've granted (among the "handful" of filesystems) an exception, *DOES* *CALL* *THE* *FUCKING* sget(). Yes, really. And in some obscure[1] cases (including the one mentioned upthread) it does reuse a pre-existing superblock. For a very good reason. [1] such as, oh, mounting two filesystems from the same server with default options - who would've ever thought of doing something so perverted?
On 8/10/2018 9:48 PM, Eric W. Biederman wrote: > "Theodore Y. Ts'o" <tytso@mit.edu> writes: > >> On Fri, Aug 10, 2018 at 08:05:44PM -0500, Eric W. Biederman wrote: >>> My complaint is that the current implemented behavior of practically >>> every filesystem in the kernel, is that it will ignore mount options >>> when mounted a second time. >> The file system is ***not*** mounted a second time. >> >> The design bug is that we allow bind mounts to be specified via a >> block device. A bind mount is not "a second mount" of the file >> system. Bind mounts != mounts. >> >> I had assumed we had allowed bind mounts to be specified via the block >> device because of container use cases. If the container folks don't >> want it, I would be pushing to simply not allow bind mounts to be >> specified via block device at all. > No it is not a container thing. Inigo: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die." Rugen: "Stop saying that!" Eric: "It is not a container thing." Casey: "Stop saying that!" Yes, Virginia, it *is* a container thing. Your container manager expects all filesystems to be server-client based. It makes bad assumptions. It is doing things that we would fire a sysadmin for doing. Don't blame the filesystems for behaving as documented. Export the filesystem using NFS and mount them using the NFS mechanism, which is designed to do what you're asking for. The problem is not in the mount mechanism, it's in the way you want to abuse it. >> The only reason why we should support it is because we don't want to >> break scripts; and if the goal is not to break scripts, then we have >> to keep to the current semantics, however broken you think it is. > But we don't have to support returning filesystems with mismatched mount > options in the new fsopen api. That is my concern. Confusing > userspace this way has been shown to be harmful let's not keep doing it. It's not "userspace" that's confused. Developers of userspace code implementing system behavior (e.g. systemd, container managers) need to understand how the system works. The container manager needs to know that it can't mount filesystems with different options. That's the kind of thing "managers" do. If it has to go to the mount table and check on how the device is already mounted before doing a mount, so be it. Unless, of course, you want the concept of "container" introduced into the kernel. There's a whole lot of feldercarb that container managers have to deal with that would be lots easier to deal with down below. I'm not advocating that, and I understand the arguments against it. On the other hand, if you want a platform that is optimized for a container environment ... > Eric
On Sat, Aug 11, 2018 at 3:58 AM, Al Viro <viro@zeniv.linux.org.uk> wrote: > What I'm saying is that the entire superblock-creating > machinery - all of it - is nothing but library helpers. With the > decision of when/how/if they are to be used being down to filesystem > driver. Your "first mount"/"additional mount" simply do not map > to anything universally applicable. Why so? (Note: using the "mount" terminology here is fundamentally broken to start with, mounts have nothing to do with this... Filesystem instance is better word.) You bring up NFS as an example, but creating and/or reusing an nfs client instance connected to a certain server is certainly a clear and well defined concept. The question becomes: does it make sense to generalize this concept and export it to userspace with the new API? You know the Plan 9 fs interface much better, but to me it looks like there's a separate namespace for filesystem instances, and the mount command just refers to such an instance. So there's no comparing of options or any such horror, just the need to explicitly instantiate a new instance when necessary. Doesn't sound very difficult to implement in the new API. Thanks, Miklos
> If the same block device is visible, with rw access, in two different > containers, I don't see any anything good can happen. Sure, with the At the raw level there are lots of use cases involving high performance data capture, media streaming and the like. At the file system layer you can use GFS2 for example. So there are cases where it's possible. There are even cases where it's actually useful at the filesystem level although not many I agree. Alan
On Mon, Aug 13, 2018 at 9:35 AM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote: >> If the same block device is visible, with rw access, in two different >> containers, I don't see any anything good can happen. Sure, with the > > At the raw level there are lots of use cases involving high performance > data capture, media streaming and the like. > > At the file system layer you can use GFS2 for example. Ugh. I even thought of this case, and I should have been a bit more precise: I would consider the GFS2 case to be essentially equivalent to the NFS case. I think we can probably divide all the filesystems into three or four types: pseudo file systems: Multiple instantiations of the same fs driver pointing at the same backing store give separate filesystems. (Same backing store includes the case where there isn't any backing store.) tmpfs is an example. This isn't particularly interesting. network-like file systems: Multiple instantiations of the same fs driver pointing at the same backing store are expected. This includes NFS, GFS2, AFS, CIFS, etc. This is only really interesting to the extent that, if the fs driver internally wants to share state between multiple instantiations, it should be smart enough to make sure the options are compatible or that it can otherwise handle mismatched options correctly. NFS does this right. non-network-like filesystems: There are complicated ones like btrfs and ZFS and simple ones like ext4. In either case, multiple totally separate instantiations of the driver sharing the backing store will lead to corruption. In cases like ext4, we seem to support it for legacy reasons, because we're afraid that there are scripts that try to mount the same block device more than once, and I think the new API has no need to support this. In cases like btrfs, we also seem to support multiple user requests for "mounts" with the same underlying block devices because we need it for full functionality. But I think this is because our API is wrong. Are there cases I'm missing? It sounds like the API could be improved to fully model the last case, and everything will work nicely.
On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote: > I would consider the GFS2 case to be essentially equivalent to the NFS > case. I think we can probably divide all the filesystems into three > or four types: > > pseudo file systems: Multiple instantiations of the same fs driver > pointing at the same backing store give separate filesystems. (Same > backing store includes the case where there isn't any backing store.) > tmpfs is an example. This isn't particularly interesting. > > network-like file systems: Multiple instantiations of the same fs > driver pointing at the same backing store are expected. This includes > NFS, GFS2, AFS, CIFS, etc. This is only really interesting to the > extent that, if the fs driver internally wants to share state between > multiple instantiations, it should be smart enough to make sure the > options are compatible or that it can otherwise handle mismatched > options correctly. NFS does this right. > > non-network-like filesystems: There are complicated ones like btrfs > and ZFS and simple ones like ext4. In either case, multiple totally > separate instantiations of the driver sharing the backing store will > lead to corruption. In cases like ext4, we seem to support it for > legacy reasons, because we're afraid that there are scripts that try > to mount the same block device more than once, and I think the new API > has no need to support this. In cases like btrfs, we also seem to > support multiple user requests for "mounts" with the same underlying > block devices because we need it for full functionality. But I think > this is because our API is wrong. > > Are there cases I'm missing? It sounds like the API could be improved > to fully model the last case, and everything will work nicely. You know, that's starting to remind of this little gem of Borges: http://www.alamut.com/subj/artiface/language/johnWilkins.html Especially the delightful (fake) quote contained in there: [...] it is written that the animals are divided into: (a) belonging to the emperor, (b) embalmed, (c) tame, (d) sucking pigs, (e) sirens, (f) fabulous, (g) stray dogs, (h) included in the present classification, (i) frenzied, (j) innumerable, (k) drawn with a very fine camelhair brush, (l) et cetera, (m) having just broken the water pitcher, (n) that from a long way off look like flies.
On Mon, 13 Aug 2018, Al Viro wrote: > On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote: > > Are there cases I'm missing? It sounds like the API could be improved > > to fully model the last case, and everything will work nicely. > > You know, that's starting to remind of this little gem of Borges: > http://www.alamut.com/subj/artiface/language/johnWilkins.html > Especially the delightful (fake) quote contained in there: > [...] it is written that the animals are divided into: > (a) belonging to the emperor, > (b) embalmed, > (c) tame, > (d) sucking pigs, > (e) sirens, > (f) fabulous, > (g) stray dogs, > (h) included in the present classification, > (i) frenzied, > (j) innumerable, > (k) drawn with a very fine camelhair brush, > (l) et cetera, > (m) having just broken the water pitcher, > (n) that from a long way off look like flies. Coincidentally, this was also the model for Linux capabilities.
On 8/13/2018 12:00 PM, James Morris wrote: > On Mon, 13 Aug 2018, Al Viro wrote: > >> On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote: >>> Are there cases I'm missing? It sounds like the API could be improved >>> to fully model the last case, and everything will work nicely. >> You know, that's starting to remind of this little gem of Borges: >> http://www.alamut.com/subj/artiface/language/johnWilkins.html >> Especially the delightful (fake) quote contained in there: >> [...] it is written that the animals are divided into: >> (a) belonging to the emperor, >> (b) embalmed, >> (c) tame, >> (d) sucking pigs, >> (e) sirens, >> (f) fabulous, >> (g) stray dogs, >> (h) included in the present classification, >> (i) frenzied, >> (j) innumerable, >> (k) drawn with a very fine camelhair brush, >> (l) et cetera, >> (m) having just broken the water pitcher, >> (n) that from a long way off look like flies. > > Coincidentally, this was also the model for Linux capabilities. Linux capabilities are POSIX capabilities which are modeled closely to accommodate the historical behavior manifest in the P1003.1 specification. So except for (c), (f) and (k) you can use this characterization. On a slightly more serious note, there's a lot of Linux, mount semantics included, that have grow organically and that aren't quite up to the usage models they are being applied to. I applaud David's work in part because it may make it possible to accommodate more of those cases going forward.
Casey Schaufler <casey@schaufler-ca.com> writes: > Don't blame the filesystems for behaving as documented. No. This behavior is not documented. At least I certainly don't see a word about this in any of the man pages. Where does it say mounting a filesystem will not honor it's mount options? It is also rare enough in practice it is something it is reasonable to expect people to be surprised by. > The problem is not in the mount mechanism, it's in the way you want to > abuse it. I am not asking for this behavior. I am pointing out this behavior exists. I am pointing out this behavior is harmful. I am asking we stop doing this harmful thing in the new API where we don't have a chance of breaking anything. The place where this has bitten the hardest is someone wrote a script to do something for Xen in a chroot. That script involved a chroot that mounted devpts and in doing so happend to change the options of the main /dev/pts. Which resulted in ptys created with /dev/ptmx outside the chroot with the wrong permissions. That in turn caused several distros to retain the ancient suid pt_chown binary from libc that the devpts filesystem was built to make obsolete. As the world turned that pt_chown binary could be confused into chowning the wrong pty if a pty from a container was used. The fix was to mount a new instance of devpts every time mount of devpts is called. That simplified the code, and allowed pt_chown to be removed permanently. The tricky bit was figuring out how keep /dev/ptmx working. I wound up testing on every distribution I could think of to ensure no one would notice the slightly changed behavior of the devpts filesystem. The behavior in other filesystems of ignoring the options instead of changing them on the filesystem isn't quite as bad. But it still has the potential for a lot of mischief. Eric
Having just re-ported NFS on top of the new mount API stuff, I find that I don't really like the idea of superblocks being separated by communication parameters - especially when it might seem reasonable to be able to adjust those parameters. Does it make sense to abstract out the remote peer and allow (a) that to be configured separately from any superblocks using it and (b) that to be used to create superblocks? Note that what a 'remote peer' is would be different for different filesystems: (*) For NFS, it would probably be a named server, with address(es) attached to the name. In lieu of actually having a name, the initial IP address could be used. (*) For CIFS, it would probably be a named server. I'm not sure if CIFS allows an abstraction for a share that can move about inside a domain. (*) For AFS, it would be a cell, I think, where the actual fileserver(s) used are a matter of direction from the Volume Location server. (*) For 9P and Ceph, I don't really know. What could be configured? Well, addresses, ports, timeouts. Maybe protocol level negotiation - though not being able to explicitly specify, say, the particular version and minorversion on an NFS share would be problematic for backward compatibility. One advantage it could give us is that it might make it easier if someone asks for server X to query userspace in some way for the default parameters for X are. What might this look like in terms of userspace? Well, we could overload the new mount API: peer1 = fsopen("nfs", FSOPEN_CREATE_PEER); fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd); fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home"); fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2"); fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1"); fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2"); fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122"); fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0); peer2 = fsopen("nfs", FSOPEN_CREATE_PEER); fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd); fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home"); fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3"); fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3"); fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001"); fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0); fs = fsopen("nfs", 0); fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1); fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2); fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0); m = fsmount(fs, 0, 0); [Note that Eric's oft-repeated point about the 'creation' operation altering established parameters still stands here.] You could also then reopen it for configuration, maybe by: peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER); or: peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME); though it might be better to give it its own syscall: peer = fspeer("nfs", "server.home", O_CLOEXEC); fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd); ... fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0); In terms of alternative interfaces, I'm not sure how easy it would be to make it like cgroups where you go and create a dir in a special filesystem, say, "/sys/peers/nfs", because the peers records and names would have to be network namespaced. Also, it might make it more difficult to use to create a root fs. On the other hand, being able to adjust the peer configuration by: echo 71 >/sys/peers/nfs/server.home/timeo does have a certain appeal. Also, netlink might be the right option, but I'm not sure how you'd pin the resultant object whilst you make use of it. A further thought is that is it worth making this idea more general and encompassing non-network devices also? This would run into issues of some logical sources being visible across namespaces and but not others. David
> On Aug 15, 2018, at 9:31 AM, David Howells <dhowells@redhat.com> wrote: > > Having just re-ported NFS on top of the new mount API stuff, I find that I > don't really like the idea of superblocks being separated by communication > parameters - especially when it might seem reasonable to be able to adjust > those parameters. > > Does it make sense to abstract out the remote peer and allow (a) that to be > configured separately from any superblocks using it and (b) that to be used to > create superblocks? > > Note that what a 'remote peer' is would be different for different > filesystems: ... I think this looks rather nice. But maybe you should generalize the concept of “peer” so that it works for btrfs too. In the case where you mount two different subvolumes, you’re creating a *something*, and you’re then creating a filesystem that references it. It’s almost the same thing. > > > > fs = fsopen("nfs", 0); > fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1); As you mention below, this seems like it might have namespacing issues. > > In terms of alternative interfaces, I'm not sure how easy it would be to make > it like cgroups where you go and create a dir in a special filesystem, say, > "/sys/peers/nfs", because the peers records and names would have to be network > namespaced. Also, it might make it more difficult to use to create a root fs. > > On the other hand, being able to adjust the peer configuration by: > > echo 71 >/sys/peers/nfs/server.home/timeo > > does have a certain appeal. > > Also, netlink might be the right option, but I'm not sure how you'd pin the > resultant object whilst you make use of it. > My suggestion would be to avoid giving these things names at all. I think that referring to them by fd should be sufficient, especially if you allow them to be reopened based on a mount that uses them and allow them to get bind-mounted somewhere a la namespaces to make them permanent if needed. > A further thought is that is it worth making this idea more general and > encompassing non-network devices also? This would run into issues of some > logical sources being visible across namespaces and but not others. Indeed :) It probably pays to rope a btrfs person into this discussion.
Quoting James Morris (jmorris@namei.org): > On Mon, 13 Aug 2018, Al Viro wrote: > > > On Mon, Aug 13, 2018 at 09:48:53AM -0700, Andy Lutomirski wrote: > > > > Are there cases I'm missing? It sounds like the API could be improved > > > to fully model the last case, and everything will work nicely. > > > > You know, that's starting to remind of this little gem of Borges: > > http://www.alamut.com/subj/artiface/language/johnWilkins.html > > Especially the delightful (fake) quote contained in there: > > [...] it is written that the animals are divided into: > > (a) belonging to the emperor, > > (b) embalmed, > > (c) tame, > > (d) sucking pigs, > > (e) sirens, > > (f) fabulous, > > (g) stray dogs, > > (h) included in the present classification, > > (i) frenzied, > > (j) innumerable, > > (k) drawn with a very fine camelhair brush, > > (l) et cetera, > > (m) having just broken the water pitcher, > > (n) that from a long way off look like flies. > > > Coincidentally, this was also the model for Linux capabilities. But maybe we want to split the stray dogs up by breed.
This is worth further detailed discussion re:SMB3 as there are some fascinating protocol features that might help here, but my first thought is just the obvious one - this could help 'DFS' (the global name space feature almost all modern CIFS/SMB3 implement) work a little better in the client. A share can be represented by an array of \\server\share\path targets although typically only one except in the DFS case (and server can be an ipv4 or ipv6 address or host name (which could have multiple addresses). It could be over RDMA, TCP, and even other protocols (as the transport). There are various examples of DFS referrals in https://msdn.microsoft.com/en-us/library/cc227066.aspx section 4. But since SMB3 also supports transparent failover, and "share move" and "server move" features, as well as multichannel - I would like to better understand the patch set to see if it helps/hurts. But until I dive into the patch set more and try it, hard for me to speculate. Has anyone looked at the CIFS/SMB3 changes needed? On Wed, Aug 15, 2018 at 11:32 AM David Howells <dhowells@redhat.com> wrote: > > Having just re-ported NFS on top of the new mount API stuff, I find that I > don't really like the idea of superblocks being separated by communication > parameters - especially when it might seem reasonable to be able to adjust > those parameters. > > Does it make sense to abstract out the remote peer and allow (a) that to be > configured separately from any superblocks using it and (b) that to be used to > create superblocks? > > Note that what a 'remote peer' is would be different for different > filesystems: > > (*) For NFS, it would probably be a named server, with address(es) attached > to the name. In lieu of actually having a name, the initial IP address > could be used. > > (*) For CIFS, it would probably be a named server. I'm not sure if CIFS > allows an abstraction for a share that can move about inside a domain. CIFS/SMB3 has fairly mature support (in the protocol) for various types of share redirection (not just 'DFS' that is supported by most every NAS server, and Macs, Windows, Linux clients etc). There are also very interesting features introduced with SMB 3.1.1 allowing 'tree connect contexts" which some important servers in the last few years implement. This is worth more discussion - SMB3 (in particular the SMB3.1.1 dialect) has a lot of interesting features here.
David Howells <dhowells@redhat.com> writes: > Having just re-ported NFS on top of the new mount API stuff, I find that I > don't really like the idea of superblocks being separated by communication > parameters - especially when it might seem reasonable to be able to adjust > those parameters. > > Does it make sense to abstract out the remote peer and allow (a) that to be > configured separately from any superblocks using it and (b) that to be used to > create superblocks? > > Note that what a 'remote peer' is would be different for different > filesystems: > > (*) For NFS, it would probably be a named server, with address(es) attached > to the name. In lieu of actually having a name, the initial IP address > could be used. > > (*) For CIFS, it would probably be a named server. I'm not sure if CIFS > allows an abstraction for a share that can move about inside a domain. > > (*) For AFS, it would be a cell, I think, where the actual fileserver(s) used > are a matter of direction from the Volume Location server. > > (*) For 9P and Ceph, I don't really know. > > What could be configured? Well, addresses, ports, timeouts. Maybe protocol > level negotiation - though not being able to explicitly specify, say, the > particular version and minorversion on an NFS share would be problematic for > backward compatibility. > > One advantage it could give us is that it might make it easier if someone asks > for server X to query userspace in some way for the default parameters for X > are. > > What might this look like in terms of userspace? Well, we could overload the > new mount API: > > peer1 = fsopen("nfs", FSOPEN_CREATE_PEER); > fsconfig(peer1, FSCONFIG_SET_NS, "net", NULL, netns_fd); > fsconfig(peer1, FSCONFIG_SET_STRING, "peer_name", "server.home"); > fsconfig(peer1, FSCONFIG_SET_STRING, "vers", "4.2"); > fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.1"); > fsconfig(peer1, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.2"); > fsconfig(peer1, FSCONFIG_SET_STRING, "timeo", "122"); > fsconfig(peer1, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0); > > peer2 = fsopen("nfs", FSOPEN_CREATE_PEER); > fsconfig(peer2, FSCONFIG_SET_NS, "net", NULL, netns_fd); > fsconfig(peer2, FSCONFIG_SET_STRING, "peer_name", "server2.home"); > fsconfig(peer2, FSCONFIG_SET_STRING, "vers", "3"); > fsconfig(peer2, FSCONFIG_SET_STRING, "address", "tcp:192.168.1.3"); > fsconfig(peer2, FSCONFIG_SET_STRING, "address", "udp:192.168.1.4+6001"); > fsconfig(peer2, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0); > > fs = fsopen("nfs", 0); > fsconfig(fs, FSCONFIG_SET_PEER, "peer.1", NULL, peer1); > fsconfig(fs, FSCONFIG_SET_PEER, "peer.2", NULL, peer2); > fsconfig(fs, FSCONFIG_SET_STRING, "source", "/home/dhowells", 0); > m = fsmount(fs, 0, 0); > > [Note that Eric's oft-repeated point about the 'creation' operation altering > established parameters still stands here.] > > You could also then reopen it for configuration, maybe by: > > peer = fspick(AT_FDCWD, "/mnt", FSPICK_PEER); > > or: > > peer = fspick(AT_FDCWD, "nfs:server.home", FSPICK_PEER_BY_NAME); > > though it might be better to give it its own syscall: > > peer = fspeer("nfs", "server.home", O_CLOEXEC); > fsconfig(peer, FSCONFIG_SET_NS, "net", NULL, netns_fd); > ... > fsconfig(peer, FSCONFIG_CMD_SET_UP_PEER, NULL, NULL, 0); > > In terms of alternative interfaces, I'm not sure how easy it would be to make > it like cgroups where you go and create a dir in a special filesystem, say, > "/sys/peers/nfs", because the peers records and names would have to be network > namespaced. Also, it might make it more difficult to use to create a root fs. > > On the other hand, being able to adjust the peer configuration by: > > echo 71 >/sys/peers/nfs/server.home/timeo > > does have a certain appeal. > > Also, netlink might be the right option, but I'm not sure how you'd pin the > resultant object whilst you make use of it. > > A further thought is that is it worth making this idea more general and > encompassing non-network devices also? This would run into issues of some > logical sources being visible across namespaces and but not others. Even network filesystems are going to have challenges of filesystems being visible in some network namespaces and not others. As some filesystems will be visible on the internet and some filesystems will only be visible on the appropriate local network. Network namespaces are sometimes used to deal with the case of local networks with overlapping ip addresses. I think you are proposing a model for network filesystems that is essentially the same situation where we are with most block devices filesystems today. Where some parameters identitify the local filesystem instance and some parameters identify how the kernel interacts with that filesystem instance. For system efficiency there is a strong argument for having the fewest number of filesystem instances we can. Otherwise we will be caching the same data twice and wasting space in RAM etc. So I like the idea. At least for devpts we always create a new filesystem instance every time mount(2) is called. NFS seems to have the option to create a new filesystem instance every time mount(2) is called as well, (even if the filesystem parameters are the same). And depending on the case I can see the attraction for other filesystems as well. So I don't think we can completely abandon the option for filesystems to always create a new filesystem instance when mount(8) is called. I most definitely support thinking this through and figuring out how it best make sense for the new filesystem API to create new filesystem instances or fail to create new filesystems instances. Eric
On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <ebiederm@xmission.com> wrote: > > David Howells <dhowells@redhat.com> writes: > > > Having just re-ported NFS on top of the new mount API stuff, I find that I > > don't really like the idea of superblocks being separated by communication > > parameters - especially when it might seem reasonable to be able to adjust > > those parameters. > > > > Does it make sense to abstract out the remote peer and allow (a) that to be > > configured separately from any superblocks using it and (b) that to be used to > > create superblocks? <snip> > At least for devpts we always create a new filesystem instance every > time mount(2) is called. NFS seems to have the option to create a new > filesystem instance every time mount(2) is called as well, (even if the > filesystem parameters are the same). And depending on the case I can > see the attraction for other filesystems as well. > > So I don't think we can completely abandon the option for filesystems > to always create a new filesystem instance when mount(8) is called. In cifs we attempt to match new mounts to existing tree connections (instances of connections to a \\server\share) from other mount(s) based first on whether security settings match (e.g. are both Kerberos) and then on whether encryption is on/off and whether this is a snapshot mount (smb3 previous versions feature). If neither is mounted with a snaphsot and the encryption settings match then we will use the same tree id to talk with the server as the other mounts use. Interesting idea to allow mount to force a new tree id. What was the NFS mount option you were talking about? Looking at the nfs man page the only one that looked similar was "nosharecache" > I most definitely support thinking this through and figuring out how it > best make sense for the new filesystem API to create new filesystem > instances or fail to create new filesystems instances. Yes - it is an interesting question.
Steve French <smfrench@gmail.com> writes: > On Thu, Aug 16, 2018 at 2:56 AM Eric W. Biederman <ebiederm@xmission.com> wrote: >> >> David Howells <dhowells@redhat.com> writes: >> >> > Having just re-ported NFS on top of the new mount API stuff, I find that I >> > don't really like the idea of superblocks being separated by communication >> > parameters - especially when it might seem reasonable to be able to adjust >> > those parameters. >> > >> > Does it make sense to abstract out the remote peer and allow (a) that to be >> > configured separately from any superblocks using it and (b) that to be used to >> > create superblocks? > <snip> >> At least for devpts we always create a new filesystem instance every >> time mount(2) is called. NFS seems to have the option to create a new >> filesystem instance every time mount(2) is called as well, (even if the >> filesystem parameters are the same). And depending on the case I can >> see the attraction for other filesystems as well. >> >> So I don't think we can completely abandon the option for filesystems >> to always create a new filesystem instance when mount(8) is called. > > In cifs we attempt to match new mounts to existing tree connections > (instances of connections to a \\server\share) from other mount(s) > based first on whether security settings match (e.g. are both > Kerberos) and then on whether encryption is on/off and whether this is > a snapshot mount (smb3 previous versions feature). If neither is > mounted with a snaphsot and the encryption settings match then > we will use the same tree id to talk with the server as the other > mounts use. Interesting idea to allow mount to force a new > tree id. > > What was the NFS mount option you were talking about? > Looking at the nfs man page the only one that looked similar > was "nosharecache" I was remembering this from reading the nfs mount code: static int nfs_compare_super(struct super_block *sb, void *data) { ... if (!nfs_compare_super_address(old, server)) return 0; /* Note: NFS_MOUNT_UNSHARED == NFS4_MOUNT_UNSHARED */ if (old->flags & NFS_MOUNT_UNSHARED) return 0; ... } If a filesystem has NFS_MOUNT_UNSHARED set it does not serve as a candidate for new mount requests. Skimming the code it looks like nosharecache is what sets NFS_MOUNT_UNSHARED. Another interesting and common case is tmpfs which always creates a new filesystem instance whenever it is mounted. Eric
Steve French <smfrench@gmail.com> writes: > In cifs we attempt to match new mounts to existing tree connections > (instances of connections to a \\server\share) from other mount(s) > based first on whether security settings match (e.g. are both > Kerberos) and then on whether encryption is on/off and whether this is > a snapshot mount (smb3 previous versions feature). If neither is > mounted with a snaphsot and the encryption settings match then > we will use the same tree id to talk with the server as the other > mounts use. Interesting idea to allow mount to force a new > tree id. We actually already have this mount option in cifs.ko, it's "nosharesock". > What was the NFS mount option you were talking about? > Looking at the nfs man page the only one that looked similar > was "nosharecache" Cheers,
On Thu, Aug 16, 2018 at 12:23 PM Aurélien Aptel <aaptel@suse.com> wrote: > > Steve French <smfrench@gmail.com> writes: > > In cifs we attempt to match new mounts to existing tree connections > > (instances of connections to a \\server\share) from other mount(s) > > based first on whether security settings match (e.g. are both > > Kerberos) and then on whether encryption is on/off and whether this is > > a snapshot mount (smb3 previous versions feature). If neither is > > mounted with a snaphsot and the encryption settings match then > > we will use the same tree id to talk with the server as the other > > mounts use. Interesting idea to allow mount to force a new > > tree id. > > We actually already have this mount option in cifs.ko, it's "nosharesock". Yes - good point. It is very easy to do on cifs. I mainly use that to simulate multiple clients for testing servers (so each mount to the same server whether or not the share matched, looks like a different client, coming from a different socket and thus with different session ids and tree ids as well). It is very useful when trying to simulate multiple clients running to the same server while using only one client machine (or VM). > > What was the NFS mount option you were talking about? > > Looking at the nfs man page the only one that looked similar > > was "nosharecache" The nfs man page apparently discourages its use: "As of kernel 2.6.18, the behavior specified by nosharecache is legacy caching behavior. This is considered a data risk"
On Thu, Aug 16, 2018 at 12:06:06AM -0500, Eric W. Biederman wrote: > So I don't think we can completely abandon the option for filesystems > to always create a new filesystem instance when mount(8) is called. Huh? If filesystem wants to create a new instance on each ->mount(), it can bloody well do so. Quite a few do - if that fs can handle that, more power to it. The problem is what to do with filesystems that *can't* do that. You really, really can't have two ext4 (or xfs, etc.) instances over the same device at the same time. Cache coherency, locking, etc. will kill you. And that's not to mention the joy of defining the semantics of having the same ext4 mounted with two logs at the same time ;-) I've seen "reject unless the options are compatible/identical/whatever", but that ignores the real problem with existing policy. It's *NOT* "I've mounted this and got an existing instance with non-matching options". That's a minor annoyance (and back when that decision had been made, mount(2) was very definitly root-only). The real problem is different and much worse - it's remount. I have asked to mount something and it had already been mounted, with identical options. OK, so what happens if I do mount -o remount on my instance? *IF* we are operating in the "only sysadmin can mount new filesystems", it's not a big deal - there are already lots of ways you can shoot yourself in the foot and mount(2) is certainly a powerful one. But if we get to "Joe R. Luser can do it in his container", we have a big problem. Decision back then had been mostly for usability reasons - it was back in 2001 (well before the containermania, userns or anything of that sort) and it was more about "how many hoops does one have to jump through to get something mounted, assuming the sanity of sysadmin doing that?". If *anything* like userns had been a concern back then, it probably would've been different. However, it's 17 years too late and if anyone has a functional TARDIS, I can easily think of better uses for it...