Message ID | 158230810644.2185128.16726948836367716086.stgit@warthog.procyon.org.uk (mailing list archive) |
---|---|
Headers | show |
Series | VFS: Filesystem information and notifications [ver #17] | expand |
On Fri, 2020-02-21 at 18:01 +0000, David Howells wrote: [...] > ============================ > FILESYSTEM INFORMATION QUERY > ============================ > > The fsinfo() system call allows information about the filesystem at a > particular path point to be queried as a set of attributes, some of > which may have more than one value. > > Attribute values are of four basic types: > > (1) Version dependent-length structure (size defined by type). > > (2) Variable-length string (up to 4096, including NUL). > > (3) List of structures (up to INT_MAX size). > > (4) Opaque blob (up to INT_MAX size). > > Attributes can have multiple values either as a sequence of values or > a sequence-of-sequences of values and all the values of a particular > attribute must be of the same type. > > Note that the values of an attribute *are* allowed to vary between > dentries within a single superblock, depending on the specific dentry > that you're looking at, but all the values of an attribute have to be > of the same type. > > I've tried to make the interface as light as possible, so > integer/enum attribute selector rather than string and the core does > all the allocation and extensibility support work rather than leaving > that to the filesystems. That means that for the first two attribute > types, the filesystem will always see a sufficiently-sized buffer > allocated. Further, this removes the possibility of the filesystem > gaining access to the userspace buffer. > > > fsinfo() allows a variety of information to be retrieved about a > filesystem and the mount topology: > > (1) General superblock attributes: > > - Filesystem identifiers (UUID, volume label, device numbers, > ...) > - The limits on a filesystem's capabilities > - Information on supported statx fields and attributes and IOC > flags. > - A variety single-bit flags indicating supported capabilities. > - Timestamp resolution and range. > - The amount of space/free space in a filesystem (as statfs()). > - Superblock notification counter. > > (2) Filesystem-specific superblock attributes: > > - Superblock-level timestamps. > - Cell name. > - Server names and addresses. > - Filesystem-specific information. > > (3) VFS information: > > - Mount topology information. > - Mount attributes. > - Mount notification counter. > > (4) Information about what the fsinfo() syscall itself supports, > including > the type and struct/element size of attributes. > > The system is extensible: > > (1) New attributes can be added. There is no requirement that a > filesystem implement every attribute. Note that the core VFS > keeps a > table of types and sizes so it can handle future extensibility > rather > than delegating this to the filesystems. > > (2) Version length-dependent structure attributes can be made larger > and > have additional information tacked on the end, provided it keeps > the > layout of the existing fields. If an older process asks for a > shorter > structure, it will only be given the bits it asks for. If a > newer > process asks for a longer structure on an older kernel, the > extra > space will be set to 0. In all cases, the size of the data > actually > available is returned. > > In essence, the size of a structure is that structure's version: > a > smaller size is an earlier version and a later version includes > everything that the earlier version did. > > (3) New single-bit capability flags can be added. This is a > structure-typed > attribute and, as such, (2) applies. Any bits you wanted but > the kernel > doesn't support are automatically set to 0. > > fsinfo() may be called like the following, for example: > > struct fsinfo_params params = { > .at_flags = AT_SYMLINK_NOFOLLOW, > .flags = FSINFO_FLAGS_QUERY_PATH, > .request = FSINFO_ATTR_AFS_SERVER_ADDRESSES, > .Nth = 2, > }; > struct fsinfo_server_address address; > len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms, > &address, sizeof(address)); > > The above example would query an AFS filesystem to retrieve the > address > list for the 3rd server, and: > > struct fsinfo_params params = { > .at_flags = AT_SYMLINK_NOFOLLOW, > .flags = FSINFO_FLAGS_QUERY_PATH, > .request = FSINFO_ATTR_AFS_CELL_NAME; > }; > char cell_name[256]; > len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms, > &cell_name, sizeof(cell_name)); > > would retrieve the name of an AFS cell as a string. > > In future, I want to make fsinfo() capable of querying a context > created by > fsopen() or fspick(), e.g.: > > fd = fsopen("ext4", 0); > struct fsinfo_params params = { > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > .request = FSINFO_ATTR_PARAMETERS; > }; > char buffer[65536]; > fsinfo(fd, NULL, ¶ms, &buffer, sizeof(buffer)); > > even if that context doesn't currently have a superblock attached. I > would prefer this to contain length-prefixed strings so that there's > no need to insert escaping, especially as any character, including > '\', can be used as the separator in cifs and so that binary > parameters can be returned (though that is a lesser issue). Could I make a suggestion about how this should be done in a way that doesn't actually require the fsinfo syscall at all: it could just be done with fsconfig. The idea is based on something I've wanted to do for configfd but couldn't because otherwise it wouldn't substitute for fsconfig, but Christian made me think it was actually essential to the ability of the seccomp and other verifier tools in the critique of configfd and I belive the same critique applies here. Instead of making fsconfig functionally configure ... as in you pass the attribute name, type and parameters down into the fs specific handler and the handler does a string match and then verifies the parameters and then acts on them, make it table configured, so what each fstype does is register a table of attributes which can be got and optionally set (with each attribute having a get and optional set function). We'd have multiple tables per fstype, so the generic VFS can register a table of attributes it understands for every fstype (things like name, uuid and the like) and then each fs type would register a table of fs specific attributes following the same pattern. The system would examine the fs specific table before the generic one, allowing overrides. fsconfig would have the ability to both get and set attributes, permitting retrieval as well as setting (which is how I get rid of the fsinfo syscall), we'd have a global parameter, which would retrieve the entire table by name and type so the whole thing is introspectable because the upper layer knows a-priori all the attributes which can be set for a given fs type and what type they are (so we can make more of the parsing generic). Any attribute which doesn't have a set routine would be read only and all attributes would have to have a get routine meaning everything is queryable. I think I know how to code this up in a way that would be fully transparent to the existing syscalls. James
On Fri, Feb 21, 2020 at 9:21 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Fri, 2020-02-21 at 18:01 +0000, David Howells wrote: > [...] > > ============================ > > FILESYSTEM INFORMATION QUERY > > ============================ > > > > The fsinfo() system call allows information about the filesystem at a > > particular path point to be queried as a set of attributes, some of > > which may have more than one value. > > > > Attribute values are of four basic types: > > > > (1) Version dependent-length structure (size defined by type). > > > > (2) Variable-length string (up to 4096, including NUL). > > > > (3) List of structures (up to INT_MAX size). > > > > (4) Opaque blob (up to INT_MAX size). > > > > Attributes can have multiple values either as a sequence of values or > > a sequence-of-sequences of values and all the values of a particular > > attribute must be of the same type. > > > > Note that the values of an attribute *are* allowed to vary between > > dentries within a single superblock, depending on the specific dentry > > that you're looking at, but all the values of an attribute have to be > > of the same type. > > > > I've tried to make the interface as light as possible, so > > integer/enum attribute selector rather than string and the core does > > all the allocation and extensibility support work rather than leaving > > that to the filesystems. That means that for the first two attribute > > types, the filesystem will always see a sufficiently-sized buffer > > allocated. Further, this removes the possibility of the filesystem > > gaining access to the userspace buffer. > > > > > > fsinfo() allows a variety of information to be retrieved about a > > filesystem and the mount topology: > > > > (1) General superblock attributes: > > > > - Filesystem identifiers (UUID, volume label, device numbers, > > ...) > > - The limits on a filesystem's capabilities > > - Information on supported statx fields and attributes and IOC > > flags. > > - A variety single-bit flags indicating supported capabilities. > > - Timestamp resolution and range. > > - The amount of space/free space in a filesystem (as statfs()). > > - Superblock notification counter. > > > > (2) Filesystem-specific superblock attributes: > > > > - Superblock-level timestamps. > > - Cell name. > > - Server names and addresses. > > - Filesystem-specific information. > > > > (3) VFS information: > > > > - Mount topology information. > > - Mount attributes. > > - Mount notification counter. > > > > (4) Information about what the fsinfo() syscall itself supports, > > including > > the type and struct/element size of attributes. > > > > The system is extensible: > > > > (1) New attributes can be added. There is no requirement that a > > filesystem implement every attribute. Note that the core VFS > > keeps a > > table of types and sizes so it can handle future extensibility > > rather > > than delegating this to the filesystems. > > > > (2) Version length-dependent structure attributes can be made larger > > and > > have additional information tacked on the end, provided it keeps > > the > > layout of the existing fields. If an older process asks for a > > shorter > > structure, it will only be given the bits it asks for. If a > > newer > > process asks for a longer structure on an older kernel, the > > extra > > space will be set to 0. In all cases, the size of the data > > actually > > available is returned. > > > > In essence, the size of a structure is that structure's version: > > a > > smaller size is an earlier version and a later version includes > > everything that the earlier version did. > > > > (3) New single-bit capability flags can be added. This is a > > structure-typed > > attribute and, as such, (2) applies. Any bits you wanted but > > the kernel > > doesn't support are automatically set to 0. > > > > fsinfo() may be called like the following, for example: > > > > struct fsinfo_params params = { > > .at_flags = AT_SYMLINK_NOFOLLOW, > > .flags = FSINFO_FLAGS_QUERY_PATH, > > .request = FSINFO_ATTR_AFS_SERVER_ADDRESSES, > > .Nth = 2, > > }; > > struct fsinfo_server_address address; > > len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms, > > &address, sizeof(address)); > > > > The above example would query an AFS filesystem to retrieve the > > address > > list for the 3rd server, and: > > > > struct fsinfo_params params = { > > .at_flags = AT_SYMLINK_NOFOLLOW, > > .flags = FSINFO_FLAGS_QUERY_PATH, > > .request = FSINFO_ATTR_AFS_CELL_NAME; > > }; > > char cell_name[256]; > > len = fsinfo(AT_FDCWD, "/afs/grand.central.org/doc", ¶ms, > > &cell_name, sizeof(cell_name)); > > > > would retrieve the name of an AFS cell as a string. > > > > In future, I want to make fsinfo() capable of querying a context > > created by > > fsopen() or fspick(), e.g.: > > > > fd = fsopen("ext4", 0); > > struct fsinfo_params params = { > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > .request = FSINFO_ATTR_PARAMETERS; > > }; > > char buffer[65536]; > > fsinfo(fd, NULL, ¶ms, &buffer, sizeof(buffer)); > > > > even if that context doesn't currently have a superblock attached. I > > would prefer this to contain length-prefixed strings so that there's > > no need to insert escaping, especially as any character, including > > '\', can be used as the separator in cifs and so that binary > > parameters can be returned (though that is a lesser issue). > > Could I make a suggestion about how this should be done in a way that > doesn't actually require the fsinfo syscall at all: it could just be > done with fsconfig. The idea is based on something I've wanted to do > for configfd but couldn't because otherwise it wouldn't substitute for > fsconfig, but Christian made me think it was actually essential to the > ability of the seccomp and other verifier tools in the critique of > configfd and I belive the same critique applies here. > > Instead of making fsconfig functionally configure ... as in you pass > the attribute name, type and parameters down into the fs specific > handler and the handler does a string match and then verifies the > parameters and then acts on them, make it table configured, so what > each fstype does is register a table of attributes which can be got and > optionally set (with each attribute having a get and optional set > function). We'd have multiple tables per fstype, so the generic VFS > can register a table of attributes it understands for every fstype > (things like name, uuid and the like) and then each fs type would > register a table of fs specific attributes following the same pattern. > The system would examine the fs specific table before the generic one, > allowing overrides. fsconfig would have the ability to both get and > set attributes, permitting retrieval as well as setting (which is how I > get rid of the fsinfo syscall), we'd have a global parameter, which > would retrieve the entire table by name and type so the whole thing is > introspectable because the upper layer knows a-priori all the > attributes which can be set for a given fs type and what type they are > (so we can make more of the parsing generic). Any attribute which > doesn't have a set routine would be read only and all attributes would > have to have a get routine meaning everything is queryable. And that makes me wonder: would a "/sys/class/fs/$ST_DEV/options/$OPTION" type interface be feasible for this? Thanks, Miklos
On Mon, 2020-02-24 at 11:24 +0100, Miklos Szeredi wrote: > On Fri, Feb 21, 2020 at 9:21 PM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: [...] > > Could I make a suggestion about how this should be done in a way > > that doesn't actually require the fsinfo syscall at all: it could > > just be done with fsconfig. The idea is based on something I've > > wanted to do for configfd but couldn't because otherwise it > > wouldn't substitute for fsconfig, but Christian made me think it > > was actually essential to the ability of the seccomp and other > > verifier tools in the critique of configfd and I belive the same > > critique applies here. > > > > Instead of making fsconfig functionally configure ... as in you > > pass the attribute name, type and parameters down into the fs > > specific handler and the handler does a string match and then > > verifies the parameters and then acts on them, make it table > > configured, so what each fstype does is register a table of > > attributes which can be got and optionally set (with each attribute > > having a get and optional set function). We'd have multiple tables > > per fstype, so the generic VFS can register a table of attributes > > it understands for every fstype (things like name, uuid and the > > like) and then each fs type would register a table of fs specific > > attributes following the same pattern. The system would examine the > > fs specific table before the generic one, allowing > > overrides. fsconfig would have the ability to both get and > > set attributes, permitting retrieval as well as setting (which is > > how I get rid of the fsinfo syscall), we'd have a global parameter, > > which would retrieve the entire table by name and type so the whole > > thing is introspectable because the upper layer knows a-priori all > > the attributes which can be set for a given fs type and what type > > they are (so we can make more of the parsing generic). Any > > attribute which doesn't have a set routine would be read only and > > all attributes would have to have a get routine meaning everything > > is queryable. > > And that makes me wonder: would a > "/sys/class/fs/$ST_DEV/options/$OPTION" type interface be feasible > for this? Once it's table driven, certainly a sysfs directory becomes possible. The problem with ST_DEV is filesystems like btrfs and xfs that may have multiple devices. The current fsinfo takes a fspick'd directory fd so the input to the query is a path, which gets messy in sysfs, although I could see something like /sys/class/fs/mount/<path>/$OPTION working. James
On Mon, Feb 24, 2020 at 3:55 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > Once it's table driven, certainly a sysfs directory becomes possible. > The problem with ST_DEV is filesystems like btrfs and xfs that may have > multiple devices. For XFS there's always a single sb->s_dev though, that's what st_dev will be set to on all files. Btrfs subvolume is sort of a lightweight superblock, so basically all such st_dev's are aliases of the same master superblock. So lookup of all subvolume st_dev's could result in referencing the same underlying struct super_block (just like /proc/$PID will reference the same underlying task group regardless of which of the task group member's PID is used). Having this info in sysfs would spare us a number of issues that a set of new syscalls would bring. The question is, would that be enough, or is there a reason that sysfs can't be used to present the various filesystem related information that fsinfo is supposed to present? Thanks, Miklos
Hi, On 24/02/2020 15:28, Miklos Szeredi wrote: > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > >> Once it's table driven, certainly a sysfs directory becomes possible. >> The problem with ST_DEV is filesystems like btrfs and xfs that may have >> multiple devices. > For XFS there's always a single sb->s_dev though, that's what st_dev > will be set to on all files. > > Btrfs subvolume is sort of a lightweight superblock, so basically all > such st_dev's are aliases of the same master superblock. So lookup of > all subvolume st_dev's could result in referencing the same underlying > struct super_block (just like /proc/$PID will reference the same > underlying task group regardless of which of the task group member's > PID is used). > > Having this info in sysfs would spare us a number of issues that a set > of new syscalls would bring. The question is, would that be enough, > or is there a reason that sysfs can't be used to present the various > filesystem related information that fsinfo is supposed to present? > > Thanks, > Miklos > We need a unique id for superblocks anyway. I had wondered about using s_dev some time back, but for the reasons mentioned earlier in this thread I think it might just land up being confusing and difficult to manage. While fake s_devs are created for sbs that don't have a device, I can't help thinking that something closer to ifindex, but for superblocks, is needed here. That would avoid the issue of which device number to use. In fact we need that anyway for the notifications, since without that there is a race that can lead to missing remounts of the same device, in case a umount/mount pair is missed due to an overrun, and then fsinfo returns the same device as before, with potentially the same mount options too. So I think a unique id for a superblock is a generically useful feature, which would also allow for sensible sysfs directory naming, if required, Steve.
On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote: > Hi, > > On 24/02/2020 15:28, Miklos Szeredi wrote: > > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > Once it's table driven, certainly a sysfs directory becomes > > > possible. The problem with ST_DEV is filesystems like btrfs and > > > xfs that may have multiple devices. > > > > For XFS there's always a single sb->s_dev though, that's what > > st_dev will be set to on all files. > > > > Btrfs subvolume is sort of a lightweight superblock, so basically > > all such st_dev's are aliases of the same master superblock. So > > lookup of all subvolume st_dev's could result in referencing the > > same underlying struct super_block (just like /proc/$PID will > > reference the same underlying task group regardless of which of the > > task group member's PID is used). > > > > Having this info in sysfs would spare us a number of issues that a > > set of new syscalls would bring. The question is, would that be > > enough, or is there a reason that sysfs can't be used to present > > the various filesystem related information that fsinfo is supposed > > to present? > > > > Thanks, > > Miklos > > > > We need a unique id for superblocks anyway. I had wondered about > using s_dev some time back, but for the reasons mentioned earlier in > this thread I think it might just land up being confusing and > difficult to manage. While fake s_devs are created for sbs that don't > have a device, I can't help thinking that something closer to > ifindex, but for superblocks, is needed here. That would avoid the > issue of which device number to use. > > In fact we need that anyway for the notifications, since without > that there is a race that can lead to missing remounts of the same > device, in case a umount/mount pair is missed due to an overrun, and > then fsinfo returns the same device as before, with potentially the > same mount options too. So I think a unique id for a superblock is a > generically useful feature, which would also allow for sensible sysfs > directory naming, if required, But would this be informative and useful for the user? I'm sure we can find a persistent id for a persistent superblock, but what about tmpfs ... that's going to have to change with every reboot. It's going to be remarkably inconvenient if I want to get fsinfo on /run to have to keep finding what the id is. The other thing a file descriptor does that sysfs doesn't is that it solves the information leak: if I'm in a mount namespace that has no access to certain mounts, I can't fspick them and thus I can't see the information. By default, with sysfs I can. James
Hi, On 25/02/2020 15:28, James Bottomley wrote: > On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote: >> Hi, >> >> On 24/02/2020 15:28, Miklos Szeredi wrote: >>> On Mon, Feb 24, 2020 at 3:55 PM James Bottomley >>> <James.Bottomley@hansenpartnership.com> wrote: >>> >>>> Once it's table driven, certainly a sysfs directory becomes >>>> possible. The problem with ST_DEV is filesystems like btrfs and >>>> xfs that may have multiple devices. >>> For XFS there's always a single sb->s_dev though, that's what >>> st_dev will be set to on all files. >>> >>> Btrfs subvolume is sort of a lightweight superblock, so basically >>> all such st_dev's are aliases of the same master superblock. So >>> lookup of all subvolume st_dev's could result in referencing the >>> same underlying struct super_block (just like /proc/$PID will >>> reference the same underlying task group regardless of which of the >>> task group member's PID is used). >>> >>> Having this info in sysfs would spare us a number of issues that a >>> set of new syscalls would bring. The question is, would that be >>> enough, or is there a reason that sysfs can't be used to present >>> the various filesystem related information that fsinfo is supposed >>> to present? >>> >>> Thanks, >>> Miklos >>> >> We need a unique id for superblocks anyway. I had wondered about >> using s_dev some time back, but for the reasons mentioned earlier in >> this thread I think it might just land up being confusing and >> difficult to manage. While fake s_devs are created for sbs that don't >> have a device, I can't help thinking that something closer to >> ifindex, but for superblocks, is needed here. That would avoid the >> issue of which device number to use. >> >> In fact we need that anyway for the notifications, since without >> that there is a race that can lead to missing remounts of the same >> device, in case a umount/mount pair is missed due to an overrun, and >> then fsinfo returns the same device as before, with potentially the >> same mount options too. So I think a unique id for a superblock is a >> generically useful feature, which would also allow for sensible sysfs >> directory naming, if required, > But would this be informative and useful for the user? I'm sure we can > find a persistent id for a persistent superblock, but what about tmpfs > ... that's going to have to change with every reboot. It's going to be > remarkably inconvenient if I want to get fsinfo on /run to have to keep > finding what the id is. That is a different question though, or at least it might be... the idea of the superblock id is to uniquely identify a particular superblock. The mount notification should give you the association between that superblock and any devices (assuming those are applicable), or you can use fsinfo if you were not listening to the notifications at the time of the mount to get the same information. If someone unmounts /run and remounts it, then the superblock id would change, but otherwise it would stay the same, so you know that it is the same mount that is being described in future notifications. One of the main aims here being to combine the fsinfo information with the notifications in a race free manner. There are a number of ways one might want to specify a filesystem: by device, by uuid, by volume label and so forth but we can't use any of those very easily as a unique id. Someone might remove a drive and replace it with a different one (so same device, but different content) or they might have two filesystems with the same uuid if they've just done a dd copy to a new device. For the mount notifications we need something that doesn't suffer from these issues, but which can also be very easily associated with what in most cases are more convenient ways to specify a particular filesystem. > > The other thing a file descriptor does that sysfs doesn't is that it > solves the information leak: if I'm in a mount namespace that has no > access to certain mounts, I can't fspick them and thus I can't see the > information. By default, with sysfs I can. > > James > Yes, thats true, and I wasn't advocating for the sysfs method over fspick here, just pointing out that a unique superblock id would be a generically useful thing to have, Steve.
On Tue, Feb 25, 2020 at 4:29 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > The other thing a file descriptor does that sysfs doesn't is that it > solves the information leak: if I'm in a mount namespace that has no > access to certain mounts, I can't fspick them and thus I can't see the > information. By default, with sysfs I can. That's true, but procfs/sysfs has to deal with various namespacing issues anyway. If this is just about hiding a number of entries, then I don't think that's going to be a big deal. The syscall API is efficient: single syscall per query instead of several, no parsing necessary. However, it is difficult to extend, because the ABI must be updated, possibly libc and util-linux also, so that scripts can also consume the new parameter. With the sysfs approach only the kernel needs to be updated, and possibly only the filesystem code, not even the VFS. So I think the question comes down to: do we need a highly efficient way to query the superblock parameters all at once, or not? Thanks, Miklos
Hi, On 26/02/2020 09:11, Miklos Szeredi wrote: > On Tue, Feb 25, 2020 at 4:29 PM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > >> The other thing a file descriptor does that sysfs doesn't is that it >> solves the information leak: if I'm in a mount namespace that has no >> access to certain mounts, I can't fspick them and thus I can't see the >> information. By default, with sysfs I can. > That's true, but procfs/sysfs has to deal with various namespacing > issues anyway. If this is just about hiding a number of entries, then > I don't think that's going to be a big deal. > > The syscall API is efficient: single syscall per query instead of > several, no parsing necessary. > > However, it is difficult to extend, because the ABI must be updated, > possibly libc and util-linux also, so that scripts can also consume > the new parameter. With the sysfs approach only the kernel needs to > be updated, and possibly only the filesystem code, not even the VFS. > > So I think the question comes down to: do we need a highly efficient > way to query the superblock parameters all at once, or not? > > Thanks, > Miklos > That is Ian's use case for autofs I think, and it will also be what is needed at start up of most applications using the fs notifications, as well as at resync time if there has been an overrun leading to lost fs notification messages. We do need a solution that can scale to large numbers of mounts efficiently. Being able to extend it is also an important consideration too, so hopefully David has a solution to that, Steve.
On Wed, 2020-02-26 at 10:11 +0100, Miklos Szeredi wrote: > On Tue, Feb 25, 2020 at 4:29 PM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > The other thing a file descriptor does that sysfs doesn't is that > > it > > solves the information leak: if I'm in a mount namespace that has > > no > > access to certain mounts, I can't fspick them and thus I can't see > > the > > information. By default, with sysfs I can. > > That's true, but procfs/sysfs has to deal with various namespacing > issues anyway. If this is just about hiding a number of entries, > then > I don't think that's going to be a big deal. I didn't see name space considerations in sysfs when I was looking at it recently. Obeying name space requirements is likely a lot of work in sysfs. > > The syscall API is efficient: single syscall per query instead of > several, no parsing necessary. > > However, it is difficult to extend, because the ABI must be updated, > possibly libc and util-linux also, so that scripts can also consume > the new parameter. With the sysfs approach only the kernel needs to > be updated, and possibly only the filesystem code, not even the VFS. > > So I think the question comes down to: do we need a highly efficient > way to query the superblock parameters all at once, or not? Or a similar question could be, how could a sysfs interface work to provide mount information. Getting information about all mounts might not be too bad but the sysfs directory structure that would be needed to represent all system mounts (without considering name spaces) would likely result in somewhat busy user space code. For example, given a path, and the path is all I know, how do I get mount information? Ignoring possible multiple mounts on a mount point, call fsinfo() with the path and get the id (the path walk is low overhead) to use with fsinfo() to get the all the info I need ... done. Again, ignoring possible multiple mounts on a mount point, and assuming there is a sysfs tree enumerating all the system mounts. I could open <sysfs base> + mount point path followed buy opening and reading the individual attribute files ... a bit more busy that one ... particularly if I need to do it for several thousand mounts. Then there's the code that would need to be added to maintain the various views in the sysfs tree, which can't be restricted only to the VFS because there's file system specific info needed too (the maintain a table idea), and that's before considering name space handling changes to sysfs. At the least the question of "do we need a highly efficient way to query the superblock parameters all at once" needs to be extended to include mount table enumeration as well as getting the info. But this is just me thinking about mount table handling and the quite significant problem we now have with user space scanning the proc mount tables to get this information. Ian
On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote: > At the least the question of "do we need a highly efficient way > to query the superblock parameters all at once" needs to be > extended to include mount table enumeration as well as getting > the info. > > But this is just me thinking about mount table handling and the > quite significant problem we now have with user space scanning > the proc mount tables to get this information. Right. So the problem is that currently autofs needs to rescan the proc mount table on every change. The solution to that is to - add a notification mechanism - and a way to selectively query mount/superblock information right? For the notification we have uevents in sysfs, which also supplies the changed parameters. Taking aside namespace issues and addressing mounts would this work for autofs? Thanks, Miklos
On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote: > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote: > > > At the least the question of "do we need a highly efficient way > > to query the superblock parameters all at once" needs to be > > extended to include mount table enumeration as well as getting > > the info. > > > > But this is just me thinking about mount table handling and the > > quite significant problem we now have with user space scanning > > the proc mount tables to get this information. > > Right. > > So the problem is that currently autofs needs to rescan the proc > mount > table on every change. The solution to that is to Actually no, that's not quite the problem I see. autofs handles large mount tables fairly well (necessarily) and in time I plan to remove the need to read the proc tables at all (that's proven very difficult but I'll get back to that). This has to be done to resolve the age old problem of autofs not being able to handle large direct mount maps. But, because of the large number of mounts associated with large direct mount maps, other system processes are badly affected too. So the problem I want to see fixed is the effect of very large mount tables on other user space applications, particularly the effect when a large number of mounts or umounts are performed. Clearly large mount tables not only result from autofs and the problems caused by them are slightly different to the mount and umount problem I describe. But they are a problem nevertheless in the sense that frequent notifications that lead to reading a large proc mount table has significant overhead that can't be avoided because the table may have changed since the last time it was read. It's easy to cause several system processes to peg a fair number of CPU's when a large number of mounts/umounts are being performed, namely systemd, udisks2 and a some others. Also I've seen couple of application processes badly affected purely by the presence of a large number of mounts in the proc tables, that's not quite so bad though. > > - add a notification mechanism - lookup a mount based on path > - and a way to selectively query mount/superblock information based on path ... > > right? > > For the notification we have uevents in sysfs, which also supplies > the > changed parameters. Taking aside namespace issues and addressing > mounts would this work for autofs? The parameters supplied by the notification mechanism are important. The place this is needed will be libmount since it catches a broad number of user space applications, including those I mentioned above (well at least systemd, I think also udisks2, very probably others). So that means mount table info. needs to be maintained, whether that can be achieved using sysfs I don't know. Creating and maintaining the sysfs tree would be a big challenge I think. But before trying to work out how to use a notification mechanism just having a way to get the info provided by the proc tables using a path alone should give initial immediate improvement in libmount. Ian
On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote: > > On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote: > > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> wrote: > > > > > At the least the question of "do we need a highly efficient way > > > to query the superblock parameters all at once" needs to be > > > extended to include mount table enumeration as well as getting > > > the info. > > > > > > But this is just me thinking about mount table handling and the > > > quite significant problem we now have with user space scanning > > > the proc mount tables to get this information. > > > > Right. > > > > So the problem is that currently autofs needs to rescan the proc > > mount > > table on every change. The solution to that is to > > Actually no, that's not quite the problem I see. > > autofs handles large mount tables fairly well (necessarily) and > in time I plan to remove the need to read the proc tables at all > (that's proven very difficult but I'll get back to that). > > This has to be done to resolve the age old problem of autofs not > being able to handle large direct mount maps. But, because of > the large number of mounts associated with large direct mount > maps, other system processes are badly affected too. > > So the problem I want to see fixed is the effect of very large > mount tables on other user space applications, particularly the > effect when a large number of mounts or umounts are performed. > > Clearly large mount tables not only result from autofs and the > problems caused by them are slightly different to the mount and > umount problem I describe. But they are a problem nevertheless > in the sense that frequent notifications that lead to reading > a large proc mount table has significant overhead that can't be > avoided because the table may have changed since the last time > it was read. > > It's easy to cause several system processes to peg a fair number > of CPU's when a large number of mounts/umounts are being performed, > namely systemd, udisks2 and a some others. Also I've seen couple > of application processes badly affected purely by the presence of > a large number of mounts in the proc tables, that's not quite so > bad though. > > > > > - add a notification mechanism - lookup a mount based on path > > - and a way to selectively query mount/superblock information > based on path ... > > > > right? > > > > For the notification we have uevents in sysfs, which also supplies > > the > > changed parameters. Taking aside namespace issues and addressing > > mounts would this work for autofs? > > The parameters supplied by the notification mechanism are important. > > The place this is needed will be libmount since it catches a broad > number of user space applications, including those I mentioned above > (well at least systemd, I think also udisks2, very probably others). > > So that means mount table info. needs to be maintained, whether that > can be achieved using sysfs I don't know. Creating and maintaining > the sysfs tree would be a big challenge I think. > > But before trying to work out how to use a notification mechanism > just having a way to get the info provided by the proc tables using > a path alone should give initial immediate improvement in libmount. Adding Karel, Lennart, Zbigniew and util-linux@vger... At a quick glance at libmount and systemd code, it appears that just switching out the implementation in libmount will not be enough: systemd is calling functions like mnt_table_parse_*() when it receives a notification that the mount table changed. What is the end purpose of parsing the mount tables? Can systemd guys comment on that? Thanks, Miklos
On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote: > > So the problem I want to see fixed is the effect of very large > > mount tables on other user space applications, particularly the > > effect when a large number of mounts or umounts are performed. Yes, now you have to generate (in kernel) and parse (in userspace) all mount table to get information about just one mount table entry. This is typical for umount or systemd. > > > - add a notification mechanism - lookup a mount based on path > > > - and a way to selectively query mount/superblock information > > based on path ... For umount-like use-cases we need mountpoint/ to mount entry conversion; I guess something like open(mountpoint/) + fsinfo() should be good enough. For systemd we need the same, but triggered by notification. The ideal solution is to get mount entry ID or FD from notification and later use this ID or FD to ask for details about the mount entry (probably again fsinfo()). The notification has to be usable with in epoll() set. This solves 99% of our performance issues I guess. > > So that means mount table info. needs to be maintained, whether that > > can be achieved using sysfs I don't know. Creating and maintaining > > the sysfs tree would be a big challenge I think. It will be still necessary to get complete mount table sometimes, but not in performance sensitive scenarios. I'm not sure about sysfs/, you need somehow resolve namespaces, order of the mount entries (which one is the last one), etc. IMHO translate mountpoint path to sysfs/ path will be complicated. > > But before trying to work out how to use a notification mechanism > > just having a way to get the info provided by the proc tables using > > a path alone should give initial immediate improvement in libmount. > > Adding Karel, Lennart, Zbigniew and util-linux@vger... > > At a quick glance at libmount and systemd code, it appears that just > switching out the implementation in libmount will not be enough: > systemd is calling functions like mnt_table_parse_*() when it receives > a notification that the mount table changed. We're ready to change this stuff in systemd if there will be something better (something per-mount-entry). My plan is add new API to libmount to query information about one mount entry (but I had no time to play with fsinfo yet). > What is the end purpose of parsing the mount tables? Can systemd guys > comment on that? If mount/umount is triggered by systemd than it need verification about success and final version of the mount options. It also reads information from libmount to get userspace mount options (.e.g. _netdev -- libmount uses mount source, target and fsroot to join kernel and userpace stuff). And don't forget that mount units are part of systemd dependencies, so umount/mount is important event for systemd and it need details about the changes (what, where, ... etc.) Karel
On Thu, 2020-02-27 at 14:45 +0100, Miklos Szeredi wrote: > On Thu, Feb 27, 2020 at 12:34 PM Ian Kent <raven@themaw.net> wrote: > > On Thu, 2020-02-27 at 10:36 +0100, Miklos Szeredi wrote: > > > On Thu, Feb 27, 2020 at 6:06 AM Ian Kent <raven@themaw.net> > > > wrote: > > > > > > > At the least the question of "do we need a highly efficient way > > > > to query the superblock parameters all at once" needs to be > > > > extended to include mount table enumeration as well as getting > > > > the info. > > > > > > > > But this is just me thinking about mount table handling and the > > > > quite significant problem we now have with user space scanning > > > > the proc mount tables to get this information. > > > > > > Right. > > > > > > So the problem is that currently autofs needs to rescan the proc > > > mount > > > table on every change. The solution to that is to > > > > Actually no, that's not quite the problem I see. > > > > autofs handles large mount tables fairly well (necessarily) and > > in time I plan to remove the need to read the proc tables at all > > (that's proven very difficult but I'll get back to that). > > > > This has to be done to resolve the age old problem of autofs not > > being able to handle large direct mount maps. But, because of > > the large number of mounts associated with large direct mount > > maps, other system processes are badly affected too. > > > > So the problem I want to see fixed is the effect of very large > > mount tables on other user space applications, particularly the > > effect when a large number of mounts or umounts are performed. > > > > Clearly large mount tables not only result from autofs and the > > problems caused by them are slightly different to the mount and > > umount problem I describe. But they are a problem nevertheless > > in the sense that frequent notifications that lead to reading > > a large proc mount table has significant overhead that can't be > > avoided because the table may have changed since the last time > > it was read. > > > > It's easy to cause several system processes to peg a fair number > > of CPU's when a large number of mounts/umounts are being performed, > > namely systemd, udisks2 and a some others. Also I've seen couple > > of application processes badly affected purely by the presence of > > a large number of mounts in the proc tables, that's not quite so > > bad though. > > > > > - add a notification mechanism - lookup a mount based on path > > > - and a way to selectively query mount/superblock information > > based on path ... > > > right? > > > > > > For the notification we have uevents in sysfs, which also > > > supplies > > > the > > > changed parameters. Taking aside namespace issues and addressing > > > mounts would this work for autofs? > > > > The parameters supplied by the notification mechanism are > > important. > > > > The place this is needed will be libmount since it catches a broad > > number of user space applications, including those I mentioned > > above > > (well at least systemd, I think also udisks2, very probably > > others). > > > > So that means mount table info. needs to be maintained, whether > > that > > can be achieved using sysfs I don't know. Creating and maintaining > > the sysfs tree would be a big challenge I think. > > > > But before trying to work out how to use a notification mechanism > > just having a way to get the info provided by the proc tables using > > a path alone should give initial immediate improvement in libmount. > > Adding Karel, Lennart, Zbigniew and util-linux@vger... > > At a quick glance at libmount and systemd code, it appears that just > switching out the implementation in libmount will not be enough: > systemd is calling functions like mnt_table_parse_*() when it > receives > a notification that the mount table changed. Maybe I wasn't clear, my bad, sorry about that. There's no question that change notification handling is needed too. I'm claiming that an initial change to use something that can get the mount information without using the proc tables alone will give an "initial immediate improvement". The work needed to implement mount table change notification handling will take much more time and exactly what changes that will bring is not clear yet and I do plan to work on that too, together with Karel. Ian
On Thu, 2020-02-27 at 16:14 +0100, Karel Zak wrote: > On Thu, Feb 27, 2020 at 02:45:27PM +0100, Miklos Szeredi wrote: > > > So the problem I want to see fixed is the effect of very large > > > mount tables on other user space applications, particularly the > > > effect when a large number of mounts or umounts are performed. > > Yes, now you have to generate (in kernel) and parse (in > userspace) all mount table to get information about just > one mount table entry. This is typical for umount or systemd. > > > > > - add a notification mechanism - lookup a mount based on > > > > path > > > > - and a way to selectively query mount/superblock information > > > based on path ... > > For umount-like use-cases we need mountpoint/ to mount entry > conversion; I guess something like open(mountpoint/) + fsinfo() > should be good enough. > > For systemd we need the same, but triggered by notification. The > ideal > solution is to get mount entry ID or FD from notification and later > use this > ID or FD to ask for details about the mount entry (probably again > fsinfo()). > The notification has to be usable with in epoll() set. > > This solves 99% of our performance issues I guess. > > > > So that means mount table info. needs to be maintained, whether > > > that > > > can be achieved using sysfs I don't know. Creating and > > > maintaining > > > the sysfs tree would be a big challenge I think. > > It will be still necessary to get complete mount table sometimes, > but > not in performance sensitive scenarios. That was my understanding too. Mount table enumeration is possible with fsinfo() but you still have to handle each and every mount so improvement there is not going to be as much as cases where the proc mount table needs to be scanned independently for an individual mount. It will be somewhat more straight forward without the need to dissect text records though. > > I'm not sure about sysfs/, you need somehow resolve namespaces, order > of the mount entries (which one is the last one), etc. IMHO translate > mountpoint path to sysfs/ path will be complicated. I wonder about that too, after all sysfs contains a tree of nodes from which the view is created unlike proc which translates kernel information directly based on what the process should see. We'll need to wait a bit and see what Miklos has in mind for mount table enumeration and nothing has been said about name spaces yet. While fsinfo() is not similar to proc it does handle name spaces in a sensible way via. file handles, a bit similar to the proc fs, and ordering is catered for in the fsinfo() enumeration in a natural way. Not sure how that would be handled using sysfs ... Ian
On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote: > > I'm not sure about sysfs/, you need somehow resolve namespaces, order > > of the mount entries (which one is the last one), etc. IMHO translate > > mountpoint path to sysfs/ path will be complicated. > > I wonder about that too, after all sysfs contains a tree of nodes > from which the view is created unlike proc which translates kernel > information directly based on what the process should see. > > We'll need to wait a bit and see what Miklos has in mind for mount > table enumeration and nothing has been said about name spaces yet. Adding Greg for sysfs knowledge. As far as I understand the sysfs model is, basically: - list of devices sorted by class and address - with each class having a given set of attributes Superblocks and mounts could get enumerated by a unique identifier. mnt_id seems to be good for mounts, s_dev may or may not be good for superblock, but s_id (as introduced in this patchset) could be used instead. As for namespaces, that's "just" an access control issue, AFAICS. For example a task with a non-initial mount namespace should not have access to attributes of mounts outside of its namespace. Checking access to superblock attributes would be similar: scan the list of mounts and only allow access if at least one mount would get access. > While fsinfo() is not similar to proc it does handle name spaces > in a sensible way via. file handles, a bit similar to the proc fs, > and ordering is catered for in the fsinfo() enumeration in a natural > way. Not sure how that would be handled using sysfs ... I agree that the access control is much more straightforward with fsinfo(2) and this may be the single biggest reason to introduce a new syscall. Let's see what others thing. Thanks, Miklos
On Fri, Feb 28, 2020 at 09:35:17AM +0100, Miklos Szeredi wrote: > On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote: > > > > I'm not sure about sysfs/, you need somehow resolve namespaces, order > > > of the mount entries (which one is the last one), etc. IMHO translate > > > mountpoint path to sysfs/ path will be complicated. > > > > I wonder about that too, after all sysfs contains a tree of nodes > > from which the view is created unlike proc which translates kernel > > information directly based on what the process should see. > > > > We'll need to wait a bit and see what Miklos has in mind for mount > > table enumeration and nothing has been said about name spaces yet. > > Adding Greg for sysfs knowledge. > > As far as I understand the sysfs model is, basically: > > - list of devices sorted by class and address > - with each class having a given set of attributes Close enough :) > Superblocks and mounts could get enumerated by a unique identifier. > mnt_id seems to be good for mounts, s_dev may or may not be good for > superblock, but s_id (as introduced in this patchset) could be used > instead. So what would the sysfs tree look like with this? > As for namespaces, that's "just" an access control issue, AFAICS. > For example a task with a non-initial mount namespace should not have > access to attributes of mounts outside of its namespace. Checking > access to superblock attributes would be similar: scan the list of > mounts and only allow access if at least one mount would get access. sysfs does handle namespaces, look at how networking does this. But, it's not exactly the simplest thing to do so, so be careful with that as this is going to be essential for this type of work. thanks, greg k-h
On Fri, 2020-02-28 at 09:35 +0100, Miklos Szeredi wrote: > On Fri, Feb 28, 2020 at 1:43 AM Ian Kent <raven@themaw.net> wrote: > > > > I'm not sure about sysfs/, you need somehow resolve namespaces, > > > order of the mount entries (which one is the last one), etc. IMHO > > > translate mountpoint path to sysfs/ path will be complicated. > > > > I wonder about that too, after all sysfs contains a tree of nodes > > from which the view is created unlike proc which translates kernel > > information directly based on what the process should see. > > > > We'll need to wait a bit and see what Miklos has in mind for mount > > table enumeration and nothing has been said about name spaces yet. > > Adding Greg for sysfs knowledge. > > As far as I understand the sysfs model is, basically: > > - list of devices sorted by class and address > - with each class having a given set of attributes > > Superblocks and mounts could get enumerated by a unique identifier. > mnt_id seems to be good for mounts, s_dev may or may not be good for > superblock, but s_id (as introduced in this patchset) could be used > instead. > > As for namespaces, that's "just" an access control issue, AFAICS. That's an easy thing to say but not an easy thing to check: it can be made so for label based namespaces like the network, but the mount namespace is shared/cloned tree based. Assessing whether a given superblock is within your current namespace root can become a large search exercise. You can see how much of one in fs/proc_namespaces.c which controls how /proc/self/mounts appears in your current namespace. > For example a task with a non-initial mount namespace should not have > access to attributes of mounts outside of its namespace. Checking > access to superblock attributes would be similar: scan the list of > mounts and only allow access if at least one mount would get access. That scan can be expensive as I explained above. That's really why I think this is a bad idea. Sysfs itself is nicely currently restricted to system information that most containers don't need to know, so a lot of the sysfs issues with containers can be solved by not mounting it. If you suddenly make it required for filesystem information and notifications, that security measure gets blown out of the water. > > While fsinfo() is not similar to proc it does handle name spaces > > in a sensible way via. file handles, a bit similar to the proc fs, > > and ordering is catered for in the fsinfo() enumeration in a > > natural way. Not sure how that would be handled using sysfs ... > > I agree that the access control is much more straightforward with > fsinfo(2) and this may be the single biggest reason to introduce a > new syscall. > > Let's see what others thing. Containers are file based entities, so file descriptors are their most natural thing and they have full ACL protection within the container (can't open the file, can't then get the fd). The other reason container people like file descriptors (all the Xat system calls that have been introduced) is that if we do actually need to break the boundaries or privileges of the container, we can do so by getting the orchestration system to pass in a fd the interior of the container wouldn't have access to. James
On Fri, Feb 28, 2020 at 4:09 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > Containers are file based entities, so file descriptors are their most > natural thing and they have full ACL protection within the container > (can't open the file, can't then get the fd). The other reason > container people like file descriptors (all the Xat system calls that > have been introduced) is that if we do actually need to break the > boundaries or privileges of the container, we can do so by getting the > orchestration system to pass in a fd the interior of the container > wouldn't have access to. Yeah, agreed about the simplicity of fd based access. Then again a filesystem access would allow immediate access to all scripts, languages, etc. That, I think is a huge bonus compared to the ioctl-like mess that the current proposal is, which would require library, utility, language binding updates on all changes. Ugh. One way to resolve that is to have the mount information magic-symlinked from /proc/PID/fdmount/FD directly to the mountinfo dir, which would then have a link into the sbinfo dir. With other access denied to all except sysadmin. Would that work? Thanks, Miklos
On Tue, Feb 25, 2020 at 07:28:55AM -0800, James Bottomley wrote: > On Tue, 2020-02-25 at 12:13 +0000, Steven Whitehouse wrote: > > Hi, > > > > On 24/02/2020 15:28, Miklos Szeredi wrote: > > > On Mon, Feb 24, 2020 at 3:55 PM James Bottomley > > > <James.Bottomley@hansenpartnership.com> wrote: > > > > > > > Once it's table driven, certainly a sysfs directory becomes > > > > possible. The problem with ST_DEV is filesystems like btrfs and > > > > xfs that may have multiple devices. > > > > > > For XFS there's always a single sb->s_dev though, that's what > > > st_dev will be set to on all files. > > > > > > Btrfs subvolume is sort of a lightweight superblock, so basically > > > all such st_dev's are aliases of the same master superblock. So > > > lookup of all subvolume st_dev's could result in referencing the > > > same underlying struct super_block (just like /proc/$PID will > > > reference the same underlying task group regardless of which of the > > > task group member's PID is used). > > > > > > Having this info in sysfs would spare us a number of issues that a > > > set of new syscalls would bring. The question is, would that be > > > enough, or is there a reason that sysfs can't be used to present > > > the various filesystem related information that fsinfo is supposed > > > to present? > > > > > > Thanks, > > > Miklos > > > > > > > We need a unique id for superblocks anyway. I had wondered about > > using s_dev some time back, but for the reasons mentioned earlier in > > this thread I think it might just land up being confusing and > > difficult to manage. While fake s_devs are created for sbs that don't > > have a device, I can't help thinking that something closer to > > ifindex, but for superblocks, is needed here. That would avoid the > > issue of which device number to use. > > > > In fact we need that anyway for the notifications, since without > > that there is a race that can lead to missing remounts of the same > > device, in case a umount/mount pair is missed due to an overrun, and > > then fsinfo returns the same device as before, with potentially the > > same mount options too. So I think a unique id for a superblock is a > > generically useful feature, which would also allow for sensible sysfs > > directory naming, if required, > > But would this be informative and useful for the user? I'm sure we can > find a persistent id for a persistent superblock, but what about tmpfs > ... that's going to have to change with every reboot. It's going to be > remarkably inconvenient if I want to get fsinfo on /run to have to keep > finding what the id is. > > The other thing a file descriptor does that sysfs doesn't is that it > solves the information leak: if I'm in a mount namespace that has no > access to certain mounts, I can't fspick them and thus I can't see the > information. By default, with sysfs I can. Difficult to figure out which part of the thread to reply too. :) sysfs strikes me as fundamentally misguided for this task. Init systems or any large-scale daemon will hate parsing things, there's that and parts of the reason why mountinfo sucks is because of parsing a possibly a potentially enormous file. Exposing information in sysfs will require parsing again one way or the other. I've been discussing these bottlenecks with Lennart quite a bit and reliable and performant mount notifications without needing to parse stuff is very high on the issue list. But even if that isn't an issue for some reason the namespace aspect is definitely something I'd consider a no-go. James has been poking at this a little already and I agree. More specifically, sysfs and proc already are a security nightmare for namespace-aware workloads and require special care. Not leaking information in any way is a difficult task. I mean, over the last two years I sent quite a lot of patches to the networking-namespace aware part of sysfs alone either fixing information leaks, or making other parts namespace aware that weren't and were causing issues (There's another large-ish series sitting in Dave's tree right now.). And tbh, network namespacing in sysfs is imho trivial compared to what we would need to do to handle mount namespacing and especially mount propagation. fsinfo() is way cleaner and ultimately simpler approach. We very much want it file-descriptor based. The mount api opens up the road to secure and _delegatable_ querying of filesystem information. Christian
On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > Superblocks and mounts could get enumerated by a unique identifier. > > mnt_id seems to be good for mounts, s_dev may or may not be good for > > superblock, but s_id (as introduced in this patchset) could be used > > instead. > > So what would the sysfs tree look like with this? For a start something like this: mounts/$MOUNT_ID/ parent -> ../$PARENT_ID super -> ../../supers/$SUPER_ID root: path from mount root to fs root (could be optional as usually they are the same) mountpoint -> $MOUNTPOINT flags: mount flags propagation: mount propagation children/$CHILD_ID -> ../../$CHILD_ID supers/$SUPER_ID/ type: fstype source: mount source (devname) options: csv of mount options Thanks, Miklos
sysfs also has some other disadvantages for this: (1) There's a potential chicken-and-egg problem in that you have to create a bunch of files and dirs in sysfs for every created mount and superblock (possibly excluding special ones like the socket mount) - but this includes sysfs itself. This might work - provided you create sysfs first. (2) sysfs is memory intensive. The directory structure has to be backed by dentries and inodes that linger as long as the referenced object does (procfs is more efficient in this regard for files that aren't being accessed). (3) It gives people extra, indirect ways to pin mount objects and superblocks. For the moment, fsinfo() gives you three ways of referring to a filesystem object: (a) Directly by path. (b) By path associated with an fd. (c) By mount ID (perm checked by working back up the tree). but will need to add: (d) By fscontext fd (which is hard to find in sysfs). Indeed, the superblock may not even exist yet. David
Miklos Szeredi <miklos@szeredi.hu> wrote: > children/$CHILD_ID -> ../../$CHILD_ID This would really suck. This bit would particularly affect rescanning time. You also really want to read the entire child set atomically and, ideally, include notification counters. > supers/$SUPER_ID/ > type: fstype > source: mount source (devname) > options: csv of mount options There's a lot more to fsinfo() than just this lot - and there's the possibility that some of the values may change depending on exactly which file you're looking at. David
On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote: > On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > Superblocks and mounts could get enumerated by a unique identifier. > > > mnt_id seems to be good for mounts, s_dev may or may not be good for > > > superblock, but s_id (as introduced in this patchset) could be used > > > instead. > > > > So what would the sysfs tree look like with this? > > For a start something like this: > > mounts/$MOUNT_ID/ > parent -> ../$PARENT_ID > super -> ../../supers/$SUPER_ID > root: path from mount root to fs root (could be optional as usually > they are the same) > mountpoint -> $MOUNTPOINT > flags: mount flags > propagation: mount propagation > children/$CHILD_ID -> ../../$CHILD_ID > > supers/$SUPER_ID/ > type: fstype > source: mount source (devname) > options: csv of mount options Oh, wonderful. So let me see if I got it right - any namespace operation can create/destroy/move around an arbitrary amount of sysfs objects. Better yet, we suddenly have to express the lifetime rules for struct mount and struct superblock in terms of struct device garbage. I'm less than thrilled by the entire fsinfo circus, but this really takes the cake. In case it needs to be spelled out: NAK.
On Fri, Feb 28, 2020 at 6:15 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote: > > On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > Superblocks and mounts could get enumerated by a unique identifier. > > > > mnt_id seems to be good for mounts, s_dev may or may not be good for > > > > superblock, but s_id (as introduced in this patchset) could be used > > > > instead. > > > > > > So what would the sysfs tree look like with this? > > > > For a start something like this: > > > > mounts/$MOUNT_ID/ > > parent -> ../$PARENT_ID > > super -> ../../supers/$SUPER_ID > > root: path from mount root to fs root (could be optional as usually > > they are the same) > > mountpoint -> $MOUNTPOINT > > flags: mount flags > > propagation: mount propagation > > children/$CHILD_ID -> ../../$CHILD_ID > > > > supers/$SUPER_ID/ > > type: fstype > > source: mount source (devname) > > options: csv of mount options > > Oh, wonderful. So let me see if I got it right - any namespace operation > can create/destroy/move around an arbitrary amount of sysfs objects. Parent/children symlinks may be excessive... > Better yet, we suddenly have to express the lifetime rules for struct mount > and struct superblock in terms of struct device garbage. How so? struct mount and struct superblock would hold a ref on struct device, not the other way round. In any case, I'm not insistent on the use of sysfs device classes for this; struct device (488B) does seem too heavy for struct mount (328B). What I'm pretty sure about is that a read(2) based interface would be way more useful than the syscall multiplexer that the current proposal is. Thanks, Miklos
On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> wrote: > > sysfs also has some other disadvantages for this: > > (1) There's a potential chicken-and-egg problem in that you have to create a > bunch of files and dirs in sysfs for every created mount and superblock > (possibly excluding special ones like the socket mount) - but this > includes sysfs itself. This might work - provided you create sysfs > first. Sysfs architecture looks something like this (I hope Greg will correct me if I'm wrong): device driver -> kobj tree <- sysfs tree The kobj tree is created by the device driver, and the dentry tree is created on demand from the kobj tree. Lifetime of kobjs is bound to both the sysfs objects and the device but not the other way round. I.e. device can go away while the sysfs object is still being referenced, and sysfs can be freely mounted and unmounted independently of device initialization. So there's no ordering requirement between sysfs mounts and other mounts. I might be wrong on the details, since mounts are created very early in the boot process... > > (2) sysfs is memory intensive. The directory structure has to be backed by > dentries and inodes that linger as long as the referenced object does > (procfs is more efficient in this regard for files that aren't being > accessed) See above: I don't think dentries and inodes are pinned, only kobjs and their associated cruft. Which may be too heavy, depending on the details of the kobj tree. > (3) It gives people extra, indirect ways to pin mount objects and > superblocks. See above. > For the moment, fsinfo() gives you three ways of referring to a filesystem > object: > > (a) Directly by path. A path is always representable by an O_PATH descriptor. > > (b) By path associated with an fd. See my proposal about linking from /proc/$PID/fdmount/$FD -> /sys/devices/virtual/mounts/$MOUNT_ID. > > (c) By mount ID (perm checked by working back up the tree). Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID. The proc symlink would bypass the lookup check by directly jumping to the mountinfo dir. > but will need to add: > > (d) By fscontext fd (which is hard to find in sysfs). Indeed, the superblock > may not even exist yet. Proc symlink would work for that too. If sysfs is too heavy, this could be proc or a completely new filesystem. The implementation is much less relevant at this stage of the discussion than the interface. Thanks, Miklos
On Mon, Mar 02, 2020 at 10:09:51AM +0100, Miklos Szeredi wrote: > On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> wrote: > > > > sysfs also has some other disadvantages for this: > > > > (1) There's a potential chicken-and-egg problem in that you have to create a > > bunch of files and dirs in sysfs for every created mount and superblock > > (possibly excluding special ones like the socket mount) - but this > > includes sysfs itself. This might work - provided you create sysfs > > first. > > Sysfs architecture looks something like this (I hope Greg will correct > me if I'm wrong): > > device driver -> kobj tree <- sysfs tree > > The kobj tree is created by the device driver, and the dentry tree is > created on demand from the kobj tree. Lifetime of kobjs is bound to > both the sysfs objects and the device but not the other way round. > I.e. device can go away while the sysfs object is still being > referenced, and sysfs can be freely mounted and unmounted > independently of device initialization. > > So there's no ordering requirement between sysfs mounts and other > mounts. I might be wrong on the details, since mounts are created > very early in the boot process... > > > > > (2) sysfs is memory intensive. The directory structure has to be backed by > > dentries and inodes that linger as long as the referenced object does > > (procfs is more efficient in this regard for files that aren't being > > accessed) > > See above: I don't think dentries and inodes are pinned, only kobjs > and their associated cruft. Which may be too heavy, depending on the > details of the kobj tree. That is correct, they should not be pinned, that is what kernfs handles and why we can handle 30k virtual block devices on a 31bit s390 instance :) So you shouldn't have to worry about memory for sysfs. There are loads of other reasons probably not to use sysfs for this instead :) thanks, greg k-h
On Fri, Feb 28, 2020 at 05:24:23PM +0100, Miklos Szeredi wrote: > ned-By: MIMEDefang 2.78 on 10.11.54.4 > > On Fri, Feb 28, 2020 at 1:27 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > Superblocks and mounts could get enumerated by a unique identifier. > > > mnt_id seems to be good for mounts, s_dev may or may not be good for > > > superblock, but s_id (as introduced in this patchset) could be used > > > instead. > > > > So what would the sysfs tree look like with this? > > For a start something like this: > > mounts/$MOUNT_ID/ > parent -> ../$PARENT_ID > super -> ../../supers/$SUPER_ID > root: path from mount root to fs root (could be optional as usually > they are the same) > mountpoint -> $MOUNTPOINT > flags: mount flags > propagation: mount propagation > children/$CHILD_ID -> ../../$CHILD_ID > > supers/$SUPER_ID/ > type: fstype > source: mount source (devname) > options: What about use-cases where I have no ID, but I have mountpoint path (e.g. "umount /foo")? In this case I have to go to open() + fsinfo() and then sysfs does not make sense for me, right? Karel
On Mon, 2020-03-02 at 10:09 +0100, Miklos Szeredi wrote: > On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> > wrote: > > sysfs also has some other disadvantages for this: > > > > (1) There's a potential chicken-and-egg problem in that you have > > to create a > > bunch of files and dirs in sysfs for every created mount and > > superblock > > (possibly excluding special ones like the socket mount) - but > > this > > includes sysfs itself. This might work - provided you create > > sysfs > > first. > > Sysfs architecture looks something like this (I hope Greg will > correct > me if I'm wrong): > > device driver -> kobj tree <- sysfs tree > > The kobj tree is created by the device driver, and the dentry tree is > created on demand from the kobj tree. Lifetime of kobjs is bound to > both the sysfs objects and the device but not the other way round. > I.e. device can go away while the sysfs object is still being > referenced, and sysfs can be freely mounted and unmounted > independently of device initialization. > > So there's no ordering requirement between sysfs mounts and other > mounts. I might be wrong on the details, since mounts are created > very early in the boot process... > > > (2) sysfs is memory intensive. The directory structure has to be > > backed by > > dentries and inodes that linger as long as the referenced > > object does > > (procfs is more efficient in this regard for files that aren't > > being > > accessed) > > See above: I don't think dentries and inodes are pinned, only kobjs > and their associated cruft. Which may be too heavy, depending on the > details of the kobj tree. > > > (3) It gives people extra, indirect ways to pin mount objects and > > superblocks. > > See above. > > > For the moment, fsinfo() gives you three ways of referring to a > > filesystem > > object: > > > > (a) Directly by path. > > A path is always representable by an O_PATH descriptor. > > > (b) By path associated with an fd. > > See my proposal about linking from /proc/$PID/fdmount/$FD -> > /sys/devices/virtual/mounts/$MOUNT_ID. > > > (c) By mount ID (perm checked by working back up the tree). > > Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID. > The proc symlink would bypass the lookup check by directly jumping to > the mountinfo dir. > > > but will need to add: > > > > (d) By fscontext fd (which is hard to find in sysfs). Indeed, the > > superblock > > may not even exist yet. > > Proc symlink would work for that too. There's mounts enumeration too, ordering is required to identify the top (or bottom depending on terminology) with more than one mount on a mount point. > > If sysfs is too heavy, this could be proc or a completely new > filesystem. The implementation is much less relevant at this stage > of > the discussion than the interface. Ha, proc with the seq file interface, that's already proved to not work properly and looks difficult to fix. Ian
On Tue, Mar 3, 2020 at 6:28 AM Ian Kent <raven@themaw.net> wrote: > > On Mon, 2020-03-02 at 10:09 +0100, Miklos Szeredi wrote: > > On Fri, Feb 28, 2020 at 5:36 PM David Howells <dhowells@redhat.com> > > wrote: > > > sysfs also has some other disadvantages for this: > > > > > > (1) There's a potential chicken-and-egg problem in that you have > > > to create a > > > bunch of files and dirs in sysfs for every created mount and > > > superblock > > > (possibly excluding special ones like the socket mount) - but > > > this > > > includes sysfs itself. This might work - provided you create > > > sysfs > > > first. > > > > Sysfs architecture looks something like this (I hope Greg will > > correct > > me if I'm wrong): > > > > device driver -> kobj tree <- sysfs tree > > > > The kobj tree is created by the device driver, and the dentry tree is > > created on demand from the kobj tree. Lifetime of kobjs is bound to > > both the sysfs objects and the device but not the other way round. > > I.e. device can go away while the sysfs object is still being > > referenced, and sysfs can be freely mounted and unmounted > > independently of device initialization. > > > > So there's no ordering requirement between sysfs mounts and other > > mounts. I might be wrong on the details, since mounts are created > > very early in the boot process... > > > > > (2) sysfs is memory intensive. The directory structure has to be > > > backed by > > > dentries and inodes that linger as long as the referenced > > > object does > > > (procfs is more efficient in this regard for files that aren't > > > being > > > accessed) > > > > See above: I don't think dentries and inodes are pinned, only kobjs > > and their associated cruft. Which may be too heavy, depending on the > > details of the kobj tree. > > > > > (3) It gives people extra, indirect ways to pin mount objects and > > > superblocks. > > > > See above. > > > > > For the moment, fsinfo() gives you three ways of referring to a > > > filesystem > > > object: > > > > > > (a) Directly by path. > > > > A path is always representable by an O_PATH descriptor. > > > > > (b) By path associated with an fd. > > > > See my proposal about linking from /proc/$PID/fdmount/$FD -> > > /sys/devices/virtual/mounts/$MOUNT_ID. > > > > > (c) By mount ID (perm checked by working back up the tree). > > > > Check that perm on lookup of /sys/devices/virtual/mounts/$MOUNT_ID. > > The proc symlink would bypass the lookup check by directly jumping to > > the mountinfo dir. > > > > > but will need to add: > > > > > > (d) By fscontext fd (which is hard to find in sysfs). Indeed, the > > > superblock > > > may not even exist yet. > > > > Proc symlink would work for that too. > > There's mounts enumeration too, ordering is required to identify the > top (or bottom depending on terminology) with more than one mount on > a mount point. > > > > > If sysfs is too heavy, this could be proc or a completely new > > filesystem. The implementation is much less relevant at this stage > > of > > the discussion than the interface. > > Ha, proc with the seq file interface, that's already proved to not > work properly and looks difficult to fix. I'm doing a patch. Let's see how it fares in the face of all these preconceptions. Thanks, Miklos
Miklos Szeredi <miklos@szeredi.hu> wrote: > I'm doing a patch. Let's see how it fares in the face of all these > preconceptions. Don't forget the efficiency criterion. One reason for going with fsinfo(2) is that scanning /proc/mounts when there are a lot of mounts in the system is slow (not to mention the global lock that is held during the read). Now, going with sysfs files on top of procfs links might avoid the global lock, and you can avoid rereading the options string if you export a change notification, but you're going to end up injecting a whole lot of pathwalk latency into the system. On top of that, it isn't going to help with the case that I'm working towards implementing where a container manager can monitor for mounts taking place inside the container and supervise them. What I'm proposing is that during the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd referring to the context under construction to the manager, which would then be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or permit it. Something like: fd = receive_context_to_supervise(); struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, .request = FSINFO_ATTR_SB_OPTIONS, }; fsinfo(fd, NULL, ¶ms, sizeof(params), buffer, sizeof(buffer)); supervise_parameters(buffer); fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0); fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0); fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0); struct fsinfo_params params = { .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, .request = FSINFO_ATTR_SB_NOTIFICATIONS, }; struct fsinfo_sb_notifications sbnotify; fsinfo(fd, NULL, ¶ms, sizeof(params), &sbnotify, sizeof(sbnotify)); watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03); fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0); close(fd); However, the supervised mount may be happening in a completely different set of namespaces, in which case the supervisor presumably wouldn't be able to see the links in procfs and the relevant portions of sysfs. David
On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > I'm doing a patch. Let's see how it fares in the face of all these > > preconceptions. > > Don't forget the efficiency criterion. One reason for going with fsinfo(2) is > that scanning /proc/mounts when there are a lot of mounts in the system is > slow (not to mention the global lock that is held during the read). > > Now, going with sysfs files on top of procfs links might avoid the global > lock, and you can avoid rereading the options string if you export a change > notification, but you're going to end up injecting a whole lot of pathwalk > latency into the system. Completely irrelevant. Cached lookup is so much optimized, that you won't be able to see any of it. No, I don't think this is going to be a performance issue at all, but if anything we could introduce a syscall ssize_t readfile(int dfd, const char *path, char *buf, size_t bufsize, int flags); that is basically the equivalent of open + read + close, or even a vectored variant that reads multiple files. But that's off topic again, since I don't think there's going to be any performance issue even with plain I/O syscalls. > > On top of that, it isn't going to help with the case that I'm working towards > implementing where a container manager can monitor for mounts taking place > inside the container and supervise them. What I'm proposing is that during > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd > referring to the context under construction to the manager, which would then > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or > permit it. Something like: > > fd = receive_context_to_supervise(); > struct fsinfo_params params = { > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > .request = FSINFO_ATTR_SB_OPTIONS, > }; > fsinfo(fd, NULL, ¶ms, sizeof(params), buffer, sizeof(buffer)); > supervise_parameters(buffer); > fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0); > fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0); > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0); > struct fsinfo_params params = { > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > .request = FSINFO_ATTR_SB_NOTIFICATIONS, > }; > struct fsinfo_sb_notifications sbnotify; > fsinfo(fd, NULL, ¶ms, sizeof(params), &sbnotify, sizeof(sbnotify)); > watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03); > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0); > close(fd); > > However, the supervised mount may be happening in a completely different set > of namespaces, in which case the supervisor presumably wouldn't be able to see > the links in procfs and the relevant portions of sysfs. It would be a "jump" link to the otherwise invisible directory. Thanks, Miklos
On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > preconceptions. > > > > Don't forget the efficiency criterion. One reason for going with fsinfo(2) is > > that scanning /proc/mounts when there are a lot of mounts in the system is > > slow (not to mention the global lock that is held during the read). BTW, I do feel that there's room for improvement in userspace code as well. Even quite big mount table could be scanned for *changes* very efficiently. l.e. cache previous contents of /proc/self/mountinfo and compare with new contents, line-by-line. Only need to parse the changed/added/removed lines. Also it would be pretty easy to throttle the number of updates so systemd et al. wouldn't hog the system with unnecessary processing. Thanks, Miklos
On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > preconceptions. > > > > Don't forget the efficiency criterion. One reason for going with fsinfo(2) is > > that scanning /proc/mounts when there are a lot of mounts in the system is > > slow (not to mention the global lock that is held during the read). > > > > Now, going with sysfs files on top of procfs links might avoid the global > > lock, and you can avoid rereading the options string if you export a change > > notification, but you're going to end up injecting a whole lot of pathwalk > > latency into the system. > > Completely irrelevant. Cached lookup is so much optimized, that you > won't be able to see any of it. > > No, I don't think this is going to be a performance issue at all, but > if anything we could introduce a syscall > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > bufsize, int flags); > > that is basically the equivalent of open + read + close, or even a > vectored variant that reads multiple files. But that's off topic > again, since I don't think there's going to be any performance issue > even with plain I/O syscalls. > > > > > On top of that, it isn't going to help with the case that I'm working towards > > implementing where a container manager can monitor for mounts taking place > > inside the container and supervise them. What I'm proposing is that during > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd > > referring to the context under construction to the manager, which would then > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or > > permit it. Something like: > > > > fd = receive_context_to_supervise(); > > struct fsinfo_params params = { > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > .request = FSINFO_ATTR_SB_OPTIONS, > > }; > > fsinfo(fd, NULL, ¶ms, sizeof(params), buffer, sizeof(buffer)); > > supervise_parameters(buffer); > > fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0); > > fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0); > > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0); > > struct fsinfo_params params = { > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > .request = FSINFO_ATTR_SB_NOTIFICATIONS, > > }; > > struct fsinfo_sb_notifications sbnotify; > > fsinfo(fd, NULL, ¶ms, sizeof(params), &sbnotify, sizeof(sbnotify)); > > watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03); > > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0); > > close(fd); > > > > However, the supervised mount may be happening in a completely different set > > of namespaces, in which case the supervisor presumably wouldn't be able to see > > the links in procfs and the relevant portions of sysfs. > > It would be a "jump" link to the otherwise invisible directory. More magic links to beam you around sounds like a bad idea. We had a bunch of CVEs around them in containers and they were one of the major reasons behind us pushing for openat2(). That's why it has a RESOLVE_NO_MAGICLINKS flag. Christian
On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: > > > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > > preconceptions. > > > > > > Don't forget the efficiency criterion. One reason for going with fsinfo(2) is > > > that scanning /proc/mounts when there are a lot of mounts in the system is > > > slow (not to mention the global lock that is held during the read). > > > > > > Now, going with sysfs files on top of procfs links might avoid the global > > > lock, and you can avoid rereading the options string if you export a change > > > notification, but you're going to end up injecting a whole lot of pathwalk > > > latency into the system. > > > > Completely irrelevant. Cached lookup is so much optimized, that you > > won't be able to see any of it. > > > > No, I don't think this is going to be a performance issue at all, but > > if anything we could introduce a syscall > > > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > > bufsize, int flags); > > > > that is basically the equivalent of open + read + close, or even a > > vectored variant that reads multiple files. But that's off topic > > again, since I don't think there's going to be any performance issue > > even with plain I/O syscalls. > > > > > > > > On top of that, it isn't going to help with the case that I'm working towards > > > implementing where a container manager can monitor for mounts taking place > > > inside the container and supervise them. What I'm proposing is that during > > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd > > > referring to the context under construction to the manager, which would then > > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or > > > permit it. Something like: > > > > > > fd = receive_context_to_supervise(); > > > struct fsinfo_params params = { > > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > > .request = FSINFO_ATTR_SB_OPTIONS, > > > }; > > > fsinfo(fd, NULL, ¶ms, sizeof(params), buffer, sizeof(buffer)); > > > supervise_parameters(buffer); > > > fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0); > > > fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0); > > > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0); > > > struct fsinfo_params params = { > > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > > .request = FSINFO_ATTR_SB_NOTIFICATIONS, > > > }; > > > struct fsinfo_sb_notifications sbnotify; > > > fsinfo(fd, NULL, ¶ms, sizeof(params), &sbnotify, sizeof(sbnotify)); > > > watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03); > > > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0); > > > close(fd); > > > > > > However, the supervised mount may be happening in a completely different set > > > of namespaces, in which case the supervisor presumably wouldn't be able to see > > > the links in procfs and the relevant portions of sysfs. > > > > It would be a "jump" link to the otherwise invisible directory. > > More magic links to beam you around sounds like a bad idea. We had a > bunch of CVEs around them in containers and they were one of the major > reasons behind us pushing for openat2(). That's why it has a > RESOLVE_NO_MAGICLINKS flag. No, that link wouldn't beam you around at all, it would end up in an internally mounted instance of a mountfs, a safe place where no dangerous CVE's roam. Thanks, Miklos
Hi, On 03/03/2020 09:48, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote: >> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: >>> Miklos Szeredi <miklos@szeredi.hu> wrote: >>> >>>> I'm doing a patch. Let's see how it fares in the face of all these >>>> preconceptions. >>> Don't forget the efficiency criterion. One reason for going with fsinfo(2) is >>> that scanning /proc/mounts when there are a lot of mounts in the system is >>> slow (not to mention the global lock that is held during the read). > BTW, I do feel that there's room for improvement in userspace code as > well. Even quite big mount table could be scanned for *changes* very > efficiently. l.e. cache previous contents of /proc/self/mountinfo and > compare with new contents, line-by-line. Only need to parse the > changed/added/removed lines. > > Also it would be pretty easy to throttle the number of updates so > systemd et al. wouldn't hog the system with unnecessary processing. > > Thanks, > Miklos > At least having patches to compare would allow us to look at the performance here and gain some numbers, which would be helpful to frame the discussions. However I'm not seeing how it would be easy to throttle updates... they occur at whatever rate they are generated and this can be fairly high. Also I'm not sure that I follow how the notifications and the dumping of the whole table are synchronized in this case, either. Al has pointed out before that a single mount operation on a subtree can generate a large number of changes on that subtree. That kind of scenario will need to be dealt with efficiently so that we don't miss things, and we also minimize the possibility of overruns, and additional overhead on the mount changes themselves, by keeping the notification messages small. We should also look at what the likely worst case might be. I seem to remember from what Ian has said in the past that there can be tens of thousands of autofs mounts on some large systems. I assume that worst case might be something like that, but multiplied by however many containers might be on a system. Can anybody think of a situation which might require even more mounts? The network subsystem had a similar problem... they use rtnetlink for the routing information, and just like the proposal here it contains a dump mechanism, and a way to listen to events (add/remove routes) which is synchronized with that dump. Ian did start looking at netlink some time ago, but it also has some issues (it is in the network namespace not the fs namespace, it also has various things accumulated over the years that we don't need for filesystems) but that was part of the original inspiration for the fs notifications. There is also, of course, /proc/net/route which can be useful in many circumstances, but for efficiency and synchronization reasons if is not the interface of choice for routing protocols. David's proposal has a number of the important attributes of an rtnetlink-like (in a conceptual sense) solution, and I remain skeptical that a /sysfs or similar interface would be an efficient solution to the original problem, even if it might perhaps make a useful addition. There is also the chicken-and-egg issue, in the sense that if the interface is via a filesystem (sysfs, proc or whatever), how does one receive a notification for that filesystem itself being mounted until after it has been mounted? Maybe that is not a particular problem, but I think a cleaner solution would not require a mount in order to watch for other mounts, Steve.
On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > > > On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: > > > > > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > > > preconceptions. > > > > > > > > Don't forget the efficiency criterion. One reason for going with fsinfo(2) is > > > > that scanning /proc/mounts when there are a lot of mounts in the system is > > > > slow (not to mention the global lock that is held during the read). > > > > > > > > Now, going with sysfs files on top of procfs links might avoid the global > > > > lock, and you can avoid rereading the options string if you export a change > > > > notification, but you're going to end up injecting a whole lot of pathwalk > > > > latency into the system. > > > > > > Completely irrelevant. Cached lookup is so much optimized, that you > > > won't be able to see any of it. > > > > > > No, I don't think this is going to be a performance issue at all, but > > > if anything we could introduce a syscall > > > > > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > > > bufsize, int flags); > > > > > > that is basically the equivalent of open + read + close, or even a > > > vectored variant that reads multiple files. But that's off topic > > > again, since I don't think there's going to be any performance issue > > > even with plain I/O syscalls. > > > > > > > > > > > On top of that, it isn't going to help with the case that I'm working towards > > > > implementing where a container manager can monitor for mounts taking place > > > > inside the container and supervise them. What I'm proposing is that during > > > > the action phase (eg. FSCONFIG_CMD_CREATE), fsconfig() would hand an fd > > > > referring to the context under construction to the manager, which would then > > > > be able to call fsinfo() to query it and fsconfig() to adjust it, reject it or > > > > permit it. Something like: > > > > > > > > fd = receive_context_to_supervise(); > > > > struct fsinfo_params params = { > > > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > > > .request = FSINFO_ATTR_SB_OPTIONS, > > > > }; > > > > fsinfo(fd, NULL, ¶ms, sizeof(params), buffer, sizeof(buffer)); > > > > supervise_parameters(buffer); > > > > fsconfig(fd, FSCONFIG_SET_FLAG, "hard", NULL, 0); > > > > fsconfig(fd, FSCONFIG_SET_STRING, "vers", "4.2", 0); > > > > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_CREATE, NULL, NULL, 0); > > > > struct fsinfo_params params = { > > > > .flags = FSINFO_FLAGS_QUERY_FSCONTEXT, > > > > .request = FSINFO_ATTR_SB_NOTIFICATIONS, > > > > }; > > > > struct fsinfo_sb_notifications sbnotify; > > > > fsinfo(fd, NULL, ¶ms, sizeof(params), &sbnotify, sizeof(sbnotify)); > > > > watch_super(fd, "", AT_EMPTY_PATH, watch_fd, 0x03); > > > > fsconfig(fd, FSCONFIG_CMD_SUPERVISE_PERMIT, NULL, NULL, 0); > > > > close(fd); > > > > > > > > However, the supervised mount may be happening in a completely different set > > > > of namespaces, in which case the supervisor presumably wouldn't be able to see > > > > the links in procfs and the relevant portions of sysfs. > > > > > > It would be a "jump" link to the otherwise invisible directory. > > > > More magic links to beam you around sounds like a bad idea. We had a > > bunch of CVEs around them in containers and they were one of the major > > reasons behind us pushing for openat2(). That's why it has a > > RESOLVE_NO_MAGICLINKS flag. > > No, that link wouldn't beam you around at all, it would end up in an > internally mounted instance of a mountfs, a safe place where no Even if it is a magic link to a safe place it's a magic link. They aren't a great solution to this problem. fsinfo() is cleaner and simpler as it creates a context for a supervised mount which gives the a managing application fine-grained control and makes it easily extendable. Also, we're apparently at the point where it seems were suggesting another (pseudo)filesystem to get information about filesystems. Christian
On Tue, Mar 3, 2020 at 11:22 AM Steven Whitehouse <swhiteho@redhat.com> wrote: > > Hi, > > On 03/03/2020 09:48, Miklos Szeredi wrote: > > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > >> On Tue, Mar 3, 2020 at 10:13 AM David Howells <dhowells@redhat.com> wrote: > >>> Miklos Szeredi <miklos@szeredi.hu> wrote: > >>> > >>>> I'm doing a patch. Let's see how it fares in the face of all these > >>>> preconceptions. > >>> Don't forget the efficiency criterion. One reason for going with fsinfo(2) is > >>> that scanning /proc/mounts when there are a lot of mounts in the system is > >>> slow (not to mention the global lock that is held during the read). > > BTW, I do feel that there's room for improvement in userspace code as > > well. Even quite big mount table could be scanned for *changes* very > > efficiently. l.e. cache previous contents of /proc/self/mountinfo and > > compare with new contents, line-by-line. Only need to parse the > > changed/added/removed lines. > > > > Also it would be pretty easy to throttle the number of updates so > > systemd et al. wouldn't hog the system with unnecessary processing. > > > > Thanks, > > Miklos > > > > At least having patches to compare would allow us to look at the > performance here and gain some numbers, which would be helpful to frame > the discussions. However I'm not seeing how it would be easy to throttle > updates... they occur at whatever rate they are generated and this can > be fairly high. Also I'm not sure that I follow how the notifications > and the dumping of the whole table are synchronized in this case, either. What I meant is optimizing current userspace without additional kernel infrastructure. Since currently there's only the monolithic /proc/self/mountinfo, it's reasonable that if the rate of change is very high, then we don't re-read this table on every change, only within a reasonable time limit (e.g. 1s) to provide timely updates. Re-reading the table on every change would (does?) slow down the system so that the actual updates would even be slower, so throttling in this case very much makes sense. Once we have per-mount information from the kernel, throttling updates probably does not make sense. Thanks, Miklos
On Tue, 2020-03-03 at 11:32 +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 11:22 AM Steven Whitehouse < > swhiteho@redhat.com> wrote: > > Hi, > > > > On 03/03/2020 09:48, Miklos Szeredi wrote: > > > On Tue, Mar 3, 2020 at 10:26 AM Miklos Szeredi <miklos@szeredi.hu > > > > wrote: > > > > On Tue, Mar 3, 2020 at 10:13 AM David Howells < > > > > dhowells@redhat.com> wrote: > > > > > Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > > > > > > > > I'm doing a patch. Let's see how it fares in the face of > > > > > > all these > > > > > > preconceptions. > > > > > Don't forget the efficiency criterion. One reason for going > > > > > with fsinfo(2) is > > > > > that scanning /proc/mounts when there are a lot of mounts in > > > > > the system is > > > > > slow (not to mention the global lock that is held during the > > > > > read). > > > BTW, I do feel that there's room for improvement in userspace > > > code as > > > well. Even quite big mount table could be scanned for *changes* > > > very > > > efficiently. l.e. cache previous contents of > > > /proc/self/mountinfo and > > > compare with new contents, line-by-line. Only need to parse the > > > changed/added/removed lines. > > > > > > Also it would be pretty easy to throttle the number of updates so > > > systemd et al. wouldn't hog the system with unnecessary > > > processing. > > > > > > Thanks, > > > Miklos > > > > > > > At least having patches to compare would allow us to look at the > > performance here and gain some numbers, which would be helpful to > > frame > > the discussions. However I'm not seeing how it would be easy to > > throttle > > updates... they occur at whatever rate they are generated and this > > can > > be fairly high. Also I'm not sure that I follow how the > > notifications > > and the dumping of the whole table are synchronized in this case, > > either. > > What I meant is optimizing current userspace without additional > kernel > infrastructure. Since currently there's only the monolithic > /proc/self/mountinfo, it's reasonable that if the rate of change is > very high, then we don't re-read this table on every change, only > within a reasonable time limit (e.g. 1s) to provide timely updates. > Re-reading the table on every change would (does?) slow down the > system so that the actual updates would even be slower, so throttling > in this case very much makes sense. Optimizing user space is a huge task. For example, consider this (which is related to a recent upstream discussion I had): https://blog.janestreet.com/troubleshooting-systemd-with-systemtap/ Working on improving libmount is really useful but that can't help with inherently inefficient approaches to keeping info. current which is actually needed at times. > > Once we have per-mount information from the kernel, throttling > updates > probably does not make sense. And can easily lead to application problems. Throttling will lead to an inability to have up to date information upon which application decisions are made. I don't think it's a viable solution to the separate problem of a large number of notifications either. Ian
On Tue, Mar 3, 2020 at 11:25 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote: > > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner > > <christian.brauner@ubuntu.com> wrote: > > > More magic links to beam you around sounds like a bad idea. We had a > > > bunch of CVEs around them in containers and they were one of the major > > > reasons behind us pushing for openat2(). That's why it has a > > > RESOLVE_NO_MAGICLINKS flag. > > > > No, that link wouldn't beam you around at all, it would end up in an > > internally mounted instance of a mountfs, a safe place where no > > Even if it is a magic link to a safe place it's a magic link. They > aren't a great solution to this problem. fsinfo() is cleaner and > simpler as it creates a context for a supervised mount which gives the a > managing application fine-grained control and makes it easily > extendable. Yeah, it's a nice and clean interface in the ioctl(2) sense. Sure, fsinfo() is way better than ioctl(), but it at the core it's still the same syscall multiplexer, do everything hack. > Also, we're apparently at the point where it seems were suggesting > another (pseudo)filesystem to get information about filesystems. Implementation detail. Why would you care? Thanks, Miklos
On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > No, I don't think this is going to be a performance issue at all, but > if anything we could introduce a syscall > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > bufsize, int flags); off-topic, but I'll buy you many many beers if you implement it ;-), because open + read + close is pretty common for /sys and /proc in many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd etc. is all about it. Karel
On Tue, Mar 03, 2020 at 12:33:48PM +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 11:25 AM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > > On Tue, Mar 03, 2020 at 11:13:50AM +0100, Miklos Szeredi wrote: > > > On Tue, Mar 3, 2020 at 11:00 AM Christian Brauner > > > <christian.brauner@ubuntu.com> wrote: > > > > > More magic links to beam you around sounds like a bad idea. We had a > > > > bunch of CVEs around them in containers and they were one of the major > > > > reasons behind us pushing for openat2(). That's why it has a > > > > RESOLVE_NO_MAGICLINKS flag. > > > > > > No, that link wouldn't beam you around at all, it would end up in an > > > internally mounted instance of a mountfs, a safe place where no > > > > Even if it is a magic link to a safe place it's a magic link. They > > aren't a great solution to this problem. fsinfo() is cleaner and > > simpler as it creates a context for a supervised mount which gives the a > > managing application fine-grained control and makes it easily > > extendable. > > Yeah, it's a nice and clean interface in the ioctl(2) sense. Sure, > fsinfo() is way better than ioctl(), but it at the core it's still the > same syscall multiplexer, do everything hack. In contrast to a generic ioctl() it's a domain-specific separate syscall. You can't suddenly set kvm options through fsinfo() I would hope. I find it at least debatable that a new filesystem is preferable. And - feel free to simply dismiss the concerns I expressed - so far there has not been a lot of excitement about this idea. > > > Also, we're apparently at the point where it seems were suggesting > > another (pseudo)filesystem to get information about filesystems. > > Implementation detail. Why would you care? I wouldn't call this an implementation detail. That's quite a big design choice; it's a separate fileystem. In addition, implementation details need to be maintained. Christian
On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote: > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > > No, I don't think this is going to be a performance issue at all, but > > if anything we could introduce a syscall > > > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > > bufsize, int flags); > > off-topic, but I'll buy you many many beers if you implement it ;-), > because open + read + close is pretty common for /sys and /proc in > many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd > etc. is all about it. Unlimited beers for a 21-line kernel patch? Sign me up! Totally untested, barely compiled patch below. Actually, I like this idea (the syscall, not just the unlimited beers). Maybe this could make a lot of sense, I'll write some actual tests for it now that syscalls are getting "heavy" again due to CPU vendors finally paying the price for their madness... thanks, greg k-h ------------------- diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 44d510bc9b78..178cd45340e2 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -359,6 +359,7 @@ 435 common clone3 __x64_sys_clone3/ptregs 437 common openat2 __x64_sys_openat2 438 common pidfd_getfd __x64_sys_pidfd_getfd +439 common readfile __x86_sys_readfile # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/open.c b/fs/open.c index 0788b3715731..1a830fada750 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1340,3 +1340,23 @@ int stream_open(struct inode *inode, struct file *filp) } EXPORT_SYMBOL(stream_open); + +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename, + char __user *, buffer, size_t, bufsize, int, flags) +{ + int retval; + int fd; + + if (force_o_largefile()) + flags |= O_LARGEFILE; + + fd = do_sys_open(dfd, filename, flags, O_RDONLY); + if (fd <= 0) + return fd; + + retval = ksys_read(fd, buffer, bufsize); + + __close_fd(current->files, fd); + + return retval; +}
On Tue, Mar 03, 2020 at 02:03:47PM +0100, Greg Kroah-Hartman wrote: > On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote: > > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > > > No, I don't think this is going to be a performance issue at all, but > > > if anything we could introduce a syscall > > > > > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > > > bufsize, int flags); > > > > off-topic, but I'll buy you many many beers if you implement it ;-), > > because open + read + close is pretty common for /sys and /proc in > > many userspace tools; for example ps, top, lsblk, lsmem, lsns, udevd > > etc. is all about it. > > Unlimited beers for a 21-line kernel patch? Sign me up! > > Totally untested, barely compiled patch below. Ok, that didn't even build, let me try this for real now...
On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > Totally untested, barely compiled patch below. > > Ok, that didn't even build, let me try this for real now... Some comments on the interface: O_LARGEFILE can be unconditional, since offsets are not exposed to the caller. Use the openat2 style arguments; limit the accepted flags to sane ones (e.g. don't let this syscall create a file). If buffer is too small to fit the whole file, return error. Verify that the number of bytes read matches the file size, otherwise return error (may need to loop?). Thanks, Miklos
On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > Totally untested, barely compiled patch below. > > > > Ok, that didn't even build, let me try this for real now... > > Some comments on the interface: Ok, hey, let's do this proper :) > O_LARGEFILE can be unconditional, since offsets are not exposed to the caller. Good point. > Use the openat2 style arguments; limit the accepted flags to sane ones > (e.g. don't let this syscall create a file). Yeah, I just added that check to my local version: /* Mask off all O_ flags as we only want to read from the file */ flags &= ~(VALID_OPEN_FLAGS); flags |= O_RDONLY | O_LARGEFILE; > If buffer is too small to fit the whole file, return error. Why? What's wrong with just returning the bytes asked for? If someone only wants 5 bytes from the front of a file, it should be fine to give that to them, right? > Verify that the number of bytes read matches the file size, otherwise > return error (may need to loop?). No, we can't "match file size" as sysfs files do not really have a sane "size". So I don't want to loop at all here, one-shot, that's all you get :) Let me actually do this and try it out for real. /me has no idea what he is getting himself into... thanks, greg k-h
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
> Actually, I like this idea (the syscall,
It might mesh well with atomic_open in some way.
David
On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > > > Totally untested, barely compiled patch below. > > > > > > Ok, that didn't even build, let me try this for real now... > > > > Some comments on the interface: > > Ok, hey, let's do this proper :) Alright, how about this patch. Actually tested with some simple sysfs files. If people don't strongly object, I'll add "real" tests to it, hook it up to all arches, write a manpage, and all the fun fluff a new syscall deserves and submit it "for real". It feels like I'm doing something wrong in that the actuall syscall logic is just so small. Maybe I'll benchmark this thing to see if it makes any real difference... thanks, greg k-h From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Subject: [PATCH] readfile: implement readfile syscall It's a tiny syscall, meant to allow a user to do a single "open this file, read into this buffer, and close the file" all in a single shot. Should be good for reading "tiny" files like sysfs, procfs, and other "small" files. There is no restarting the syscall, am trying to keep it simple. At least for now. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> --- arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/open.c | 21 +++++++++++++++++++++ include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 4 +++- 5 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index c17cb77eb150..a79cd025e72b 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -442,3 +442,4 @@ 435 i386 clone3 sys_clone3 __ia32_sys_clone3 437 i386 openat2 sys_openat2 __ia32_sys_openat2 438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd +439 i386 readfile sys_readfile __ia32_sys_readfile diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 44d510bc9b78..4f518f4e0e30 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -359,6 +359,7 @@ 435 common clone3 __x64_sys_clone3/ptregs 437 common openat2 __x64_sys_openat2 438 common pidfd_getfd __x64_sys_pidfd_getfd +439 common readfile __x64_sys_readfile # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/open.c b/fs/open.c index 0788b3715731..109bad47d542 100644 --- a/fs/open.c +++ b/fs/open.c @@ -1340,3 +1340,24 @@ int stream_open(struct inode *inode, struct file *filp) } EXPORT_SYMBOL(stream_open); + +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename, + char __user *, buffer, size_t, bufsize, int, flags) +{ + int retval; + int fd; + + /* Mask off all O_ flags as we only want to read from the file */ + flags &= ~(VALID_OPEN_FLAGS); + flags |= O_RDONLY | O_LARGEFILE; + + fd = do_sys_open(dfd, filename, flags, 0000); + if (fd <= 0) + return fd; + + retval = ksys_read(fd, buffer, bufsize); + + __close_fd(current->files, fd); + + return retval; +} diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 1815065d52f3..3a636a913437 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1003,6 +1003,8 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig, siginfo_t __user *info, unsigned int flags); asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags); +asmlinkage long sys_readfile(int dfd, const char __user *filename, + char __user *buffer, size_t bufsize, int flags); /* * Architecture-specific system calls diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 3a3201e4618e..31f84500915d 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -855,9 +855,11 @@ __SYSCALL(__NR_clone3, sys_clone3) __SYSCALL(__NR_openat2, sys_openat2) #define __NR_pidfd_getfd 438 __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) +#define __NR_readfile 439 +__SYSCALL(__NR_readfile, sys_readfile) #undef __NR_syscalls -#define __NR_syscalls 439 +#define __NR_syscalls 440 /* * 32 bit systems traditionally used different
On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > If buffer is too small to fit the whole file, return error. > > Why? What's wrong with just returning the bytes asked for? If someone > only wants 5 bytes from the front of a file, it should be fine to give > that to them, right? I think we need to signal in some way to the caller that the result was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the caller might be surprised. > > > Verify that the number of bytes read matches the file size, otherwise > > return error (may need to loop?). > > No, we can't "match file size" as sysfs files do not really have a sane > "size". So I don't want to loop at all here, one-shot, that's all you > get :) Hmm. I understand the no-size thing. But looping until EOF (i.e. until read return zero) might be a good idea regardless, because short reads are allowed. Thanks, Miklos
On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > > > > > Totally untested, barely compiled patch below. > > > > > > > > Ok, that didn't even build, let me try this for real now... > > > > > > Some comments on the interface: > > > > Ok, hey, let's do this proper :) > > Alright, how about this patch. > > Actually tested with some simple sysfs files. > > If people don't strongly object, I'll add "real" tests to it, hook it up > to all arches, write a manpage, and all the fun fluff a new syscall > deserves and submit it "for real". Just FYI, io_uring is moving towards the same kind of thing... IIRC you can already use it to batch a bunch of open() calls, then batch a bunch of read() calls on all the new fds and close them at the same time. And I think they're planning to add support for doing open()+read()+close() all in one go, too, except that it's a bit complicated because passing forward the file descriptor in a generic way is a bit complicated. > It feels like I'm doing something wrong in that the actuall syscall > logic is just so small. Maybe I'll benchmark this thing to see if it > makes any real difference... > > thanks, > > greg k-h > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Subject: [PATCH] readfile: implement readfile syscall > > It's a tiny syscall, meant to allow a user to do a single "open this > file, read into this buffer, and close the file" all in a single shot. > > Should be good for reading "tiny" files like sysfs, procfs, and other > "small" files. > > There is no restarting the syscall, am trying to keep it simple. At > least for now. > > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [...] > +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename, > + char __user *, buffer, size_t, bufsize, int, flags) > +{ > + int retval; > + int fd; > + > + /* Mask off all O_ flags as we only want to read from the file */ > + flags &= ~(VALID_OPEN_FLAGS); > + flags |= O_RDONLY | O_LARGEFILE; > + > + fd = do_sys_open(dfd, filename, flags, 0000); > + if (fd <= 0) > + return fd; > + > + retval = ksys_read(fd, buffer, bufsize); > + > + __close_fd(current->files, fd); > + > + return retval; > +} If you're gonna do something like that, wouldn't you want to also elide the use of the file descriptor table completely? do_sys_open() will have to do atomic operations in the fd table and stuff, which is probably moderately bad in terms of cacheline bouncing if this is used in a multithreaded context; and as a side effect, the fd would be inherited by anyone who calls fork() concurrently. You'll probably want to use APIs like do_filp_open() and filp_close(), or something like that, instead.
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > + fd = do_sys_open(dfd, filename, flags, 0000); > + if (fd <= 0) > + return fd; > + > + retval = ksys_read(fd, buffer, bufsize); > + > + __close_fd(current->files, fd); If you can use dentry_open() and vfs_read(), you might be able to avoid dealing with file descriptors entirely. That might make it worth a syscall. You're going to be asked for writefile() you know ;-) David
On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > Totally untested, barely compiled patch below. > > > > Ok, that didn't even build, let me try this for real now... > > Some comments on the interface: > > O_LARGEFILE can be unconditional, since offsets are not exposed to the caller. > > Use the openat2 style arguments; limit the accepted flags to sane ones > (e.g. don't let this syscall create a file). If we think this is worth it, might even good to either have it support struct open_how or have it accept two flag arguments. We sure want openat2()s RESOLVE_* flags in there. Christian
On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > > > > > > > Totally untested, barely compiled patch below. > > > > > > > > > > Ok, that didn't even build, let me try this for real now... > > > > > > > > Some comments on the interface: > > > > > > Ok, hey, let's do this proper :) > > > > Alright, how about this patch. > > > > Actually tested with some simple sysfs files. > > > > If people don't strongly object, I'll add "real" tests to it, hook it up > > to all arches, write a manpage, and all the fun fluff a new syscall > > deserves and submit it "for real". > > Just FYI, io_uring is moving towards the same kind of thing... IIRC > you can already use it to batch a bunch of open() calls, then batch a > bunch of read() calls on all the new fds and close them at the same > time. And I think they're planning to add support for doing > open()+read()+close() all in one go, too, except that it's a bit > complicated because passing forward the file descriptor in a generic > way is a bit complicated. It is complicated, I wouldn't recommend using io_ring for reading a bunch of procfs or sysfs files, that feels like a ton of overkill with too much setup/teardown to make it worth while. But maybe not, will have to watch and see how it goes. > > It feels like I'm doing something wrong in that the actuall syscall > > logic is just so small. Maybe I'll benchmark this thing to see if it > > makes any real difference... > > > > thanks, > > > > greg k-h > > > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > > Subject: [PATCH] readfile: implement readfile syscall > > > > It's a tiny syscall, meant to allow a user to do a single "open this > > file, read into this buffer, and close the file" all in a single shot. > > > > Should be good for reading "tiny" files like sysfs, procfs, and other > > "small" files. > > > > There is no restarting the syscall, am trying to keep it simple. At > > least for now. > > > > Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > [...] > > +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename, > > + char __user *, buffer, size_t, bufsize, int, flags) > > +{ > > + int retval; > > + int fd; > > + > > + /* Mask off all O_ flags as we only want to read from the file */ > > + flags &= ~(VALID_OPEN_FLAGS); > > + flags |= O_RDONLY | O_LARGEFILE; > > + > > + fd = do_sys_open(dfd, filename, flags, 0000); > > + if (fd <= 0) > > + return fd; > > + > > + retval = ksys_read(fd, buffer, bufsize); > > + > > + __close_fd(current->files, fd); > > + > > + return retval; > > +} > > If you're gonna do something like that, wouldn't you want to also > elide the use of the file descriptor table completely? do_sys_open() > will have to do atomic operations in the fd table and stuff, which is > probably moderately bad in terms of cacheline bouncing if this is used > in a multithreaded context; and as a side effect, the fd would be > inherited by anyone who calls fork() concurrently. You'll probably > want to use APIs like do_filp_open() and filp_close(), or something > like that, instead. Ah, nice, that does make more sense. I'll play around with that, and benchmarking this thing later tonight. Have to go get some stable kernels out first... thanks for the quick review, much appreciated. greg k-h
On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > If buffer is too small to fit the whole file, return error. > > > > Why? What's wrong with just returning the bytes asked for? If someone > > only wants 5 bytes from the front of a file, it should be fine to give > > that to them, right? > > I think we need to signal in some way to the caller that the result > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the > caller might be surprised. But that's not the way a "normal" read works. Short reads are fine, if the file isn't big enough. That's how char device nodes work all the time as well, and this kind of is like that, or some kind of "stream" to read from. If you think the file is bigger, then you, as the caller, can just pass in a bigger buffer if you want to (i.e. you can stat the thing and determine the size beforehand.) Think of the "normal" use case here, a sysfs read with a PAGE_SIZE buffer. That way userspace "knows" it will always read all of the data it can from the file, we don't have to do any seeking or determining real file size, or anything else like that. We return the number of bytes read as well, so we "know" if we did a short read, and also, you could imply, if the number of bytes read are the exact same as the number of bytes of the buffer, maybe the file is either that exact size, or bigger. This should be "simple", let's not make it complex if we can help it :) > > > Verify that the number of bytes read matches the file size, otherwise > > > return error (may need to loop?). > > > > No, we can't "match file size" as sysfs files do not really have a sane > > "size". So I don't want to loop at all here, one-shot, that's all you > > get :) > > Hmm. I understand the no-size thing. But looping until EOF (i.e. > until read return zero) might be a good idea regardless, because short > reads are allowed. If you want to loop, then do a userspace open/read-loop/close cycle. That's not what this syscall should be for. Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()? :) thanks, greg k-h
Would you permit readfile() to access FIFOs, chardevs and blockdevs? Certainly allowing use with blockdevs seems reasonable. David
On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote: > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > > > If buffer is too small to fit the whole file, return error. > > > > > > Why? What's wrong with just returning the bytes asked for? If someone > > > only wants 5 bytes from the front of a file, it should be fine to give > > > that to them, right? > > > > I think we need to signal in some way to the caller that the result > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the > > caller might be surprised. > > But that's not the way a "normal" read works. Short reads are fine, if > the file isn't big enough. That's how char device nodes work all the > time as well, and this kind of is like that, or some kind of "stream" to > read from. > > If you think the file is bigger, then you, as the caller, can just pass > in a bigger buffer if you want to (i.e. you can stat the thing and > determine the size beforehand.) > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE > buffer. That way userspace "knows" it will always read all of the data > it can from the file, we don't have to do any seeking or determining > real file size, or anything else like that. > > We return the number of bytes read as well, so we "know" if we did a > short read, and also, you could imply, if the number of bytes read are > the exact same as the number of bytes of the buffer, maybe the file is > either that exact size, or bigger. > > This should be "simple", let's not make it complex if we can help it :) > > > > > Verify that the number of bytes read matches the file size, otherwise > > > > return error (may need to loop?). > > > > > > No, we can't "match file size" as sysfs files do not really have a sane > > > "size". So I don't want to loop at all here, one-shot, that's all you > > > get :) > > > > Hmm. I understand the no-size thing. But looping until EOF (i.e. > > until read return zero) might be a good idea regardless, because short > > reads are allowed. > > If you want to loop, then do a userspace open/read-loop/close cycle. > That's not what this syscall should be for. > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()? :) So how is this supposed to work in e.g. the following case? ======================================== $ cat map_lots_and_read_maps.c #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> int main(void) { for (int i=0; i<1000; i++) { mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); } int maps = open("/proc/self/maps", O_RDONLY); static char buf[0x100000]; int res; do { res = read(maps, buf, sizeof(buf)); } while (res > 0); } $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c $ strace -e trace='!mmap' ./map_lots_and_read_maps execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"], 0x7ffebd297ac0 /* 51 vars */) = 0 brk(NULL) = 0x563a1184f000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0 close(3) = 0 openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0 mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0 close(3) = 0 arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0 mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0 mprotect(0x563a103e4000, 4096, PROT_READ) = 0 mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0 munmap(0x7fb5c2eb7000, 208479) = 0 openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3 read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075 read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067 read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067 read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067 read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067 read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067 read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067 read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067 read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067 read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067 read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067 read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078 read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388 read(3, "", 1048576) = 0 exit_group(0) = ? +++ exited with 0 +++ $ ======================================== The kernel is randomly returning short reads *with different lengths* that are vaguely around PAGE_SIZE, no matter how big the buffer supplied by userspace is. And while repeated read() calls will return consistent state thanks to the seqfile magic, repeated readfile() calls will probably return garbage with half-complete lines.
On Tue, Mar 03, 2020 at 03:23:51PM +0100, Christian Brauner wrote: > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > > > Totally untested, barely compiled patch below. > > > > > > Ok, that didn't even build, let me try this for real now... > > > > Some comments on the interface: > > > > O_LARGEFILE can be unconditional, since offsets are not exposed to the caller. > > > > Use the openat2 style arguments; limit the accepted flags to sane ones > > (e.g. don't let this syscall create a file). > > If we think this is worth it, might even good to either have it support > struct open_how or have it accept two flag arguments. We sure want > openat2()s RESOLVE_* flags in there. If you look at the patch I posted in this thread, I think it properly supports open_how and RESOLVE_* flags. But remember it's opening a file that is already present, in RO mode, no creation allowed, so most of the open_how interactions are limited. thanks, greg k-h
On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: >> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman >> <gregkh@linuxfoundation.org> wrote: >>> >>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: >>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: >>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman >>>>> <gregkh@linuxfoundation.org> wrote: >>>>> >>>>>>> Unlimited beers for a 21-line kernel patch? Sign me up! >>>>>>> >>>>>>> Totally untested, barely compiled patch below. >>>>>> >>>>>> Ok, that didn't even build, let me try this for real now... >>>>> >>>>> Some comments on the interface: >>>> >>>> Ok, hey, let's do this proper :) >>> >>> Alright, how about this patch. >>> >>> Actually tested with some simple sysfs files. >>> >>> If people don't strongly object, I'll add "real" tests to it, hook it up >>> to all arches, write a manpage, and all the fun fluff a new syscall >>> deserves and submit it "for real". >> >> Just FYI, io_uring is moving towards the same kind of thing... IIRC >> you can already use it to batch a bunch of open() calls, then batch a >> bunch of read() calls on all the new fds and close them at the same >> time. And I think they're planning to add support for doing >> open()+read()+close() all in one go, too, except that it's a bit >> complicated because passing forward the file descriptor in a generic >> way is a bit complicated. > > It is complicated, I wouldn't recommend using io_ring for reading a > bunch of procfs or sysfs files, that feels like a ton of overkill with > too much setup/teardown to make it worth while. > > But maybe not, will have to watch and see how it goes. It really isn't, and I too thinks it makes more sense than having a system call just for the explicit purpose of open/read/close. As Jann said, you can't currently do a linked sequence of open/read/close, because the fd passing between them isn't done. But that will come in the future. If the use case is "a bunch of files", then you could trivially do "open bunch", "read bunch", "close bunch" in three separate steps. Curious what the use case is for this that warrants a special system call?
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > If you look at the patch I posted in this thread, I think it properly > supports open_how and RESOLVE_* flags. But remember it's opening a file > that is already present, in RO mode, no creation allowed, so most of the > open_how interactions are limited. Something we should consider adding to openat2() at some point is the ability to lock on open/create. Various network filesystems support it. David
On Tue, Mar 03, 2020 at 08:44:43AM -0700, Jens Axboe wrote: > On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: > > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: > >> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman > >> <gregkh@linuxfoundation.org> wrote: > >>> > >>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: > >>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > >>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > >>>>> <gregkh@linuxfoundation.org> wrote: > >>>>> > >>>>>>> Unlimited beers for a 21-line kernel patch? Sign me up! > >>>>>>> > >>>>>>> Totally untested, barely compiled patch below. > >>>>>> > >>>>>> Ok, that didn't even build, let me try this for real now... > >>>>> > >>>>> Some comments on the interface: > >>>> > >>>> Ok, hey, let's do this proper :) > >>> > >>> Alright, how about this patch. > >>> > >>> Actually tested with some simple sysfs files. > >>> > >>> If people don't strongly object, I'll add "real" tests to it, hook it up > >>> to all arches, write a manpage, and all the fun fluff a new syscall > >>> deserves and submit it "for real". > >> > >> Just FYI, io_uring is moving towards the same kind of thing... IIRC > >> you can already use it to batch a bunch of open() calls, then batch a > >> bunch of read() calls on all the new fds and close them at the same > >> time. And I think they're planning to add support for doing > >> open()+read()+close() all in one go, too, except that it's a bit > >> complicated because passing forward the file descriptor in a generic > >> way is a bit complicated. > > > > It is complicated, I wouldn't recommend using io_ring for reading a > > bunch of procfs or sysfs files, that feels like a ton of overkill with > > too much setup/teardown to make it worth while. > > > > But maybe not, will have to watch and see how it goes. > > It really isn't, and I too thinks it makes more sense than having a > system call just for the explicit purpose of open/read/close. As Jann > said, you can't currently do a linked sequence of open/read/close, > because the fd passing between them isn't done. But that will come in > the future. If the use case is "a bunch of files", then you could > trivially do "open bunch", "read bunch", "close bunch" in three separate > steps. > > Curious what the use case is for this that warrants a special system > call? All of the open/read/close cycles for sysfs and procfs files were the one reported use case as we have lots of utilities that do that all the time it seems (top and other monitoring tools). thanks, greg k-h
On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote: > On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: > > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: > > > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > > > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > > > > > > > > > > > Totally untested, barely compiled patch below. > > > > > > > > > > > > > > Ok, that didn't even build, let me try this for real now... > > > > > > > > > > > > Some comments on the interface: > > > > > > > > > > Ok, hey, let's do this proper :) > > > > > > > > Alright, how about this patch. > > > > > > > > Actually tested with some simple sysfs files. > > > > > > > > If people don't strongly object, I'll add "real" tests to it, hook it up > > > > to all arches, write a manpage, and all the fun fluff a new syscall > > > > deserves and submit it "for real". > > > > > > Just FYI, io_uring is moving towards the same kind of thing... IIRC > > > you can already use it to batch a bunch of open() calls, then batch a > > > bunch of read() calls on all the new fds and close them at the same > > > time. And I think they're planning to add support for doing > > > open()+read()+close() all in one go, too, except that it's a bit > > > complicated because passing forward the file descriptor in a generic > > > way is a bit complicated. > > > > It is complicated, I wouldn't recommend using io_ring for reading a > > bunch of procfs or sysfs files, that feels like a ton of overkill with > > too much setup/teardown to make it worth while. > > > > But maybe not, will have to watch and see how it goes. > > It really isn't, and I too thinks it makes more sense than having a > system call just for the explicit purpose of open/read/close. As Jann > said, you can't currently do a linked sequence of open/read/close, > because the fd passing between them isn't done. But that will come in > the future. If the use case is "a bunch of files", then you could > trivially do "open bunch", "read bunch", "close bunch" in three separate > steps. > > Curious what the use case is for this that warrants a special system > call? > Agreed. I'd really rather see something more general-purpose than the proposed readfile(). At least with NFS and SMB, you can compound together fairly arbitrary sorts of operations, and it'd be nice to be able to pattern calls into the kernel for those sorts of uses. So, NFSv4 has the concept of a current_stateid that is maintained by the server. So basically you can do all this (e.g.) in a single compound: open <some filehandle get a stateid> write <using that stateid> close <same stateid> It'd be nice to be able to do something similar with io_uring. Make it so that when you do an open, you set the "current fd" inside the kernel's context, and then be able to issue io_uring requests that specify a magic "fd" value that use it. That would be a really useful pattern.
On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote: > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote: > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > > > > > If buffer is too small to fit the whole file, return error. > > > > > > > > Why? What's wrong with just returning the bytes asked for? If someone > > > > only wants 5 bytes from the front of a file, it should be fine to give > > > > that to them, right? > > > > > > I think we need to signal in some way to the caller that the result > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the > > > caller might be surprised. > > > > But that's not the way a "normal" read works. Short reads are fine, if > > the file isn't big enough. That's how char device nodes work all the > > time as well, and this kind of is like that, or some kind of "stream" to > > read from. > > > > If you think the file is bigger, then you, as the caller, can just pass > > in a bigger buffer if you want to (i.e. you can stat the thing and > > determine the size beforehand.) > > > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE > > buffer. That way userspace "knows" it will always read all of the data > > it can from the file, we don't have to do any seeking or determining > > real file size, or anything else like that. > > > > We return the number of bytes read as well, so we "know" if we did a > > short read, and also, you could imply, if the number of bytes read are > > the exact same as the number of bytes of the buffer, maybe the file is > > either that exact size, or bigger. > > > > This should be "simple", let's not make it complex if we can help it :) > > > > > > > Verify that the number of bytes read matches the file size, otherwise > > > > > return error (may need to loop?). > > > > > > > > No, we can't "match file size" as sysfs files do not really have a sane > > > > "size". So I don't want to loop at all here, one-shot, that's all you > > > > get :) > > > > > > Hmm. I understand the no-size thing. But looping until EOF (i.e. > > > until read return zero) might be a good idea regardless, because short > > > reads are allowed. > > > > If you want to loop, then do a userspace open/read-loop/close cycle. > > That's not what this syscall should be for. > > > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()? :) > > So how is this supposed to work in e.g. the following case? > > ======================================== > $ cat map_lots_and_read_maps.c > #include <sys/mman.h> > #include <fcntl.h> > #include <unistd.h> > > int main(void) { > for (int i=0; i<1000; i++) { > mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > } > int maps = open("/proc/self/maps", O_RDONLY); > static char buf[0x100000]; > int res; > do { > res = read(maps, buf, sizeof(buf)); > } while (res > 0); > } > $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c > $ strace -e trace='!mmap' ./map_lots_and_read_maps > execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"], > 0x7ffebd297ac0 /* 51 vars */) = 0 > brk(NULL) = 0x563a1184f000 > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0 > close(3) = 0 > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"..., > 832) = 832 > fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0 > mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0 > close(3) = 0 > arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0 > mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0 > mprotect(0x563a103e4000, 4096, PROT_READ) = 0 > mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0 > munmap(0x7fb5c2eb7000, 208479) = 0 > openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3 > read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075 > read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067 > read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067 > read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067 > read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067 > read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067 > read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067 > read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067 > read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067 > read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067 > read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067 > read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078 > read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388 > read(3, "", 1048576) = 0 > exit_group(0) = ? > +++ exited with 0 +++ > $ > ======================================== > > The kernel is randomly returning short reads *with different lengths* > that are vaguely around PAGE_SIZE, no matter how big the buffer > supplied by userspace is. And while repeated read() calls will return > consistent state thanks to the seqfile magic, repeated readfile() > calls will probably return garbage with half-complete lines. Ah crap, I forgot about seqfile, I was only considering the "simple" cases that sysfs provides. Ok, Miklos, you were totally right, I'll loop and read until the end of file or buffer, which ever comes first. thanks, greg k-h
On 3/3/20 9:51 AM, Jeff Layton wrote: > On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote: >> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: >>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: >>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman >>>> <gregkh@linuxfoundation.org> wrote: >>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: >>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: >>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman >>>>>>> <gregkh@linuxfoundation.org> wrote: >>>>>>> >>>>>>>>> Unlimited beers for a 21-line kernel patch? Sign me up! >>>>>>>>> >>>>>>>>> Totally untested, barely compiled patch below. >>>>>>>> >>>>>>>> Ok, that didn't even build, let me try this for real now... >>>>>>> >>>>>>> Some comments on the interface: >>>>>> >>>>>> Ok, hey, let's do this proper :) >>>>> >>>>> Alright, how about this patch. >>>>> >>>>> Actually tested with some simple sysfs files. >>>>> >>>>> If people don't strongly object, I'll add "real" tests to it, hook it up >>>>> to all arches, write a manpage, and all the fun fluff a new syscall >>>>> deserves and submit it "for real". >>>> >>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC >>>> you can already use it to batch a bunch of open() calls, then batch a >>>> bunch of read() calls on all the new fds and close them at the same >>>> time. And I think they're planning to add support for doing >>>> open()+read()+close() all in one go, too, except that it's a bit >>>> complicated because passing forward the file descriptor in a generic >>>> way is a bit complicated. >>> >>> It is complicated, I wouldn't recommend using io_ring for reading a >>> bunch of procfs or sysfs files, that feels like a ton of overkill with >>> too much setup/teardown to make it worth while. >>> >>> But maybe not, will have to watch and see how it goes. >> >> It really isn't, and I too thinks it makes more sense than having a >> system call just for the explicit purpose of open/read/close. As Jann >> said, you can't currently do a linked sequence of open/read/close, >> because the fd passing between them isn't done. But that will come in >> the future. If the use case is "a bunch of files", then you could >> trivially do "open bunch", "read bunch", "close bunch" in three separate >> steps. >> >> Curious what the use case is for this that warrants a special system >> call? >> > > Agreed. I'd really rather see something more general-purpose than the > proposed readfile(). At least with NFS and SMB, you can compound > together fairly arbitrary sorts of operations, and it'd be nice to be > able to pattern calls into the kernel for those sorts of uses. > > So, NFSv4 has the concept of a current_stateid that is maintained by the > server. So basically you can do all this (e.g.) in a single compound: > > open <some filehandle get a stateid> > write <using that stateid> > close <same stateid> > > It'd be nice to be able to do something similar with io_uring. Make it > so that when you do an open, you set the "current fd" inside the > kernel's context, and then be able to issue io_uring requests that > specify a magic "fd" value that use it. > > That would be a really useful pattern. For io_uring, you can link requests that you submit into a chain. Each link in the chain is done in sequence. Which means that you could do: <open some file><read from that file><close that file> in a single sequence. The only thing that is missing right now is a way to have the return of that open propagated to the 'fd' of the read and close, and it's actually one of the topics to discuss at LSFMM next month. One approach would be to use BPF to handle this passing, another suggestion has been to have the read/close specify some magic 'fd' value that just means "inherit fd from result of previous". The latter sounds very close to the stateid you mention above, and the upside here is that it wouldn't explode the necessary toolchain to need to include BPF. In other words, this is really close to being reality and practically feasible.
On Tue, Mar 3, 2020 at 5:51 PM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote: > > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote: > > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > > > > > > > If buffer is too small to fit the whole file, return error. > > > > > > > > > > Why? What's wrong with just returning the bytes asked for? If someone > > > > > only wants 5 bytes from the front of a file, it should be fine to give > > > > > that to them, right? > > > > > > > > I think we need to signal in some way to the caller that the result > > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the > > > > caller might be surprised. > > > > > > But that's not the way a "normal" read works. Short reads are fine, if > > > the file isn't big enough. That's how char device nodes work all the > > > time as well, and this kind of is like that, or some kind of "stream" to > > > read from. > > > > > > If you think the file is bigger, then you, as the caller, can just pass > > > in a bigger buffer if you want to (i.e. you can stat the thing and > > > determine the size beforehand.) > > > > > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE > > > buffer. That way userspace "knows" it will always read all of the data > > > it can from the file, we don't have to do any seeking or determining > > > real file size, or anything else like that. > > > > > > We return the number of bytes read as well, so we "know" if we did a > > > short read, and also, you could imply, if the number of bytes read are > > > the exact same as the number of bytes of the buffer, maybe the file is > > > either that exact size, or bigger. > > > > > > This should be "simple", let's not make it complex if we can help it :) > > > > > > > > > Verify that the number of bytes read matches the file size, otherwise > > > > > > return error (may need to loop?). > > > > > > > > > > No, we can't "match file size" as sysfs files do not really have a sane > > > > > "size". So I don't want to loop at all here, one-shot, that's all you > > > > > get :) > > > > > > > > Hmm. I understand the no-size thing. But looping until EOF (i.e. > > > > until read return zero) might be a good idea regardless, because short > > > > reads are allowed. > > > > > > If you want to loop, then do a userspace open/read-loop/close cycle. > > > That's not what this syscall should be for. > > > > > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()? :) > > > > So how is this supposed to work in e.g. the following case? [...] > > int maps = open("/proc/self/maps", O_RDONLY); > > static char buf[0x100000]; > > int res; > > do { > > res = read(maps, buf, sizeof(buf)); > > } while (res > 0); > > } [...] > > > > The kernel is randomly returning short reads *with different lengths* > > that are vaguely around PAGE_SIZE, no matter how big the buffer > > supplied by userspace is. And while repeated read() calls will return > > consistent state thanks to the seqfile magic, repeated readfile() > > calls will probably return garbage with half-complete lines. > > Ah crap, I forgot about seqfile, I was only considering the "simple" > cases that sysfs provides. > > Ok, Miklos, you were totally right, I'll loop and read until the end of > file or buffer, which ever comes first. I wonder what we should do when one of the later reads returns an error code. As in, we start the first read, get a short read (maybe because a signal arrived), try a second read, get -EINTR. Do we just return the error code? That'd probably work fine for most usecases - e.g. if "top" is reading stuff from procfs, and that gets interrupted by SIGWINCH or so, it doesn't matter that we've already started the first read; the only thing "top" really needs to know is that the read was a short read and it has to retry.
On Tue, Mar 03, 2020 at 02:19:58PM +0000, David Howells wrote: > Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > > + fd = do_sys_open(dfd, filename, flags, 0000); > > + if (fd <= 0) > > + return fd; > > + > > + retval = ksys_read(fd, buffer, bufsize); > > + > > + __close_fd(current->files, fd); > > If you can use dentry_open() and vfs_read(), you might be able to avoid > dealing with file descriptors entirely. That might make it worth a syscall. Will poke at that... > You're going to be asked for writefile() you know ;-) Yup, that just got asked on this thread already :) greg k-h
On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote: > On 3/3/20 9:51 AM, Jeff Layton wrote: > > On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote: > > > On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: > > > > On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: > > > > > On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman > > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: > > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > > > > > On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman > > > > > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > > > > > > > Unlimited beers for a 21-line kernel patch? Sign me up! > > > > > > > > > > > > > > > > > > > > Totally untested, barely compiled patch below. > > > > > > > > > > > > > > > > > > Ok, that didn't even build, let me try this for real now... > > > > > > > > > > > > > > > > Some comments on the interface: > > > > > > > > > > > > > > Ok, hey, let's do this proper :) > > > > > > > > > > > > Alright, how about this patch. > > > > > > > > > > > > Actually tested with some simple sysfs files. > > > > > > > > > > > > If people don't strongly object, I'll add "real" tests to it, hook it up > > > > > > to all arches, write a manpage, and all the fun fluff a new syscall > > > > > > deserves and submit it "for real". > > > > > > > > > > Just FYI, io_uring is moving towards the same kind of thing... IIRC > > > > > you can already use it to batch a bunch of open() calls, then batch a > > > > > bunch of read() calls on all the new fds and close them at the same > > > > > time. And I think they're planning to add support for doing > > > > > open()+read()+close() all in one go, too, except that it's a bit > > > > > complicated because passing forward the file descriptor in a generic > > > > > way is a bit complicated. > > > > > > > > It is complicated, I wouldn't recommend using io_ring for reading a > > > > bunch of procfs or sysfs files, that feels like a ton of overkill with > > > > too much setup/teardown to make it worth while. > > > > > > > > But maybe not, will have to watch and see how it goes. > > > > > > It really isn't, and I too thinks it makes more sense than having a > > > system call just for the explicit purpose of open/read/close. As Jann > > > said, you can't currently do a linked sequence of open/read/close, > > > because the fd passing between them isn't done. But that will come in > > > the future. If the use case is "a bunch of files", then you could > > > trivially do "open bunch", "read bunch", "close bunch" in three separate > > > steps. > > > > > > Curious what the use case is for this that warrants a special system > > > call? > > > > > > > Agreed. I'd really rather see something more general-purpose than the > > proposed readfile(). At least with NFS and SMB, you can compound > > together fairly arbitrary sorts of operations, and it'd be nice to be > > able to pattern calls into the kernel for those sorts of uses. > > > > So, NFSv4 has the concept of a current_stateid that is maintained by the > > server. So basically you can do all this (e.g.) in a single compound: > > > > open <some filehandle get a stateid> > > write <using that stateid> > > close <same stateid> > > > > It'd be nice to be able to do something similar with io_uring. Make it > > so that when you do an open, you set the "current fd" inside the > > kernel's context, and then be able to issue io_uring requests that > > specify a magic "fd" value that use it. > > > > That would be a really useful pattern. > > For io_uring, you can link requests that you submit into a chain. Each > link in the chain is done in sequence. Which means that you could do: > > <open some file><read from that file><close that file> > > in a single sequence. The only thing that is missing right now is a way > to have the return of that open propagated to the 'fd' of the read and > close, and it's actually one of the topics to discuss at LSFMM next > month. > > One approach would be to use BPF to handle this passing, another > suggestion has been to have the read/close specify some magic 'fd' value > that just means "inherit fd from result of previous". The latter sounds > very close to the stateid you mention above, and the upside here is that > it wouldn't explode the necessary toolchain to need to include BPF. > > In other words, this is really close to being reality and practically > feasible. > Excellent. Yes, the latter is exactly what I had in mind for this. I suspect that that would cover a large fraction of the potential use-cases for this. Basically, all you'd need to do is keep a pointer to struct file in the internal state for the chain. Then, allow userland to specify some magic fd value for subsequent chained operations that says to use that instead of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? That would cover the smb or nfs server sort of use cases, I think. For the sysfs cases, I guess you'd need to dispatch several chains, but that doesn't sound _too_ onerous. In fact, with that you should even be able to emulate the proposed readlink syscall in a userland library.
On 3/3/20 12:02 PM, Jeff Layton wrote: > On Tue, 2020-03-03 at 09:55 -0700, Jens Axboe wrote: >> On 3/3/20 9:51 AM, Jeff Layton wrote: >>> On Tue, 2020-03-03 at 08:44 -0700, Jens Axboe wrote: >>>> On 3/3/20 7:24 AM, Greg Kroah-Hartman wrote: >>>>> On Tue, Mar 03, 2020 at 03:13:26PM +0100, Jann Horn wrote: >>>>>> On Tue, Mar 3, 2020 at 3:10 PM Greg Kroah-Hartman >>>>>> <gregkh@linuxfoundation.org> wrote: >>>>>>> On Tue, Mar 03, 2020 at 02:43:16PM +0100, Greg Kroah-Hartman wrote: >>>>>>>> On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: >>>>>>>>> On Tue, Mar 3, 2020 at 2:14 PM Greg Kroah-Hartman >>>>>>>>> <gregkh@linuxfoundation.org> wrote: >>>>>>>>> >>>>>>>>>>> Unlimited beers for a 21-line kernel patch? Sign me up! >>>>>>>>>>> >>>>>>>>>>> Totally untested, barely compiled patch below. >>>>>>>>>> >>>>>>>>>> Ok, that didn't even build, let me try this for real now... >>>>>>>>> >>>>>>>>> Some comments on the interface: >>>>>>>> >>>>>>>> Ok, hey, let's do this proper :) >>>>>>> >>>>>>> Alright, how about this patch. >>>>>>> >>>>>>> Actually tested with some simple sysfs files. >>>>>>> >>>>>>> If people don't strongly object, I'll add "real" tests to it, hook it up >>>>>>> to all arches, write a manpage, and all the fun fluff a new syscall >>>>>>> deserves and submit it "for real". >>>>>> >>>>>> Just FYI, io_uring is moving towards the same kind of thing... IIRC >>>>>> you can already use it to batch a bunch of open() calls, then batch a >>>>>> bunch of read() calls on all the new fds and close them at the same >>>>>> time. And I think they're planning to add support for doing >>>>>> open()+read()+close() all in one go, too, except that it's a bit >>>>>> complicated because passing forward the file descriptor in a generic >>>>>> way is a bit complicated. >>>>> >>>>> It is complicated, I wouldn't recommend using io_ring for reading a >>>>> bunch of procfs or sysfs files, that feels like a ton of overkill with >>>>> too much setup/teardown to make it worth while. >>>>> >>>>> But maybe not, will have to watch and see how it goes. >>>> >>>> It really isn't, and I too thinks it makes more sense than having a >>>> system call just for the explicit purpose of open/read/close. As Jann >>>> said, you can't currently do a linked sequence of open/read/close, >>>> because the fd passing between them isn't done. But that will come in >>>> the future. If the use case is "a bunch of files", then you could >>>> trivially do "open bunch", "read bunch", "close bunch" in three separate >>>> steps. >>>> >>>> Curious what the use case is for this that warrants a special system >>>> call? >>>> >>> >>> Agreed. I'd really rather see something more general-purpose than the >>> proposed readfile(). At least with NFS and SMB, you can compound >>> together fairly arbitrary sorts of operations, and it'd be nice to be >>> able to pattern calls into the kernel for those sorts of uses. >>> >>> So, NFSv4 has the concept of a current_stateid that is maintained by the >>> server. So basically you can do all this (e.g.) in a single compound: >>> >>> open <some filehandle get a stateid> >>> write <using that stateid> >>> close <same stateid> >>> >>> It'd be nice to be able to do something similar with io_uring. Make it >>> so that when you do an open, you set the "current fd" inside the >>> kernel's context, and then be able to issue io_uring requests that >>> specify a magic "fd" value that use it. >>> >>> That would be a really useful pattern. >> >> For io_uring, you can link requests that you submit into a chain. Each >> link in the chain is done in sequence. Which means that you could do: >> >> <open some file><read from that file><close that file> >> >> in a single sequence. The only thing that is missing right now is a way >> to have the return of that open propagated to the 'fd' of the read and >> close, and it's actually one of the topics to discuss at LSFMM next >> month. >> >> One approach would be to use BPF to handle this passing, another >> suggestion has been to have the read/close specify some magic 'fd' value >> that just means "inherit fd from result of previous". The latter sounds >> very close to the stateid you mention above, and the upside here is that >> it wouldn't explode the necessary toolchain to need to include BPF. >> >> In other words, this is really close to being reality and practically >> feasible. >> > > Excellent. > > Yes, the latter is exactly what I had in mind for this. I suspect that > that would cover a large fraction of the potential use-cases for this. > > Basically, all you'd need to do is keep a pointer to struct file in the > internal state for the chain. Then, allow userland to specify some magic > fd value for subsequent chained operations that says to use that instead > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? Yeah I think that'd be a suitable way to signal that. > That would cover the smb or nfs server sort of use cases, I think. For > the sysfs cases, I guess you'd need to dispatch several chains, but that > doesn't sound _too_ onerous. The magic fd would be per-chain, so doing multiple chains wouldn't really matter at all. Let me try and hack this up, should be pretty trivial. > In fact, with that you should even be able to emulate the proposed > readlink syscall in a userland library. Exactly
On 3/3/20 12:02 PM, Jeff Layton wrote: > Basically, all you'd need to do is keep a pointer to struct file in the > internal state for the chain. Then, allow userland to specify some magic > fd value for subsequent chained operations that says to use that instead > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? BTW, I think we need two magics here. One that says "result from previous is fd for next", and one that says "fd from previous is fd for next". The former allows inheritance from open -> read, the latter from read -> write.
On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote: > On 3/3/20 12:02 PM, Jeff Layton wrote: > > Basically, all you'd need to do is keep a pointer to struct file in the > > internal state for the chain. Then, allow userland to specify some magic > > fd value for subsequent chained operations that says to use that instead > > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? > > BTW, I think we need two magics here. One that says "result from > previous is fd for next", and one that says "fd from previous is fd for > next". The former allows inheritance from open -> read, the latter from > read -> write. > Do we? I suspect that in almost all of the cases, all we'd care about is the last open. Also if you have unrelated operations in there you still have to chain the fd through somehow to the next op which is a bit hard to do with that scheme. I'd just have a single magic carveout that means "use the result of last open call done in this chain". If you do a second open (or pipe, or...), then that would put the old struct file pointer and drop a new one in there. If we really do want to enable multiple opens in a single chain though, then we might want to rethink this and consider some sort of slot table for storing open fds.
On Tue, Mar 03, 2020 at 05:51:03PM +0100, Greg Kroah-Hartman wrote: > On Tue, Mar 03, 2020 at 03:40:24PM +0100, Jann Horn wrote: > > On Tue, Mar 3, 2020 at 3:30 PM Greg Kroah-Hartman > > <gregkh@linuxfoundation.org> wrote: > > > On Tue, Mar 03, 2020 at 03:10:50PM +0100, Miklos Szeredi wrote: > > > > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman > > > > <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > > > > > > > > If buffer is too small to fit the whole file, return error. > > > > > > > > > > Why? What's wrong with just returning the bytes asked for? If someone > > > > > only wants 5 bytes from the front of a file, it should be fine to give > > > > > that to them, right? > > > > > > > > I think we need to signal in some way to the caller that the result > > > > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise the > > > > caller might be surprised. > > > > > > But that's not the way a "normal" read works. Short reads are fine, if > > > the file isn't big enough. That's how char device nodes work all the > > > time as well, and this kind of is like that, or some kind of "stream" to > > > read from. > > > > > > If you think the file is bigger, then you, as the caller, can just pass > > > in a bigger buffer if you want to (i.e. you can stat the thing and > > > determine the size beforehand.) > > > > > > Think of the "normal" use case here, a sysfs read with a PAGE_SIZE > > > buffer. That way userspace "knows" it will always read all of the data > > > it can from the file, we don't have to do any seeking or determining > > > real file size, or anything else like that. > > > > > > We return the number of bytes read as well, so we "know" if we did a > > > short read, and also, you could imply, if the number of bytes read are > > > the exact same as the number of bytes of the buffer, maybe the file is > > > either that exact size, or bigger. > > > > > > This should be "simple", let's not make it complex if we can help it :) > > > > > > > > > Verify that the number of bytes read matches the file size, otherwise > > > > > > return error (may need to loop?). > > > > > > > > > > No, we can't "match file size" as sysfs files do not really have a sane > > > > > "size". So I don't want to loop at all here, one-shot, that's all you > > > > > get :) > > > > > > > > Hmm. I understand the no-size thing. But looping until EOF (i.e. > > > > until read return zero) might be a good idea regardless, because short > > > > reads are allowed. > > > > > > If you want to loop, then do a userspace open/read-loop/close cycle. > > > That's not what this syscall should be for. > > > > > > Should we call it: readfile-only-one-try-i-hope-my-buffer-is-big-enough()? :) > > > > So how is this supposed to work in e.g. the following case? > > > > ======================================== > > $ cat map_lots_and_read_maps.c > > #include <sys/mman.h> > > #include <fcntl.h> > > #include <unistd.h> > > > > int main(void) { > > for (int i=0; i<1000; i++) { > > mmap(NULL, 0x1000, (i&1)?PROT_READ:PROT_NONE, > > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > > } > > int maps = open("/proc/self/maps", O_RDONLY); > > static char buf[0x100000]; > > int res; > > do { > > res = read(maps, buf, sizeof(buf)); > > } while (res > 0); > > } > > $ gcc -o map_lots_and_read_maps map_lots_and_read_maps.c > > $ strace -e trace='!mmap' ./map_lots_and_read_maps > > execve("./map_lots_and_read_maps", ["./map_lots_and_read_maps"], > > 0x7ffebd297ac0 /* 51 vars */) = 0 > > brk(NULL) = 0x563a1184f000 > > access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) > > openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 > > fstat(3, {st_mode=S_IFREG|0644, st_size=208479, ...}) = 0 > > close(3) = 0 > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > > read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320l\2\0\0\0\0\0"..., > > 832) = 832 > > fstat(3, {st_mode=S_IFREG|0755, st_size=1820104, ...}) = 0 > > mprotect(0x7fb5c2d1a000, 1642496, PROT_NONE) = 0 > > close(3) = 0 > > arch_prctl(ARCH_SET_FS, 0x7fb5c2eb6500) = 0 > > mprotect(0x7fb5c2eab000, 12288, PROT_READ) = 0 > > mprotect(0x563a103e4000, 4096, PROT_READ) = 0 > > mprotect(0x7fb5c2f12000, 4096, PROT_READ) = 0 > > munmap(0x7fb5c2eb7000, 208479) = 0 > > openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3 > > read(3, "563a103e1000-563a103e2000 r--p 0"..., 1048576) = 4075 > > read(3, "7fb5c2985000-7fb5c2986000 ---p 0"..., 1048576) = 4067 > > read(3, "7fb5c29d8000-7fb5c29d9000 r--p 0"..., 1048576) = 4067 > > read(3, "7fb5c2a2b000-7fb5c2a2c000 ---p 0"..., 1048576) = 4067 > > read(3, "7fb5c2a7e000-7fb5c2a7f000 r--p 0"..., 1048576) = 4067 > > read(3, "7fb5c2ad1000-7fb5c2ad2000 ---p 0"..., 1048576) = 4067 > > read(3, "7fb5c2b24000-7fb5c2b25000 r--p 0"..., 1048576) = 4067 > > read(3, "7fb5c2b77000-7fb5c2b78000 ---p 0"..., 1048576) = 4067 > > read(3, "7fb5c2bca000-7fb5c2bcb000 r--p 0"..., 1048576) = 4067 > > read(3, "7fb5c2c1d000-7fb5c2c1e000 ---p 0"..., 1048576) = 4067 > > read(3, "7fb5c2c70000-7fb5c2c71000 r--p 0"..., 1048576) = 4067 > > read(3, "7fb5c2cc3000-7fb5c2cc4000 ---p 0"..., 1048576) = 4078 > > read(3, "7fb5c2eca000-7fb5c2ecb000 r--p 0"..., 1048576) = 2388 > > read(3, "", 1048576) = 0 > > exit_group(0) = ? > > +++ exited with 0 +++ > > $ > > ======================================== > > > > The kernel is randomly returning short reads *with different lengths* > > that are vaguely around PAGE_SIZE, no matter how big the buffer > > supplied by userspace is. And while repeated read() calls will return > > consistent state thanks to the seqfile magic, repeated readfile() > > calls will probably return garbage with half-complete lines. > > Ah crap, I forgot about seqfile, I was only considering the "simple" > cases that sysfs provides. > > Ok, Miklos, you were totally right, I'll loop and read until the end of > file or buffer, which ever comes first. Hm, nope, this works just fine with the single "read" call. I can read /proc/self/maps with a single buffer, also larger files like /sys/kernel/debug/usb/devices work just fine. So maybe it is all sane without a loop. I'll try to get rid of the fd now, and despite the interest in io_uring, this might be a lot more "simple" overall. thanks, greg k-h
On 3/3/20 12:43 PM, Jeff Layton wrote: > On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote: >> On 3/3/20 12:02 PM, Jeff Layton wrote: >>> Basically, all you'd need to do is keep a pointer to struct file in the >>> internal state for the chain. Then, allow userland to specify some magic >>> fd value for subsequent chained operations that says to use that instead >>> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? >> >> BTW, I think we need two magics here. One that says "result from >> previous is fd for next", and one that says "fd from previous is fd for >> next". The former allows inheritance from open -> read, the latter from >> read -> write. >> > > Do we? I suspect that in almost all of the cases, all we'd care about is > the last open. Also if you have unrelated operations in there you still > have to chain the fd through somehow to the next op which is a bit hard > to do with that scheme. > > I'd just have a single magic carveout that means "use the result of last > open call done in this chain". If you do a second open (or pipe, or...), > then that would put the old struct file pointer and drop a new one in > there. > > If we really do want to enable multiple opens in a single chain though, > then we might want to rethink this and consider some sort of slot table > for storing open fds. I think the one magic can work, you just have to define your chain appropriately for the case where you have multiple opens. That's true for the two magic approach as well, of course, I don't want a stack of open fds, just "last open" should suffice. I don't like the implicit close, if your op opens an fd, something should close it again. You pass it back to the application in any case for io_uring, so the app can just close it. Which means that your chain should just include a close for whatever fd you open, unless you plan on using it in the application aftwards.
On Tue, 2020-03-03 at 13:33 -0700, Jens Axboe wrote: > On 3/3/20 12:43 PM, Jeff Layton wrote: > > On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote: > > > On 3/3/20 12:02 PM, Jeff Layton wrote: > > > > Basically, all you'd need to do is keep a pointer to struct file in the > > > > internal state for the chain. Then, allow userland to specify some magic > > > > fd value for subsequent chained operations that says to use that instead > > > > of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? > > > > > > BTW, I think we need two magics here. One that says "result from > > > previous is fd for next", and one that says "fd from previous is fd for > > > next". The former allows inheritance from open -> read, the latter from > > > read -> write. > > > > > > > Do we? I suspect that in almost all of the cases, all we'd care about is > > the last open. Also if you have unrelated operations in there you still > > have to chain the fd through somehow to the next op which is a bit hard > > to do with that scheme. > > > > I'd just have a single magic carveout that means "use the result of last > > open call done in this chain". If you do a second open (or pipe, or...), > > then that would put the old struct file pointer and drop a new one in > > there. > > > > If we really do want to enable multiple opens in a single chain though, > > then we might want to rethink this and consider some sort of slot table > > for storing open fds. > > I think the one magic can work, you just have to define your chain > appropriately for the case where you have multiple opens. That's true > for the two magic approach as well, of course, I don't want a stack of > open fds, just "last open" should suffice. > Yep. > I don't like the implicit close, if your op opens an fd, something > should close it again. You pass it back to the application in any case > for io_uring, so the app can just close it. Which means that your chain > should just include a close for whatever fd you open, unless you plan on > using it in the application aftwards. > Yeah sorry, I didn't word that correctly. Let me try again: My thinking was that you would still return the result of the open to userland, but also stash a struct file pointer in the internal chain representation. Then you just refer to that when you get the "magic" fd. You'd still need to explicitly close the file though if you didn't want to use it past the end of the current chain. So, I guess you _do_ need the actual fd to properly close the file in that case. On another note, what happens if you do open+write+close and the write fails? Does the close still happen, or would you have to issue one separately after getting the result?
On 3/3/20 2:03 PM, Jeff Layton wrote: > On Tue, 2020-03-03 at 13:33 -0700, Jens Axboe wrote: >> On 3/3/20 12:43 PM, Jeff Layton wrote: >>> On Tue, 2020-03-03 at 12:23 -0700, Jens Axboe wrote: >>>> On 3/3/20 12:02 PM, Jeff Layton wrote: >>>>> Basically, all you'd need to do is keep a pointer to struct file in the >>>>> internal state for the chain. Then, allow userland to specify some magic >>>>> fd value for subsequent chained operations that says to use that instead >>>>> of consulting the fdtable. Maybe use -4096 (-MAX_ERRNO - 1)? >>>> >>>> BTW, I think we need two magics here. One that says "result from >>>> previous is fd for next", and one that says "fd from previous is fd for >>>> next". The former allows inheritance from open -> read, the latter from >>>> read -> write. >>>> >>> >>> Do we? I suspect that in almost all of the cases, all we'd care about is >>> the last open. Also if you have unrelated operations in there you still >>> have to chain the fd through somehow to the next op which is a bit hard >>> to do with that scheme. >>> >>> I'd just have a single magic carveout that means "use the result of last >>> open call done in this chain". If you do a second open (or pipe, or...), >>> then that would put the old struct file pointer and drop a new one in >>> there. >>> >>> If we really do want to enable multiple opens in a single chain though, >>> then we might want to rethink this and consider some sort of slot table >>> for storing open fds. >> >> I think the one magic can work, you just have to define your chain >> appropriately for the case where you have multiple opens. That's true >> for the two magic approach as well, of course, I don't want a stack of >> open fds, just "last open" should suffice. >> > > Yep. > >> I don't like the implicit close, if your op opens an fd, something >> should close it again. You pass it back to the application in any case >> for io_uring, so the app can just close it. Which means that your chain >> should just include a close for whatever fd you open, unless you plan on >> using it in the application aftwards. >> > > Yeah sorry, I didn't word that correctly. Let me try again: > > My thinking was that you would still return the result of the open to > userland, but also stash a struct file pointer in the internal chain > representation. Then you just refer to that when you get the "magic" fd. > > You'd still need to explicitly close the file though if you didn't want > to use it past the end of the current chain. So, I guess you _do_ need > the actual fd to properly close the file in that case. Right, I'm caching both the fd and the file, we'll need both. See below, quick hack. Needs a few prep patches, we always prepare the file upfront and the prep handlers expect it to be there, we'd have to do things a bit differently for that. And attached small test app that does open+read+close. > On another note, what happens if you do open+write+close and the write > fails? Does the close still happen, or would you have to issue one > separately after getting the result? For any io_uring chain, if any link in the chain fails, then the rest of the chain is errored. So if your open fails, you'd get -ECANCELED for your read+close. If your read fails, just the close is errored. So yes, you'd have to close the fd again if the chain doesn't fully execute. diff --git a/fs/io_uring.c b/fs/io_uring.c index 0e2065614ace..bbaea6b3e16a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -488,6 +488,10 @@ enum { REQ_F_NEED_CLEANUP_BIT, REQ_F_OVERFLOW_BIT, REQ_F_POLLED_BIT, + REQ_F_OPEN_FD_BIT, + + /* not a real bit, just to check we're not overflowing the space */ + __REQ_F_LAST_BIT, }; enum { @@ -532,6 +536,8 @@ enum { REQ_F_OVERFLOW = BIT(REQ_F_OVERFLOW_BIT), /* already went through poll handler */ REQ_F_POLLED = BIT(REQ_F_POLLED_BIT), + /* use chain previous open fd */ + REQ_F_OPEN_FD = BIT(REQ_F_OPEN_FD_BIT), }; struct async_poll { @@ -593,6 +599,8 @@ struct io_kiocb { struct callback_head task_work; struct hlist_node hash_node; struct async_poll *apoll; + struct file *last_open_file; + int last_open_fd; }; struct io_wq_work work; }; @@ -1292,7 +1300,7 @@ static inline void io_put_file(struct io_kiocb *req, struct file *file, { if (fixed) percpu_ref_put(&req->ctx->file_data->refs); - else + else if (!(req->flags & REQ_F_OPEN_FD)) fput(file); } @@ -1435,6 +1443,12 @@ static void io_req_link_next(struct io_kiocb *req, struct io_kiocb **nxtptr) list_del_init(&req->link_list); if (!list_empty(&nxt->link_list)) nxt->flags |= REQ_F_LINK; + if (nxt->flags & REQ_F_OPEN_FD) { + WARN_ON_ONCE(nxt->file); + nxt->last_open_file = req->last_open_file; + nxt->last_open_fd = req->last_open_fd; + nxt->file = req->last_open_file; + } *nxtptr = nxt; break; } @@ -1957,37 +1971,20 @@ static bool io_file_supports_async(struct file *file) return false; } -static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, - bool force_nonblock) +static int __io_prep_rw(struct io_kiocb *req, bool force_nonblock) { struct io_ring_ctx *ctx = req->ctx; struct kiocb *kiocb = &req->rw.kiocb; - unsigned ioprio; - int ret; if (S_ISREG(file_inode(req->file)->i_mode)) req->flags |= REQ_F_ISREG; - kiocb->ki_pos = READ_ONCE(sqe->off); if (kiocb->ki_pos == -1 && !(req->file->f_mode & FMODE_STREAM)) { req->flags |= REQ_F_CUR_POS; kiocb->ki_pos = req->file->f_pos; } kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); - kiocb->ki_flags = iocb_flags(kiocb->ki_filp); - ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); - if (unlikely(ret)) - return ret; - - ioprio = READ_ONCE(sqe->ioprio); - if (ioprio) { - ret = ioprio_check_cap(ioprio); - if (ret) - return ret; - - kiocb->ki_ioprio = ioprio; - } else - kiocb->ki_ioprio = get_current_ioprio(); + kiocb->ki_flags |= iocb_flags(kiocb->ki_filp); /* don't allow async punt if RWF_NOWAIT was requested */ if ((kiocb->ki_flags & IOCB_NOWAIT) || @@ -2011,6 +2008,31 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, kiocb->ki_complete = io_complete_rw; } + return 0; +} + +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct kiocb *kiocb = &req->rw.kiocb; + unsigned ioprio; + int ret; + + kiocb->ki_pos = READ_ONCE(sqe->off); + + ret = kiocb_set_rw_flags(kiocb, READ_ONCE(sqe->rw_flags)); + if (unlikely(ret)) + return ret; + + ioprio = READ_ONCE(sqe->ioprio); + if (ioprio) { + ret = ioprio_check_cap(ioprio); + if (ret) + return ret; + + kiocb->ki_ioprio = ioprio; + } else + kiocb->ki_ioprio = get_current_ioprio(); + req->rw.addr = READ_ONCE(sqe->addr); req->rw.len = READ_ONCE(sqe->len); /* we own ->private, reuse it for the buffer index */ @@ -2273,13 +2295,10 @@ static int io_read_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct iov_iter iter; ssize_t ret; - ret = io_prep_rw(req, sqe, force_nonblock); + ret = io_prep_rw(req, sqe); if (ret) return ret; - if (unlikely(!(req->file->f_mode & FMODE_READ))) - return -EBADF; - /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -2304,6 +2323,13 @@ static int io_read(struct io_kiocb *req, bool force_nonblock) size_t iov_count; ssize_t io_size, ret; + ret = __io_prep_rw(req, force_nonblock); + if (ret) + return ret; + + if (unlikely(!(req->file->f_mode & FMODE_READ))) + return -EBADF; + ret = io_import_iovec(READ, req, &iovec, &iter); if (ret < 0) return ret; @@ -2362,13 +2388,10 @@ static int io_write_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe, struct iov_iter iter; ssize_t ret; - ret = io_prep_rw(req, sqe, force_nonblock); + ret = io_prep_rw(req, sqe); if (ret) return ret; - if (unlikely(!(req->file->f_mode & FMODE_WRITE))) - return -EBADF; - /* either don't need iovec imported or already have it */ if (!req->io || req->flags & REQ_F_NEED_CLEANUP) return 0; @@ -2393,6 +2416,13 @@ static int io_write(struct io_kiocb *req, bool force_nonblock) size_t iov_count; ssize_t ret, io_size; + ret = __io_prep_rw(req, force_nonblock); + if (ret) + return ret; + + if (unlikely(!(req->file->f_mode & FMODE_WRITE))) + return -EBADF; + ret = io_import_iovec(WRITE, req, &iovec, &iter); if (ret < 0) return ret; @@ -2737,8 +2767,8 @@ static int io_openat2_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) static int io_openat2(struct io_kiocb *req, bool force_nonblock) { + struct file *file = NULL; struct open_flags op; - struct file *file; int ret; if (force_nonblock) @@ -2763,8 +2793,12 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock) err: putname(req->open.filename); req->flags &= ~REQ_F_NEED_CLEANUP; - if (ret < 0) + if (ret < 0) { req_set_fail_links(req); + } else if (req->flags & REQ_F_LINK) { + req->last_open_file = file; + req->last_open_fd = ret; + } io_cqring_add_event(req, ret); io_put_req(req); return 0; @@ -2980,10 +3014,6 @@ static int io_close_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return -EBADF; req->close.fd = READ_ONCE(sqe->fd); - if (req->file->f_op == &io_uring_fops || - req->close.fd == req->ctx->ring_fd) - return -EBADF; - return 0; } @@ -3013,6 +3043,18 @@ static int io_close(struct io_kiocb *req, bool force_nonblock) { int ret; + if (req->flags & REQ_F_OPEN_FD) { + if (req->close.fd != IOSQE_FD_LAST_OPEN) + return -EBADF; + req->close.fd = req->last_open_fd; + req->last_open_file = NULL; + req->last_open_fd = -1; + } + + if (req->file->f_op == &io_uring_fops || + req->close.fd == req->ctx->ring_fd) + return -EBADF; + req->close.put_file = NULL; ret = __close_fd_get_file(req->close.fd, &req->close.put_file); if (ret < 0) @@ -3437,8 +3479,14 @@ static int __io_accept(struct io_kiocb *req, bool force_nonblock) return -EAGAIN; if (ret == -ERESTARTSYS) ret = -EINTR; - if (ret < 0) + if (ret < 0) { req_set_fail_links(req); + } else if (req->flags & REQ_F_LINK) { + rcu_read_lock(); + req->last_open_file = fcheck_files(current->files, ret); + rcu_read_unlock(); + req->last_open_fd = ret; + } io_cqring_add_event(req, ret); io_put_req(req); return 0; @@ -4779,6 +4827,14 @@ static int io_req_set_file(struct io_submit_state *state, struct io_kiocb *req, return 0; fixed = (flags & IOSQE_FIXED_FILE); + if (fd == IOSQE_FD_LAST_OPEN) { + if (fixed) + return -EBADF; + req->flags |= REQ_F_OPEN_FD; + req->file = NULL; + return 0; + } + if (unlikely(!fixed && req->needs_fixed_file)) return -EBADF; @@ -7448,6 +7504,7 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in); BUILD_BUG_ON(ARRAY_SIZE(io_op_defs) != IORING_OP_LAST); + BUILD_BUG_ON(__REQ_F_LAST_BIT >= 8 * sizeof(int)); req_cachep = KMEM_CACHE(io_kiocb, SLAB_HWCACHE_ALIGN | SLAB_PANIC); return 0; }; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 53b36311cdac..3ccf74efe381 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -77,6 +77,12 @@ enum { /* always go async */ #define IOSQE_ASYNC (1U << IOSQE_ASYNC_BIT) +/* + * 'magic' ->fd values don't point to a real fd, but rather define how fds + * can be inherited through links in a chain + */ +#define IOSQE_FD_LAST_OPEN (-4096) /* previous result is fd */ + /* * io_uring_setup() flags */
On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote: > On Tue, Mar 03, 2020 at 12:38:14PM +0100, Karel Zak wrote: > > On Tue, Mar 03, 2020 at 10:26:21AM +0100, Miklos Szeredi wrote: > > > No, I don't think this is going to be a performance issue at all, > > > but > > > if anything we could introduce a syscall > > > > > > ssize_t readfile(int dfd, const char *path, char *buf, size_t > > > bufsize, int flags); > > > > off-topic, but I'll buy you many many beers if you implement it ;- > > ), > > because open + read + close is pretty common for /sys and /proc in > > many userspace tools; for example ps, top, lsblk, lsmem, lsns, > > udevd > > etc. is all about it. > > Unlimited beers for a 21-line kernel patch? Sign me up! > > Totally untested, barely compiled patch below. > > Actually, I like this idea (the syscall, not just the unlimited > beers). > Maybe this could make a lot of sense, I'll write some actual tests > for > it now that syscalls are getting "heavy" again due to CPU vendors > finally paying the price for their madness... The problem isn't with open->read->close but with the mount info. changing between reads (ie. seq file read takes and drops the needed lock between reads at least once). The problem is you don't know the buffer size needed to get this in one hit, how is this different to read(2)? > > thanks, > > greg k-h > ------------------- > > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 44d510bc9b78..178cd45340e2 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -359,6 +359,7 @@ > 435 common clone3 __x64_sys_clone3/ptregs > 437 common openat2 __x64_sys_openat2 > 438 common pidfd_getfd __x64_sys_pidfd_getfd > +439 common readfile __x86_sys_readfile > > # > # x32-specific system call numbers start at 512 to avoid cache > impact > diff --git a/fs/open.c b/fs/open.c > index 0788b3715731..1a830fada750 100644 > --- a/fs/open.c > +++ b/fs/open.c > @@ -1340,3 +1340,23 @@ int stream_open(struct inode *inode, struct > file *filp) > } > > EXPORT_SYMBOL(stream_open); > + > +SYSCALL_DEFINE5(readfile, int, dfd, const char __user *, filename, > + char __user *, buffer, size_t, bufsize, int, flags) > +{ > + int retval; > + int fd; > + > + if (force_o_largefile()) > + flags |= O_LARGEFILE; > + > + fd = do_sys_open(dfd, filename, flags, O_RDONLY); > + if (fd <= 0) > + return fd; > + > + retval = ksys_read(fd, buffer, bufsize); > + > + __close_fd(current->files, fd); > + > + return retval; > +}
On Tue, 2020-03-03 at 15:10 +0100, Miklos Szeredi wrote: > On Tue, Mar 3, 2020 at 2:43 PM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > On Tue, Mar 03, 2020 at 02:34:42PM +0100, Miklos Szeredi wrote: > > > If buffer is too small to fit the whole file, return error. > > > > Why? What's wrong with just returning the bytes asked for? If > > someone > > only wants 5 bytes from the front of a file, it should be fine to > > give > > that to them, right? > > I think we need to signal in some way to the caller that the result > was truncated (see readlink(2), getxattr(2), getcwd(2)), otherwise > the > caller might be surprised. > > > > Verify that the number of bytes read matches the file size, > > > otherwise > > > return error (may need to loop?). > > > > No, we can't "match file size" as sysfs files do not really have a > > sane > > "size". So I don't want to loop at all here, one-shot, that's all > > you > > get :) > > Hmm. I understand the no-size thing. But looping until EOF (i.e. > until read return zero) might be a good idea regardless, because > short > reads are allowed. Surely a short read equates to an error. That has to be the definition of readfile() because you can do the looping thing with read(2) and get the entire file anyway. If you think about it don't you arrive at the conclusion this can be done with read(2) alone anyway because you have to loop to get the entire file, otherwise there's no point to the syscall! Ian
On Wed, Mar 04, 2020 at 10:01:33AM +0800, Ian Kent wrote: > On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote: > > Actually, I like this idea (the syscall, not just the unlimited > > beers). > > Maybe this could make a lot of sense, I'll write some actual tests > > for > > it now that syscalls are getting "heavy" again due to CPU vendors > > finally paying the price for their madness... > > The problem isn't with open->read->close but with the mount info. > changing between reads (ie. seq file read takes and drops the > needed lock between reads at least once). readfile() is not reaction to mountinfo. The motivation is that we have many places with trivial open->read->close for very small text files due to /sys and /proc. The current way how kernel delivers these small strings to userspace seems pretty inefficient if we can do the same by one syscall. Karel $ strace -e openat,read,close -c ps aux ... % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 43.32 0.004190 4 987 read 31.42 0.003039 3 844 4 openat 25.26 0.002443 2 842 close ------ ----------- ----------- --------- --------- ---------------- 100.00 0.009672 2673 4 total $ strace -e openat,read,close -c lsns ... % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 39.95 0.001567 2 593 openat 30.93 0.001213 2 597 close 29.12 0.001142 3 365 read ------ ----------- ----------- --------- --------- ---------------- 100.00 0.003922 1555 total $ strace -e openat,read,close -c lscpu ... % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 44.67 0.001480 7 189 52 openat 34.77 0.001152 6 180 read 20.56 0.000681 4 140 close ------ ----------- ----------- --------- --------- ---------------- 100.00 0.003313 509 52 total
On Wed, Mar 04, 2020 at 04:22:41PM +0100, Karel Zak wrote: > On Wed, Mar 04, 2020 at 10:01:33AM +0800, Ian Kent wrote: > > On Tue, 2020-03-03 at 14:03 +0100, Greg Kroah-Hartman wrote: > > > Actually, I like this idea (the syscall, not just the unlimited > > > beers). > > > Maybe this could make a lot of sense, I'll write some actual tests > > > for > > > it now that syscalls are getting "heavy" again due to CPU vendors > > > finally paying the price for their madness... > > > > The problem isn't with open->read->close but with the mount info. > > changing between reads (ie. seq file read takes and drops the > > needed lock between reads at least once). > > readfile() is not reaction to mountinfo. > > The motivation is that we have many places with trivial > open->read->close for very small text files due to /sys and /proc. The > current way how kernel delivers these small strings to userspace seems > pretty inefficient if we can do the same by one syscall. > > Karel > > $ strace -e openat,read,close -c ps aux > ... > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 43.32 0.004190 4 987 read > 31.42 0.003039 3 844 4 openat > 25.26 0.002443 2 842 close > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.009672 2673 4 total > > $ strace -e openat,read,close -c lsns > ... > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 39.95 0.001567 2 593 openat > 30.93 0.001213 2 597 close > 29.12 0.001142 3 365 read > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.003922 1555 total > > > $ strace -e openat,read,close -c lscpu > ... > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 44.67 0.001480 7 189 52 openat > 34.77 0.001152 6 180 read > 20.56 0.000681 4 140 close > ------ ----------- ----------- --------- --------- ---------------- > 100.00 0.003313 509 52 total As a "real-world" test, would you recommend me converting one of the above tools to my implementation of readfile to see how/if it actually makes sense, or do you have some other tool you would rather see me try? thanks, greg k-h
On Wed, Mar 04, 2020 at 05:49:13PM +0100, Greg Kroah-Hartman wrote: > On Wed, Mar 04, 2020 at 04:22:41PM +0100, Karel Zak wrote: > > $ strace -e openat,read,close -c ps aux > > ... > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 43.32 0.004190 4 987 read > > 31.42 0.003039 3 844 4 openat > > 25.26 0.002443 2 842 close > > ------ ----------- ----------- --------- --------- ---------------- > > 100.00 0.009672 2673 4 total > > > > $ strace -e openat,read,close -c lsns > > ... > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 39.95 0.001567 2 593 openat > > 30.93 0.001213 2 597 close > > 29.12 0.001142 3 365 read > > ------ ----------- ----------- --------- --------- ---------------- > > 100.00 0.003922 1555 total > > > > > > $ strace -e openat,read,close -c lscpu > > ... > > % time seconds usecs/call calls errors syscall > > ------ ----------- ----------- --------- --------- ---------------- > > 44.67 0.001480 7 189 52 openat > > 34.77 0.001152 6 180 read > > 20.56 0.000681 4 140 close > > ------ ----------- ----------- --------- --------- ---------------- > > 100.00 0.003313 509 52 total > > As a "real-world" test, would you recommend me converting one of the > above tools to my implementation of readfile to see how/if it actually > makes sense, or do you have some other tool you would rather see me try? See lib/path.c and lib/sysfs.c in util-linux (https://github.com/karelzak/util-linux). For example ul_path_read() and ul_path_scanf(). We use it for lsblk, lsmem, lscpu, etc. $ git grep -c ul_path_read misc-utils/lsblk.c sys-utils/lscpu.c misc-utils/lsblk.c:30 sys-utils/lscpu.c:31 We're probably a little bit off-topic here, no problem to continue on util-linux@vger.kernel.org or by private mails. Thanks! Karel
On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > I'm doing a patch. Let's see how it fares in the face of all these > preconceptions. Here's a first cut. Doesn't yet have superblock info, just mount info. Probably has rough edges, but appears to work. I started with sysfs, then kernfs, then went with a custom filesystem, because neither could do what I wanted. Anyway, this is more for review of the concept, than for a code review, but obviously if you see a fatal flaw in the design, please let me know. get mountinfo from open file: cat /proc/$PID/fdmount/$FD/* get mountinfo by mount ID: mount -t mountfs mountfs /mountfs cat /mountfs/$MNT_ID/* Thanks, Miklos --- fs/Makefile | 1 fs/mount.h | 11 + fs/mountfs/Makefile | 1 fs/mountfs/super.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++ fs/namespace.c | 60 +++++ fs/proc/base.c | 2 fs/proc/fd.c | 82 +++++++ fs/proc/fd.h | 3 fs/proc_namespace.c | 22 -- fs/seq_file.c | 23 ++ include/linux/seq_file.h | 1 11 files changed, 682 insertions(+), 21 deletions(-) --- a/fs/Makefile +++ b/fs/Makefile @@ -135,3 +135,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-y += mountfs/ --- a/fs/mount.h +++ b/fs/mount.h @@ -72,6 +72,7 @@ struct mount { int mnt_expiry_mark; /* true if marked for expiry */ struct hlist_head mnt_pins; struct hlist_head mnt_stuck_children; + struct mountfs_entry *mnt_mountfs_entry; } __randomize_layout; #define MNT_NS_INTERNAL ERR_PTR(-EINVAL) /* distinct from any mnt_namespace */ @@ -153,3 +154,13 @@ static inline bool is_anon_ns(struct mnt { return ns->seq == 0; } + +extern struct mount *get_mount(struct mount *mnt); +extern void mntput_no_expire(struct mount *mnt); + +void mountfs_create(struct mount *mnt, struct mnt_namespace *mnt_ns); +extern void mountfs_remove(struct mount *mnt); +void seq_mount_children(struct seq_file *sf, struct mount *mnt); +void seq_mount_propagate_from(struct seq_file *sf, struct mount *mnt, + const struct path *root); +int mountfs_lookup_internal(struct vfsmount *m, struct path *path); --- /dev/null +++ b/fs/mountfs/Makefile @@ -0,0 +1 @@ +obj-y += super.o --- /dev/null +++ b/fs/mountfs/super.c @@ -0,0 +1,497 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "../pnode.h" +#include <linux/fs.h> +#include <linux/kref.h> +#include <linux/nsproxy.h> +#include <linux/fs_struct.h> +#include <linux/fs_context.h> + +#define MOUNTFS_SUPER_MAGIC 0x4e756f4d + +static DEFINE_MUTEX(mountfs_lock); +static struct rb_root mountfs_entries = RB_ROOT; +static struct vfsmount *mountfs_mnt __read_mostly; + +struct mountfs_entry { + struct kref kref; + struct mount *mnt; + struct rb_node node; + int id; +}; + +static const char *mountfs_attrs[] = { + "root", "mountpoint", "id", "parent", "options", "children", + "group", "master", "propagate_from" +}; + +#define MOUNTFS_INO(id) (((unsigned long) id + 1) * \ + (ARRAY_SIZE(mountfs_attrs) + 1)) + +void mountfs_entry_release(struct kref *kref) +{ + kfree(container_of(kref, struct mountfs_entry, kref)); +} + +void mountfs_entry_put(struct mountfs_entry *entry) +{ + kref_put(&entry->kref, mountfs_entry_release); +} + +static struct mount *mountfs_get_mount(struct mountfs_entry *entry) +{ + struct mount *mnt; + + rcu_read_lock(); + mnt = get_mount(rcu_dereference(entry->mnt)); + rcu_read_unlock(); + + return mnt; +} + +static bool mountfs_entry_visible(struct mountfs_entry *entry) +{ + struct mount *mnt; + bool visible = false; + + rcu_read_lock(); + mnt = rcu_dereference(entry->mnt); + if (mnt && mnt->mnt_ns == current->nsproxy->mnt_ns) + visible = true; + rcu_read_unlock(); + + return visible; +} + +static int mountfs_attr_show(struct seq_file *sf, void *v) +{ + const char *name = sf->file->f_path.dentry->d_name.name; + struct mountfs_entry *entry = sf->private; + struct mount *mnt = mountfs_get_mount(entry); + struct vfsmount *m; + struct super_block *sb; + struct path root; + int err = 0; + + if (!mnt) + return -ENODEV; + + m = &mnt->mnt; + sb = m->mnt_sb; + + if (strcmp(name, "root") == 0) { + if (sb->s_op->show_path) { + err = sb->s_op->show_path(sf, m->mnt_root); + } else { + seq_dentry(sf, m->mnt_root, " \t\n\\"); + } + seq_putc(sf, '\n'); + } else if (strcmp(name, "mountpoint") == 0) { + struct path mnt_path = { .dentry = m->mnt_root, .mnt = m }; + + get_fs_root(current->fs, &root); + err = seq_path_root(sf, &mnt_path, &root, " \t\n\\"); + path_put(&root); + if (err == SEQ_SKIP) { + seq_puts(sf, "(unreachable)"); + err = 0; + } + seq_putc(sf, '\n'); + } else if (strcmp(name, "id") == 0) { + seq_printf(sf, "%i\n", mnt->mnt_id); + } else if (strcmp(name, "parent") == 0) { + int parent; + + rcu_read_lock(); + parent = rcu_dereference(mnt->mnt_parent)->mnt_id; + rcu_read_unlock(); + + seq_printf(sf, "%i\n", parent); + } else if (strcmp(name, "options") == 0) { + int mnt_flags = READ_ONCE(m->mnt_flags); + + seq_puts(sf, mnt_flags & MNT_READONLY ? "ro" : "rw"); + seq_mnt_opts(sf, mnt_flags); + seq_putc(sf, '\n'); + } else if (strcmp(name, "children") == 0) { + seq_mount_children(sf, mnt); + } else if (strcmp(name, "group") == 0) { + if (IS_MNT_SHARED(mnt)) + seq_printf(sf, "%i\n", mnt->mnt_group_id); + } else if (strcmp(name, "master") == 0) { + if (IS_MNT_SLAVE(mnt)) { + int master; + + rcu_read_lock(); + master = rcu_dereference(mnt->mnt_master)->mnt_group_id; + rcu_read_unlock(); + seq_printf(sf, "%i\n", master); + } + } else if (strcmp(name, "propagate_from") == 0) { + if (IS_MNT_SLAVE(mnt)) { + get_fs_root(current->fs, &root); + seq_mount_propagate_from(sf, mnt, &root); + path_put(&root); + } + } + mntput_no_expire(mnt); + + return err; +} + +static int mountfs_attr_open(struct inode *inode, struct file *file) +{ + return single_open(file, mountfs_attr_show, inode->i_private); +} + +static const struct file_operations mountfs_attr_fops = { + .open = mountfs_attr_open, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; + +static struct mountfs_entry *mountfs_node_to_entry(struct rb_node *node) +{ + return rb_entry(node, struct mountfs_entry, node); +} + +static struct rb_node **mountfs_find_node(int id, struct rb_node **parent) +{ + struct rb_node **link = &mountfs_entries.rb_node; + + *parent = NULL; + while (*link) { + struct mountfs_entry *entry = mountfs_node_to_entry(*link); + + *parent = *link; + if (id < entry->id) + link = &entry->node.rb_left; + else if (id > entry->id) + link = &entry->node.rb_right; + else + break; + } + return link; +} + +void mountfs_create(struct mount *mnt, struct mnt_namespace *mnt_ns) +{ + struct mountfs_entry *entry; + struct rb_node **link, *parent; + + if (mnt->mnt.mnt_flags & MNT_INTERNAL) + return; + + entry = kzalloc(sizeof(*entry), GFP_KERNEL); + if (!entry) { + WARN(1, "failed to allocate mountfs entry"); + return; + } + kref_init(&entry->kref); + entry->mnt = mnt; + entry->id = mnt->mnt_id; + + mutex_lock(&mountfs_lock); + link = mountfs_find_node(entry->id, &parent); + if (!WARN_ON(*link)) { + rb_link_node(&entry->node, parent, link); + rb_insert_color(&entry->node, &mountfs_entries); + mnt->mnt_mountfs_entry = entry; + } else { + kfree(entry); + } + mutex_unlock(&mountfs_lock); +} + +void mountfs_remove(struct mount *mnt) +{ + struct mountfs_entry *entry = mnt->mnt_mountfs_entry; + + if (!entry) + return; + + mutex_lock(&mountfs_lock); + entry->mnt = NULL; + rb_erase(&entry->node, &mountfs_entries); + mutex_unlock(&mountfs_lock); + + mountfs_entry_put(entry); + + mnt->mnt_mountfs_entry = NULL; +} + +static struct mountfs_entry *mountfs_get_entry(const char *name) +{ + struct mountfs_entry *entry = NULL; + struct rb_node **link, *dummy; + unsigned long mnt_id; + char buf[32]; + int ret; + + ret = kstrtoul(name, 10, &mnt_id); + if (ret || mnt_id > INT_MAX) + return NULL; + + if (WARN_ON(snprintf(buf, sizeof(buf), "%lu", mnt_id) >= sizeof(buf)) || + strcmp(buf, name) != 0) + return NULL; + + mutex_lock(&mountfs_lock); + link = mountfs_find_node(mnt_id, &dummy); + if (*link) { + entry = mountfs_node_to_entry(*link); + if (!mountfs_entry_visible(entry)) + entry = NULL; + else + kref_get(&entry->kref); + } + mutex_unlock(&mountfs_lock); + + return entry; +} + +static void mountfs_init_inode(struct inode *inode, umode_t mode); + +static struct dentry *mountfs_lookup_entry(struct dentry *dentry, + struct mountfs_entry *entry, + int idx) +{ + struct inode *inode; + + inode = new_inode(dentry->d_sb); + if (!inode) { + mountfs_entry_put(entry); + return ERR_PTR(-ENOMEM); + } + inode->i_private = entry; + inode->i_ino = MOUNTFS_INO(entry->id) + idx; + mountfs_init_inode(inode, idx ? S_IFREG | 0444 : S_IFDIR | 0555); + return d_splice_alias(inode, dentry); + +} + +static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry, + unsigned int flags) +{ + struct mountfs_entry *entry = dir->i_private; + int i = 0; + + if (entry) { + for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++) + if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0) + break; + if (i == ARRAY_SIZE(mountfs_attrs)) + return ERR_PTR(-ENOMEM); + i++; + } else { + entry = mountfs_get_entry(dentry->d_name.name); + if (!entry) + return ERR_PTR(-ENOENT); + } + + return mountfs_lookup_entry(dentry, entry, i); +} + +static int mountfs_d_revalidate(struct dentry *dentry, unsigned int flags) +{ + struct mountfs_entry *entry = dentry->d_inode->i_private; + + /* root: valid */ + if (!entry) + return 1; + + /* removed: invalid */ + if (!entry->mnt) + return 0; + + /* attribute or visible in this namespace: valid */ + if (!d_can_lookup(dentry) || mountfs_entry_visible(entry)) + return 1; + + /* invlisible in this namespace: valid but deny entry*/ + return -ENOENT; +} + +static int mountfs_readdir(struct file *file, struct dir_context *ctx) +{ + struct rb_node *node; + struct mountfs_entry *entry = file_inode(file)->i_private; + char name[32]; + const char *s; + unsigned int len; + + if (ctx->pos - 2 > INT_MAX || !dir_emit_dots(file, ctx)) + return 0; + + if (entry) { + while (ctx->pos - 2 < ARRAY_SIZE(mountfs_attrs)) { + s = mountfs_attrs[ctx->pos - 2]; + if (!dir_emit(ctx, s, strlen(s), + MOUNTFS_INO(entry->id) + ctx->pos, + DT_REG)) + break; + ctx->pos++; + } + return 0; + } + + mutex_lock(&mountfs_lock); + mountfs_find_node(ctx->pos - 2, &node); + for (; node; node = rb_next(node)) { + entry = mountfs_node_to_entry(node); + len = snprintf(name, sizeof(name), "%i", entry->id); + if (WARN_ON(len >= sizeof(name))) + goto out_unlock; + if (!mountfs_entry_visible(entry)) + continue; + ctx->pos = (loff_t) entry->id + 2; + if (!dir_emit(ctx, name, len, MOUNTFS_INO(entry->id), DT_DIR)) + goto out_unlock; + } + ctx->pos = (loff_t) INT_MAX + 3; +out_unlock: + mutex_unlock(&mountfs_lock); + return 0; +} + +int mountfs_lookup_internal(struct vfsmount *m, struct path *path) +{ + char name[32]; + struct qstr this = { .name = name }; + struct mount *mnt = real_mount(m); + struct mountfs_entry *entry = mnt->mnt_mountfs_entry; + struct dentry *dentry, *old, *root = mountfs_mnt->mnt_root; + + this.len = snprintf(name, sizeof(name), "%i", mnt->mnt_id); + if (WARN_ON(this.len >= sizeof(name))) + return -EIO; + + dentry = d_hash_and_lookup(root, &this); + if (!dentry) { + DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq); + + dentry = d_alloc_parallel(root, &this, &wq); + if (!IS_ERR(dentry) && d_in_lookup(dentry)) { + kref_get(&entry->kref); + old = mountfs_lookup_entry(dentry, entry, 0); + d_lookup_done(dentry); + if (unlikely(old)) { + dput(dentry); + dentry = old; + } + } + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + } + + *path = (struct path) { .mnt = mountfs_mnt, .dentry = dentry }; + return 0; +} + +static const struct dentry_operations mountfs_dops = { + .d_revalidate = mountfs_d_revalidate, +}; + +static const struct inode_operations mountfs_iops = { + .lookup = mountfs_lookup, +}; + +static const struct file_operations mountfs_fops = { + .iterate_shared = mountfs_readdir, + .read = generic_read_dir, + .llseek = generic_file_llseek, +}; + +static void mountfs_init_inode(struct inode *inode, umode_t mode) +{ + inode->i_mode = mode; + inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); + if (S_ISREG(mode)) { + inode->i_size = PAGE_SIZE; + inode->i_fop = &mountfs_attr_fops; + } else { + inode->i_op = &mountfs_iops; + inode->i_fop = &mountfs_fops; + } +} + +static void mountfs_evict_inode(struct inode *inode) +{ + struct mountfs_entry *entry = inode->i_private; + + clear_inode(inode); + if (entry) + mountfs_entry_put(entry); +} + +static const struct super_operations mountfs_sops = { + .statfs = simple_statfs, + .drop_inode = generic_delete_inode, + .evict_inode = mountfs_evict_inode, +}; + +static int mountfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *root; + + sb->s_iflags |= SB_I_NOEXEC | SB_I_NODEV; + sb->s_blocksize = PAGE_SIZE; + sb->s_blocksize_bits = PAGE_SHIFT; + sb->s_magic = MOUNTFS_SUPER_MAGIC; + sb->s_time_gran = 1; + sb->s_shrink.seeks = 0; + sb->s_op = &mountfs_sops; + sb->s_d_op = &mountfs_dops; + + root = new_inode(sb); + if (!root) + return -ENOMEM; + + root->i_ino = 1; + mountfs_init_inode(root, S_IFDIR | 0444); + + sb->s_root = d_make_root(root); + if (!sb->s_root) + return -ENOMEM; + + return 0; +} + +static int mountfs_get_tree(struct fs_context *fc) +{ + return get_tree_single(fc, mountfs_fill_super); +} + +static const struct fs_context_operations mountfs_context_ops = { + .get_tree = mountfs_get_tree, +}; + +static int mountfs_init_fs_context(struct fs_context *fc) +{ + fc->ops = &mountfs_context_ops; + fc->global = true; + return 0; +} + +static struct file_system_type mountfs_fs_type = { + .name = "mountfs", + .init_fs_context = mountfs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int __init mountfs_init(void) +{ + int err; + + err = register_filesystem(&mountfs_fs_type); + if (!err) { + mountfs_mnt = kern_mount(&mountfs_fs_type); + if (IS_ERR(mountfs_mnt)) { + err = PTR_ERR(mountfs_mnt); + unregister_filesystem(&mountfs_fs_type); + } + } + return err; +} +fs_initcall(mountfs_init); --- a/fs/namespace.c +++ b/fs/namespace.c @@ -172,6 +172,24 @@ unsigned int mnt_get_count(struct mount #endif } +struct mount *get_mount(struct mount *mnt) +{ + if (mnt) { + /* see comment in mntput_no_expire() */ + if (likely(READ_ONCE(mnt->mnt_ns))) { + mnt_add_count(mnt, 1); + } else { + lock_mount_hash(); + if (mnt->mnt.mnt_flags & MNT_DOOMED) + mnt = NULL; + else + mnt_add_count(mnt, 1); + unlock_mount_hash(); + } + } + return mnt; +} + static struct mount *alloc_vfsmnt(const char *name) { struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); @@ -1091,6 +1109,9 @@ static void cleanup_mnt(struct mount *mn * so mnt_get_writers() below is safe. */ WARN_ON(mnt_get_writers(mnt)); + + mountfs_remove(mnt); + if (unlikely(mnt->mnt_pins.first)) mnt_pin_kill(mnt); hlist_for_each_entry_safe(m, p, &mnt->mnt_stuck_children, mnt_umount) { @@ -1120,7 +1141,7 @@ static void delayed_mntput(struct work_s } static DECLARE_DELAYED_WORK(delayed_mntput_work, delayed_mntput); -static void mntput_no_expire(struct mount *mnt) +void mntput_no_expire(struct mount *mnt) { LIST_HEAD(list); @@ -1296,6 +1317,37 @@ const struct seq_operations mounts_op = }; #endif /* CONFIG_PROC_FS */ +void seq_mount_children(struct seq_file *sf, struct mount *mnt) +{ + struct mount *child; + bool first = true; + + down_read(&namespace_sem); + list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) { + if (!first) + seq_putc(sf, ','); + else + first = false; + seq_printf(sf, "%i", child->mnt_id); + } + up_read(&namespace_sem); + if (!first) + seq_putc(sf, '\n'); +} + +void seq_mount_propagate_from(struct seq_file *sf, struct mount *mnt, + const struct path *root) +{ + int dom; + + down_read(&namespace_sem); + dom = get_dominating_id(mnt, root); + up_read(&namespace_sem); + + if (dom) + seq_printf(sf, "%i\n", dom); +} + /** * may_umount_tree - check if a mount tree is busy * @mnt: root of mount tree @@ -2062,6 +2114,9 @@ static int attach_recursive_mnt(struct m err = count_mounts(ns, source_mnt); if (err) goto out; + + for (p = source_mnt; p; p = next_mnt(p, source_mnt)) + mountfs_create(p, ns); } if (IS_MNT_SHARED(dest_mnt)) { @@ -3224,6 +3279,7 @@ struct mnt_namespace *copy_mnt_ns(unsign p = old; q = new; while (p) { + mountfs_create(q, new_ns); q->mnt_ns = new_ns; new_ns->mounts++; if (new_fs) { @@ -3686,6 +3742,8 @@ static void __init init_mount_tree(void) if (IS_ERR(ns)) panic("Can't allocate initial namespace"); m = real_mount(mnt); + + mountfs_create(m, ns); m->mnt_ns = ns; ns->root = m; ns->mounts = 1; --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3092,6 +3092,7 @@ static const struct pid_entry tgid_base_ DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations), DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), @@ -3497,6 +3498,7 @@ static const struct inode_operations pro static const struct pid_entry tid_base_stuff[] = { DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations), DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations), + DIR("fdmount", S_IRUSR|S_IXUSR, proc_fdmount_inode_operations, proc_fdmount_operations), DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations), #ifdef CONFIG_NET DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations), --- a/fs/proc/fd.c +++ b/fs/proc/fd.c @@ -361,3 +361,85 @@ const struct file_operations proc_fdinfo .iterate_shared = proc_readfdinfo, .llseek = generic_file_llseek, }; + +static int proc_fdmount_link(struct dentry *dentry, struct path *path) +{ + struct files_struct *files = NULL; + struct task_struct *task; + struct path fd_path; + int ret = -ENOENT; + + task = get_proc_task(d_inode(dentry)); + if (task) { + files = get_files_struct(task); + put_task_struct(task); + } + + if (files) { + unsigned int fd = proc_fd(d_inode(dentry)); + struct file *fd_file; + + spin_lock(&files->file_lock); + fd_file = fcheck_files(files, fd); + if (fd_file) { + fd_path = fd_file->f_path; + path_get(&fd_path); + ret = 0; + } + spin_unlock(&files->file_lock); + put_files_struct(files); + } + if (!ret) { + ret = mountfs_lookup_internal(fd_path.mnt, path); + path_put(&fd_path); + } + + return ret; +} + +static struct dentry *proc_fdmount_instantiate(struct dentry *dentry, + struct task_struct *task, const void *ptr) +{ + const struct fd_data *data = ptr; + struct proc_inode *ei; + struct inode *inode; + + inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | 0400); + if (!inode) + return ERR_PTR(-ENOENT); + + ei = PROC_I(inode); + ei->fd = data->fd; + + inode->i_op = &proc_pid_link_inode_operations; + inode->i_size = 64; + + ei->op.proc_get_link = proc_fdmount_link; + tid_fd_update_inode(task, inode, 0); + + d_set_d_op(dentry, &tid_fd_dentry_operations); + return d_splice_alias(inode, dentry); +} + +static struct dentry * +proc_lookupfdmount(struct inode *dir, struct dentry *dentry, unsigned int flags) +{ + return proc_lookupfd_common(dir, dentry, proc_fdmount_instantiate); +} + +static int proc_readfdmount(struct file *file, struct dir_context *ctx) +{ + return proc_readfd_common(file, ctx, + proc_fdmount_instantiate); +} + +const struct inode_operations proc_fdmount_inode_operations = { + .lookup = proc_lookupfdmount, + .setattr = proc_setattr, +}; + +const struct file_operations proc_fdmount_operations = { + .read = generic_read_dir, + .iterate_shared = proc_readfdmount, + .llseek = generic_file_llseek, +}; --- a/fs/proc/fd.h +++ b/fs/proc/fd.h @@ -10,6 +10,9 @@ extern const struct inode_operations pro extern const struct file_operations proc_fdinfo_operations; extern const struct inode_operations proc_fdinfo_inode_operations; +extern const struct file_operations proc_fdmount_operations; +extern const struct inode_operations proc_fdmount_inode_operations; + extern int proc_fd_permission(struct inode *inode, int mask); static inline unsigned int proc_fd(struct inode *inode) --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -61,24 +61,6 @@ static int show_sb_opts(struct seq_file return security_sb_show_options(m, sb); } -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) -{ - static const struct proc_fs_info mnt_info[] = { - { MNT_NOSUID, ",nosuid" }, - { MNT_NODEV, ",nodev" }, - { MNT_NOEXEC, ",noexec" }, - { MNT_NOATIME, ",noatime" }, - { MNT_NODIRATIME, ",nodiratime" }, - { MNT_RELATIME, ",relatime" }, - { 0, NULL } - }; - const struct proc_fs_info *fs_infop; - - for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) { - if (mnt->mnt_flags & fs_infop->flag) - seq_puts(m, fs_infop->str); - } -} static inline void mangle(struct seq_file *m, const char *s) { @@ -120,7 +102,7 @@ static int show_vfsmnt(struct seq_file * err = show_sb_opts(m, sb); if (err) goto out; - show_mnt_opts(m, mnt); + seq_mnt_opts(m, mnt->mnt_flags); if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); @@ -153,7 +135,7 @@ static int show_mountinfo(struct seq_fil goto out; seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw"); - show_mnt_opts(m, mnt); + seq_mnt_opts(m, mnt->mnt_flags); /* Tagged fields ("foo:X" or "bar") */ if (IS_MNT_SHARED(r)) --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -15,6 +15,7 @@ #include <linux/cred.h> #include <linux/mm.h> #include <linux/printk.h> +#include <linux/mount.h> #include <linux/string_helpers.h> #include <linux/uaccess.h> @@ -548,6 +549,28 @@ int seq_dentry(struct seq_file *m, struc } EXPORT_SYMBOL(seq_dentry); +void seq_mnt_opts(struct seq_file *m, int mnt_flags) +{ + unsigned int i; + static const struct { + int flag; + const char *str; + } mnt_info[] = { + { MNT_NOSUID, ",nosuid" }, + { MNT_NODEV, ",nodev" }, + { MNT_NOEXEC, ",noexec" }, + { MNT_NOATIME, ",noatime" }, + { MNT_NODIRATIME, ",nodiratime" }, + { MNT_RELATIME, ",relatime" }, + { 0, NULL } + }; + + for (i = 0; mnt_info[i].flag; i++) { + if (mnt_flags & mnt_info[i].flag) + seq_puts(m, mnt_info[i].str); + } +} + static void *single_start(struct seq_file *p, loff_t *pos) { return NULL + (*pos == 0); --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -138,6 +138,7 @@ int seq_file_path(struct seq_file *, str int seq_dentry(struct seq_file *, struct dentry *, const char *); int seq_path_root(struct seq_file *m, const struct path *path, const struct path *root, const char *esc); +void seq_mnt_opts(struct seq_file *m, int mnt_flags); int single_open(struct file *, int (*)(struct seq_file *, void *), void *); int single_open_size(struct file *, int (*)(struct seq_file *, void *), void *, size_t);
On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > I'm doing a patch. Let's see how it fares in the face of all these > > preconceptions. > > Here's a first cut. Doesn't yet have superblock info, just mount info. > Probably has rough edges, but appears to work. For starters, you have just made namespace_sem held over copy_to_user(). This is not going to fly.
On Fri, Mar 6, 2020 at 8:43 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > preconceptions. > > > > Here's a first cut. Doesn't yet have superblock info, just mount info. > > Probably has rough edges, but appears to work. > > For starters, you have just made namespace_sem held over copy_to_user(). > This is not going to fly. Where? Thanks, Miklos
On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote: > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > preconceptions. > > > > Here's a first cut. Doesn't yet have superblock info, just mount info. > > Probably has rough edges, but appears to work. > > For starters, you have just made namespace_sem held over copy_to_user(). > This is not going to fly. In case if the above is too terse: you grab your mutex while under namespace_sem (see attach_recursive_mnt()); the same mutex is held while calling dir_emit(). Which can (and normally does) copy data to userland-supplied buffer. NAK for that reason alone, and to be honest I had been too busy suppressing the gag reflex to read and comment any deeper. I really hate that approach, in case it's not clear from the above. To the degree that I don't trust myself to filter out the obscenities if I try to comment on it right now. The only blocking thing we can afford under namespace_sem is GFP_KERNEL allocation.
On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote: > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > > preconceptions. > > > > > > Here's a first cut. Doesn't yet have superblock info, just mount info. > > > Probably has rough edges, but appears to work. > > > > For starters, you have just made namespace_sem held over copy_to_user(). > > This is not going to fly. > > In case if the above is too terse: you grab your mutex while under > namespace_sem (see attach_recursive_mnt()); the same mutex is held > while calling dir_emit(). Which can (and normally does) copy data > to userland-supplied buffer. > > NAK for that reason alone, and to be honest I had been too busy > suppressing the gag reflex to read and comment any deeper. > > I really hate that approach, in case it's not clear from the above. > To the degree that I don't trust myself to filter out the obscenities > if I try to comment on it right now. > > The only blocking thing we can afford under namespace_sem is GFP_KERNEL > allocation. Incidentally, attach_recursive_mnt() only gets you the root(s) of attached tree(s); try mount --rbind and see how much you've missed.
On Fri, Mar 6, 2020 at 9:05 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote: > > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > > > preconceptions. > > > > > > > > Here's a first cut. Doesn't yet have superblock info, just mount info. > > > > Probably has rough edges, but appears to work. > > > > > > For starters, you have just made namespace_sem held over copy_to_user(). > > > This is not going to fly. > > > > In case if the above is too terse: you grab your mutex while under > > namespace_sem (see attach_recursive_mnt()); the same mutex is held > > while calling dir_emit(). Which can (and normally does) copy data > > to userland-supplied buffer. > > > > NAK for that reason alone, and to be honest I had been too busy > > suppressing the gag reflex to read and comment any deeper. > > > > I really hate that approach, in case it's not clear from the above. > > To the degree that I don't trust myself to filter out the obscenities > > if I try to comment on it right now. > > > > The only blocking thing we can afford under namespace_sem is GFP_KERNEL > > allocation. > > Incidentally, attach_recursive_mnt() only gets you the root(s) of > attached tree(s); try mount --rbind and see how much you've missed. Okay. Both trivially fixable: - the dir_emit() can be taken out from under the mutex and the rb tree search repeated for every entry; possibly not as efficient, but I guess at this point that's irrelevant - addition of the mountfs entry moved to the right places Thanks, Miklos
On Fri, Mar 06, 2020 at 08:05:22PM +0000, Al Viro wrote: > On Fri, Mar 06, 2020 at 07:58:23PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 07:43:22PM +0000, Al Viro wrote: > > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > > > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > > > preconceptions. > > > > > > > > Here's a first cut. Doesn't yet have superblock info, just mount info. > > > > Probably has rough edges, but appears to work. > > > > > > For starters, you have just made namespace_sem held over copy_to_user(). > > > This is not going to fly. > > > > In case if the above is too terse: you grab your mutex while under > > namespace_sem (see attach_recursive_mnt()); the same mutex is held > > while calling dir_emit(). Which can (and normally does) copy data > > to userland-supplied buffer. > > > > NAK for that reason alone, and to be honest I had been too busy > > suppressing the gag reflex to read and comment any deeper. > > > > I really hate that approach, in case it's not clear from the above. > > To the degree that I don't trust myself to filter out the obscenities > > if I try to comment on it right now. > > > > The only blocking thing we can afford under namespace_sem is GFP_KERNEL > > allocation. > > Incidentally, attach_recursive_mnt() only gets you the root(s) of > attached tree(s); try mount --rbind and see how much you've missed. You are misreading mntput_no_expire(), BTW - your get_mount() can bloody well race with umount(2), hitting the moment when we are done figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that is not mounted anywhere else, they are not supposed to see the sucker return 0 until the filesystem is shut down. You break that.
On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > You are misreading mntput_no_expire(), BTW - your get_mount() can > bloody well race with umount(2), hitting the moment when we are done > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > is not mounted anywhere else, they are not supposed to see the sucker > return 0 until the filesystem is shut down. You break that. While we are at it, d_alloc_parallel() requires i_rwsem on parent held at least shared.
On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote: > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > > > You are misreading mntput_no_expire(), BTW - your get_mount() can > > bloody well race with umount(2), hitting the moment when we are done > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > > is not mounted anywhere else, they are not supposed to see the sucker > > return 0 until the filesystem is shut down. You break that. > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held > at least shared. Egads... Let me see if I got it right - you are providing procfs symlinks to objects on the internal mount of that thing. And those objects happen to be directories, so one can get to their parent that way. Or am I misreading that thing?
On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote: > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can > > > bloody well race with umount(2), hitting the moment when we are done > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > > > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > > > is not mounted anywhere else, they are not supposed to see the sucker > > > return 0 until the filesystem is shut down. You break that. > > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held > > at least shared. > > Egads... Let me see if I got it right - you are providing procfs symlinks > to objects on the internal mount of that thing. And those objects happen > to be directories, so one can get to their parent that way. Or am I misreading > that thing? IDGI. You have (in your lookup) kstrtoul, followed by snprintf, followed by strcmp and WARN_ON() in case of mismatch? Is there any point in having stat(2) on "00" spew into syslog? Confused...
On Fri, Mar 6, 2020 at 9:45 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can > > > bloody well race with umount(2), hitting the moment when we are done > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > > > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > > > is not mounted anywhere else, they are not supposed to see the sucker > > > return 0 until the filesystem is shut down. You break that. Ah, good point. > > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held > > at least shared. Okay. > Egads... Let me see if I got it right - you are providing procfs symlinks > to objects on the internal mount of that thing. And those objects happen > to be directories, so one can get to their parent that way. Or am I misreading > that thing? Yes. Thanks, Miklos
On Fri, Mar 6, 2020 at 9:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote: > > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > > > > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can > > > > bloody well race with umount(2), hitting the moment when we are done > > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > > > > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > > > > is not mounted anywhere else, they are not supposed to see the sucker > > > > return 0 until the filesystem is shut down. You break that. > > > > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held > > > at least shared. > > > > Egads... Let me see if I got it right - you are providing procfs symlinks > > to objects on the internal mount of that thing. And those objects happen > > to be directories, so one can get to their parent that way. Or am I misreading > > that thing? > > IDGI. You have (in your lookup) kstrtoul, followed by snprintf, followed > by strcmp and WARN_ON() in case of mismatch? Is there any point in having > stat(2) on "00" spew into syslog? Confused... The WARN_ON() is for the buffer overrun, not for the strcmp mismatch. Thanks, Miklos
On Fri, Mar 06, 2020 at 08:49:26PM +0000, Al Viro wrote: > On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote: > > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote: > > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > > > > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can > > > > bloody well race with umount(2), hitting the moment when we are done > > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > > > > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > > > > is not mounted anywhere else, they are not supposed to see the sucker > > > > return 0 until the filesystem is shut down. You break that. > > > > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held > > > at least shared. > > > > Egads... Let me see if I got it right - you are providing procfs symlinks > > to objects on the internal mount of that thing. And those objects happen > > to be directories, so one can get to their parent that way. Or am I misreading > > that thing? > > IDGI. You have (in your lookup) kstrtoul, followed by snprintf, followed > by strcmp and WARN_ON() in case of mismatch? Is there any point in having > stat(2) on "00" spew into syslog? Confused... AFAICS, refcounting in there cannot be right: +static struct dentry *mountfs_lookup(struct inode *dir, struct dentry *dentry, + unsigned int flags) +{ + struct mountfs_entry *entry = dir->i_private; + int i = 0; + + if (entry) { + for (i = 0; i < ARRAY_SIZE(mountfs_attrs); i++) + if (strcmp(mountfs_attrs[i], dentry->d_name.name) == 0) + break; + if (i == ARRAY_SIZE(mountfs_attrs)) + return ERR_PTR(-ENOMEM); + i++; + } else { + entry = mountfs_get_entry(dentry->d_name.name); + if (!entry) + return ERR_PTR(-ENOENT); + } + + return mountfs_lookup_entry(dentry, entry, i); +} ends up consuming a reference in mountfs_lookup_entry() (at the very least, drops it in case of inode allocation hitting OOM) and nothing in the that loop in mountfs_lookup() appears to do a matching reference grab.
On Fri, Mar 06, 2020 at 09:51:50PM +0100, Miklos Szeredi wrote: > On Fri, Mar 6, 2020 at 9:49 PM Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > On Fri, Mar 06, 2020 at 08:45:23PM +0000, Al Viro wrote: > > > On Fri, Mar 06, 2020 at 08:38:44PM +0000, Al Viro wrote: > > > > On Fri, Mar 06, 2020 at 08:37:05PM +0000, Al Viro wrote: > > > > > > > > > You are misreading mntput_no_expire(), BTW - your get_mount() can > > > > > bloody well race with umount(2), hitting the moment when we are done > > > > > figuring out whether it's busy but hadn't cleaned ->mnt_ns (let alone > > > > > set MNT_DOOMED) yet. If somebody calls umount(2) on a filesystem that > > > > > is not mounted anywhere else, they are not supposed to see the sucker > > > > > return 0 until the filesystem is shut down. You break that. > > > > > > > > While we are at it, d_alloc_parallel() requires i_rwsem on parent held > > > > at least shared. > > > > > > Egads... Let me see if I got it right - you are providing procfs symlinks > > > to objects on the internal mount of that thing. And those objects happen > > > to be directories, so one can get to their parent that way. Or am I misreading > > > that thing? > > > > IDGI. You have (in your lookup) kstrtoul, followed by snprintf, followed > > by strcmp and WARN_ON() in case of mismatch? Is there any point in having > > stat(2) on "00" spew into syslog? Confused... > > The WARN_ON() is for the buffer overrun, not for the strcmp mismatch. That makes even less sense - buffer overrun on snprintf of an int into 32-character array? That's what, future-proofing it for the time we manage to issue 10^31 syscalls since the (much closer) moment when we get 128bit int?
On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > I'm doing a patch. Let's see how it fares in the face of all these > > preconceptions. > > Here's a first cut. Doesn't yet have superblock info, just mount info. > Probably has rough edges, but appears to work. > > I started with sysfs, then kernfs, then went with a custom filesystem, because > neither could do what I wanted. Hm, what is wrong with kernfs that prevented you from using it here? Just complexity or something else? thanks, greg k-h
On Sat, Mar 7, 2020 at 10:48 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Fri, Mar 06, 2020 at 05:25:49PM +0100, Miklos Szeredi wrote: > > On Tue, Mar 03, 2020 at 08:46:09AM +0100, Miklos Szeredi wrote: > > > > > > I'm doing a patch. Let's see how it fares in the face of all these > > > preconceptions. > > > > Here's a first cut. Doesn't yet have superblock info, just mount info. > > Probably has rough edges, but appears to work. > > > > I started with sysfs, then kernfs, then went with a custom filesystem, because > > neither could do what I wanted. > > Hm, what is wrong with kernfs that prevented you from using it here? > Just complexity or something else? I wanted to have a single instance covering all the namespaces, with just a filtered view depending on which namespace the task is looking at it. Having a kernfs_node for each attribute is also rather heavy compared to the size of struct mount. Thanks, Miklos