Message ID | a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files | expand |
[ Adding random people who get blamed for lines in this remap_range thing to the participants ] On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> wrote: > > using the deduplication API we found out, that the FIDEDUPERANGE ioctl > syscall can be used to read a writeonly file. So I think your patch is slightly wrong, but I think this is worth fixing - just likely differently. First off - an odd technicality: you can already read write-only files by simply mmap'ing them, because on many architectures PROT_WRITE ends up implying PROT_READ too. So you should basically expect that "I have permissions to write to the file" automatically means that you can read it too. People simply do that "open for writing, mmap to change it" and expect it to work - not realizing that that means you have to be able to read it too. Anybody who thought otherwise was sadly wrong, and if you depend on "this is write-only" as some kind of security measure for secrets, you need to fix your setup. Now, is that a "feature or a bug"? You be the judge.It is what it is, and it's always been that way. Writability trumps readability, even if you have to do special things to get there. That said, this file remap case was clearly not intentional, and despite the mmap() issue I think this is just plain wrong and we should fix it as a QoI issue. A dedupe may only write to the destination file, but at the same time it does obviously have that implication of "I need to be able to read it to see that it's duplicate". However, your patch is wrong: > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file) > > if (capable(CAP_SYS_ADMIN)) > return true; > - if (file->f_mode & FMODE_WRITE) > + if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | > FMODE_WRITE)) > return true; This part looks like a good idea - although it is possible that people will argue that this is the same kind of issue as 'mmap()' has (but unlike mmap, we don't have "this is how the hardware works" issues, or "long history of uses"). But > - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) > + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) looks wrong. Note that readability is about the file *descriptor*, not the inode. Because the file descriptor may have been opened by somebody who had permission to read the file even for a write-only file. That happens for capability reasons, but it also happens for things like "open(O_RDWR | O_CREAT, 0444)" which creates a new file that is write-only in the filesystem, but despite that the file descriptor is actually readable by the opener. I wonder if the inode_permission() check should just be removed entirely (ie the MAY_WRITE check smells bogus too, for the same reason I don't like the added MAY_READ one) The file permission check - that was done at open time - is the correct one, and is the one that read/write already uses. Any permission checks done at IO time are basically always buggy: things may have changed since the 'open()', and those changes explicitly should *not* matter for the IO. That's really fundamentally how UNIX file permissions work. Linus
On Tue, Jul 12, 2022 at 10:33:01AM -0700, Linus Torvalds wrote: > [ Adding random people who get blamed for lines in this remap_range > thing to the participants ] > > On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer > <ansgar.loesser@tu-darmstadt.de> wrote: > > > > using the deduplication API we found out, that the FIDEDUPERANGE ioctl > > syscall can be used to read a writeonly file. > > So I think your patch is slightly wrong, but I think this is worth > fixing - just likely differently. I'm going to leave discussing the permissions aspect to the experts in that realm, but from a practical point of view, why do we allow the dedupe ioctl to investigate arbitrary byte ranges? If you're going to dedupe, it has to be block aligned (both start and length). If we enforce that in the ioctl, this attack becomes impractical (maybe you can investigate 512-byte blobs of an 8192-bit key, but we seem to max out at 4096-bit keys before switching to a fundamentally harder algorithm).
On Tue, Jul 12, 2022 at 11:43 AM Matthew Wilcox <willy@infradead.org> wrote: > > I'm going to leave discussing the permissions aspect to the experts in > that realm, but from a practical point of view, why do we allow the dedupe > ioctl to investigate arbitrary byte ranges? If you're going to dedupe, > it has to be block aligned (both start and length). I agree, although I think that's a separate issue. I suspect it should just check for inode->i_blkbits alignment. I think that's what DIO does, and it seems like a sane minimum. Linus
On Tue, Jul 12, 2022 at 11:47 AM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > I suspect it should just check for inode->i_blkbits alignment. I think > that's what DIO does, and it seems like a sane minimum. .. actually, looking closer, DIO also knows about and tries to deal with blksize_bits() of the underlying device. Which makes sense for IO patterns, because that's basically the "physical alignment" (vs "logical alignment") part. However, from a filesystem dedupe standpoint, I think the logical alignment is what matters, so just i_blkbits seems to be the best model. Linus
On Tue, Jul 12, 2022 at 1:33 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > [ Adding random people who get blamed for lines in this remap_range > thing to the participants ] > > On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer > <ansgar.loesser@tu-darmstadt.de> wrote: > > > > using the deduplication API we found out, that the FIDEDUPERANGE ioctl > > syscall can be used to read a writeonly file. > > So I think your patch is slightly wrong, but I think this is worth > fixing - just likely differently. > > First off - an odd technicality: you can already read write-only files > by simply mmap'ing them, because on many architectures PROT_WRITE ends > up implying PROT_READ too. > > So you should basically expect that "I have permissions to write to > the file" automatically means that you can read it too. > > People simply do that "open for writing, mmap to change it" and expect > it to work - not realizing that that means you have to be able to read > it too. > > Anybody who thought otherwise was sadly wrong, and if you depend on > "this is write-only" as some kind of security measure for secrets, you > need to fix your setup. > > Now, is that a "feature or a bug"? You be the judge.It is what it is, > and it's always been that way. Writability trumps readability, even if > you have to do special things to get there. > > That said, this file remap case was clearly not intentional, and > despite the mmap() issue I think this is just plain wrong and we > should fix it as a QoI issue. > > A dedupe may only write to the destination file, but at the same time > it does obviously have that implication of "I need to be able to read > it to see that it's duplicate". Yeah the implication is there of course, we might as well honor it I think? Clearly it's sort of silly to say that the write doesn't imply read, especially since we can get around it in other ways, but at the same time I don't really see a harm in adding the extra "hey I can read this too, right?" since DEDUPE does imply we need to be able to read both sides. > > However, your patch is wrong: > > > --- a/fs/remap_range.c > > +++ b/fs/remap_range.c > > @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file) > > > > if (capable(CAP_SYS_ADMIN)) > > return true; > > - if (file->f_mode & FMODE_WRITE) > > + if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | > > FMODE_WRITE)) > > return true; > > This part looks like a good idea - although it is possible that people > will argue that this is the same kind of issue as 'mmap()' has (but > unlike mmap, we don't have "this is how the hardware works" issues, or > "long history of uses"). > > But > > > - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) > > + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) > > looks wrong. > > Note that readability is about the file *descriptor*, not the inode. > Because the file descriptor may have been opened by somebody who had > permission to read the file even for a write-only file. > > That happens for capability reasons, but it also happens for things > like "open(O_RDWR | O_CREAT, 0444)" which creates a new file that is > write-only in the filesystem, but despite that the file descriptor is > actually readable by the opener. > > I wonder if the inode_permission() check should just be removed > entirely (ie the MAY_WRITE check smells bogus too, for the same reason > I don't like the added MAY_READ one) > > The file permission check - that was done at open time - is the > correct one, and is the one that read/write already uses. > > Any permission checks done at IO time are basically always buggy: > things may have changed since the 'open()', and those changes > explicitly should *not* matter for the IO. That's really fundamentally > how UNIX file permissions work. > I don't think we should go this far, after all the normal write()/read() syscalls do the permission checking each time as well, so this is consistent with any other file modification operation. Of course it's racey, but we should probably be consistently racey with any other file modification operation. Thanks, Josef
On Tue, Jul 12, 2022 at 12:02 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > Any permission checks done at IO time are basically always buggy: > > things may have changed since the 'open()', and those changes > > explicitly should *not* matter for the IO. That's really fundamentally > > how UNIX file permissions work. > > I don't think we should go this far, after all the normal > write()/read() syscalls do the permission checking each time as well, No, they really don't. The permission check is ONLY DONE AT OPEN TIME. Really. Go look. Anything else is a bug. If you open a file, and then change the permissions of the file (or the ownership, or whatever) afterwards, the open file descriptor is still supposed to be readable or writable. Doing IO time permission checks is not only wrong, it's ACTIVELY BUGGY, and is a fairly common source of security problems (ie passing a file descriptor off to a suid binary, and then using the suid permissions to make changes that the original opener didn't have the rights to do). So if you do permission checks at read/write time, you are a buggy mess. It really is that simple. This is why read and write check FMODE_READ and FMODE_WRITE. That's the *open* time check. The fact that dedupe does that inode_permission() check at IO time really looks completely bogus and buggy. Linus
On Tue, Jul 12, 2022 at 12:07 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > Anything else is a bug. If you open a file, and then change the > permissions of the file (or the ownership, or whatever) afterwards, > the open file descriptor is still supposed to be readable or writable. .. it works the other way too. If you have higher capabilities and open a file, and then drop the capabilities after the open, the file descriptor is still supposed to be available. Again, doing some new permission check at IO time is wrong. Of course, the sad part is that we have done that wrong thing many times. Several /proc files have had fairly lax open-time permission checks, and then they do stricter checks at IO time, and it's been a serious security issue many many times. Similarly, the traditionally horribly broken BSD model of "do SCSI ioctl's using read/write calls" was a complete disaster in this area, exactly because the permission checks were then done based on the IO details (ie whatever command was written). And then that was fundamental to the whole interface, because some commands require more permissions than others, so you can't do the permission checks early. This is largely why we then have that "file->f_cred" thing, so that you can at least do things like "use the open-time credentials for checking at IO time". Or just say "if open-time credentials are different from IO time credentials, just abort". That solves the SUID issue when the actor permissions ("credentials") change, but it doesn't solve things like "somebody actually changed the target file permissions themselves" issue. So those really have to be tested at open time, and IO should not do "inode_permission()" checks, because it's just fundamentally too late by then. Of course, on eg stateless network filesystems, the IO-time permission checks may be the only ones that the server side *can* do, and then you end up with non-POSIX behavior and potential breakage. The world is a messy place. Linus
On Tue, Jul 12, 2022 at 12:07:46PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 12:02 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > > Any permission checks done at IO time are basically always buggy: > > > things may have changed since the 'open()', and those changes > > > explicitly should *not* matter for the IO. That's really fundamentally > > > how UNIX file permissions work. > > > > I don't think we should go this far, after all the normal > > write()/read() syscalls do the permission checking each time as well, > > No, they really don't. > > The permission check is ONLY DONE AT OPEN TIME. > > Really. Go look. > I did, I just misread what rw_verify_area was doing, it's just doing the security check, not a full POSIX permissions check, my mistake. > Anything else is a bug. If you open a file, and then change the > permissions of the file (or the ownership, or whatever) afterwards, > the open file descriptor is still supposed to be readable or writable. > > Doing IO time permission checks is not only wrong, it's ACTIVELY > BUGGY, and is a fairly common source of security problems (ie passing > a file descriptor off to a suid binary, and then using the suid > permissions to make changes that the original opener didn't have the > rights to do). > > So if you do permission checks at read/write time, you are a buggy > mess. It really is that simple. > > This is why read and write check FMODE_READ and FMODE_WRITE. That's > the *open* time check. > > The fact that dedupe does that inode_permission() check at IO time > really looks completely bogus and buggy. > Yeah I'm fine with removing the inode_permission(), I just want to make sure we're consistent across all of our IO related operations. It looks like at the very least we're getting security_*_permission on things like read/write/fallocate, so perhaps switch to that here and call it good enough? Thanks, Josef
On Tue, Jul 12, 2022 at 1:04 PM Josef Bacik <josef@toxicpanda.com> wrote: > > Yeah I'm fine with removing the inode_permission(), I just want to make sure > we're consistent across all of our IO related operations. It looks like at the > very least we're getting security_*_permission on things like > read/write/fallocate, so perhaps switch to that here and call it good enough? remap_verify_area() already does that, afaik. The more I look at the remap_range stuff, the more I feel it is very ad-hoc and nobody really thought deeply about it. What about an append-only destination? Is that kind of write supposed to be ok because it's just REMAP_FILE_DEDUP? The open side should already have checked for IS_IMMUTABLE, but O_APPEND is a thing? I'm getting the feeling that somebody really needs to think about what the semantics should be. In the meantime, I think that requiring the block size alignment is the important part here. The "check read permissions" is kind of a non-issue, since we already have that mmap() case. Strangely, it *does* check that the position is aligned for remapping in .generic_remap_checks(). And not at all for dedupe. And even for remapping, the size alignment is a bit strange. It takes the "source EOF" into account, but what if the destination file is big enough that that's not the end? And the inode_permission() check is wrong, but at least it does have the important check there (ie that FMODE_WRITE one). So doing the inode_permissions() check at worst just makes it fail too often, but since it's all a "optimistically dedupe" anyway, that kind of "fail in odd situations" doesn't really matter. So for allow_file_dedupe(), I'd suggest: (a) remove the inode_permission() check in allow_file_dedupe() (b) remove the uid_eq() check for the same reason (if you didn't open the destination for write, you have no business deduping anything, even if you're the owner) (c) add a "can't do it for APPEND_ONLY" (but let the CAP_SYS_ADMIN override it) AND I'd add a "make sure it's all block-aligned" check for both the source and each destination chunk. Something like the attached, IOW. Entirely untested, this is not meant to be applied as-is, this is meant to be the basis of discussion. Btw, the generic_remap_file_range_prep() IS_IMMUTABLE check seems bogus too. How could it possibly be immutable if we've opened the target for writing? So all of this seems a bit confused. It really smells like "filesystem people wrote this with low-level filesystem rules in mind, rather than any kind of high-level understanding or conceptual rules" Hmm? Linus
On Tue, Jul 12, 2022 at 01:48:18PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 1:04 PM Josef Bacik <josef@toxicpanda.com> wrote: > > > > Yeah I'm fine with removing the inode_permission(), I just want to make sure > > we're consistent across all of our IO related operations. It looks like at the > > very least we're getting security_*_permission on things like > > read/write/fallocate, so perhaps switch to that here and call it good enough? > > remap_verify_area() already does that, afaik. > > The more I look at the remap_range stuff, the more I feel it is very > ad-hoc and nobody really thought deeply about it. > > What about an append-only destination? Is that kind of write supposed > to be ok because it's just REMAP_FILE_DEDUP? The open side should > already have checked for IS_IMMUTABLE, but O_APPEND is a thing? This whole area of dedupe preconditions is a giant mess that makes my head hurt every time I think about it, so I don't really think about it. I hoisted all of it from btrfs to reuse the system call entry point without breaking existing userspace. Were I designing this from scratch for XFS, I would've required either CAP_SYS_ADMIN; or FMODE_READ on the src, and FMODE_READ|WRITE on the dest, and left it at that. Neither file can be IMMUTABLE because, frankly, I don't see much point in having such a flag if its behavior is the same as chmod 0000; I'll come back to this. Deduping between readable O_APPEND files would be fine because we're not writing to either file. (But that's my own fantasyland.) AFAICT, the reason why dedupe does all the weird checks it does is because apparently some of the dedupe tools expect that they can do weird things like dedupe into a file that they own but haven't opened for writes or could open for writes. Change those bits, and you break userspace. Given that you can already mmap the write-only file and read data from the mapping, I don't see what the leak is. If someone really wants to break userspace on these grounds, they can own all the howling that results. > I'm getting the feeling that somebody really needs to think about what > the semantics should be. > > In the meantime, I think that requiring the block size alignment is > the important part here. The "check read permissions" is kind of a > non-issue, since we already have that mmap() case. > > Strangely, it *does* check that the position is aligned for remapping > in .generic_remap_checks(). And not at all for dedupe. The dedupe implementations /do/ check file blocksize, it's under generic_remap_file_range_prep. The reason this exploit program works on the 7-byte file is that we stop comparing at EOF, which means that there are fewer bytes to guess. But. You can already read the file anyway. > And even for remapping, the size alignment is a bit strange. It takes > the "source EOF" into account, but what if the destination file is big > enough that that's not the end? generic_remap_check_len is supposed to catch those. > And the inode_permission() check is wrong, but at least it does have > the important check there (ie that FMODE_WRITE one). So doing the > inode_permissions() check at worst just makes it fail too often, but > since it's all a "optimistically dedupe" anyway, that kind of "fail in > odd situations" doesn't really matter. > > So for allow_file_dedupe(), I'd suggest: > > (a) remove the inode_permission() check in allow_file_dedupe() > > (b) remove the uid_eq() check for the same reason (if you didn't open > the destination for write, you have no business deduping anything, > even if you're the owner) So we're going to break userspace? https://github.com/markfasheh/duperemove/issues/129 > (c) add a "can't do it for APPEND_ONLY" (but let the CAP_SYS_ADMIN override it) > > AND I'd add a "make sure it's all block-aligned" check for both the > source and each destination chunk. ...and we're going to break deduping the EOF block again? > Something like the attached, IOW. Entirely untested, this is not meant > to be applied as-is, this is meant to be the basis of discussion. > > Btw, the generic_remap_file_range_prep() IS_IMMUTABLE check seems > bogus too. How could it possibly be immutable if we've opened the > target for writing? What /is/ the meaning of S_IMMUTABLE? Documentation/ only says that the fsverity author thinks it means "read-only contents". If it's the same as "chmod 0000" in that anyone with a writable fd can still modify the supposedly immutable file, then what was the point? The manual page for the getflags/setflags ioctls (~2017) say this: "FS_IMMUTABLE_FL 'i' The file is immutable: no changes are permitted to the file contents or metadata (permissions, timestamps, ownership, link count, and so on). (This restriction applies even to the superuser.) Only a privileged process (CAP_LINUX_IMMUTABLE) can set or clear this attribute." https://man7.org/linux/man-pages/man2/ioctl_iflags.2.html Going back to e2fsprogs 1.02 in 1997, the manual page for chattr says this: " i A file with the 'i' attribute cannot be modified: it cannot be deleted or renamed, no link can be created to this file, most of the file's metadata can not be modified, and the file can not be opened in write mode. Only the superuser or a process possessing the CAP_LINUX_IMMUTABLE capability can set or clear this attribute." https://man7.org/linux/man-pages/man1/chattr.1.html To me, that sounds like you shouldn't be able to change the contents of an immutable file, full stop, and that's what the authors of the clone/dedupe/copy_file_range calls have all gone with. True, you can write() to such a file if you have a writable fd, but that's not what I would have expected from the documentation. ISTR Ted or someone muttering that the current behavior seems much more like a historical accident than planned out semantics. > So all of this seems a bit confused. It really smells like "filesystem > people wrote this with low-level filesystem rules in mind, rather than > any kind of high-level understanding or conceptual rules" Frankly, I don't know what all the high level conceptual rules are for Linux filesystems. They don't seem to be written down in Documentation/ and this has made writing fstests very difficult for me when I want to wire up XFS to some syscall or another, and what documentation there is doesn't seem to be consistent with what the kernel actually does. --D > Hmm? > > Linus > fs/remap_range.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/fs/remap_range.c b/fs/remap_range.c > index e112b5424cdb..ba71ceb8dde3 100644 > --- a/fs/remap_range.c > +++ b/fs/remap_range.c > @@ -409,17 +409,12 @@ EXPORT_SYMBOL(vfs_clone_file_range); > /* Check whether we are allowed to dedupe the destination file */ > static bool allow_file_dedupe(struct file *file) > { > - struct user_namespace *mnt_userns = file_mnt_user_ns(file); > - struct inode *inode = file_inode(file); > - > if (capable(CAP_SYS_ADMIN)) > return true; > + if (file->f_flags & O_APPEND) > + return false; > if (file->f_mode & FMODE_WRITE) > return true; > - if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) > - return true; > - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) > - return true; > return false; > } > > @@ -428,6 +423,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, > loff_t len, unsigned int remap_flags) > { > loff_t ret; > + struct inode *dst; > + unsigned long blocksize; > > WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP | > REMAP_FILE_CAN_SHORTEN)); > @@ -457,10 +454,17 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos, > goto out_drop_write; > > ret = -EISDIR; > - if (S_ISDIR(file_inode(dst_file)->i_mode)) > + dst = file_inode(dst_file); > + if (S_ISDIR(dst->i_mode)) > goto out_drop_write; > > ret = -EINVAL; > + blocksize = 1ul << dst->i_blkbits; > + if (dst_pos & (blocksize-1)) > + goto out_drop_write; > + if (len & (blocksize-1)) > + goto out_drop_write; > + > if (!dst_file->f_op->remap_file_range) > goto out_drop_write; > > @@ -488,6 +492,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > int ret; > u16 count = same->dest_count; > loff_t deduped; > + unsigned long blocksize; > > if (!(file->f_mode & FMODE_READ)) > return -EINVAL; > @@ -507,6 +512,12 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same) > if (!file->f_op->remap_file_range) > return -EOPNOTSUPP; > > + blocksize = 1ul << src->i_blkbits; > + if (off & (blocksize-1)) > + return -EINVAL; > + if (len & (blocksize-1)) > + return -EINVAL; > + > ret = remap_verify_area(file, off, len, false); > if (ret < 0) > return ret;
On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote: > > AFAICT, the reason why dedupe does all the weird checks it does is > because apparently some of the dedupe tools expect that they can do > weird things like dedupe into a file that they own but haven't opened > for writes or could open for writes. Change those bits, and you break > userspace. Christ, what a crock. > The dedupe implementations /do/ check file blocksize, it's under > generic_remap_file_range_prep. The reason this exploit program works on > the 7-byte file is that we stop comparing at EOF, which means that there > are fewer bytes to guess. But. You can already read the file anyway. The dedupe clearly does *not* check for blocksize. It doesn't even call generic_remap_file_range_prep(). Just check it yourself: do_vfs_ioctl -> case FIDEDUPERANGE: ioctl_file_dedupe_range(filp, argp) -> vfs_dedupe_file_range(file, same) -> and there it is. All that calls is remap_verify_area() for both files, and allow_file_dedupe() for the target. Now, maybe some filesystem then calls the checking functions, but considering the posted code, that can't be true, because no, the the whole EOF argument is garbage, because even if the *source* was at EOF, the destination offset should still have been checked for being at a block boundary. And it clearly wasn't, since the code just walks the destination offset one byte at a time. So you may *think* that it checks the file blocksize, but I'm afraid reality disagrees. And no amount of "source is at EOF" is possibly relevant for the destination block size check. > So we're going to break userspace? > https://github.com/markfasheh/duperemove/issues/129 Well, if you're supposed to be able to dedupe read-only files, then the whole "check for writing" is just bogus to begin with. So in no way are any of those checks possibly correct. The destination offset is clearly not checked at all. And if writability isn't something you want honored, then FMODE_WRITE shouldn't have been an issue. > ...and we're going to break deduping the EOF block again? Why are you arguing about something that is clearly broken. The lack of destination offset checking cannot possibly be rigth. No amount of "EOF block" is relevant AT ALL. Stop it. You're making an argument that has nothing to do with the bug at hand. > What /is/ the meaning of S_IMMUTABLE? Documentation/ only says that the > fsverity author thinks it means "read-only contents". If it's the same > as "chmod 0000" in that anyone with a writable fd can still modify the > supposedly immutable file, then what was the point? No, the whole point is that you cannot ever get a writable file descriptor to an immutable file. So if that FMODE_WRITE check is correct, then IS_IMMUTABLE is nonsense. And if MODE_WRITE isn't correct, and dedupe is considered a read-only operation, then *that* isn't valid. You can't have it both ways. Which is it? That's really my point here - all those checks are completely random noise. There is absoltuely no rhyme or reason to them. And the offset check is clearly not there, and you should stop talking about "source EOF", because that is irrelevant. Source EOF may affect the *length* of the block, but it does not affect the offset of the destination. Linus
On Tue, Jul 12, 2022 at 7:58 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > Well, if you're supposed to be able to dedupe read-only files, then > the whole "check for writing" is just bogus to begin with. Hmm. Maybe that's too strong. The "if you can write to it, you can always dedup it" does make sense, and then couple that with "other situations are also ok". And if you were to want to dedup things like symlinks (do people?) you need to be able to handle FMODE_PATH cases that aren't FMODE_WRITE. And at that point, the "inode owner" check starts making sense. Except the code doesn't allow symlinks anyway right now, so that's kind of theoretical. But then you really do need to check for IS_IMMUTABLE there too and that it's a valid file type, and not just hope that it's checked for elsewhere. And those checks shouldn't pass just because of CAP_SYS_ADMIN, so that check seems to be in the wrong place too. So maybe it mostly works (apart from how clearly the destination offset alignment check is somehow missing or done too late or whatever), but it all seems very random, with checks spread out and not with any consistency or logic. As an example of this randomness, for the dedup source file, we have three different error returns for checking whether it's ok in vfs_dedupe_file_range(*): if (S_ISDIR(src->i_mode)) return -EISDIR; if (!S_ISREG(src->i_mode)) return -EINVAL; if (!file->f_op->remap_file_range) return -EOPNOTSUPP; but then for the destination check the code does ret = -EISDIR; if (S_ISDIR(file_inode(dst_file)->i_mode)) goto out_drop_write; ret = -EINVAL; if (!dst_file->f_op->remap_file_range) goto out_drop_write; and again - maybe this works, and it's just that the error return is inconsistent for source-vs-target, but it just reinforces that whole "this is all completely ad-hoc and doesn't really make much sense". Linus
On Tue, Jul 12, 2022 at 07:58:11PM -0700, Linus Torvalds wrote: > On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote: > > > > AFAICT, the reason why dedupe does all the weird checks it does is > > because apparently some of the dedupe tools expect that they can do > > weird things like dedupe into a file that they own but haven't opened > > for writes or could open for writes. Change those bits, and you break > > userspace. > > Christ, what a crock. Yes, the dedupe API is ... poor. But we're stuck with it because someone called Linus Torvalds who keeps telling us that we should not break userspace if there's *any way possible* to avoid doing so.... .... and in this case, I think the problem is avoided by a simple fix to generic_remap_checks(). > > The dedupe implementations /do/ check file blocksize, it's under > > generic_remap_file_range_prep. The reason this exploit program works on > > the 7-byte file is that we stop comparing at EOF, which means that there > > are fewer bytes to guess. But. You can already read the file anyway. > > The dedupe clearly does *not* check for blocksize. It doesn't even > call generic_remap_file_range_prep(). > > Just check it yourself: I did, hours ago, and concluded that you were on the wrong track and I didn't care enough to get involved in a shouting match. But you're not actually understanding the code, nor are you listening to the people who are trying to point out that you haven't understood how the code works. You're causing those people stress and angst by shouting them down even though you are wrong - these people know the code far better than you, so you need to listen to them rather than shout over them. So, let's clear all that away, and just follow the code. I'll map out the path to block alignment for you: > do_vfs_ioctl -> > case FIDEDUPERANGE: > ioctl_file_dedupe_range(filp, argp) -> > vfs_dedupe_file_range(file, same) -> There's no checks at this point because we can't do them safely. We have to first lock the inodes before we check against EOF so that dedupe does not race against truncate, fallocate, or extending writes. This is important because we have to support the "dedupe whole file" case that is defined by src = {0, EOF}, dst = {0, EOF}, and that's where all the complexity lies. Hence we have to continue down the dedupe stack to lock the inodes and perform the block alignment checks: vfs_dedupe_file_range_one() file->f_op->remap_file_range() xfs_file_remap_range() <locks inodes> <performs XFS specific remap checks> generic_remap_file_range_prep() generic_remap_checks() generic_remap_checks() does: ..... loff_t bs = inode_out->i_sb->s_blocksize; int ret; /* The start of both ranges must be aligned to an fs block. */ if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs)) return -EINVAL; ..... /* * If the user wanted us to link to the infile's EOF, round up to the * next block boundary for this check. * * Otherwise, make sure the count is also block-aligned, having * already confirmed the starting offsets' block alignment. */ if (pos_in + count == size_in) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; } It should be clear that this code does not allow unaligned file offsets at all. It should also be clear that we silently round the dedupe length down to block size alignment to avoid sub-block dedupe attempts (not possible) rather than erroring out because it's always valid to dedupe less range than was asked for. However, we also have a special case for the "dedupe to EOF" operation that allows whole file dedupe. In that case, we round the length *up* to capture the whole EOF block in the remap attempt. That's the case where bug the test case exploits lies - it fails to check whether the dst range is also deduping all the way to EOF. We haven't got to the data match checks yet - we're still deciding on the bounds for the data match checks at this point. Hence if we get the bound checks wrong here, the data checks might match when they shouldn't. e.g. by only checking for partial block matches instead of checking all the valid data in both src and dst match. That's the bug in this code - the dedupe length EOF rounding needs to be more constrained as it's allowing an EOF block to be partially matched against any full filesystem block as long as the range from start to EOF in the block matches. That's the information leak right there, and it has nothing to do with permissions. Hence if we restrict EOF block deduping to both the src and dst files having matching EOF offsets in their EOF blocks like so: - if (pos_in + count == size_in) { + if (pos_in + count == size_in && + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { bcount = ALIGN(size_in, bs) - pos_in; } else { if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; } This should fix the bug that was reported as it prevents dedupe an EOF block against non-EOF blocks or even EOF blocks where EOF is at a different offset into the EOF block. So, yeah, I think arguing about permissions and API and all that stuff is just going completely down the wrong track because it doesn't actually address the root cause of the information leak.... Cheers, Dave.
On Tue, Jul 12, 2022 at 11:46 PM Dave Chinner <david@fromorbit.com> wrote: > > Hence if we restrict EOF block deduping to both the src and dst > files having matching EOF offsets in their EOF blocks like so: > > - if (pos_in + count == size_in) { > + if (pos_in + count == size_in && > + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { > bcount = ALIGN(size_in, bs) - pos_in; I agree with checking the target size too. And I can see how missing that might cause the problem. I don't think that is limited to the REMAP_FILE_DEDUP case, though. Even if you a clone operation, you cannot just clone the EOF block to some random part of the destination. Anyway, isn't all of this supposed to be done by generic_remap_check_len()? That function already takes care of a similar concern for REMAP_FILE_CAN_SHORTEN, where the size of the *output* file matters. So generic_remap_check_len() basically already does one EOF block check for the output file. It just doesn't do it for the input side. And currently generic_remap_check_len() is done too late for REMAP_FILE_DEDUP, which did its handling just before calling it. So while I agree with your patch from a "this seems to be the underlying bug", I think the fix should be to move this "both EOF blocks have to match" logic to generic_remap_check_len(), and just do that *before* that if (remap_flags & REMAP_FILE_DEDUP) { in generic_remap_file_range_prep(). No? That said, the rest of that code in generic_remap_checks() still makes little to no sense to me. Look: > bcount = ALIGN(size_in, bs) - pos_in; and we literally *just* checked that "pos_in + count == size_in". So we can write that as > bcount = ALIGN(pos_in + count, bs) - pos_in; That doesn't look simpler, but... Again, just a few lines above this all, we had > if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs)) > return -EINVAL; so we know that 'pos_in' is aligned wrt bs. So we can rewrite that "ALIGN(pos_in + count, bs)" as "pos_in + ALIGN(count, bs)", because 'pos_in' doesn't change anything wrt an alignment operation. And then trivial simplification ("pos_in - pos_in goes away") makes the whole expression be just > bcount = ALIGN(count, bs); which just once more makes me go "maybe this code works, but it is clearly written for maximum nonsensical value". The "else" side is equally overly complex too, and does if (!IS_ALIGNED(count, bs)) count = ALIGN_DOWN(count, bs); bcount = count; which is just a really complicated way to write bcount = ALIGN_DOWN(count, bs); count = bcount; so that side if the if-statement knew that it could just align the count directly, but decided to do that in the least obvious way too. If 'count' was already aligned, ALIGN_DOWN() does nothing. And masking is much cheaper than testing and branching. Not to mention just *simpler*: One case aligns up to the next block boundary ("include the shared EOF block"), the other case aligns down ("only try to merge full blocks")./ Now the code makes sense, although it's still somewhat subtle in that the align-down case will also update 'count' (which is returned), while the EOF code will only set 'bcount' (which is only used for the overlapping range check) . But then, when you look at that and understand what's going on, that in turn then makes *another* thing obvious: the whole existence of 'bcount' is entirely pointless. Because 'bcount' is only used for that range check, and for the non-EOF case it's the same as 'count'. And for the EOF case, doing that alignment is entirely pointless, since if the in/out inodes are the same, then the file size is going to be the same, and the EOF block is going to overlap whether bcount was aligned to block boundary or not. So the EOF case might as well just have made 'bcount = count' without any alignment to the next block boundary at all. And once it does that, now bcount is _always_ the same as count, and there is no point in having bcount at all. So after doing all the above simplification, you can then get rid of 'bcount' entirely. > So, yeah, I think arguing about permissions and API and all that > stuff is just going completely down the wrong track because it > doesn't actually address the root cause of the information leak.... I agree that getting this "check the right range" thing right is the prime thing. The code being *very* hard to follow and not having any obvious rules really does not help, though. The permission checks are odd. And the range checks were odd and inscrutable and buggy to boot. Yes, I require that people don't break user space. That's the #1 rule of kernel development. But that does not mean or imply "write incomprehensible code". And honestly, I think your suggested patch just makes incomprehensibly and pointlessly complex code even more so. Which is why I'm suggesting the real fix is to clean it up, and mover that EOF offset check to generic_remap_check_len() where it belongs, and where we already have that comment: * ... For deduplication we accept a partial EOF block only if it ends at the * destination file's EOF (can not link it into the middle of a file). but that comment doesn't actually match the code in that function. In fact, that comment currently doesn't match any existing code at all (your suggested patch would be that code, but in another place entirely). Can we please all agree that this code is too obscure for its own good? Linus
> First off - an odd technicality: you can already read write-only files > by simply mmap'ing them, because on many architectures PROT_WRITE ends > up implying PROT_READ too. > > So you should basically expect that "I have permissions to write to > the file" automatically means that you can read it too. > > People simply do that "open for writing, mmap to change it" and expect > it to work - not realizing that that means you have to be able to read > it too. Thank you for the explanation. Unfortunately I was not able to reproduce this. I do understand, that being able to write to memory without being able to read from it cannot be implemented because of hardware limitations on many architectures. However using a writeonly fd in a call to mmap() in the first place already consistently fails for me. According to the man pages this is actually intended behavior. "Errors: EACCESS: [... If] a file mapping was requested, but fd is not open for reading." Therefore I do not see how it is possible to read out data without a readable fd, since no mapping can be created without read permissions. I assumed readability not being a subset of writability for files was the exact reason for this limitation on the fd in mmap(). Do I miss something here? > But > >> - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) >> + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) > > looks wrong. > > Note that readability is about the file *descriptor*, not the inode. > Because the file descriptor may have been opened by somebody who had > permission to read the file even for a write-only file. I do agree. At least it did not look typical for me. The idea for the patch was simply to have the smallest possible change for this specific issue. Best regards, Ansgar
>> And the inode_permission() check is wrong, but at least it does have >> the important check there (ie that FMODE_WRITE one). So doing the >> inode_permissions() check at worst just makes it fail too often, but >> since it's all a "optimistically dedupe" anyway, that kind of "fail in >> odd situations" doesn't really matter. >> >> So for allow_file_dedupe(), I'd suggest: >> >> (a) remove the inode_permission() check in allow_file_dedupe() >> >> (b) remove the uid_eq() check for the same reason (if you didn't open >> the destination for write, you have no business deduping anything, >> even if you're the owner) > So we're going to break userspace? > https://github.com/markfasheh/duperemove/issues/129 > I am actually not sure why writability is needed for 'dest' at all. Of course, the deduplication might cause a write on the block device or underlying storage, but from the abstract fs perspective, neither the data nor any metadata can change, regardless of the success of the deduplication. The only attribute that might change is the position of the block on the blockdevice. Does changing this information require write permissions? If I interpret the linked issue correctly, only needing read permissions on both fds would also solve this problem. Best regards, Ansgar
> That's the bug in this code - the dedupe length EOF rounding needs > to be more constrained as it's allowing an EOF block to be partially > matched against any full filesystem block as long as the range from > start to EOF in the block matches. That's the information leak right > there, and it has nothing to do with permissions. > > Hence if we restrict EOF block deduping to both the src and dst > files having matching EOF offsets in their EOF blocks like so: > > - if (pos_in + count == size_in) { > + if (pos_in + count == size_in && > + (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) { > bcount = ALIGN(size_in, bs) - pos_in; > } else { > if (!IS_ALIGNED(count, bs)) > count = ALIGN_DOWN(count, bs); > bcount = count; > } > > This should fix the bug that was reported as it prevents dedupe an > EOF block against non-EOF blocks or even EOF blocks where EOF is at > a different offset into the EOF block. > > So, yeah, I think arguing about permissions and API and all that > stuff is just going completely down the wrong track because it > doesn't actually address the root cause of the information leak.... I am sorry, while I do agree about the patch, I strongly disagree regarding the permissions. Of course, this exact approach will not work anymore in practice, since up to 2^^(4096*8) tries would be needed to guess a single 4K block correctly. Nevertheless it would be possible to guess the correct data for the whole block, since a comparison of data is still possible. So if I want to check whether a file contains a specific block (or one of a set) I can still do so, without having read access. Whether to keep this behavior to not break backwards compatibility is not my decision, but the root problem of this leak is *not* about the minimum block size that can be deduplicated, or that it can be lowered by using the EOF behavior. It is about not needing read permissions to compare data from a file. Best regards, Ansgar
On Wed, Jul 13, 2022 at 10:15 AM Ansgar Lößer <ansgar.loesser@tu-darmstadt.de> wrote: > > > Thank you for the explanation. Unfortunately I was not able to reproduce > this. I do understand, that being able to write to memory without being > able to read from it cannot be implemented because of hardware > limitations on many architectures. Oh, you are right, we actually catch that situation, and require FMODE_READ for all mmap's. For some reason I was entirely sure that this had come up and we didn't, but I see do_mmap() clearly doing fallthrough; case MAP_PRIVATE: if (!(file->f_mode & FMODE_READ)) return -EACCES; where that "fallthrough" is for the non-MAP_PRIVATE cases, so it hits shared mappings too. I was sure we had hit this case and it caused problems to check for it, but that test goes back to before the git days (and in fact to before the BK days). So clearly my "clear memory" was complete garbage. And thinking about it, I suspect said "clear memory" goes back to me having issues with the original alpha port, where the hardware *did* technically support write-only mappings (_PAGE_FOR set - "Fault On Read", but _PAGE_FOW not set). So alpha was the first port I did (and a big influence for how the portable VM model came about), and page protections could be done "right" in theory from a memory management standpoint. But even then you can't actually do it, because writable maps required reading _anyway_ - because the common alpha sequence was to do byte accesses as "read-modify-write" longword accesses. So that's likely the source of my conviction that write-only mappings always require read accesses, but I had it exactly the wrong way around - it's not even hardware-specific, but general, and just means that we actually refuse to mmap() files that have been opened write-only. That should teach me to actually go and look at the code (or test), not go by "I have a distinct memory". Linus
On Wed, Jul 13, 2022 at 07:16:14PM +0200, Ansgar Lößer wrote: > > > And the inode_permission() check is wrong, but at least it does have > > > the important check there (ie that FMODE_WRITE one). So doing the > > > inode_permissions() check at worst just makes it fail too often, but > > > since it's all a "optimistically dedupe" anyway, that kind of "fail in > > > odd situations" doesn't really matter. > > > > > > So for allow_file_dedupe(), I'd suggest: > > > > > > (a) remove the inode_permission() check in allow_file_dedupe() > > > > > > (b) remove the uid_eq() check for the same reason (if you didn't open > > > the destination for write, you have no business deduping anything, > > > even if you're the owner) > > So we're going to break userspace? > > https://github.com/markfasheh/duperemove/issues/129 > > > > I am actually not sure why writability is needed for 'dest' at all. Of > course, the deduplication might cause a write on the block device or > underlying storage, but from the abstract fs perspective, neither the data > nor any metadata can change, regardless of the success of the deduplication. Metadata is most definitely changed by a dedupe operation. Run filefrag or FIEMAP before/after and you'll see that the block mapping for at least the destination file (and maybe the source, too!) as a result of the dedupe operation. > The only attribute that might change is the position of the block on the > blockdevice. Does changing this information require write permissions? Yes. The block map is needed to access the user data, hence POSIX requires modifications to the block map be covered by fdatasync() persistence guarantees. i.e. modifying the block map of a file/inode is most definitely considered a write operation that needs a post-completion fdatasync/fsync operation to provide the modification with persistence guarantees. Indeed, we require write permissions for fallocate() operations that directly modify the block list but don't change data, too. e.g. preallocation over a hole, sparsifying a file by punching out data ranges containing only zeros, etc. These are not changing the user visible data; they only modify the underlying block map of the inode. This is exactly the same as dedupe - these operations are not modifying user visible data, they just modify the index needed to find the physical location of the user data in the filesystem. In fact, there's an fallocate() operation called FALLOC_FL_UNSHARE_RANGE, which is the exact opposite of dedupe - it takes a shared extent and explicitly breaks the sharing, copying data and changing the underlying block map of the file to do that. It doesn't change user visible data, but it most definitely changes the metadata and the data layout of the file. So, yeah, operations that directly manipulate the extent layout of the file (reflink/clone, dedupe, fallocate, etc) are most definitely modification operations. Always have been, always will be, and should always require write permissions to have been granted to the caller... Cheers, Dave.
On Wed, Jul 13, 2022 at 01:16:37AM -0700, Linus Torvalds wrote: > > Can we please all agree that this code is too obscure for its own good? Oh, there's no denying that, or that the API is .... poor. The problem is that touching this code has a very high validation burden. Like Darrick, I'm familiar with this code because we were the poor shmucks who decided dedupe needed data integrity testing before we could support it in XFS. Three months of finding and fixing data corruption after data corruption in the ioctl/vfs layers and tens of billions of fsx ops later... ... and the code has really not changed very much since then. That's the fundamental problem with rewriting this code - validating that changes have not introduced new data corruption bugs on XFS, btrfs, NFS and other idedupe supporting filesystems is time consuming and resource intensive. And it can't be skipped, because corrupting user data is even worse than breaking userspace applications (i.e. you can fix the kernel so the apps run again, but corrupted data is gone forever). So while the code might be somewhat inscrutible for outsiders with no familiarity of the dedupe/clone APIs or it's required behaviours, it works as advertised and doesn't corrupt data. Hence for the past few years the rule "don't try to fix what ain't broke" has applied - we've all got plenty of stuff that is actually broken or deficient and needs to be fixed to deal with first.... Cheers, Dave.
diff --git a/fs/remap_range.c b/fs/remap_range.c index e112b54..ad5b44d 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file) if (capable(CAP_SYS_ADMIN)) return true; - if (file->f_mode & FMODE_WRITE) + if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | FMODE_WRITE)) return true; if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode))) return true; - if (!inode_permission(mnt_userns, inode, MAY_WRITE)) + if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE)) return true; return false;