btrfs,vfs: allow FILE_EXTENT_SAME on a file opened ro
diff mbox

Message ID 1463715912-8005-1-git-send-email-kilobyte@angband.pl
State Not Applicable
Headers show

Commit Message

Adam Borowski May 20, 2016, 3:45 a.m. UTC
(Only btrfs currently implements dedupe_file_range.)

Instead of checking the mode of the file descriptor, let's check whether
it could have been opened rw.  This allows fixing failures when deduping
a live system: anyone trying to exec a file currently being deduped gets
ETXTBSY.

Issuing this ioctl on a ro file was already allowed for root/cap.

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
---
 fs/read_write.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Adam Borowski May 26, 2016, 10:57 p.m. UTC | #1
On Fri, May 20, 2016 at 05:45:12AM +0200, Adam Borowski wrote:
> (Only btrfs currently implements dedupe_file_range.)
> 
> Instead of checking the mode of the file descriptor, let's check whether
> it could have been opened rw.  This allows fixing failures when deduping
> a live system: anyone trying to exec a file currently being deduped gets
> ETXTBSY.
> 
> Issuing this ioctl on a ro file was already allowed for root/cap.

>  fs/read_write.c | 2 +-

I'm unsure whom to bother with this patch.  I've sent it to -btrfs as it's
relevant only to btrfs (at least currently), but technically the file
changed belongs to vfs.  Was CCing -fsdevel enough or should I
retarget/resend the patch?


Apologies for a newbish question.
Meow!
Mark Fasheh May 27, 2016, 12:04 a.m. UTC | #2
On Fri, May 20, 2016 at 05:45:12AM +0200, Adam Borowski wrote:
> (Only btrfs currently implements dedupe_file_range.)
> 
> Instead of checking the mode of the file descriptor, let's check whether
> it could have been opened rw.  This allows fixing failures when deduping
> a live system: anyone trying to exec a file currently being deduped gets
> ETXTBSY.
> 
> Issuing this ioctl on a ro file was already allowed for root/cap.
> 
> Signed-off-by: Adam Borowski <kilobyte@angband.pl>

Hi Adam, this patch seems reasonable to me but I have to admit to being
worried about 'unintended consequences'. I poked around the code in fs/ for
a bit and saw mostly checks against file open mode. It might be that dedupe
is a special case due to the potential for longer running operations, but
theoretically you'd see the same problem if trying to exec against a file
being cloned too, correct? If that's the case then I wonder how this issue
gets solved for other ioctls.
	--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Adam Borowski May 27, 2016, 12:48 a.m. UTC | #3
On Thu, May 26, 2016 at 05:04:01PM -0700, Mark Fasheh wrote:
> On Fri, May 20, 2016 at 05:45:12AM +0200, Adam Borowski wrote:
> > (Only btrfs currently implements dedupe_file_range.)
> > 
> > Instead of checking the mode of the file descriptor, let's check whether
> > it could have been opened rw.  This allows fixing failures when deduping
> > a live system: anyone trying to exec a file currently being deduped gets
> > ETXTBSY.
> > 
> > Issuing this ioctl on a ro file was already allowed for root/cap.
> 
> Hi Adam, this patch seems reasonable to me but I have to admit to being
> worried about 'unintended consequences'. I poked around the code in fs/ for
> a bit and saw mostly checks against file open mode.

I can't think of any unintended consequences:
* root already could dedupe a file opened ro, so the code can handle that
* a file being open ro but you having rw rights means you could have opened
  it rw

There are details related to inode_permission() I admit I don't fully
understand but I believe those don't really matter as reasons for not just
allowing FILE_EXTENT_SAME for anyone who can read the file are quite
far-fetched.

> It might be that dedupe is a special case due to the potential for longer
> running operations, but theoretically you'd see the same problem if trying
> to exec against a file being cloned too, correct?  If that's the case then
> I wonder how this issue gets solved for other ioctls.

Clone is a destructive operation that overwrites the file.  FILE_EXTENT_SAME
on the other hand makes no changes to the Posix view of the file, just to
its internal representation.


Meow!
Zygo Blaxell May 28, 2016, 1:59 a.m. UTC | #4
On Thu, May 26, 2016 at 05:04:01PM -0700, Mark Fasheh wrote:
> On Fri, May 20, 2016 at 05:45:12AM +0200, Adam Borowski wrote:
> > (Only btrfs currently implements dedupe_file_range.)
> > 
> > Instead of checking the mode of the file descriptor, let's check whether
> > it could have been opened rw.  This allows fixing failures when deduping
> > a live system: anyone trying to exec a file currently being deduped gets
> > ETXTBSY.

Isn't there another copy of this check in fs/btrfs/ioctl.c?

> > Issuing this ioctl on a ro file was already allowed for root/cap.
> > 
> > Signed-off-by: Adam Borowski <kilobyte@angband.pl>
> 
> Hi Adam, this patch seems reasonable to me but I have to admit to being
> worried about 'unintended consequences'. I poked around the code in fs/ for
> a bit and saw mostly checks against file open mode. It might be that dedupe
> is a special case due to the potential for longer running operations, 

The length of the operation is irrelevant.

If dedup is run over frequently executed files (e.g. /bin/sh), opening
the files in write mode will randomly disrupt some normal execution no
matter how quickly the dedup agent opens and closes the files.

The interference works both ways, too:  if a file is already being
executed, it can't be opened with write access.   Executables often
run for extended periods of time, effectively preventing them from ever
being deduped by non-root users.

Executables are often really good candidates for dedup, so we don't want
to avoid them.  Executable files on build servers are often duplicated
several times over in various .o and .a files and install/packaging trees.

> theoretically you'd see the same problem if trying to exec against a file
> being cloned too, correct? If that's the case then I wonder how this issue
> gets solved for other ioctls.

Clone could be used to change the content of the destination file, so
it is right to insist on having the file opened for writing in that case.

Dedupe is an unusual case because it will never change the content of
the destination file even if it is opened *with* write access.  The
defrag ioctl is similar in this respect.

Opening either dedup or defrag to non-root O_RDONLY access does have
some concerns.  It could be used to create a lot of extra I/O load,
particularly write load, that might otherwise be denied to an unprivileged
user...but then, so can repeatedly calling sync(), so maybe this isn't
a real problem.

There is a potential risk of exposing bugs in other parts of the system
(e.g. the various recently fixed dedup races) but whatever those bugs are,
we had them already without changing the permission checks.

I wonder if there is a risk of damaging files like this:

	open A O_RDWR

	open B O_RDONLY

	copy B to A

	do _not_ call fsync() here

	dedup(A, B)

	do _not_ call fsync() here either

	some time passes

	power fail

If A's extent isn't flushed to disk, what happens to B?  Does dedup imply
fsync or data ordering such that A is written to disk before the extent
ref in B is updated, or can the content of B change?  If root is running
dedup, we can assume that whoever authorized root code execution also made
sure that this case never arises (e.g. by ensuring the dedup agent always
calls fsync, or otherwise ensuring that the extent at A is stable on disk
before calling dedup).  If we allow non-root to do this then we have no
such assurance--but on the other hand maybe the assurance is weak and
bad things can happen here already even if we require root privilege.

I also wonder what happens if an executable that has called
mlockall(MCL_FUTURE) gets its pages replaced by a deduped extent while
it's running...do the replaced pages become un-mlockall-ed?  Is the VFS
smart enough to swap in the new pages?  Am I just silly and insane for
expecting shared mmap() to have meaningful real-time properties on btrfs?
Adam Borowski May 29, 2016, 12:21 a.m. UTC | #5
On Fri, May 27, 2016 at 09:59:22PM -0400, Zygo Blaxell wrote:
> On Thu, May 26, 2016 at 05:04:01PM -0700, Mark Fasheh wrote:
> > On Fri, May 20, 2016 at 05:45:12AM +0200, Adam Borowski wrote:
> > > (Only btrfs currently implements dedupe_file_range.)
> > > 
> > > Instead of checking the mode of the file descriptor, let's check whether
> > > it could have been opened rw.  This allows fixing failures when deduping
> > > a live system: anyone trying to exec a file currently being deduped gets
> > > ETXTBSY.
> 
> Isn't there another copy of this check in fs/btrfs/ioctl.c?

I can't seem to find one.  But even if there was, dedupe_file_range() is
supposed to be not specific to btrfs, btrfs merely happens to be the only
implementation at the moment.  It'd be welcome in zfs, new xfs, ...

> > Hi Adam, this patch seems reasonable to me but I have to admit to being
> > worried about 'unintended consequences'. I poked around the code in fs/ for
> > a bit and saw mostly checks against file open mode. It might be that dedupe
> > is a special case due to the potential for longer running operations, 
> 
> The length of the operation is irrelevant.
> 
> If dedup is run over frequently executed files (e.g. /bin/sh), opening
> the files in write mode will randomly disrupt some normal execution no
> matter how quickly the dedup agent opens and closes the files.

And this is exactly the failure this patch aims to fix.

> The interference works both ways, too:  if a file is already being
> executed, it can't be opened with write access.   Executables often
> run for extended periods of time, effectively preventing them from ever
> being deduped by non-root users.

That's sadly not 100% true: it depends on handler in question.  A native ELF
file open for exec can't be opened for write, a binfmt-ed or hashbanged one
can, with disastrous results.

> > theoretically you'd see the same problem if trying to exec against a file
> > being cloned too, correct? If that's the case then I wonder how this issue
> > gets solved for other ioctls.
> 
> Clone could be used to change the content of the destination file, so
> it is right to insist on having the file opened for writing in that case.

Clone always writes to the file.  It may, as a special case, write the exact
same content as the file had before, but that's impossible to ascertain
without a race.

> Dedupe is an unusual case because it will never change the content of
> the destination file even if it is opened *with* write access.  The
> defrag ioctl is similar in this respect.

Changing defrag this way might be a good idea, too.  These ioctls are in
different parts of the kernel (vfs vs btrfs) though so let's do dedup first.

> Opening either dedup or defrag to non-root O_RDONLY access does have
> some concerns.  It could be used to create a lot of extra I/O load,
> particularly write load, that might otherwise be denied to an unprivileged
> user...but then, so can repeatedly calling sync(), so maybe this isn't
> a real problem.

As you have write access to the file, you can cause such load by... writing!

> There is a potential risk of exposing bugs in other parts of the system
> (e.g. the various recently fixed dedup races) but whatever those bugs are,
> we had them already without changing the permission checks.
> 
> I wonder if there is a risk of damaging files like this:
> 
> 	open A O_RDWR
> 
> 	open B O_RDONLY
> 
> 	copy B to A
> 
> 	do _not_ call fsync() here
> 
> 	dedup(A, B)
> 
> 	do _not_ call fsync() here either
> 
> 	some time passes
> 
> 	power fail
> 
> If A's extent isn't flushed to disk, what happens to B?  Does dedup imply
> fsync or data ordering such that A is written to disk before the extent
> ref in B is updated, or can the content of B change?  If root is running
> dedup, we can assume that whoever authorized root code execution also made
> sure that this case never arises (e.g. by ensuring the dedup agent always
> calls fsync, or otherwise ensuring that the extent at A is stable on disk
> before calling dedup).  If we allow non-root to do this then we have no
> such assurance--but on the other hand maybe the assurance is weak and
> bad things can happen here already even if we require root privilege.

I don't think this can happen on btrfs: the superblock is updated only after
a barrier when both the data and extent refs are already on the disk.
However, your scenario may apply to other filesystems that may implement
dedupe_file_range() in the future, thus it's prudent to require write
permissions. 

> I also wonder what happens if an executable that has called
> mlockall(MCL_FUTURE) gets its pages replaced by a deduped extent while
> it's running...do the replaced pages become un-mlockall-ed?  Is the VFS
> smart enough to swap in the new pages?  Am I just silly and insane for
> expecting shared mmap() to have meaningful real-time properties on btrfs?

mmap() has currently no relation to extents, even entirely identical files
don't share any pages:

dd if=/dev/urandom of=a bs=1048576 count=1024  # 1GB of junk
cp --reflink a b
time cat a >/dev/null  # all in page cache -- fast
time cat b >/dev/null  # every extent is shared with a, but...

It'd be nice to have such sharing in the future.  I don't see much
difference for mlock between pages that were shared before the mapping and
ones which become sharing on runtime, though.  The complexity will be on CoW
splitting rather than joining...

In any case, this patch doesn't introduce any cases not already triggerable
by root.


Meow!
Zygo Blaxell May 29, 2016, 12:56 a.m. UTC | #6
On Sun, May 29, 2016 at 02:21:03AM +0200, Adam Borowski wrote:
> On Fri, May 27, 2016 at 09:59:22PM -0400, Zygo Blaxell wrote:
> > I wonder if there is a risk of damaging files like this:
> > 
> > 	open A O_RDWR
> > 
> > 	open B O_RDONLY
> > 
> > 	copy B to A
> > 
> > 	do _not_ call fsync() here
> > 
> > 	dedup(A, B)
> > 
> > 	do _not_ call fsync() here either
> > 
> > 	some time passes
> > 
> > 	power fail
> > 
> > If A's extent isn't flushed to disk, what happens to B?  Does dedup imply
> > fsync or data ordering such that A is written to disk before the extent
> > ref in B is updated, or can the content of B change?  If root is running
> > dedup, we can assume that whoever authorized root code execution also made
> > sure that this case never arises (e.g. by ensuring the dedup agent always
> > calls fsync, or otherwise ensuring that the extent at A is stable on disk
> > before calling dedup).  If we allow non-root to do this then we have no
> > such assurance--but on the other hand maybe the assurance is weak and
> > bad things can happen here already even if we require root privilege.
> 
> I don't think this can happen on btrfs: the superblock is updated only after
> a barrier when both the data and extent refs are already on the disk.

If and only if the filesystem is mounted with the flushoncommit option,
that's true.  This is not the default, though, and I lost a fair amount
of time and data before I discovered this.

> > I also wonder what happens if an executable that has called
> > mlockall(MCL_FUTURE) gets its pages replaced by a deduped extent while
> > it's running...do the replaced pages become un-mlockall-ed?  Is the VFS
> > smart enough to swap in the new pages?  Am I just silly and insane for
> > expecting shared mmap() to have meaningful real-time properties on btrfs?
> 
> mmap() has currently no relation to extents, even entirely identical files
> don't share any pages:
> 
> dd if=/dev/urandom of=a bs=1048576 count=1024  # 1GB of junk
> cp --reflink a b
> time cat a >/dev/null  # all in page cache -- fast
> time cat b >/dev/null  # every extent is shared with a, but...

That's precisely my point.  If we've called mlockall(MCL_FUTURE), the
executable pages are locked into RAM.  If we then replace the underlying
physical pages, this would normally invalidate the page cache for the
replaced extent; however mlockall(MCL_FUTURE) and the exec() DENYWRITE
lock combined mean that such an invalidation should not be possible.
So what happens?

(now that I think of it, whatever happens would be the same thing that happens
during device delete or balance, so it's probably OK, maybe)

> In any case, this patch doesn't introduce any cases not already triggerable
> by root.

It allows non-root to trigger cases that previously could only be
triggered by root.
Andrei Borzenkov May 29, 2016, 6:53 a.m. UTC | #7
29.05.2016 03:56, Zygo Blaxell пишет:
>>
>> I don't think this can happen on btrfs: the superblock is updated only after
>> a barrier when both the data and extent refs are already on the disk.
> 
> If and only if the filesystem is mounted with the flushoncommit option,
> that's true.  This is not the default, though, and I lost a fair amount
> of time and data before I discovered this.
> 

According to wiki, this is default on "reasonably recent kernels";
unfortunately it does say what kernel is recent enough. I am surprised
it can be disabled at all.
Adam Borowski May 30, 2016, 12:24 p.m. UTC | #8
On Sat, May 28, 2016 at 08:56:39PM -0400, Zygo Blaxell wrote:
> On Sun, May 29, 2016 at 02:21:03AM +0200, Adam Borowski wrote:
> > In any case, this patch doesn't introduce any cases not already triggerable
> > by root.
> 
> It allows non-root to trigger cases that previously could only be
> triggered by root.

Only the proposed "ro is enough" variant does.  The patch, as written,
requires write permission on the inode, thus alleviating your concerns:
* mangling the contents of dstfile: the user has rw access so he can do that
  already
* triggering a bug: the user could have opened dstfile rw (like duperemove
  currently does)


Meow!

Patch
diff mbox

diff --git a/fs/read_write.c b/fs/read_write.c
index cf377cf..6c414d8 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1736,7 +1736,7 @@  int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 
 		if (info->reserved) {
 			info->status = -EINVAL;
-		} else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+		} else if (!(is_admin || !inode_permission(dst, MAY_WRITE))) {
 			info->status = -EINVAL;
 		} else if (file->f_path.mnt != dst_file->f_path.mnt) {
 			info->status = -EXDEV;