diff mbox series

[2/2] dlmfs: convert dlmfs_file_read() to copy_to_user()

Message ID 20200529000419.4106697-2-viro@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show
Series [1/2] esas2r: don't bother with __copy_to_user() | expand

Commit Message

Al Viro May 29, 2020, 12:04 a.m. UTC
From: Al Viro <viro@zeniv.linux.org.uk>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/ocfs2/dlmfs/dlmfs.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

Comments

Linus Torvalds May 29, 2020, 1:27 a.m. UTC | #1
On Thu, May 28, 2020 at 5:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         if (*ppos >= i_size_read(inode))
>                 return 0;
>
> +       /* don't read past the lvb */
> +       if (count > i_size_read(inode) - *ppos)
> +               count = i_size_read(inode) - *ppos;

This isn't a new problem, since you effectively just moved this code,
but it's perhaps more obvious now..

"i_size_read()" is not necessarily stable - we do special things on
32-bit to make sure that we get _a_ stable value for it, but it's not
necessarily guaranteed to be the same value when called twice. Think
concurrent pread() with a write..

So the inode size could change in between those two accesses, and the
subtraction might end up underflowing despite the check just above.

This might not be an issue with ocfs2 (I didn't check locking), but ..

                  Linus
Al Viro May 29, 2020, 1:47 a.m. UTC | #2
On Thu, May 28, 2020 at 06:27:36PM -0700, Linus Torvalds wrote:
> On Thu, May 28, 2020 at 5:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         if (*ppos >= i_size_read(inode))
> >                 return 0;
> >
> > +       /* don't read past the lvb */
> > +       if (count > i_size_read(inode) - *ppos)
> > +               count = i_size_read(inode) - *ppos;
> 
> This isn't a new problem, since you effectively just moved this code,
> but it's perhaps more obvious now..
> 
> "i_size_read()" is not necessarily stable - we do special things on
> 32-bit to make sure that we get _a_ stable value for it, but it's not
> necessarily guaranteed to be the same value when called twice. Think
> concurrent pread() with a write..
> 
> So the inode size could change in between those two accesses, and the
> subtraction might end up underflowing despite the check just above.
> 
> This might not be an issue with ocfs2 (I didn't check locking), but ..

        case S_IFREG:
                inode->i_op = &dlmfs_file_inode_operations;
                inode->i_fop = &dlmfs_file_operations;

                i_size_write(inode,  DLM_LVB_LEN);
is the only thing that does anything to size of that sucker.  IOW, that
i_size_read() might as well had been an explicit 64.  Actually,
looking at that thing I would suggest simply

static ssize_t dlmfs_file_read(struct file *filp,
                               char __user *buf,
                               size_t count,
                               loff_t *ppos)
{
        struct inode *inode = file_inode(filp);
	char lvb_buf[DLM_LVB_LEN];

	if (!user_dlm_read_lvb(inode, lvb_buf, DLM_LVB_LEN))
		return 0;
	return simple_read_from_buffer(buf, count, ppos,
				       lvb_buf, DLM_LVB_LEN);
}

But that's belongs in a followup, IMO.
Linus Torvalds May 29, 2020, 1:54 a.m. UTC | #3
On Thu, May 28, 2020 at 6:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         case S_IFREG:
>                 inode->i_op = &dlmfs_file_inode_operations;
>                 inode->i_fop = &dlmfs_file_operations;
>
>                 i_size_write(inode,  DLM_LVB_LEN);
> is the only thing that does anything to size of that sucker.  IOW, that
> i_size_read() might as well had been an explicit 64.

Heh. Indeed. I did actually grep for i_size_write() use in ocfs2 and
saw several. But I didn't realize to limit it to just the dlmfs part.

So it does that crazy sequence number lock dance on 32-bit just to
read a constant value.

Oh well.

It would be nice to get those follow-up cleanups eventually, but I
guess the general user access cleanups are more important than this
very odd special case silliness.

             Linus
Al Viro May 29, 2020, 3:10 a.m. UTC | #4
On Thu, May 28, 2020 at 06:54:11PM -0700, Linus Torvalds wrote:
> On Thu, May 28, 2020 at 6:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         case S_IFREG:
> >                 inode->i_op = &dlmfs_file_inode_operations;
> >                 inode->i_fop = &dlmfs_file_operations;
> >
> >                 i_size_write(inode,  DLM_LVB_LEN);
> > is the only thing that does anything to size of that sucker.  IOW, that
> > i_size_read() might as well had been an explicit 64.
> 
> Heh. Indeed. I did actually grep for i_size_write() use in ocfs2 and
> saw several. But I didn't realize to limit it to just the dlmfs part.
> 
> So it does that crazy sequence number lock dance on 32-bit just to
> read a constant value.
> 
> Oh well.
> 
> It would be nice to get those follow-up cleanups eventually, but I
> guess the general user access cleanups are more important than this
> very odd special case silliness.

Not a problem - I'll put it into work.misc for the next cycle...
BTW, regarding uaccess - how badly does the following offend your taste?
Normally I'd just go for copy_from_user(), but these syscalls just might
be hot enough for overhead to matter...

commit 40f443f132306d724f43eaff5330b31c632455a6
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Wed Feb 19 09:54:24 2020 -0500

    pselect6() and friends: take handling the combined 6th/7th args into helper
    
    ... and use unsafe_get_user(), while we are at it.
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

diff --git a/fs/select.c b/fs/select.c
index 11d0285d46b7..ff0489c67e3f 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -766,6 +766,24 @@ static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp,
  * which has a pointer to the sigset_t itself followed by a size_t containing
  * the sigset size.
  */
+static inline int unkludge_sigmask(void __user *sig,
+				   sigset_t __user **up,
+				   size_t *sigsetsize)
+{
+	if (sig) {
+		if (!user_read_access_begin(sig, sizeof(void *)+sizeof(size_t)))
+			return -EFAULT;
+		unsafe_get_user(*up, (sigset_t __user * __user *)sig, Efault);
+		unsafe_get_user(*sigsetsize,
+				(size_t __user *)(sig+sizeof(void *)), Efault);
+		user_read_access_end();
+	}
+	return 0;
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
 SYSCALL_DEFINE6(pselect6, int, n, fd_set __user *, inp, fd_set __user *, outp,
 		fd_set __user *, exp, struct __kernel_timespec __user *, tsp,
 		void __user *, sig)
@@ -773,13 +791,8 @@ SYSCALL_DEFINE6(pselect6, int, n, fd_set __user *, inp, fd_set __user *, outp,
 	size_t sigsetsize = 0;
 	sigset_t __user *up = NULL;
 
-	if (sig) {
-		if (!access_ok(sig, sizeof(void *)+sizeof(size_t))
-		    || __get_user(up, (sigset_t __user * __user *)sig)
-		    || __get_user(sigsetsize,
-				(size_t __user *)(sig+sizeof(void *))))
-			return -EFAULT;
-	}
+	if (unkludge_sigmask(sig, &up, &sigsetsize))
+		return -EFAULT;
 
 	return do_pselect(n, inp, outp, exp, tsp, up, sigsetsize, PT_TIMESPEC);
 }
@@ -793,13 +806,8 @@ SYSCALL_DEFINE6(pselect6_time32, int, n, fd_set __user *, inp, fd_set __user *,
 	size_t sigsetsize = 0;
 	sigset_t __user *up = NULL;
 
-	if (sig) {
-		if (!access_ok(sig, sizeof(void *)+sizeof(size_t))
-		    || __get_user(up, (sigset_t __user * __user *)sig)
-		    || __get_user(sigsetsize,
-				(size_t __user *)(sig+sizeof(void *))))
-			return -EFAULT;
-	}
+	if (unkludge_sigmask(sig, &up, &sigsetsize))
+		return -EFAULT;
 
 	return do_pselect(n, inp, outp, exp, tsp, up, sigsetsize, PT_OLD_TIMESPEC);
 }
@@ -1325,6 +1333,25 @@ static long do_compat_pselect(int n, compat_ulong_t __user *inp,
 	return poll_select_finish(&end_time, tsp, type, ret);
 }
 
+static inline int unkludge_compat_sigmask(void __user *sig,
+				   compat_uptr_t *up,
+				   compat_size_t *sigsetsize)
+{
+	if (sig) {
+		if (!user_read_access_begin(sig,
+				sizeof(compat_uptr_t)+sizeof(compat_size_t)))
+			return -EFAULT;
+		unsafe_get_user(*up, (compat_uptr_t __user *)sig, Efault);
+		unsafe_get_user(*sigsetsize,
+				(compat_size_t __user *)(sig+sizeof(up)), Efault);
+		user_read_access_end();
+	}
+	return 0;
+Efault:
+	user_access_end();
+	return -EFAULT;
+}
+
 COMPAT_SYSCALL_DEFINE6(pselect6_time64, int, n, compat_ulong_t __user *, inp,
 	compat_ulong_t __user *, outp, compat_ulong_t __user *, exp,
 	struct __kernel_timespec __user *, tsp, void __user *, sig)
@@ -1332,14 +1359,8 @@ COMPAT_SYSCALL_DEFINE6(pselect6_time64, int, n, compat_ulong_t __user *, inp,
 	compat_size_t sigsetsize = 0;
 	compat_uptr_t up = 0;
 
-	if (sig) {
-		if (!access_ok(sig,
-				sizeof(compat_uptr_t)+sizeof(compat_size_t)) ||
-				__get_user(up, (compat_uptr_t __user *)sig) ||
-				__get_user(sigsetsize,
-				(compat_size_t __user *)(sig+sizeof(up))))
-			return -EFAULT;
-	}
+	if (unkludge_compat_sigmask(sig, &up, &sigsetsize))
+		return -EFAULT;
 
 	return do_compat_pselect(n, inp, outp, exp, tsp, compat_ptr(up),
 				 sigsetsize, PT_TIMESPEC);
@@ -1354,14 +1375,8 @@ COMPAT_SYSCALL_DEFINE6(pselect6_time32, int, n, compat_ulong_t __user *, inp,
 	compat_size_t sigsetsize = 0;
 	compat_uptr_t up = 0;
 
-	if (sig) {
-		if (!access_ok(sig,
-				sizeof(compat_uptr_t)+sizeof(compat_size_t)) ||
-		    	__get_user(up, (compat_uptr_t __user *)sig) ||
-		    	__get_user(sigsetsize,
-				(compat_size_t __user *)(sig+sizeof(up))))
-			return -EFAULT;
-	}
+	if (unkludge_compat_sigmask(sig, &up, &sigsetsize))
+		return -EFAULT;
 
 	return do_compat_pselect(n, inp, outp, exp, tsp, compat_ptr(up),
 				 sigsetsize, PT_OLD_TIMESPEC);
Linus Torvalds May 29, 2020, 3:42 a.m. UTC | #5
On Thu, May 28, 2020 at 8:10 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> BTW, regarding uaccess - how badly does the following offend your taste?
> Normally I'd just go for copy_from_user(), but these syscalls just might
> be hot enough for overhead to matter...

Hmm. So the code itself per se doesn't really offend me, but:

> +static inline int unkludge_sigmask(void __user *sig,
> +                                  sigset_t __user **up,
> +                                  size_t *sigsetsize)

That's a rather odd function, and if there's a reason for it I have no
issue, but I dislike the combination of "odd semantics" together with
"nondescriptive naming".

"unkludge" really doesn't describe anything.

Why is that "sig" pointer "void __user *" instead of being an actually
descriptive structure pointer:

   struct sigset_argpack {
        sigset_t __user *sigset;
        size_t sigset_size;
  };

and then it would be "struct sigset_size_argpack __user *" instead?
And same with compat_uptr_t */compat_size_t for the compat case?

Yeah, yeah, maybe I got that struct definition wrong when writing it
in the email, but wouldn't that make it much more understandable?

Then the output arguments could be just a pointer to that struct too
(except now in kernel space), and change that "unkludge" to
"get_sigset_argpack()", and the end result would be

    static inline int get_sigset_argpack(
          struct sigset_argpack __user *uarg,
          struct sigset_argpack *out)

and I think the implementation would be simpler and more
understandable too when it didn't need those odd casts and "+sizeof"
things etc..

So then the call-site would go from

>         size_t sigsetsize = 0;
>         sigset_t __user *up = NULL;
>
>         if (unkludge_sigmask(sig, &up, &sigsetsize))
>                 return -EFAULT;

to

>         struct sigset_argpack argpack = { NULL, 0 };
>
>         if (get_sigset_argpack(sig, &argpack))
>                 return -EFAULT;

and now you can use "argpack.sigset" and "argpack.sigset_size".

No?

Same exact deal for the compat case, where you'd just need that compat
struct (using "compat_uptr_t" and "compat_size_t"), and then

>         struct compat_sigset_argpack argpack = { 0, 0 };
>
> +       if (get_compat_sigset_argpack(sig, &argpack))
> +               return -EFAULT;

and then you use the result with "compat_ptr(argpack.sigset)" and
"argpack.sigset_size".

Or did I mis-read anything and get confused by that code in your patch?

                 Linus
Al Viro May 29, 2020, 8:46 p.m. UTC | #6
On Thu, May 28, 2020 at 08:42:25PM -0700, Linus Torvalds wrote:

> >         struct sigset_argpack argpack = { NULL, 0 };
> >
> >         if (get_sigset_argpack(sig, &argpack))
> >                 return -EFAULT;
> 
> and now you can use "argpack.sigset" and "argpack.sigset_size".
> 
> No?
> 
> Same exact deal for the compat case, where you'd just need that compat
> struct (using "compat_uptr_t" and "compat_size_t"), and then
> 
> >         struct compat_sigset_argpack argpack = { 0, 0 };
> >
> > +       if (get_compat_sigset_argpack(sig, &argpack))
> > +               return -EFAULT;
> 
> and then you use the result with "compat_ptr(argpack.sigset)" and
> "argpack.sigset_size".
> 
> Or did I mis-read anything and get confused by that code in your patch?

Umm...  I'd been concerned about code generation, but it actually gets
split into a pair of scalars just fine...

Al, trying to resist the temptation to call those struct bad_idea and
struct bad_idea_32...

All jokes aside, when had we (or anybody else, really) _not_ gotten
into trouble when passing structs across the kernel boundary?  Sure,
sometimes you have to (stat, for example), but just look at the amount
of PITA stat() has spawned...
Linus Torvalds May 29, 2020, 8:57 p.m. UTC | #7
On Fri, May 29, 2020 at 1:46 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Umm...  I'd been concerned about code generation, but it actually gets
> split into a pair of scalars just fine...

We actually have depended on that for a long time: our 'pte_t' etc on
32-bit kernels were very much about "structs of two words are handled
fairly well by gcc".

IIrc, we (for a while) had a config option to switch between "long
long" and the struct, but passing and returning two-word structs ends
up working fine even when it's a function call, and when it's all
inlined it ends up generating pretty good code on just two registers
instead.

> Al, trying to resist the temptation to call those struct bad_idea and
> struct bad_idea_32...

I'm sure you can contain yourself.

> All jokes aside, when had we (or anybody else, really) _not_ gotten
> into trouble when passing structs across the kernel boundary?  Sure,
> sometimes you have to (stat, for example), but just look at the amount
> of PITA stat() has spawned...

I'd rather see the struct than some ugly manual address calculations
and casts...

Because that's fundamentally what a struct _is_, after all.

               Linus
Al Viro May 29, 2020, 9:06 p.m. UTC | #8
On Fri, May 29, 2020 at 01:57:36PM -0700, Linus Torvalds wrote:

> > All jokes aside, when had we (or anybody else, really) _not_ gotten
> > into trouble when passing structs across the kernel boundary?  Sure,
> > sometimes you have to (stat, for example), but just look at the amount
> > of PITA stat() has spawned...
> 
> I'd rather see the struct than some ugly manual address calculations
> and casts...
> 
> Because that's fundamentally what a struct _is_, after all.

Sure; the bad idea I was refering to had been to pass the arguments from
userland that way, not the syntax used for it.  And it's obviously cast
in stone by now - userland ABI and all such...
diff mbox series

Patch

diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 8e4f1ace467c..92f0a3bc3ac5 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -227,7 +227,7 @@  static ssize_t dlmfs_file_read(struct file *filp,
 			       loff_t *ppos)
 {
 	int bytes_left;
-	ssize_t readlen, got;
+	ssize_t got;
 	char *lvb_buf;
 	struct inode *inode = file_inode(filp);
 
@@ -237,36 +237,31 @@  static ssize_t dlmfs_file_read(struct file *filp,
 	if (*ppos >= i_size_read(inode))
 		return 0;
 
+	/* don't read past the lvb */
+	if (count > i_size_read(inode) - *ppos)
+		count = i_size_read(inode) - *ppos;
+
 	if (!count)
 		return 0;
 
-	if (!access_ok(buf, count))
-		return -EFAULT;
-
-	/* don't read past the lvb */
-	if ((count + *ppos) > i_size_read(inode))
-		readlen = i_size_read(inode) - *ppos;
-	else
-		readlen = count;
-
-	lvb_buf = kmalloc(readlen, GFP_NOFS);
+	lvb_buf = kmalloc(count, GFP_NOFS);
 	if (!lvb_buf)
 		return -ENOMEM;
 
-	got = user_dlm_read_lvb(inode, lvb_buf, readlen);
+	got = user_dlm_read_lvb(inode, lvb_buf, count);
 	if (got) {
-		BUG_ON(got != readlen);
-		bytes_left = __copy_to_user(buf, lvb_buf, readlen);
-		readlen -= bytes_left;
+		BUG_ON(got != count);
+		bytes_left = copy_to_user(buf, lvb_buf, count);
+		count -= bytes_left;
 	} else
-		readlen = 0;
+		count = 0;
 
 	kfree(lvb_buf);
 
-	*ppos = *ppos + readlen;
+	*ppos = *ppos + count;
 
-	mlog(0, "read %zd bytes\n", readlen);
-	return readlen;
+	mlog(0, "read %zu bytes\n", count);
+	return count;
 }
 
 static ssize_t dlmfs_file_write(struct file *filp,