[RFC] get rid of the use of set_fs() (by way of kernel_recvmsg()) in sunrpc
diff mbox

Message ID 20180118030634.GY13338@ZenIV.linux.org.uk
State New
Headers show

Commit Message

Al Viro Jan. 18, 2018, 3:06 a.m. UTC
On Wed, Jan 17, 2018 at 06:52:32PM +0000, Al Viro wrote:

[use of set_fs() by sunrpc]
> We are asking for recvmsg() with zero data length; what we really want is
> ->msg_control.  And _that_ is why we need that set_fs() - we want the damn
> thing to go into local variable.
> 
> But note that filling ->msg_control will happen in put_cmsg(), called
> from ip_cmsg_recv_pktinfo(), called from ip_cmsg_recv_offset(),
> called from udp_recvmsg(), called from sock_recvmsg_nosec(), called
> from sock_recvmsg().  Or in another path in case of IPv6.
> Sure, we can arrange for propagation of that all way down those
> call chains.  My preference would be to try and mark that (one and
> only) case in ->msg_flags, so that put_cmsg() would be able to
> check.  ___sys_recvmsg() sets that as
>         msg_sys->msg_flags = flags & (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
> so we ought to be free to use any bit other than those two.  Since
> put_cmsg() already checks ->msg_flags, that shouldn't put too much
> overhead.

Folks, does anybody have objections against the following:

Get rid of kernel_recvmsg() (and thus set_fs()) use in sunrpc

In net/sunrpc/svcsock.c:svc_udp_recvfrom() we want to get
IP_PKTINFO; currently we stash the address of local variable
into ->msg_control (which normall contains a userland pointer
in recepients) and issue zero-length ->recvmsg() under
set_fs(KERNEL_DS).

Similar to the way put_cmsg() handles 32bit case on biarch
targets, introduce a flag telling put_cmsg() to treat
->msg_control as kernel pointer, using memcpy instead of
copy_to_user().  That allows to avoid the use of kernel_recvmsg()
with its set_fs().

All places that might have non-NULL ->msg_control either pass the
msghdr only to ->sendmsg() and its ilk, or have ->msg_flags
sanitized before passing msghdr anywhere.  IOW, there no
way the new flag to reach put_cmsg() in the mainline kernel,
and after this change it will only be seen in sunrpc case.

Incidentally, all other kernel_recvmsg() users are very easy to
convert to sock_recvmsg(), so that would allow to kill
kernel_recvmsg() off...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---

Comments

Linus Torvalds Jan. 18, 2018, 3:16 a.m. UTC | #1
On Wed, Jan 17, 2018 at 7:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Similar to the way put_cmsg() handles 32bit case on biarch
> targets, introduce a flag telling put_cmsg() to treat
> ->msg_control as kernel pointer, using memcpy instead of
> copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> with its set_fs().

If this is the only case that kernel_recvmsg() exists for, then by all
means, that patch certainly looks like a good thing.

                 Linus
Al Viro Jan. 18, 2018, 4:43 a.m. UTC | #2
On Wed, Jan 17, 2018 at 07:16:02PM -0800, Linus Torvalds wrote:
> On Wed, Jan 17, 2018 at 7:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Similar to the way put_cmsg() handles 32bit case on biarch
> > targets, introduce a flag telling put_cmsg() to treat
> > ->msg_control as kernel pointer, using memcpy instead of
> > copy_to_user().  That allows to avoid the use of kernel_recvmsg()
> > with its set_fs().
> 
> If this is the only case that kernel_recvmsg() exists for, then by all
> means, that patch certainly looks like a good thing.

In -next that's the only remaining caller.  kernel_recvmsg() is
{
        mm_segment_t oldfs = get_fs();
        int result;

        iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
        set_fs(KERNEL_DS);
        result = sock_recvmsg(sock, msg, flags);
        set_fs(oldfs);
        return result;
}
and 
        iov_iter_kvec(&msg->msg_iter, READ | ITER_KVEC, vec, num, size);
        result = sock_recvmsg(sock, msg, flags);
works just fine for copying the data - that gets handled by copy_to_iter()
and copy_page_to_iter().  Those don't care about set_fs(); the trouble with
sunrpc call site is that we want to fill msg_control-pointed kernel object.
That gets copied by put_cmsg().

	We could turn ->msg_control/->msg_controllen into another
iov_iter, but seeing that we never do scatter-gather for those
IMO that would be a massive overkill.  A flag controlling whether
->msg_control is kernel or userland pointer would do, especially
since we already have a flag for "do we want a native or compat
layout for cmsg" in there.

	That's the only caller we need it for, but that thing looks cheap
enough.  Obviously needs to pass testing, including "is it too ugly to
live as far as Davem is concerned" test, though...
Christoph Hellwig Jan. 18, 2018, 4:29 p.m. UTC | #3
> 	We could turn ->msg_control/->msg_controllen into another
> iov_iter, but seeing that we never do scatter-gather for those
> IMO that would be a massive overkill.  A flag controlling whether
> ->msg_control is kernel or userland pointer would do, especially
> since we already have a flag for "do we want a native or compat
> layout for cmsg" in there.

While your current hack seems like a nice short term improvement
I think we need an iov_iter or iov_iter-light there in the long run.
Same for ioctl so that we can pass properly typed kernel or user
buffers through without all these set_fs hacks.
Al Viro Jan. 18, 2018, 5:10 p.m. UTC | #4
On Thu, Jan 18, 2018 at 08:29:57AM -0800, Christoph Hellwig wrote:
> > 	We could turn ->msg_control/->msg_controllen into another
> > iov_iter, but seeing that we never do scatter-gather for those
> > IMO that would be a massive overkill.  A flag controlling whether
> > ->msg_control is kernel or userland pointer would do, especially
> > since we already have a flag for "do we want a native or compat
> > layout for cmsg" in there.
> 
> While your current hack seems like a nice short term improvement
> I think we need an iov_iter or iov_iter-light there in the long run.

For one caller in the entire history of the kernel?

> Same for ioctl so that we can pass properly typed kernel or user
> buffers through without all these set_fs hacks.

Umm...  Most of the PITA with ioctls is due to compat ones being
reformatted for native and fed under set_fs().  I actually have
a series dealing with most of such places for net ioctls.  Sure,
there's also ioctl_by_bdev(), but for those we might be better
off exposing the things like ->get_last_session() and its ilk to
filesystems that want to deal with cdroms...

It's kernel_setsockopt() that is the real PITA...
Al Viro Jan. 18, 2018, 7:31 p.m. UTC | #5
On Thu, Jan 18, 2018 at 04:43:02AM +0000, Al Viro wrote:

> 	We could turn ->msg_control/->msg_controllen into another
> iov_iter, but seeing that we never do scatter-gather for those
> IMO that would be a massive overkill.  A flag controlling whether
> ->msg_control is kernel or userland pointer would do, especially
> since we already have a flag for "do we want a native or compat
> layout for cmsg" in there.
> 
> 	That's the only caller we need it for, but that thing looks cheap
> enough.  Obviously needs to pass testing, including "is it too ugly to
> live as far as Davem is concerned" test, though...

	BTW, there's another series of set_fs-removal patches in
net ioctls; still needs review, though.  With that one we would be down
to 11 instances in the entire net/*:

* SO_RCVTIMEO/SO_SNDTIMEO handling in compat [sg]etsockopt()
* passing SIOC{ADD,DEL}TUNNEL down (ipmr_del_tunnel(),ipmr_new_tunnel(),
  addrconf_set_dstaddr())
* SIOCGSTAMP/SIOCGSTAMPNS in compat ioctls
* SIOCADDRT/SIOCDELRT in compat ioctls
* kernel_[gs]etsockopt()
* ipv6_renew_options_kern()

I don't know if all of that stuff can be realistically done without set_fs().
kernel_setsockopt(), in particular, is unpleasant...

The patches need review and testing, obviously; I'll post them in followups,
the entire series (on top of net/master) is in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.net-ioctl
Al Viro Jan. 18, 2018, 8:33 p.m. UTC | #6
On Thu, Jan 18, 2018 at 07:31:56PM +0000, Al Viro wrote:

> * SO_RCVTIMEO/SO_SNDTIMEO handling in compat [sg]etsockopt()
> * passing SIOC{ADD,DEL}TUNNEL down (ipmr_del_tunnel(),ipmr_new_tunnel(),
>   addrconf_set_dstaddr())
> * SIOCGSTAMP/SIOCGSTAMPNS in compat ioctls
> * SIOCADDRT/SIOCDELRT in compat ioctls
> * kernel_[gs]etsockopt()
> * ipv6_renew_options_kern()
> 
> I don't know if all of that stuff can be realistically done without set_fs().
> kernel_setsockopt(), in particular, is unpleasant...

Speaking of weird indirect calls: in net/packet/af_packet.c:packet_ioctl() we
have this:
#ifdef CONFIG_INET
        case SIOCADDRT:
        case SIOCDELRT:
        case SIOCDARP:
        case SIOCGARP:
        case SIOCSARP:
        case SIOCGIFADDR:
        case SIOCSIFADDR:
        case SIOCGIFBRDADDR:
        case SIOCSIFBRDADDR:
        case SIOCGIFNETMASK:
        case SIOCSIFNETMASK:
        case SIOCGIFDSTADDR:
        case SIOCSIFDSTADDR:
        case SIOCSIFFLAGS:
                return inet_dgram_ops.ioctl(sock, cmd, arg);
#endif

That's inet_ioctl(sock, cmd, arg) disguised by indirect.  AFAICS,
that line dates back to 2.1.89; back then inet_dgram_ops had been
exported and inet_ioctl() had been static.  When SCTP went in
they'd exported inet_ioctl() rather than playing that kind of
games.

Is there anything subtle I'm missing here that would make it
wrong to replace that with explicit call of inet_ioctl()?
Al Viro Jan. 19, 2018, 3:27 a.m. UTC | #7
On Thu, Jan 18, 2018 at 07:31:56PM +0000, Al Viro wrote:

> * SIOCADDRT/SIOCDELRT in compat ioctls

To bring back a question I'd asked back in October - what do
we do about SIOC...RT compat?

To recap:
	* AF_INET sockets expect struct rtentry; it differs
between 32bit and 64bit, so routing_ioctl() in net/socket.c
is called from compat_sock_ioctl_trans() and does the right
thing.  All proto_ops instances with .family = PF_INET (and
only they) have inet_ioctl() as ->ioctl(), and end up with
ip_rt_ioctl() called for native ones.  Three of those have
->compat_ioctl() set to inet_compat_ioctl(), the rest have
it NULL.  In any case, inet_compat_ioctl() ignores those,
leaving them to compat_sock_ioctl_trans() to pick up.
	* for AF_INET6 the situation is similar, except that
they use struct in6_rtmsg.  Compat is also dealt with in
routing_ioctl().  inet6_ioctl() for all such proto_ops
(and only those), ipv6_route_ioctl() is what ends up
handling the native ones.  No ->compat_ioctl() in any
of those.
	* AF_PACKET sockets expect struct rt_entry and
actually bounce the native calls to inet_ioctl().  No
->compat_ioctl() there, but routing_ioctl() in net/socket.c
does the right thing.
	* AF_APPLETALK sockets expect struct rt_entry.
Native handled in atrtr_ioctl(); there is ->compat_ioctl(),
but it ignores those ioctls, so we go through the conversion
in net/socket.c.  Also happens to work correctly.

	* ax25, ipx, netrom, rose and x25 use structures
of their own, and those structures have identical layouts on
32bit and 64bit.  x25 has ->compat_ioctl() that does the
right thing (bounces to native), the rest either have
->compat_ioctl() ignoring those ioctls (ipx) or do not
have ->compat_ioctl() at all.  That ends up with generic
code picking those and buggering them up - routing_ioctl()
assumes that we want either in6_rtmsg (ipv6) or rtentry
(everything else).  Unfortunately, in case of these
protocols we should just leave the suckers alone.
	Back then Ralf has verified that the bug exists
and said he'd put together a fix.  Looks like that fix
has fallen through the cracks, though.

	* all other protocols fail those; usually with
ENOTTY, except for AF_QIPCRTR that fails with EINVAL.
Either way, compat is not an issue.

	Note that handling of SIOCADDRT on e.g. raw ipv4
sockets from 32bit process is convoluted as hell.  The
call chain is
	compat_sys_ioctl()
		compat_sock_ioctl()
			inet_compat_ioctl()
				compat_raw_ioctl()
					=> -ENOIOCTLCMD, possibly
					    by way of ipmr_compat_ioctl()
			compat_sock_ioctl_trans()
				routing_ioctl() [conversion done here]
					sock_do_ioctl()
						inet_ioctl()
							ip_rt_ioctl()
A lot of those are method calls, BTW, and the overhead on those has
just grown...

Does anybody have objections against the following?

1) Somewhere in net/core (or net/compat.c, for that matter) add
int compat_get_rtentry(struct rtentry *r, struct rtentry32 __user *p);

2) In inet_compat_ioctl() recognize SIOC{ADD,DEL}RT and do
		err = compat_get_rtentry(&r, (void __user *)arg);
		if (!err)
			err = ip_rt_ioctl(...)
		return err;

3) Add inet_compat_ioctl() as ->compat_ioctl in all PF_INET proto_ops.

4) Lift copyin from atrtr_ioctl() to atalk_ioctl(), teach
atalk_compat_ioctl() about these ioctls (using compat_get_rtentry()
and atrtr_ioctl(), that is).

5) Add ->compat_ioctl() to AF_PACKET, let it just call inet_compat_ioctl()
for those two.

6) Lift copyin from ipv6_route_ioctl() to inet6_ioctl().  
Add inet6_compat_ioctl() that would recognize those two, do compat copyin
and call ipv6_route_ioctl().  Make it ->compat_ioctl for all PF_INET6
proto_ops.

7) Tell compat_sock_ioctl_trans() to move these two into the "just call
sock_do_ioctl()" group of cases.  Or, with Ralf's fix, just remove these
two cases from compat_sock_ioctl_trans() completely.  Either way,
routing_ioctl() becomes dead code and can be removed.

Patch
diff mbox

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 9286a5a8c60c..60947da9c806 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -298,6 +298,7 @@  struct ucred {
 #else
 #define MSG_CMSG_COMPAT	0		/* We never have 32 bit fixups */
 #endif
+#define MSG_CMSG_KERNEL	0x10000000
 
 
 /* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
diff --git a/net/core/scm.c b/net/core/scm.c
index b1ff8a441748..1b73b682e441 100644
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -237,10 +237,17 @@  int put_cmsg(struct msghdr * msg, int level, int type, int len, void *data)
 	cmhdr.cmsg_len = cmlen;
 
 	err = -EFAULT;
-	if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
-		goto out;
-	if (copy_to_user(CMSG_DATA(cm), data, cmlen - sizeof(struct cmsghdr)))
-		goto out;
+	if (unlikely(MSG_CMSG_KERNEL & msg->msg_flags)) {
+		struct cmsghdr *p = msg->msg_control;
+		memcpy(p, &cmhdr, sizeof cmhdr);
+		memcpy(CMSG_DATA(p), data, cmlen - sizeof(struct cmsghdr));
+	} else {
+		if (copy_to_user(cm, &cmhdr, sizeof cmhdr))
+			goto out;
+		if (copy_to_user(CMSG_DATA(cm), data,
+				 cmlen - sizeof(struct cmsghdr)))
+			goto out;
+	}
 	cmlen = CMSG_SPACE(len);
 	if (msg->msg_controllen < cmlen)
 		cmlen = msg->msg_controllen;
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 5570719e4787..774904f67c93 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -545,7 +545,7 @@  static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 		.msg_name = svc_addr(rqstp),
 		.msg_control = cmh,
 		.msg_controllen = sizeof(buffer),
-		.msg_flags = MSG_DONTWAIT,
+		.msg_flags = MSG_DONTWAIT | MSG_CMSG_KERNEL,
 	};
 	size_t len;
 	int err;
@@ -565,8 +565,7 @@  static int svc_udp_recvfrom(struct svc_rqst *rqstp)
 
 	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	skb = NULL;
-	err = kernel_recvmsg(svsk->sk_sock, &msg, NULL,
-			     0, 0, MSG_PEEK | MSG_DONTWAIT);
+	err = sock_recvmsg(svsk->sk_sock, &msg, MSG_PEEK | MSG_DONTWAIT);
 	if (err >= 0)
 		skb = skb_recv_udp(svsk->sk_sk, 0, 1, &err);