diff mbox series

[1/9] kernel: add a PF_FORCE_COMPAT flag

Message ID	20200918124533.3487701-2-hch@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=yEt/=C3=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AD6652376F From: Christoph Hellwig <hch@lst.de> To: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org>, Jens Axboe <axboe@kernel.dk>, Arnd Bergmann <arnd@arndb.de>, David Howells <dhowells@redhat.com>, linux-arm-kernel@lists.infradead.org, x86@kernel.org, linux-kernel@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, sparclinux@vger.kernel.org, linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, io-uring@vger.kernel.org, linux-arch@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org Subject: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag Date: Fri, 18 Sep 2020 14:45:25 +0200 Message-Id: <20200918124533.3487701-2-hch@lst.de> In-Reply-To: <20200918124533.3487701-1-hch@lst.de> References: <20200918124533.3487701-1-hch@lst.de> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[1/9] kernel: add a PF_FORCE_COMPAT flag \| expand [1/9] kernel: add a PF_FORCE_COMPAT flag [2/9] compat.h: fix a spelling error in <linux/compat.h> [3/9] fs: explicitly check for CHECK_IOVEC_ONLY in rw_copy_check_uvector [4/9] fs: handle the compat case in import_iovec [5/9] fs: remove various compat readv/writev helpers [6/9] fs: remove the compat readv/writev syscalls [7/9] fs: remove compat_sys_vmsplice [8/9] mm: remove compat_process_vm_{readv,writev} [9/9] security/keys: remove compat_keyctl_instantiate_key_iov

Commit Message

Christoph Hellwig Sept. 18, 2020, 12:45 p.m. UTC

Add a flag to force processing a syscall as a compat syscall.  This is
required so that in_compat_syscall() works for I/O submitted by io_uring
helper threads on behalf of compat syscalls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/sparc/include/asm/compat.h | 3 ++-
 arch/x86/include/asm/compat.h   | 2 +-
 fs/io_uring.c                   | 9 +++++++++
 include/linux/compat.h          | 5 ++++-
 include/linux/sched.h           | 1 +
 5 files changed, 17 insertions(+), 3 deletions(-)

Comments

Al Viro Sept. 18, 2020, 1:40 p.m. UTC | #1

On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> Add a flag to force processing a syscall as a compat syscall.  This is
> required so that in_compat_syscall() works for I/O submitted by io_uring
> helper threads on behalf of compat syscalls.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  arch/sparc/include/asm/compat.h | 3 ++-
>  arch/x86/include/asm/compat.h   | 2 +-
>  fs/io_uring.c                   | 9 +++++++++
>  include/linux/compat.h          | 5 ++++-
>  include/linux/sched.h           | 1 +
>  5 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/sparc/include/asm/compat.h b/arch/sparc/include/asm/compat.h
> index 40a267b3bd5208..fee6c51d36e869 100644
> --- a/arch/sparc/include/asm/compat.h
> +++ b/arch/sparc/include/asm/compat.h
> @@ -211,7 +211,8 @@ static inline int is_compat_task(void)
>  static inline bool in_compat_syscall(void)
>  {
>  	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> -	return pt_regs_trap_type(current_pt_regs()) == 0x110;
> +	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> +		(current->flags & PF_FORCE_COMPAT);

Can't say I like that approach ;-/  Reasoning about the behaviour is much
harder when it's controlled like that - witness set_fs() shite...

Christoph Hellwig Sept. 18, 2020, 1:44 p.m. UTC | #2

On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:
> >  	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> > -	return pt_regs_trap_type(current_pt_regs()) == 0x110;
> > +	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> > +		(current->flags & PF_FORCE_COMPAT);
> 
> Can't say I like that approach ;-/  Reasoning about the behaviour is much
> harder when it's controlled like that - witness set_fs() shite...

I don't particularly like it either.  But do you have a better idea
how to deal with io_uring vs compat tasks?

Al Viro Sept. 18, 2020, 1:58 p.m. UTC | #3

On Fri, Sep 18, 2020 at 03:44:06PM +0200, Christoph Hellwig wrote:
> On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:
> > >  	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> > > -	return pt_regs_trap_type(current_pt_regs()) == 0x110;
> > > +	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> > > +		(current->flags & PF_FORCE_COMPAT);
> > 
> > Can't say I like that approach ;-/  Reasoning about the behaviour is much
> > harder when it's controlled like that - witness set_fs() shite...
> 
> I don't particularly like it either.  But do you have a better idea
> how to deal with io_uring vs compat tasks?

<wry> git rm fs/io_uring.c would make a good starting point </wry>
Yes, I know it's not going to happen, but one can dream...

Said that, why not provide a variant that would take an explicit
"is it compat" argument and use it there?  And have the normal
one pass in_compat_syscall() to that...

Arnd Bergmann Sept. 18, 2020, 1:59 p.m. UTC | #4

On Fri, Sep 18, 2020 at 3:44 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:
> > >     /* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> > > -   return pt_regs_trap_type(current_pt_regs()) == 0x110;
> > > +   return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> > > +           (current->flags & PF_FORCE_COMPAT);
> >
> > Can't say I like that approach ;-/  Reasoning about the behaviour is much
> > harder when it's controlled like that - witness set_fs() shite...
>
> I don't particularly like it either.  But do you have a better idea
> how to deal with io_uring vs compat tasks?

Do we need to worry about something other than the compat_iovec
struct for now? Regarding the code in io_import_iovec(), it would
seem that can easily be handled by exposing an internal helper.
Instead of

#ifdef CONFIG_COMPAT
     if (req->ctx->compat)
            return compat_import_iovec(rw, buf, sqe_len, UIO_FASTIOV,
iovec, iter);
#endif
        return import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec, iter);

This could do

    __import_iovec(rw, buf, sqe_len, UIO_FASTIOV, iovec,
                     iter, req->ctx->compat);

With the normal import_iovec() becoming a trivial wrapper around
the same thing:

ssize_t import_iovec(int type, const struct iovec __user * uvector,
                 unsigned nr_segs, unsigned fast_segs,
                 struct iovec **iov, struct iov_iter *i)
{
     return __import_iovec(type, uvector, nr_segs, fast_segs, iov,
              i, in_compat_syscall());
}


         Arnd

Christoph Hellwig Sept. 18, 2020, 3:16 p.m. UTC | #5

On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> Said that, why not provide a variant that would take an explicit
> "is it compat" argument and use it there?  And have the normal
> one pass in_compat_syscall() to that...

That would help to not introduce a regression with this series yes.
But it wouldn't fix existing bugs when io_uring is used to access
read or write methods that use in_compat_syscall().  One example that
I recently ran into is drivers/scsi/sg.c.

David Laight Sept. 19, 2020, 2:53 p.m. UTC | #6

From: Al Viro
> Sent: 18 September 2020 14:58
> 
> On Fri, Sep 18, 2020 at 03:44:06PM +0200, Christoph Hellwig wrote:
> > On Fri, Sep 18, 2020 at 02:40:12PM +0100, Al Viro wrote:
> > > >  	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
> > > > -	return pt_regs_trap_type(current_pt_regs()) == 0x110;
> > > > +	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
> > > > +		(current->flags & PF_FORCE_COMPAT);
> > >
> > > Can't say I like that approach ;-/  Reasoning about the behaviour is much
> > > harder when it's controlled like that - witness set_fs() shite...
> >
> > I don't particularly like it either.  But do you have a better idea
> > how to deal with io_uring vs compat tasks?
> 
> <wry> git rm fs/io_uring.c would make a good starting point </wry>
> Yes, I know it's not going to happen, but one can dream...

Maybe the io_uring code needs some changes to make it vaguely safe.
- No support for 32-bit compat mixed working (or at all?).
  Plausibly a special worker could do 32bit work.
- ring structure (I'm assuming mapped by mmap()) never mapped
  in more than one process (not cloned by fork()).
- No implicit handover of files to another process.
  Would need an munmap, handover, mmap sequence.

In any case the io_ring rather abuses the import_iovec() interface.

The canonical sequence is (types from memory):
	struct iovec cache[8], *iov = cache;
	struct iter iter;
	...
	rval = import_iovec(..., &iov, 8, &iter);
	// Do read/write user using 'iter'
	free(iov);

I don't think there is any strict requirement that iter.iov
is set to either 'cache' or 'iov' (it probably must point
into one of them.)
But the io_uring code will make that assumption because the
actual copies can be done much later and it doesn't save 'iter'.
It gets itself in a right mess because it doesn't separate
the 'address I need to free' from 'the iov[] for any transfers'.

io_uring is also the only code that relies on import_iovec()
returning the iter.count on success.
It would be much better to have:
	iov = import_iovec(..., &cache, ...);
	free(iov);
and use ERR_PTR() et al for error detectoion.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Andy Lutomirski Sept. 19, 2020, 4:21 p.m. UTC | #7

On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > Said that, why not provide a variant that would take an explicit
> > "is it compat" argument and use it there?  And have the normal
> > one pass in_compat_syscall() to that...
>
> That would help to not introduce a regression with this series yes.
> But it wouldn't fix existing bugs when io_uring is used to access
> read or write methods that use in_compat_syscall().  One example that
> I recently ran into is drivers/scsi/sg.c.

Aside from the potentially nasty use of per-task variables, one thing
I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
going to have a generic mechanism for this, shouldn't we allow a full
override of the syscall arch instead of just allowing forcing compat
so that a compat syscall can do a non-compat operation?

Arnd Bergmann Sept. 19, 2020, 9:16 p.m. UTC | #8

On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > Said that, why not provide a variant that would take an explicit
> > > "is it compat" argument and use it there?  And have the normal
> > > one pass in_compat_syscall() to that...
> >
> > That would help to not introduce a regression with this series yes.
> > But it wouldn't fix existing bugs when io_uring is used to access
> > read or write methods that use in_compat_syscall().  One example that
> > I recently ran into is drivers/scsi/sg.c.

Ah, so reading /dev/input/event* would suffer from the same issue,
and that one would in fact be broken by your patch in the hypothetical
case that someone tried to use io_uring to read /dev/input/event on x32...

For reference, I checked the socket timestamp handling that has a
number of corner cases with time32/time64 formats in compat mode,
but none of those appear to be affected by the problem.

> Aside from the potentially nasty use of per-task variables, one thing
> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
> going to have a generic mechanism for this, shouldn't we allow a full
> override of the syscall arch instead of just allowing forcing compat
> so that a compat syscall can do a non-compat operation?

The only reason it's needed here is that the caller is in a kernel
thread rather than a system call. Are there any possible scenarios
where one would actually need the opposite?

       Arnd

Finn Thain Sept. 19, 2020, 9:52 p.m. UTC | #9

On Sat, 19 Sep 2020, Arnd Bergmann wrote:

> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
> > On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
> > > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > > Said that, why not provide a variant that would take an explicit 
> > > > "is it compat" argument and use it there?  And have the normal one 
> > > > pass in_compat_syscall() to that...
> > >
> > > That would help to not introduce a regression with this series yes. 
> > > But it wouldn't fix existing bugs when io_uring is used to access 
> > > read or write methods that use in_compat_syscall().  One example 
> > > that I recently ran into is drivers/scsi/sg.c.
> 
> Ah, so reading /dev/input/event* would suffer from the same issue, and 
> that one would in fact be broken by your patch in the hypothetical case 
> that someone tried to use io_uring to read /dev/input/event on x32...
> 
> For reference, I checked the socket timestamp handling that has a number 
> of corner cases with time32/time64 formats in compat mode, but none of 
> those appear to be affected by the problem.
> 
> > Aside from the potentially nasty use of per-task variables, one thing 
> > I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're 
> > going to have a generic mechanism for this, shouldn't we allow a full 
> > override of the syscall arch instead of just allowing forcing compat 
> > so that a compat syscall can do a non-compat operation?
> 
> The only reason it's needed here is that the caller is in a kernel 
> thread rather than a system call. Are there any possible scenarios where 
> one would actually need the opposite?
> 

Quite possibly. The ext4 vs. compat getdents bug is still unresolved. 
Please see,
https://lore.kernel.org/lkml/CAFEAcA9W+JK7_TrtTnL1P2ES1knNPJX9wcUvhfLwxLq9augq1w@mail.gmail.com/

>        Arnd
>

Al Viro Sept. 19, 2020, 10:09 p.m. UTC | #10

On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > Said that, why not provide a variant that would take an explicit
> > "is it compat" argument and use it there?  And have the normal
> > one pass in_compat_syscall() to that...
> 
> That would help to not introduce a regression with this series yes.
> But it wouldn't fix existing bugs when io_uring is used to access
> read or write methods that use in_compat_syscall().  One example that
> I recently ran into is drivers/scsi/sg.c.

So screw such read/write methods - don't use them with io_uring.
That, BTW, is one of the reasons I'm sceptical about burying the
decisions deep into the callchain - we don't _want_ different
data layouts on read/write depending upon the 32bit vs. 64bit
caller, let alone the pointer-chasing garbage that is /dev/sg.

Andy Lutomirski Sept. 19, 2020, 10:22 p.m. UTC | #11

> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> 
> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>> Said that, why not provide a variant that would take an explicit
>>>> "is it compat" argument and use it there?  And have the normal
>>>> one pass in_compat_syscall() to that...
>>> 
>>> That would help to not introduce a regression with this series yes.
>>> But it wouldn't fix existing bugs when io_uring is used to access
>>> read or write methods that use in_compat_syscall().  One example that
>>> I recently ran into is drivers/scsi/sg.c.
> 
> Ah, so reading /dev/input/event* would suffer from the same issue,
> and that one would in fact be broken by your patch in the hypothetical
> case that someone tried to use io_uring to read /dev/input/event on x32...
> 
> For reference, I checked the socket timestamp handling that has a
> number of corner cases with time32/time64 formats in compat mode,
> but none of those appear to be affected by the problem.
> 
>> Aside from the potentially nasty use of per-task variables, one thing
>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
>> going to have a generic mechanism for this, shouldn't we allow a full
>> override of the syscall arch instead of just allowing forcing compat
>> so that a compat syscall can do a non-compat operation?
> 
> The only reason it's needed here is that the caller is in a kernel
> thread rather than a system call. Are there any possible scenarios
> where one would actually need the opposite?
> 

I can certainly imagine needing to force x32 mode from a kernel thread.

As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?

Andy Lutomirski Sept. 19, 2020, 10:23 p.m. UTC | #12

> On Sep 19, 2020, at 3:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>> Said that, why not provide a variant that would take an explicit
>>> "is it compat" argument and use it there?  And have the normal
>>> one pass in_compat_syscall() to that...
>> 
>> That would help to not introduce a regression with this series yes.
>> But it wouldn't fix existing bugs when io_uring is used to access
>> read or write methods that use in_compat_syscall().  One example that
>> I recently ran into is drivers/scsi/sg.c.
> 
> So screw such read/write methods - don't use them with io_uring.
> That, BTW, is one of the reasons I'm sceptical about burying the
> decisions deep into the callchain - we don't _want_ different
> data layouts on read/write depending upon the 32bit vs. 64bit
> caller, let alone the pointer-chasing garbage that is /dev/sg.

Well, we could remove in_compat_syscall(), etc and instead have an implicit parameter in DEFINE_SYSCALL.  Then everything would have to be explicit.  This would probably be a win, although it could be quite a bit of work.

Al Viro Sept. 19, 2020, 10:41 p.m. UTC | #13

On Sat, Sep 19, 2020 at 03:23:54PM -0700, Andy Lutomirski wrote:
> 
> > On Sep 19, 2020, at 3:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> >>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> >>> Said that, why not provide a variant that would take an explicit
> >>> "is it compat" argument and use it there?  And have the normal
> >>> one pass in_compat_syscall() to that...
> >> 
> >> That would help to not introduce a regression with this series yes.
> >> But it wouldn't fix existing bugs when io_uring is used to access
> >> read or write methods that use in_compat_syscall().  One example that
> >> I recently ran into is drivers/scsi/sg.c.
> > 
> > So screw such read/write methods - don't use them with io_uring.
> > That, BTW, is one of the reasons I'm sceptical about burying the
> > decisions deep into the callchain - we don't _want_ different
> > data layouts on read/write depending upon the 32bit vs. 64bit
> > caller, let alone the pointer-chasing garbage that is /dev/sg.
> 
> Well, we could remove in_compat_syscall(), etc and instead have an implicit parameter in DEFINE_SYSCALL.  Then everything would have to be explicit.  This would probably be a win, although it could be quite a bit of work.

It would not be a win - most of the syscalls don't give a damn
about 32bit vs. 64bit...

Andy Lutomirski Sept. 19, 2020, 10:53 p.m. UTC | #14

> On Sep 19, 2020, at 3:41 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> On Sat, Sep 19, 2020 at 03:23:54PM -0700, Andy Lutomirski wrote:
>> 
>>>> On Sep 19, 2020, at 3:09 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>> 
>>> On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
>>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>>> Said that, why not provide a variant that would take an explicit
>>>>> "is it compat" argument and use it there?  And have the normal
>>>>> one pass in_compat_syscall() to that...
>>>> 
>>>> That would help to not introduce a regression with this series yes.
>>>> But it wouldn't fix existing bugs when io_uring is used to access
>>>> read or write methods that use in_compat_syscall().  One example that
>>>> I recently ran into is drivers/scsi/sg.c.
>>> 
>>> So screw such read/write methods - don't use them with io_uring.
>>> That, BTW, is one of the reasons I'm sceptical about burying the
>>> decisions deep into the callchain - we don't _want_ different
>>> data layouts on read/write depending upon the 32bit vs. 64bit
>>> caller, let alone the pointer-chasing garbage that is /dev/sg.
>> 
>> Well, we could remove in_compat_syscall(), etc and instead have an implicit parameter in DEFINE_SYSCALL.  Then everything would have to be explicit.  This would probably be a win, although it could be quite a bit of work.
> 
> It would not be a win - most of the syscalls don't give a damn
> about 32bit vs. 64bit...

Any reasonable implementation would optimize it out for syscalls that don’t care.  Or it could be explicit:

DEFINE_MULTIARCH_SYSCALL(...)

Al Viro Sept. 19, 2020, 11:24 p.m. UTC | #15

On Sat, Sep 19, 2020 at 03:53:40PM -0700, Andy Lutomirski wrote:

> > It would not be a win - most of the syscalls don't give a damn
> > about 32bit vs. 64bit...
> 
> Any reasonable implementation would optimize it out for syscalls that don’t care.  Or it could be explicit:
> 
> DEFINE_MULTIARCH_SYSCALL(...)

1) what would that look like?
2) have you counted the syscalls that do and do not need that?
3) how many of those realistically *can* be unified with their
compat counterparts?  [hint: ioctl(2) cannot]

Andy Lutomirski Sept. 20, 2020, 12:14 a.m. UTC | #16

On Sat, Sep 19, 2020 at 4:24 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, Sep 19, 2020 at 03:53:40PM -0700, Andy Lutomirski wrote:
>
> > > It would not be a win - most of the syscalls don't give a damn
> > > about 32bit vs. 64bit...
> >
> > Any reasonable implementation would optimize it out for syscalls that don’t care.  Or it could be explicit:
> >
> > DEFINE_MULTIARCH_SYSCALL(...)
>
> 1) what would that look like?

In effect, it would work like this:

/* Arch-specific, but there's a generic case for sane architectures. */
enum syscall_arch {
  SYSCALL_NATIVE,
  SYSCALL_COMPAT,
  SYSCALL_X32,
};

DEFINE_MULTIARCH_SYSCALLn(args, arch)
{
  args are the args here, and arch is the arch.
}

> 2) have you counted the syscalls that do and do not need that?

No.

> 3) how many of those realistically *can* be unified with their
> compat counterparts?  [hint: ioctl(2) cannot]

There would be no requirement to unify anything.  The idea is that
we'd get rid of all the global state flags.

For ioctl, we'd have a new file_operation:

long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);

I'm not saying this is easy, but I think it's possible and the result
would be more obviously correct than what we have now.

Al Viro Sept. 20, 2020, 2:57 a.m. UTC | #17

On Sat, Sep 19, 2020 at 05:14:41PM -0700, Andy Lutomirski wrote:

> > 2) have you counted the syscalls that do and do not need that?
> 
> No.

Might be illuminating...

> > 3) how many of those realistically *can* be unified with their
> > compat counterparts?  [hint: ioctl(2) cannot]
> 
> There would be no requirement to unify anything.  The idea is that
> we'd get rid of all the global state flags.

_What_ global state flags?  When you have separate SYSCALL_DEFINE3(ioctl...)
and COMPAT_SYSCALL_DEFINE3(ioctl...), there's no flags at all, global or
local.  They only come into the play when you try to share the same function
for both, right on the top level.

> For ioctl, we'd have a new file_operation:
> 
> long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);
> 
> I'm not saying this is easy, but I think it's possible and the result
> would be more obviously correct than what we have now.

No, it would not.  Seriously, from time to time a bit of RTFS before grand
proposals turns out to be useful.

Arnd Bergmann Sept. 20, 2020, 1:55 p.m. UTC | #18

On Sun, Sep 20, 2020 at 12:09 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > Said that, why not provide a variant that would take an explicit
> > > "is it compat" argument and use it there?  And have the normal
> > > one pass in_compat_syscall() to that...
> >
> > That would help to not introduce a regression with this series yes.
> > But it wouldn't fix existing bugs when io_uring is used to access
> > read or write methods that use in_compat_syscall().  One example that
> > I recently ran into is drivers/scsi/sg.c.
>
> So screw such read/write methods - don't use them with io_uring.
> That, BTW, is one of the reasons I'm sceptical about burying the
> decisions deep into the callchain - we don't _want_ different
> data layouts on read/write depending upon the 32bit vs. 64bit
> caller, let alone the pointer-chasing garbage that is /dev/sg.

Would it be too late to limit what kind of file descriptors we allow
io_uring to read/write on?

If io_uring can get changed to return -EINVAL on trying to
read/write something other than S_IFREG file descriptors,
that particular problem space gets a lot simpler, but this
is of course only possible if nobody actually relies on it yet.

      Arnd

Al Viro Sept. 20, 2020, 3:02 p.m. UTC | #19

On Sun, Sep 20, 2020 at 03:55:47PM +0200, Arnd Bergmann wrote:
> On Sun, Sep 20, 2020 at 12:09 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
> > > On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > > > Said that, why not provide a variant that would take an explicit
> > > > "is it compat" argument and use it there?  And have the normal
> > > > one pass in_compat_syscall() to that...
> > >
> > > That would help to not introduce a regression with this series yes.
> > > But it wouldn't fix existing bugs when io_uring is used to access
> > > read or write methods that use in_compat_syscall().  One example that
> > > I recently ran into is drivers/scsi/sg.c.
> >
> > So screw such read/write methods - don't use them with io_uring.
> > That, BTW, is one of the reasons I'm sceptical about burying the
> > decisions deep into the callchain - we don't _want_ different
> > data layouts on read/write depending upon the 32bit vs. 64bit
> > caller, let alone the pointer-chasing garbage that is /dev/sg.
> 
> Would it be too late to limit what kind of file descriptors we allow
> io_uring to read/write on?
> 
> If io_uring can get changed to return -EINVAL on trying to
> read/write something other than S_IFREG file descriptors,
> that particular problem space gets a lot simpler, but this
> is of course only possible if nobody actually relies on it yet.

S_IFREG is almost certainly too heavy as a restriction.  Looking through
the stuff sensitive to 32bit/64bit, we seem to have
	* /dev/sg - pointer-chasing horror
	* sysfs files for efivar - different layouts for compat and native,
shitty userland ABI design (
struct efi_variable {
        efi_char16_t  VariableName[EFI_VAR_NAME_LEN/sizeof(efi_char16_t)];
        efi_guid_t    VendorGuid;
        unsigned long DataSize;
        __u8          Data[1024];
        efi_status_t  Status;
        __u32         Attributes;
} __attribute__((packed));
) is the piece of crap in question; 'DataSize' is where the headache comes
from.  Regular files, BTW...
	* uhid - character device, milder pointer-chasing horror.  Trouble
comes from this:
/* Obsolete! Use UHID_CREATE2. */
struct uhid_create_req {
        __u8 name[128];
        __u8 phys[64];
        __u8 uniq[64];
        __u8 __user *rd_data;
        __u16 rd_size;

        __u16 bus;
        __u32 vendor;
        __u32 product;
        __u32 version;
        __u32 country;
} __attribute__((__packed__));
and suggested replacement doesn't do any pointer-chasing (rd_data is an
embedded array in the end of struct uhid_create2_req).
	* evdev, uinput - bitness-sensitive layout, due to timestamps
	* /proc/bus/input/devices - weird crap with printing bitmap, different
_text_ layouts seen by 32bit and 64bit readers.  Binary structures are PITA,
but with sufficient effort you can screw the text just as hard...  Oh, and it's
a regular file.
	* similar in sysfs analogue

And AFAICS, that's it for read/write-related method instances.

Matthew Wilcox (Oracle) Sept. 20, 2020, 3:15 p.m. UTC | #20

On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> Add a flag to force processing a syscall as a compat syscall.  This is
> required so that in_compat_syscall() works for I/O submitted by io_uring
> helper threads on behalf of compat syscalls.

Al doesn't like this much, but my suggestion is to introduce two new
opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
can translate IORING_OP_READV to IORING_OP_READV32 and then the core
code can know what that user pointer is pointing to.

William Kucharski Sept. 20, 2020, 3:55 p.m. UTC | #21

I really like that as it’s self-documenting and anyone debugging it can see what is actually being used at a glance.

> On Sep 20, 2020, at 09:15, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
>> Add a flag to force processing a syscall as a compat syscall.  This is
>> required so that in_compat_syscall() works for I/O submitted by io_uring
>> helper threads on behalf of compat syscalls.
> 
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

Arnd Bergmann Sept. 20, 2020, 4 p.m. UTC | #22

On Sun, Sep 20, 2020 at 5:15 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > Add a flag to force processing a syscall as a compat syscall.  This is
> > required so that in_compat_syscall() works for I/O submitted by io_uring
> > helper threads on behalf of compat syscalls.
>
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

How is that different from the current approach of storing the ABI as
a flag in ctx->compat?

     Arnd

Andy Lutomirski Sept. 20, 2020, 4:59 p.m. UTC | #23

On Sat, Sep 19, 2020 at 7:57 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sat, Sep 19, 2020 at 05:14:41PM -0700, Andy Lutomirski wrote:
>
> > > 2) have you counted the syscalls that do and do not need that?
> >
> > No.
>
> Might be illuminating...
>
> > > 3) how many of those realistically *can* be unified with their
> > > compat counterparts?  [hint: ioctl(2) cannot]
> >
> > There would be no requirement to unify anything.  The idea is that
> > we'd get rid of all the global state flags.
>
> _What_ global state flags?  When you have separate SYSCALL_DEFINE3(ioctl...)
> and COMPAT_SYSCALL_DEFINE3(ioctl...), there's no flags at all, global or
> local.  They only come into the play when you try to share the same function
> for both, right on the top level.

...

>
> > For ioctl, we'd have a new file_operation:
> >
> > long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);
> >
> > I'm not saying this is easy, but I think it's possible and the result
> > would be more obviously correct than what we have now.
>
> No, it would not.  Seriously, from time to time a bit of RTFS before grand
> proposals turns out to be useful.

As one example, look at __sys_setsockopt().  It's called for the
native and compat versions, and it contains an in_compat_syscall()
check.  (This particularly check looks dubious to me, but that's
another story.)  If this were to be done with equivalent semantics
without a separate COMPAT_DEFINE_SYSCALL and without
in_compat_syscall(), there would need to be some indication as to
whether this is compat or native setsockopt.  There are other
setsockopt implementations in the net stack with more
legitimate-seeming uses of in_compat_syscall() that would need some
other mechanism if in_compat_syscall() were to go away.

setsockopt is (I hope!) out of scope for io_uring, but the situation
isn't fundamentally different from read and write.

Al Viro Sept. 20, 2020, 6:07 p.m. UTC | #24

On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > Add a flag to force processing a syscall as a compat syscall.  This is
> > required so that in_compat_syscall() works for I/O submitted by io_uring
> > helper threads on behalf of compat syscalls.
> 
> Al doesn't like this much, but my suggestion is to introduce two new
> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> code can know what that user pointer is pointing to.

Let's separate two issues:
	1) compat syscalls want 32bit iovecs.  Nothing to do with the
drivers, dealt with just fine.
	2) a few drivers are really fucked in head.  They use different
*DATA* layouts for reads/writes, depending upon the calling process.
IOW, if you fork/exec a 32bit binary and your stdin is one of those,
reads from stdin in parent and child will yield different data layouts.
On the same struct file.
	That's what Christoph worries about (/dev/sg he'd mentioned is
one of those).

	IMO we should simply have that dozen or so of pathological files
marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
it describes the userland ABI provided by those.  And it's cast in stone.

	Any in_compat_syscall() in ->read()/->write() instances is an ABI
bug, plain and simple.  Some are unfixable for compatibility reasons, but
any new caller like that should be a big red flag.

	How we import iovec array is none of the drivers' concern; we do
not need to mess with in_compat_syscall() reporting the matching value,
etc. for that.  It's about the instances that want in_compat_syscall() to
decide between the 32bit and 64bit data layouts.  And I believe that
we should simply have them marked as such and rejected by io_uring.  With
any new occurences getting slapped down hard.

	Current list of those turds:
/dev/sg (pointer-chasing, generally insane)
/sys/firmware/efi/vars/*/raw_var (fucked binary structure)
/sys/firmware/efi/vars/new_var (fucked binary structure)
/sys/firmware/efi/vars/del_var (fucked binary structure)
/dev/uhid	(pointer-chasing for one obsolete command)
/dev/input/event* (timestamps)
/dev/uinput (timestamps)
/proc/bus/input/devices (fucked bitmap-to-text representation)
/sys/class/input/*/capabilities/* (fucked bitmap-to-text representation)

Al Viro Sept. 20, 2020, 6:12 p.m. UTC | #25

On Sun, Sep 20, 2020 at 09:59:36AM -0700, Andy Lutomirski wrote:

> As one example, look at __sys_setsockopt().  It's called for the
> native and compat versions, and it contains an in_compat_syscall()
> check.  (This particularly check looks dubious to me, but that's
> another story.)  If this were to be done with equivalent semantics
> without a separate COMPAT_DEFINE_SYSCALL and without
> in_compat_syscall(), there would need to be some indication as to
> whether this is compat or native setsockopt.  There are other
> setsockopt implementations in the net stack with more
> legitimate-seeming uses of in_compat_syscall() that would need some
> other mechanism if in_compat_syscall() were to go away.
> 
> setsockopt is (I hope!) out of scope for io_uring, but the situation
> isn't fundamentally different from read and write.

	Except that setsockopt() had that crap very widespread; for read()
and write() those are very rare exceptions.

	Andy, please RTFS.  Or dig through archives.  The situation
with setsockopt() is *NOT* a good thing - it's (probably) the least
of the evils.  The last thing we need is making that the norm.

Al Viro Sept. 20, 2020, 6:41 p.m. UTC | #26

On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:

> /proc/bus/input/devices (fucked bitmap-to-text representation)

To illustrate the, er, beauty of that stuff:

; cat32 /proc/bus/input/devices >/tmp/a
; cat /proc/bus/input/devices >/tmp/b
; diff -u /tmp/a /tmp/b|grep '^[-+]'
--- /tmp/a      2020-09-20 14:28:43.442560691 -0400
+++ /tmp/b      2020-09-20 14:28:49.018543230 -0400
-B: KEY=1100f 2902000 8380307c f910f001 feffffdf ffefffff ffffffff fffffffe
+B: KEY=1100f02902000 8380307cf910f001 feffffdfffefffff fffffffffffffffe
-B: KEY=70000 0 0 0 0 0 0 0 0
+B: KEY=70000 0 0 0 0
-B: KEY=420 0 70000 0 0 0 0 0 0 0 0
+B: KEY=420 70000 0 0 0 0
-B: KEY=100000 0 0 0
+B: KEY=10000000000000 0
-B: KEY=4000 0 0 0 0
+B: KEY=4000 0 0
-B: KEY=8000 0 0 0 0 0 1100b 800 2 200000 0 0 0 0
+B: KEY=800000000000 0 0 1100b00000800 200200000 0 0
-B: KEY=3e000b 0 0 0 0 0 0 0
+B: KEY=3e000b00000000 0 0 0
-B: KEY=ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff fffffffe
+B: KEY=ffffffffffffffff ffffffffffffffff ffffffffffffffff fffffffffffffffe
; 
(cat32 being a 32bit binary of cat)
All the differences are due to homegrown bitmap-to-text conversion.

Note that feeding that into a pipe leaves the recepient with a lovely problem -
you can't go by the width of words (they are not zero-padded) and you can't
go by the number of words either - it varies from device to device.

And there's nothing we can do with that - it's a userland ABI, can't be
changed without breaking stuff.  I would prefer to avoid additional examples
like that, TYVM...

Matthew Wilcox (Oracle) Sept. 20, 2020, 7:01 p.m. UTC | #27

On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:
> 	2) a few drivers are really fucked in head.  They use different
> *DATA* layouts for reads/writes, depending upon the calling process.
> IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> reads from stdin in parent and child will yield different data layouts.
> On the same struct file.
> 	That's what Christoph worries about (/dev/sg he'd mentioned is
> one of those).
> 
> 	IMO we should simply have that dozen or so of pathological files
> marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> it describes the userland ABI provided by those.  And it's cast in stone.
> 
> 	Any in_compat_syscall() in ->read()/->write() instances is an ABI
> bug, plain and simple.  Some are unfixable for compatibility reasons, but
> any new caller like that should be a big red flag.

So an IOCB_COMPAT flag would let us know whether the caller is expecting
a 32-bit or 64-bit layout?  And io_uring could set it based on the
ctx->compat flag.

> 	Current list of those turds:
> /dev/sg (pointer-chasing, generally insane)
> /sys/firmware/efi/vars/*/raw_var (fucked binary structure)
> /sys/firmware/efi/vars/new_var (fucked binary structure)
> /sys/firmware/efi/vars/del_var (fucked binary structure)
> /dev/uhid	(pointer-chasing for one obsolete command)
> /dev/input/event* (timestamps)
> /dev/uinput (timestamps)
> /proc/bus/input/devices (fucked bitmap-to-text representation)
> /sys/class/input/*/capabilities/* (fucked bitmap-to-text representation)

Al Viro Sept. 20, 2020, 7:10 p.m. UTC | #28

On Sun, Sep 20, 2020 at 08:01:59PM +0100, Matthew Wilcox wrote:
> On Sun, Sep 20, 2020 at 07:07:42PM +0100, Al Viro wrote:
> > 	2) a few drivers are really fucked in head.  They use different
> > *DATA* layouts for reads/writes, depending upon the calling process.
> > IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> > reads from stdin in parent and child will yield different data layouts.
> > On the same struct file.
> > 	That's what Christoph worries about (/dev/sg he'd mentioned is
> > one of those).
> > 
> > 	IMO we should simply have that dozen or so of pathological files
> > marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> > it describes the userland ABI provided by those.  And it's cast in stone.
> > 
> > 	Any in_compat_syscall() in ->read()/->write() instances is an ABI
> > bug, plain and simple.  Some are unfixable for compatibility reasons, but
> > any new caller like that should be a big red flag.
> 
> So an IOCB_COMPAT flag would let us know whether the caller is expecting
> a 32-bit or 64-bit layout?  And io_uring could set it based on the
> ctx->compat flag.

So you want to convert all that crap to a form that would see iocb
(IOW, ->read_iter()/->write_iter())?  And, since quite a few are sysfs
attributes, propagate that through sysfs, changing the method signatures
to match that and either modifying fuckloads of instances or introducing
new special methods for the ones that are sensitive to that crap?

IMO it's much saner to mark those and refuse to touch them from io_uring...

Andy Lutomirski Sept. 20, 2020, 7:14 p.m. UTC | #29

On Sun, Sep 20, 2020 at 11:07 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:
> > On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > > Add a flag to force processing a syscall as a compat syscall.  This is
> > > required so that in_compat_syscall() works for I/O submitted by io_uring
> > > helper threads on behalf of compat syscalls.
> >
> > Al doesn't like this much, but my suggestion is to introduce two new
> > opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> > can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> > code can know what that user pointer is pointing to.
>
> Let's separate two issues:
>         1) compat syscalls want 32bit iovecs.  Nothing to do with the
> drivers, dealt with just fine.
>         2) a few drivers are really fucked in head.  They use different
> *DATA* layouts for reads/writes, depending upon the calling process.
> IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> reads from stdin in parent and child will yield different data layouts.
> On the same struct file.
>         That's what Christoph worries about (/dev/sg he'd mentioned is
> one of those).
>
>         IMO we should simply have that dozen or so of pathological files
> marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> it describes the userland ABI provided by those.  And it's cast in stone.
>

I wonder if this is really quite cast in stone.  We could also have
FMODE_SHITTY_COMPAT and set that when a file like this is *opened* in
compat mode.  Then that particular struct file would be read and
written using the compat data format.  The change would be
user-visible, but the user that would see it would be very strange
indeed.

I don't have a strong opinion as to whether that is better or worse
than denying io_uring access to these things, but at least it moves
the special case out of io_uring.

--Andy

Matthew Wilcox (Oracle) Sept. 20, 2020, 7:22 p.m. UTC | #30

On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> IMO it's much saner to mark those and refuse to touch them from io_uring...

Simpler solution is to remove io_uring from the 32-bit syscall list.
If you're a 32-bit process, you don't get to use io_uring.  Would
any real users actually care about that?

Andy Lutomirski Sept. 20, 2020, 7:28 p.m. UTC | #31

On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > IMO it's much saner to mark those and refuse to touch them from io_uring...
>
> Simpler solution is to remove io_uring from the 32-bit syscall list.
> If you're a 32-bit process, you don't get to use io_uring.  Would
> any real users actually care about that?

We could go one step farther and declare that we're done adding *any*
new compat syscalls :)

Arnd Bergmann Sept. 20, 2020, 8:49 p.m. UTC | #32

On Sun, Sep 20, 2020 at 9:28 PM Andy Lutomirski <luto@kernel.org> wrote:
> On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > > IMO it's much saner to mark those and refuse to touch them from io_uring...
> >
> > Simpler solution is to remove io_uring from the 32-bit syscall list.
> > If you're a 32-bit process, you don't get to use io_uring.  Would
> > any real users actually care about that?
>
> We could go one step farther and declare that we're done adding *any*
> new compat syscalls :)

Would you also stop adding system calls to native 32-bit systems then?

On memory constrained systems (less than 2GB a.t.m.), there is still a
strong demand for running 32-bit user space, but all of the recent Arm
cores (after Cortex-A55) dropped the ability to run 32-bit kernels, so
that compat mode may eventually become the primary way to run
Linux on cheap embedded systems.

I don't think there is any chance we can realistically take away io_uring
from the 32-bit ABI any more now.

      Arnd

David Laight Sept. 20, 2020, 9:13 p.m. UTC | #33

From: Arnd Bergmann
> Sent: 20 September 2020 21:49
> 
> On Sun, Sep 20, 2020 at 9:28 PM Andy Lutomirski <luto@kernel.org> wrote:
> > On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > > > IMO it's much saner to mark those and refuse to touch them from io_uring...
> > >
> > > Simpler solution is to remove io_uring from the 32-bit syscall list.
> > > If you're a 32-bit process, you don't get to use io_uring.  Would
> > > any real users actually care about that?
> >
> > We could go one step farther and declare that we're done adding *any*
> > new compat syscalls :)
> 
> Would you also stop adding system calls to native 32-bit systems then?
> 
> On memory constrained systems (less than 2GB a.t.m.), there is still a
> strong demand for running 32-bit user space, but all of the recent Arm
> cores (after Cortex-A55) dropped the ability to run 32-bit kernels, so
> that compat mode may eventually become the primary way to run
> Linux on cheap embedded systems.
> 
> I don't think there is any chance we can realistically take away io_uring
> from the 32-bit ABI any more now.

Can't it just run requests from 32bit apps in a kernel thread that has
the 'in_compat_syscall' flag set?
Not that i recall seeing the code where it saves the 'compat' nature
of any requests.

It is already completely f*cked if you try to pass the command ring
to a child process - it uses the wrong 'mm'.
I suspect there are some really horrid security holes in that area.

	David.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Al Viro Sept. 20, 2020, 9:42 p.m. UTC | #34

On Sun, Sep 20, 2020 at 08:22:59PM +0100, Matthew Wilcox wrote:
> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > IMO it's much saner to mark those and refuse to touch them from io_uring...
> 
> Simpler solution is to remove io_uring from the 32-bit syscall list.
> If you're a 32-bit process, you don't get to use io_uring.  Would
> any real users actually care about that?

What for?  I mean, is there any reason to try and keep those bugs as
first-class citizens?  IDGI...  Yes, we have several special files
(out of thousands) that have read()/write() user-visible semantics
broken wrt 32bit/64bit.  And we have to keep them working that way
for existing syscalls.  Why would we want to pretend that their
behaviour is normal and isn't an ABI bug, not to be repeated for
anything new?

Christoph Hellwig Sept. 21, 2020, 4:28 a.m. UTC | #35

On Sun, Sep 20, 2020 at 12:14:49PM -0700, Andy Lutomirski wrote:
> I wonder if this is really quite cast in stone.  We could also have
> FMODE_SHITTY_COMPAT and set that when a file like this is *opened* in
> compat mode.  Then that particular struct file would be read and
> written using the compat data format.  The change would be
> user-visible, but the user that would see it would be very strange
> indeed.
> 
> I don't have a strong opinion as to whether that is better or worse
> than denying io_uring access to these things, but at least it moves
> the special case out of io_uring.

open could have happened through an io_uring thread a well, so I don't
see how this would do anything but move the problem to a different
place.

> 
> --Andy
---end quoted text---

Pavel Begunkov Sept. 21, 2020, 4:10 p.m. UTC | #36

On 20/09/2020 01:22, Andy Lutomirski wrote:
> 
>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>>> Said that, why not provide a variant that would take an explicit
>>>>> "is it compat" argument and use it there?  And have the normal
>>>>> one pass in_compat_syscall() to that...
>>>>
>>>> That would help to not introduce a regression with this series yes.
>>>> But it wouldn't fix existing bugs when io_uring is used to access
>>>> read or write methods that use in_compat_syscall().  One example that
>>>> I recently ran into is drivers/scsi/sg.c.
>>
>> Ah, so reading /dev/input/event* would suffer from the same issue,
>> and that one would in fact be broken by your patch in the hypothetical
>> case that someone tried to use io_uring to read /dev/input/event on x32...
>>
>> For reference, I checked the socket timestamp handling that has a
>> number of corner cases with time32/time64 formats in compat mode,
>> but none of those appear to be affected by the problem.
>>
>>> Aside from the potentially nasty use of per-task variables, one thing
>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
>>> going to have a generic mechanism for this, shouldn't we allow a full
>>> override of the syscall arch instead of just allowing forcing compat
>>> so that a compat syscall can do a non-compat operation?
>>
>> The only reason it's needed here is that the caller is in a kernel
>> thread rather than a system call. Are there any possible scenarios
>> where one would actually need the opposite?
>>
> 
> I can certainly imagine needing to force x32 mode from a kernel thread.
> 
> As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?

It's rather the second one. Even though AFAIR it wasn't discussed
specifically, that how it works now (_partially_).

Pavel Begunkov Sept. 21, 2020, 4:13 p.m. UTC | #37

On 21/09/2020 19:10, Pavel Begunkov wrote:
> On 20/09/2020 01:22, Andy Lutomirski wrote:
>>
>>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>>
>>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
>>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>>>> Said that, why not provide a variant that would take an explicit
>>>>>> "is it compat" argument and use it there?  And have the normal
>>>>>> one pass in_compat_syscall() to that...
>>>>>
>>>>> That would help to not introduce a regression with this series yes.
>>>>> But it wouldn't fix existing bugs when io_uring is used to access
>>>>> read or write methods that use in_compat_syscall().  One example that
>>>>> I recently ran into is drivers/scsi/sg.c.
>>>
>>> Ah, so reading /dev/input/event* would suffer from the same issue,
>>> and that one would in fact be broken by your patch in the hypothetical
>>> case that someone tried to use io_uring to read /dev/input/event on x32...
>>>
>>> For reference, I checked the socket timestamp handling that has a
>>> number of corner cases with time32/time64 formats in compat mode,
>>> but none of those appear to be affected by the problem.
>>>
>>>> Aside from the potentially nasty use of per-task variables, one thing
>>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
>>>> going to have a generic mechanism for this, shouldn't we allow a full
>>>> override of the syscall arch instead of just allowing forcing compat
>>>> so that a compat syscall can do a non-compat operation?
>>>
>>> The only reason it's needed here is that the caller is in a kernel
>>> thread rather than a system call. Are there any possible scenarios
>>> where one would actually need the opposite?
>>>
>>
>> I can certainly imagine needing to force x32 mode from a kernel thread.
>>
>> As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?
> 
> It's rather the second one. Even though AFAIR it wasn't discussed
> specifically, that how it works now (_partially_).

Double checked -- I'm wrong, that's the former one. Most of it is based
on a flag that was set an creation.

Pavel Begunkov Sept. 21, 2020, 4:20 p.m. UTC | #38

On 20/09/2020 18:55, William Kucharski wrote:
> I really like that as it’s self-documenting and anyone debugging it can see what is actually being used at a glance.

Also creates special cases for things that few people care about,
and makes it a pain for cross-platform (cross-bitness) development.

> 
>> On Sep 20, 2020, at 09:15, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
>>> Add a flag to force processing a syscall as a compat syscall.  This is
>>> required so that in_compat_syscall() works for I/O submitted by io_uring
>>> helper threads on behalf of compat syscalls.
>>
>> Al doesn't like this much, but my suggestion is to introduce two new
>> opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
>> can translate IORING_OP_READV to IORING_OP_READV32 and then the core
>> code can know what that user pointer is pointing to.

Pavel Begunkov Sept. 21, 2020, 4:26 p.m. UTC | #39

On 20/09/2020 22:22, Matthew Wilcox wrote:
> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
>> IMO it's much saner to mark those and refuse to touch them from io_uring...
> 
> Simpler solution is to remove io_uring from the 32-bit syscall list.
> If you're a 32-bit process, you don't get to use io_uring.  Would
> any real users actually care about that?

There were .net and\or wine (which AFAIK often works in compat) guys
experimenting with io_uring, they might want it.

Pavel Begunkov Sept. 21, 2020, 4:31 p.m. UTC | #40

On 21/09/2020 00:13, David Laight wrote:
> From: Arnd Bergmann
>> Sent: 20 September 2020 21:49
>>
>> On Sun, Sep 20, 2020 at 9:28 PM Andy Lutomirski <luto@kernel.org> wrote:
>>> On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
>>>>> IMO it's much saner to mark those and refuse to touch them from io_uring...
>>>>
>>>> Simpler solution is to remove io_uring from the 32-bit syscall list.
>>>> If you're a 32-bit process, you don't get to use io_uring.  Would
>>>> any real users actually care about that?
>>>
>>> We could go one step farther and declare that we're done adding *any*
>>> new compat syscalls :)
>>
>> Would you also stop adding system calls to native 32-bit systems then?
>>
>> On memory constrained systems (less than 2GB a.t.m.), there is still a
>> strong demand for running 32-bit user space, but all of the recent Arm
>> cores (after Cortex-A55) dropped the ability to run 32-bit kernels, so
>> that compat mode may eventually become the primary way to run
>> Linux on cheap embedded systems.
>>
>> I don't think there is any chance we can realistically take away io_uring
>> from the 32-bit ABI any more now.
> 
> Can't it just run requests from 32bit apps in a kernel thread that has
> the 'in_compat_syscall' flag set?
> Not that i recall seeing the code where it saves the 'compat' nature
> of any requests.
> 
> It is already completely f*cked if you try to pass the command ring
> to a child process - it uses the wrong 'mm'.

And how so? io_uring uses mm of a submitter. The exception is SQPOLL
mode, but it requires CAP_SYS_ADMIN or CAP_SYS_NICE anyway.

> I suspect there are some really horrid security holes in that area.
> 
> 	David.
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

Andy Lutomirski Sept. 21, 2020, 11:51 p.m. UTC | #41

On Mon, Sep 21, 2020 at 9:15 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 21/09/2020 19:10, Pavel Begunkov wrote:
> > On 20/09/2020 01:22, Andy Lutomirski wrote:
> >>
> >>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> >>>
> >>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
> >>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
> >>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> >>>>>> Said that, why not provide a variant that would take an explicit
> >>>>>> "is it compat" argument and use it there?  And have the normal
> >>>>>> one pass in_compat_syscall() to that...
> >>>>>
> >>>>> That would help to not introduce a regression with this series yes.
> >>>>> But it wouldn't fix existing bugs when io_uring is used to access
> >>>>> read or write methods that use in_compat_syscall().  One example that
> >>>>> I recently ran into is drivers/scsi/sg.c.
> >>>
> >>> Ah, so reading /dev/input/event* would suffer from the same issue,
> >>> and that one would in fact be broken by your patch in the hypothetical
> >>> case that someone tried to use io_uring to read /dev/input/event on x32...
> >>>
> >>> For reference, I checked the socket timestamp handling that has a
> >>> number of corner cases with time32/time64 formats in compat mode,
> >>> but none of those appear to be affected by the problem.
> >>>
> >>>> Aside from the potentially nasty use of per-task variables, one thing
> >>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
> >>>> going to have a generic mechanism for this, shouldn't we allow a full
> >>>> override of the syscall arch instead of just allowing forcing compat
> >>>> so that a compat syscall can do a non-compat operation?
> >>>
> >>> The only reason it's needed here is that the caller is in a kernel
> >>> thread rather than a system call. Are there any possible scenarios
> >>> where one would actually need the opposite?
> >>>
> >>
> >> I can certainly imagine needing to force x32 mode from a kernel thread.
> >>
> >> As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?
> >
> > It's rather the second one. Even though AFAIR it wasn't discussed
> > specifically, that how it works now (_partially_).
>
> Double checked -- I'm wrong, that's the former one. Most of it is based
> on a flag that was set an creation.
>

Could we get away with making io_uring_enter() return -EINVAL (or
maybe -ENOTTY?) if you try to do it with bitness that doesn't match
the io_uring?  And disable SQPOLL in compat mode?

--Andy

Pavel Begunkov Sept. 22, 2020, 12:22 a.m. UTC | #42

On 22/09/2020 02:51, Andy Lutomirski wrote:
> On Mon, Sep 21, 2020 at 9:15 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 21/09/2020 19:10, Pavel Begunkov wrote:
>>> On 20/09/2020 01:22, Andy Lutomirski wrote:
>>>>
>>>>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
>>>>>
>>>>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
>>>>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>>>>>> Said that, why not provide a variant that would take an explicit
>>>>>>>> "is it compat" argument and use it there?  And have the normal
>>>>>>>> one pass in_compat_syscall() to that...
>>>>>>>
>>>>>>> That would help to not introduce a regression with this series yes.
>>>>>>> But it wouldn't fix existing bugs when io_uring is used to access
>>>>>>> read or write methods that use in_compat_syscall().  One example that
>>>>>>> I recently ran into is drivers/scsi/sg.c.
>>>>>
>>>>> Ah, so reading /dev/input/event* would suffer from the same issue,
>>>>> and that one would in fact be broken by your patch in the hypothetical
>>>>> case that someone tried to use io_uring to read /dev/input/event on x32...
>>>>>
>>>>> For reference, I checked the socket timestamp handling that has a
>>>>> number of corner cases with time32/time64 formats in compat mode,
>>>>> but none of those appear to be affected by the problem.
>>>>>
>>>>>> Aside from the potentially nasty use of per-task variables, one thing
>>>>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
>>>>>> going to have a generic mechanism for this, shouldn't we allow a full
>>>>>> override of the syscall arch instead of just allowing forcing compat
>>>>>> so that a compat syscall can do a non-compat operation?
>>>>>
>>>>> The only reason it's needed here is that the caller is in a kernel
>>>>> thread rather than a system call. Are there any possible scenarios
>>>>> where one would actually need the opposite?
>>>>>
>>>>
>>>> I can certainly imagine needing to force x32 mode from a kernel thread.
>>>>
>>>> As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?
>>>
>>> It's rather the second one. Even though AFAIR it wasn't discussed
>>> specifically, that how it works now (_partially_).
>>
>> Double checked -- I'm wrong, that's the former one. Most of it is based
>> on a flag that was set an creation.
>>
> 
> Could we get away with making io_uring_enter() return -EINVAL (or
> maybe -ENOTTY?) if you try to do it with bitness that doesn't match
> the io_uring?  And disable SQPOLL in compat mode?

Something like below. If PF_FORCE_COMPAT or any other solution
doesn't lend by the time, I'll take a look whether other io_uring's
syscalls need similar checks, etc.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0458f02d4ca8..aab20785fa9a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8671,6 +8671,10 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
 	if (ctx->flags & IORING_SETUP_R_DISABLED)
 		goto out;
 
+	ret = -EINVAl;
+	if (ctx->compat != in_compat_syscall())
+		goto out;
+
 	/*
 	 * For SQ polling, the thread will do all submissions and completions.
 	 * Just return the requested submit count, and wake the thread if
@@ -9006,6 +9010,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ret)
 		goto err;
 
+	ret = -EINVAL;
+	if (ctx->compat)
+		goto err;
+
 	/* Only gets the ring fd, doesn't install it in the file table */
 	fd = io_uring_get_fd(ctx, &file);
 	if (fd < 0) {

Andy Lutomirski Sept. 22, 2020, 12:58 a.m. UTC | #43

On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
>
>
> On 22/09/2020 02:51, Andy Lutomirski wrote:
> > On Mon, Sep 21, 2020 at 9:15 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 21/09/2020 19:10, Pavel Begunkov wrote:
> >>> On 20/09/2020 01:22, Andy Lutomirski wrote:
> >>>>
> >>>>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> >>>>>
> >>>>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski <luto@kernel.org> wrote:
> >>>>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig <hch@lst.de> wrote:
> >>>>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> >>>>>>>> Said that, why not provide a variant that would take an explicit
> >>>>>>>> "is it compat" argument and use it there?  And have the normal
> >>>>>>>> one pass in_compat_syscall() to that...
> >>>>>>>
> >>>>>>> That would help to not introduce a regression with this series yes.
> >>>>>>> But it wouldn't fix existing bugs when io_uring is used to access
> >>>>>>> read or write methods that use in_compat_syscall().  One example that
> >>>>>>> I recently ran into is drivers/scsi/sg.c.
> >>>>>
> >>>>> Ah, so reading /dev/input/event* would suffer from the same issue,
> >>>>> and that one would in fact be broken by your patch in the hypothetical
> >>>>> case that someone tried to use io_uring to read /dev/input/event on x32...
> >>>>>
> >>>>> For reference, I checked the socket timestamp handling that has a
> >>>>> number of corner cases with time32/time64 formats in compat mode,
> >>>>> but none of those appear to be affected by the problem.
> >>>>>
> >>>>>> Aside from the potentially nasty use of per-task variables, one thing
> >>>>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
> >>>>>> going to have a generic mechanism for this, shouldn't we allow a full
> >>>>>> override of the syscall arch instead of just allowing forcing compat
> >>>>>> so that a compat syscall can do a non-compat operation?
> >>>>>
> >>>>> The only reason it's needed here is that the caller is in a kernel
> >>>>> thread rather than a system call. Are there any possible scenarios
> >>>>> where one would actually need the opposite?
> >>>>>
> >>>>
> >>>> I can certainly imagine needing to force x32 mode from a kernel thread.
> >>>>
> >>>> As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?
> >>>
> >>> It's rather the second one. Even though AFAIR it wasn't discussed
> >>> specifically, that how it works now (_partially_).
> >>
> >> Double checked -- I'm wrong, that's the former one. Most of it is based
> >> on a flag that was set an creation.
> >>
> >
> > Could we get away with making io_uring_enter() return -EINVAL (or
> > maybe -ENOTTY?) if you try to do it with bitness that doesn't match
> > the io_uring?  And disable SQPOLL in compat mode?
>
> Something like below. If PF_FORCE_COMPAT or any other solution
> doesn't lend by the time, I'll take a look whether other io_uring's
> syscalls need similar checks, etc.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 0458f02d4ca8..aab20785fa9a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8671,6 +8671,10 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>         if (ctx->flags & IORING_SETUP_R_DISABLED)
>                 goto out;
>
> +       ret = -EINVAl;
> +       if (ctx->compat != in_compat_syscall())
> +               goto out;
> +

This seems entirely reasonable to me.  Sharing an io_uring ring
between programs with different ABIs seems a bit nutty.

>         /*
>          * For SQ polling, the thread will do all submissions and completions.
>          * Just return the requested submit count, and wake the thread if
> @@ -9006,6 +9010,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
>         if (ret)
>                 goto err;
>
> +       ret = -EINVAL;
> +       if (ctx->compat)
> +               goto err;
> +

I may be looking at a different kernel than you, but aren't you
preventing creating an io_uring regardless of whether SQPOLL is
requested?

>         /* Only gets the ring fd, doesn't install it in the file table */
>         fd = io_uring_get_fd(ctx, &file);
>         if (fd < 0) {
> --
> Pavel Begunkov

Pavel Begunkov Sept. 22, 2020, 6:30 a.m. UTC | #44

On 22/09/2020 03:58, Andy Lutomirski wrote:
> On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>>>> Ah, so reading /dev/input/event* would suffer from the same issue,
>>>>>>> and that one would in fact be broken by your patch in the hypothetical
>>>>>>> case that someone tried to use io_uring to read /dev/input/event on x32...
>>>>>>>
>>>>>>> For reference, I checked the socket timestamp handling that has a
>>>>>>> number of corner cases with time32/time64 formats in compat mode,
>>>>>>> but none of those appear to be affected by the problem.
>>>>>>>
>>>>>>>> Aside from the potentially nasty use of per-task variables, one thing
>>>>>>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
>>>>>>>> going to have a generic mechanism for this, shouldn't we allow a full
>>>>>>>> override of the syscall arch instead of just allowing forcing compat
>>>>>>>> so that a compat syscall can do a non-compat operation?
>>>>>>>
>>>>>>> The only reason it's needed here is that the caller is in a kernel
>>>>>>> thread rather than a system call. Are there any possible scenarios
>>>>>>> where one would actually need the opposite?
>>>>>>>
>>>>>>
>>>>>> I can certainly imagine needing to force x32 mode from a kernel thread.
>>>>>>
>>>>>> As for the other direction: what exactly are the desired bitness/arch semantics of io_uring?  Is the operation bitness chosen by the io_uring creation or by the io_uring_enter() bitness?
>>>>>
>>>>> It's rather the second one. Even though AFAIR it wasn't discussed
>>>>> specifically, that how it works now (_partially_).
>>>>
>>>> Double checked -- I'm wrong, that's the former one. Most of it is based
>>>> on a flag that was set an creation.
>>>>
>>>
>>> Could we get away with making io_uring_enter() return -EINVAL (or
>>> maybe -ENOTTY?) if you try to do it with bitness that doesn't match
>>> the io_uring?  And disable SQPOLL in compat mode?
>>
>> Something like below. If PF_FORCE_COMPAT or any other solution
>> doesn't lend by the time, I'll take a look whether other io_uring's
>> syscalls need similar checks, etc.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 0458f02d4ca8..aab20785fa9a 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -8671,6 +8671,10 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
>>         if (ctx->flags & IORING_SETUP_R_DISABLED)
>>                 goto out;
>>
>> +       ret = -EINVAl;
>> +       if (ctx->compat != in_compat_syscall())
>> +               goto out;
>> +
> 
> This seems entirely reasonable to me.  Sharing an io_uring ring
> between programs with different ABIs seems a bit nutty.
> 
>>         /*
>>          * For SQ polling, the thread will do all submissions and completions.
>>          * Just return the requested submit count, and wake the thread if
>> @@ -9006,6 +9010,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
>>         if (ret)
>>                 goto err;
>>
>> +       ret = -EINVAL;
>> +       if (ctx->compat)
>> +               goto err;
>> +
> 
> I may be looking at a different kernel than you, but aren't you
> preventing creating an io_uring regardless of whether SQPOLL is
> requested?

I diffed a not-saved file on a sleepy head, thanks for noticing.
As you said, there should be an SQPOLL check.

...
if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
	goto err;

Arnd Bergmann Sept. 22, 2020, 7:23 a.m. UTC | #45

On Tue, Sep 22, 2020 at 8:32 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 22/09/2020 03:58, Andy Lutomirski wrote:
> > On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> > I may be looking at a different kernel than you, but aren't you
> > preventing creating an io_uring regardless of whether SQPOLL is
> > requested?
>
> I diffed a not-saved file on a sleepy head, thanks for noticing.
> As you said, there should be an SQPOLL check.
>
> ...
> if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
>         goto err;

Wouldn't that mean that now 32-bit containers behave differently
between compat and native execution?

I think if you want to prevent 32-bit applications from using SQPOLL,
it needs to be done the same way on both to be consistent:

   if ((!IS_ENABLED(CONFIG_64BIT) || ctx->compat) &&
        (p->flags & IORING_SETUP_SQPOLL))
            goto err;

I don't really see how taking away SQPOLL from 32-bit tasks is
any better than just preventing access to the known-broken files
as Al suggested, or adding the hack to make it work as in
Christoph's original patch.

Can we expect all existing and future user space to have a sane
fallback when IORING_SETUP_SQPOLL fails?

      Arnd

Pavel Begunkov Sept. 22, 2020, 7:57 a.m. UTC | #46

On 22/09/2020 10:23, Arnd Bergmann wrote:
> On Tue, Sep 22, 2020 at 8:32 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 22/09/2020 03:58, Andy Lutomirski wrote:
>>> On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>> I may be looking at a different kernel than you, but aren't you
>>> preventing creating an io_uring regardless of whether SQPOLL is
>>> requested?
>>
>> I diffed a not-saved file on a sleepy head, thanks for noticing.
>> As you said, there should be an SQPOLL check.
>>
>> ...
>> if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
>>         goto err;
> 
> Wouldn't that mean that now 32-bit containers behave differently
> between compat and native execution?
> 
> I think if you want to prevent 32-bit applications from using SQPOLL,
> it needs to be done the same way on both to be consistent:

The intention was to disable only compat not native 32-bit.

> 
>    if ((!IS_ENABLED(CONFIG_64BIT) || ctx->compat) &&
>         (p->flags & IORING_SETUP_SQPOLL))
>             goto err;
> 
> I don't really see how taking away SQPOLL from 32-bit tasks is
> any better than just preventing access to the known-broken files
> as Al suggested, or adding the hack to make it work as in
> Christoph's original patch.

That's why I'm hoping that Christoph's work and the discussion will
reach consensus, but the bug should be patched in the end. IMHO,
it's a good and easy enough fallback option (temporal?).

> 
> Can we expect all existing and future user space to have a sane
> fallback when IORING_SETUP_SQPOLL fails?

SQPOLL has a few differences with non-SQPOLL modes, but it's easy
to convert between them. Anyway, SQPOLL is a privileged special
case that's here for performance/latency reasons, I don't think
there will be any non-accidental users of it.

Arnd Bergmann Sept. 22, 2020, 9:01 a.m. UTC | #47

On Tue, Sep 22, 2020 at 9:59 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> On 22/09/2020 10:23, Arnd Bergmann wrote:
> > On Tue, Sep 22, 2020 at 8:32 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >> On 22/09/2020 03:58, Andy Lutomirski wrote:
> >>> On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>> I may be looking at a different kernel than you, but aren't you
> >>> preventing creating an io_uring regardless of whether SQPOLL is
> >>> requested?
> >>
> >> I diffed a not-saved file on a sleepy head, thanks for noticing.
> >> As you said, there should be an SQPOLL check.
> >>
> >> ...
> >> if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
> >>         goto err;
> >
> > Wouldn't that mean that now 32-bit containers behave differently
> > between compat and native execution?
> >
> > I think if you want to prevent 32-bit applications from using SQPOLL,
> > it needs to be done the same way on both to be consistent:
>
> The intention was to disable only compat not native 32-bit.

I'm not following why that would be considered a valid option,
as that clearly breaks existing users that update from a 32-bit
kernel to a 64-bit one.

Taking away the features from users that are still on 32-bit kernels
already seems questionable to me, but being inconsistent
about it seems much worse, in particular when the regression
is on the upgrade path.

> > Can we expect all existing and future user space to have a sane
> > fallback when IORING_SETUP_SQPOLL fails?
>
> SQPOLL has a few differences with non-SQPOLL modes, but it's easy
> to convert between them. Anyway, SQPOLL is a privileged special
> case that's here for performance/latency reasons, I don't think
> there will be any non-accidental users of it.

Ok, so the behavior of 32-bit tasks would be the same as running
the same application as unprivileged 64-bit tasks, with applications
already having to implement that fallback, right?

      Arnd

Andy Lutomirski Sept. 22, 2020, 4:20 p.m. UTC | #48

> On Sep 22, 2020, at 2:01 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> 
> On Tue, Sep 22, 2020 at 9:59 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>> On 22/09/2020 10:23, Arnd Bergmann wrote:
>>> On Tue, Sep 22, 2020 at 8:32 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>> On 22/09/2020 03:58, Andy Lutomirski wrote:
>>>>> On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>> I may be looking at a different kernel than you, but aren't you
>>>>> preventing creating an io_uring regardless of whether SQPOLL is
>>>>> requested?
>>>> 
>>>> I diffed a not-saved file on a sleepy head, thanks for noticing.
>>>> As you said, there should be an SQPOLL check.
>>>> 
>>>> ...
>>>> if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
>>>>        goto err;
>>> 
>>> Wouldn't that mean that now 32-bit containers behave differently
>>> between compat and native execution?
>>> 
>>> I think if you want to prevent 32-bit applications from using SQPOLL,
>>> it needs to be done the same way on both to be consistent:
>> 
>> The intention was to disable only compat not native 32-bit.
> 
> I'm not following why that would be considered a valid option,
> as that clearly breaks existing users that update from a 32-bit
> kernel to a 64-bit one.
> 
> Taking away the features from users that are still on 32-bit kernels
> already seems questionable to me, but being inconsistent
> about it seems much worse, in particular when the regression
> is on the upgrade path.
> 
>>> Can we expect all existing and future user space to have a sane
>>> fallback when IORING_SETUP_SQPOLL fails?
>> 
>> SQPOLL has a few differences with non-SQPOLL modes, but it's easy
>> to convert between them. Anyway, SQPOLL is a privileged special
>> case that's here for performance/latency reasons, I don't think
>> there will be any non-accidental users of it.
> 
> Ok, so the behavior of 32-bit tasks would be the same as running
> the same application as unprivileged 64-bit tasks, with applications
> already having to implement that fallback, right?
> 
> 

I don’t have any real preference wrt SQPOLL, and it may be that we have a problem even without SQPOLL when IO gets punted without one of the fixes discussed.

But banning the mismatched io_uring and io_uring_enter seems like it may be worthwhile regardless.

Pavel Begunkov Sept. 23, 2020, 8:01 a.m. UTC | #49

On 22/09/2020 12:01, Arnd Bergmann wrote:
> On Tue, Sep 22, 2020 at 9:59 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>> On 22/09/2020 10:23, Arnd Bergmann wrote:
>>> On Tue, Sep 22, 2020 at 8:32 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>> On 22/09/2020 03:58, Andy Lutomirski wrote:
>>>>> On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>> I may be looking at a different kernel than you, but aren't you
>>>>> preventing creating an io_uring regardless of whether SQPOLL is
>>>>> requested?
>>>>
>>>> I diffed a not-saved file on a sleepy head, thanks for noticing.
>>>> As you said, there should be an SQPOLL check.
>>>>
>>>> ...
>>>> if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
>>>>         goto err;
>>>
>>> Wouldn't that mean that now 32-bit containers behave differently
>>> between compat and native execution?
>>>
>>> I think if you want to prevent 32-bit applications from using SQPOLL,
>>> it needs to be done the same way on both to be consistent:
>>
>> The intention was to disable only compat not native 32-bit.
> 
> I'm not following why that would be considered a valid option,
> as that clearly breaks existing users that update from a 32-bit
> kernel to a 64-bit one.

Do you mean users who move 32-bit binaries (without recompiling) to a
new x64 kernel? Does the kernel guarantees that to work? I'd personally
care more native-bitness apps.

> 
> Taking away the features from users that are still on 32-bit kernels
> already seems questionable to me, but being inconsistent
> about it seems much worse, in particular when the regression
> is on the upgrade path.

TBH, this won't fix that entirely (e.g. passing non-compat io_uring
to a compat process should yield the same problem). So, let's put
it aside for now until this bikeshedding would be relevant.

> 
>>> Can we expect all existing and future user space to have a sane
>>> fallback when IORING_SETUP_SQPOLL fails?
>>
>> SQPOLL has a few differences with non-SQPOLL modes, but it's easy
>> to convert between them. Anyway, SQPOLL is a privileged special
>> case that's here for performance/latency reasons, I don't think
>> there will be any non-accidental users of it.
> 
> Ok, so the behavior of 32-bit tasks would be the same as running
> the same application as unprivileged 64-bit tasks, with applications

Yes, something like that, but that's not automatic and in some
(hopefully rare) cases there may be pitfalls. That's in short,
I can expand the idea a bit if anyone would be interested.

> already having to implement that fallback, right?

Well, not everyone _have_ to implement such a fallback, e.g.
applications working only whilst privileged may use SQPOLL only.

Al Viro Sept. 23, 2020, 1:22 p.m. UTC | #50

On Wed, Sep 23, 2020 at 11:01:34AM +0300, Pavel Begunkov wrote:

> > I'm not following why that would be considered a valid option,
> > as that clearly breaks existing users that update from a 32-bit
> > kernel to a 64-bit one.
> 
> Do you mean users who move 32-bit binaries (without recompiling) to a
> new x64 kernel? Does the kernel guarantees that to work?

Yes.

No further (printable) comments for now...

diff mbox series

Patch

diff --git a/arch/sparc/include/asm/compat.h b/arch/sparc/include/asm/compat.h
index 40a267b3bd5208..fee6c51d36e869 100644
--- a/arch/sparc/include/asm/compat.h
+++ b/arch/sparc/include/asm/compat.h
@@ -211,7 +211,8 @@  static inline int is_compat_task(void)
 static inline bool in_compat_syscall(void)
 {
 	/* Vector 0x110 is LINUX_32BIT_SYSCALL_TRAP */
-	return pt_regs_trap_type(current_pt_regs()) == 0x110;
+	return pt_regs_trap_type(current_pt_regs()) == 0x110 ||
+		(current->flags & PF_FORCE_COMPAT);
 }
 #define in_compat_syscall in_compat_syscall
 #endif
diff --git a/arch/x86/include/asm/compat.h b/arch/x86/include/asm/compat.h
index d4edf281fff49d..fbab072d4e5b31 100644
--- a/arch/x86/include/asm/compat.h
+++ b/arch/x86/include/asm/compat.h
@@ -208,7 +208,7 @@  static inline bool in_32bit_syscall(void)
 #ifdef CONFIG_COMPAT
 static inline bool in_compat_syscall(void)
 {
-	return in_32bit_syscall();
+	return in_32bit_syscall() || (current->flags & PF_FORCE_COMPAT);
 }
 #define in_compat_syscall in_compat_syscall	/* override the generic impl */
 #endif
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3790c7fe9fee22..5755d557c3f7bc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5449,6 +5449,9 @@  static int io_req_defer_prep(struct io_kiocb *req,
 	if (unlikely(ret))
 		return ret;
 
+	if (req->ctx->compat)
+		current->flags |= PF_FORCE_COMPAT;
+
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		break;
@@ -5546,6 +5549,7 @@  static int io_req_defer_prep(struct io_kiocb *req,
 		break;
 	}
 
+	current->flags &= ~PF_FORCE_COMPAT;
 	return ret;
 }
 
@@ -5669,6 +5673,9 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	struct io_ring_ctx *ctx = req->ctx;
 	int ret;
 
+	if (ctx->compat)
+		current->flags |= PF_FORCE_COMPAT;
+
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		ret = io_nop(req, cs);
@@ -5898,6 +5905,8 @@  static int io_issue_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		break;
 	}
 
+	current->flags &= ~PF_FORCE_COMPAT;
+
 	if (ret)
 		return ret;
 
diff --git a/include/linux/compat.h b/include/linux/compat.h
index b354ce58966e2d..685066f7ad325f 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -891,7 +891,10 @@  asmlinkage long compat_sys_socketcall(int call, u32 __user *args);
  */
 
 #ifndef in_compat_syscall
-static inline bool in_compat_syscall(void) { return is_compat_task(); }
+static inline bool in_compat_syscall(void)
+{
+	return is_compat_task() || (current->flags & PF_FORCE_COMPAT);
+}
 #endif
 
 /**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index afe01e232935fa..c8b183b5655a1e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1491,6 +1491,7 @@  extern struct pid *cad_pid;
  */
 #define PF_IDLE			0x00000002	/* I am an IDLE thread */
 #define PF_EXITING		0x00000004	/* Getting shut down */
+#define PF_FORCE_COMPAT		0x00000008	/* acting as compat task */
 #define PF_VCPU			0x00000010	/* I'm a virtual CPU */
 #define PF_WQ_WORKER		0x00000020	/* I'm a workqueue worker */
 #define PF_FORKNOEXEC		0x00000040	/* Forked but didn't exec */