diff mbox

tracing/syscalls: ignore numbers outside NR_syscalls' range

Message ID 1414620418-29472-1-git-send-email-rabin@rab.in (mailing list archive)
State New, archived
Headers show

Commit Message

Rabin Vincent Oct. 29, 2014, 10:06 p.m. UTC
ARM has some private syscalls (for example, set_tls(2)) which lie
outside the range of NR_syscalls.  If any of these are called while
syscall tracing is being performed, out-of-bounds array access will
occur in the ftrace and perf sys_{enter,exit} handlers.

 # trace-cmd record -e raw_syscalls:* true && trace-cmd report
 ...
 true-653   [000]   384.675777: sys_enter:            NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
 true-653   [000]   384.675812: sys_exit:             NR 192 = 1995915264
 true-653   [000]   384.675971: sys_enter:            NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
 true-653   [000]   384.675988: sys_exit:             NR 983045 = 0
 ...

 # trace-cmd record -e syscalls:* true
 [   17.289329] Unable to handle kernel paging request at virtual address aaaaaace
 [   17.289590] pgd = 9e71c000
 [   17.289696] [aaaaaace] *pgd=00000000
 [   17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
 [   17.290169] Modules linked in:
 [   17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
 [   17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
 [   17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
 [   17.290866] LR is at syscall_trace_enter+0x124/0x184

Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.

Signed-off-by: Rabin Vincent <rabin@rab.in>
---
 kernel/trace/trace_syscalls.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Oct. 30, 2014, 8:26 a.m. UTC | #1
On Wed, Oct 29, 2014 at 11:06:58PM +0100, Rabin Vincent wrote:
> ARM has some private syscalls (for example, set_tls(2)) which lie
> outside the range of NR_syscalls.  If any of these are called while
> syscall tracing is being performed, out-of-bounds array access will
> occur in the ftrace and perf sys_{enter,exit} handlers.

While this patch looks like good caution, having syscalls outside of
NR_syscalls seems like a receipe for a disaster.  Can you try to fix
that issue as ell, please?
Russell King - ARM Linux Oct. 30, 2014, 10:18 a.m. UTC | #2
On Thu, Oct 30, 2014 at 01:26:06AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 29, 2014 at 11:06:58PM +0100, Rabin Vincent wrote:
> > ARM has some private syscalls (for example, set_tls(2)) which lie
> > outside the range of NR_syscalls.  If any of these are called while
> > syscall tracing is being performed, out-of-bounds array access will
> > occur in the ftrace and perf sys_{enter,exit} handlers.
> 
> While this patch looks like good caution, having syscalls outside of
> NR_syscalls seems like a receipe for a disaster.  Can you try to fix
> that issue as ell, please?

No.  We've had them since the inception of Linux on ARM.  They predate
this tracing crap by more than a decade.  We're not changing them
because that would be a massive user API breakage.
Steven Rostedt Oct. 30, 2014, 11:10 a.m. UTC | #3
On Thu, 30 Oct 2014 10:18:08 +0000
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:

> On Thu, Oct 30, 2014 at 01:26:06AM -0700, Christoph Hellwig wrote:
> > On Wed, Oct 29, 2014 at 11:06:58PM +0100, Rabin Vincent wrote:
> > > ARM has some private syscalls (for example, set_tls(2)) which lie
> > > outside the range of NR_syscalls.  If any of these are called while
> > > syscall tracing is being performed, out-of-bounds array access will
> > > occur in the ftrace and perf sys_{enter,exit} handlers.
> > 
> > While this patch looks like good caution, having syscalls outside of
> > NR_syscalls seems like a receipe for a disaster.  Can you try to fix
> > that issue as ell, please?
> 
> No.  We've had them since the inception of Linux on ARM.  They predate
> this tracing crap by more than a decade.  We're not changing them
> because that would be a massive user API breakage.
> 

Since syscall tracing is only broken on ARM, then the fix needs to be
ARM specific, and not remove the check for all other architectures that
have a sane NR_syscalls variable.

-- Steve
Russell King - ARM Linux Oct. 30, 2014, 11:14 a.m. UTC | #4
On Thu, Oct 30, 2014 at 07:10:39AM -0400, Steven Rostedt wrote:
> On Thu, 30 Oct 2014 10:18:08 +0000
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> 
> > On Thu, Oct 30, 2014 at 01:26:06AM -0700, Christoph Hellwig wrote:
> > > On Wed, Oct 29, 2014 at 11:06:58PM +0100, Rabin Vincent wrote:
> > > > ARM has some private syscalls (for example, set_tls(2)) which lie
> > > > outside the range of NR_syscalls.  If any of these are called while
> > > > syscall tracing is being performed, out-of-bounds array access will
> > > > occur in the ftrace and perf sys_{enter,exit} handlers.
> > > 
> > > While this patch looks like good caution, having syscalls outside of
> > > NR_syscalls seems like a receipe for a disaster.  Can you try to fix
> > > that issue as ell, please?
> > 
> > No.  We've had them since the inception of Linux on ARM.  They predate
> > this tracing crap by more than a decade.  We're not changing them
> > because that would be a massive user API breakage.
> > 
> 
> Since syscall tracing is only broken on ARM, then the fix needs to be
> ARM specific, and not remove the check for all other architectures that
> have a sane NR_syscalls variable.

This issue came up before.

We have always had syscall number range of 0x900000 or so.  The tracing
design does not expect that.  Therefore, the tracing design did not take
account of ARM when it was created.  Therefore, it's up to the tracing
people to decide how to properly fit their ill-designed subsystem into
one of the popular and well-established kernel architectures - or at
least suggest a way to work around this issue.
Steven Rostedt Oct. 30, 2014, 11:30 a.m. UTC | #5
On Thu, 30 Oct 2014 11:14:41 +0000
Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:


> We have always had syscall number range of 0x900000 or so.  The tracing
> design does not expect that.  Therefore, the tracing design did not take
> account of ARM when it was created.  Therefore, it's up to the tracing
> people to decide how to properly fit their ill-designed subsystem into
> one of the popular and well-established kernel architectures - or at
> least suggest a way to work around this issue.
> 


Fine, lets define a MAX_SYSCALL_NR that is by default NR_syscalls, but
an architecture can override it.

In trace_syscalls.c, where the checks are done, have this:

#ifndef MAX_SYSCALL_NR
# define MAX_SYSCALL_NR NR_syscalls
#endif

change all the checks to test against MAX_SYSCALL_NR instead of
NR_syscalls.

Then in arch/arm/include/asm/syscall.h have:

#define MAX_SYSCALL_NR 0xa00000

or whatever would be the highest syscall number for ARM.

-- Steve
Russell King - ARM Linux Oct. 30, 2014, 11:35 a.m. UTC | #6
On Thu, Oct 30, 2014 at 07:30:28AM -0400, Steven Rostedt wrote:
> On Thu, 30 Oct 2014 11:14:41 +0000
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> 
> 
> > We have always had syscall number range of 0x900000 or so.  The tracing
> > design does not expect that.  Therefore, the tracing design did not take
> > account of ARM when it was created.  Therefore, it's up to the tracing
> > people to decide how to properly fit their ill-designed subsystem into
> > one of the popular and well-established kernel architectures - or at
> > least suggest a way to work around this issue.
> > 
> 
> 
> Fine, lets define a MAX_SYSCALL_NR that is by default NR_syscalls, but
> an architecture can override it.
> 
> In trace_syscalls.c, where the checks are done, have this:
> 
> #ifndef MAX_SYSCALL_NR
> # define MAX_SYSCALL_NR NR_syscalls
> #endif
> 
> change all the checks to test against MAX_SYSCALL_NR instead of
> NR_syscalls.
> 
> Then in arch/arm/include/asm/syscall.h have:
> 
> #define MAX_SYSCALL_NR 0xa00000
> 
> or whatever would be the highest syscall number for ARM.

Or do we just ignore the high "special" ARM syscalls and treat them (from
the tracing point of view) as non-syscalls, avoiding the allocation of
something around 1.2MB for the syscall bitmap.  I really don't know, I
don't use any of this tracing stuff, so it isn't something I care about.

Maybe those who do use the facility should have an input here?
Steven Rostedt Oct. 30, 2014, 11:52 a.m. UTC | #7
On Thu, 30 Oct 2014 07:10:39 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 30 Oct 2014 10:18:08 +0000
> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
> 
> > On Thu, Oct 30, 2014 at 01:26:06AM -0700, Christoph Hellwig wrote:
> > > On Wed, Oct 29, 2014 at 11:06:58PM +0100, Rabin Vincent wrote:
> > > > ARM has some private syscalls (for example, set_tls(2)) which lie
> > > > outside the range of NR_syscalls.  If any of these are called while
> > > > syscall tracing is being performed, out-of-bounds array access will
> > > > occur in the ftrace and perf sys_{enter,exit} handlers.
> > > 
> > > While this patch looks like good caution, having syscalls outside of
> > > NR_syscalls seems like a receipe for a disaster.  Can you try to fix
> > > that issue as ell, please?
> > 
> > No.  We've had them since the inception of Linux on ARM.  They predate
> > this tracing crap by more than a decade.  We're not changing them
> > because that would be a massive user API breakage.
> > 
> 
> Since syscall tracing is only broken on ARM, then the fix needs to be
> ARM specific, and not remove the check for all other architectures that
> have a sane NR_syscalls variable.

Bah, I misread the patch. I shouldn't read patches before having my
morning coffee :-/

I read it backwards. I thought it was removing the checks for
NR_syscalls, and not adding them.

I'm fine with the patch as is, and will take it.

But I agree that the syscall tracing code needs a rewrite to handle
these types of issues. It has problems with compat calls as well, which
we simply ignore.

Sorry for the confusion.

-- Steve
Steven Rostedt Oct. 30, 2014, 11:55 a.m. UTC | #8
On Thu, 30 Oct 2014 07:52:23 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

 
> Bah, I misread the patch. I shouldn't read patches before having my
> morning coffee :-/
> 

That's what I get by reading email before doing my morning Physical
Therapy treatment.

I'm off to do my PT exercises and then have breakfast. I'll pull the
patch in today, test it out, and then push it off for 3.18-rcX.

Thanks,

-- Steve
Ingo Molnar Oct. 31, 2014, 10:01 a.m. UTC | #9
* Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:

> On Thu, Oct 30, 2014 at 01:26:06AM -0700, Christoph Hellwig wrote:
> > On Wed, Oct 29, 2014 at 11:06:58PM +0100, Rabin Vincent wrote:
> > > ARM has some private syscalls (for example, set_tls(2)) which lie
> > > outside the range of NR_syscalls.  If any of these are called while
> > > syscall tracing is being performed, out-of-bounds array access will
> > > occur in the ftrace and perf sys_{enter,exit} handlers.
> > 
> > While this patch looks like good caution, having syscalls 
> > outside of NR_syscalls seems like a receipe for a disaster.  
> > Can you try to fix that issue as ell, please?
> 
> No.  We've had them since the inception of Linux on ARM.  They 
> predate this tracing crap by more than a decade.  We're not 
> changing them because that would be a massive user API 
> breakage.

So if you go around calling other people's code 'crap' so easily: 
if we should call something 'crap' in this area it's the decision 
of ARM to deviate from all other architectures arbitrarily and to 
introduce 'private' syscalls outside NR_syscalls...

There's a reason why we have NR_syscalls with relatively tighly 
packed syscall numbers and there's a reason why we don't do 
'private' syscalls on other architectures.

I'd probably have NAK-ed ARM's 'private syscalls' had I known 
about it when this was introduced for ARM. IMO you should be 
ashamed for it instead of blaming others for the complication ...

But yes, it's probably an ABI, albeit a crappy one, which is now 
hurting the introduction of a generic kernel facility in the ARM 
space.

Thanks,

	Ingo
Nathan Lynch Nov. 3, 2014, 5:08 p.m. UTC | #10
On 10/30/2014 06:35 AM, Russell King - ARM Linux wrote:
> On Thu, Oct 30, 2014 at 07:30:28AM -0400, Steven Rostedt wrote:
>> On Thu, 30 Oct 2014 11:14:41 +0000
>> Russell King - ARM Linux <linux@arm.linux.org.uk> wrote:
>>
>>
>>> We have always had syscall number range of 0x900000 or so.  The tracing
>>> design does not expect that.  Therefore, the tracing design did not take
>>> account of ARM when it was created.  Therefore, it's up to the tracing
>>> people to decide how to properly fit their ill-designed subsystem into
>>> one of the popular and well-established kernel architectures - or at
>>> least suggest a way to work around this issue.
>>>
>>
>>
>> Fine, lets define a MAX_SYSCALL_NR that is by default NR_syscalls, but
>> an architecture can override it.
>>
>> In trace_syscalls.c, where the checks are done, have this:
>>
>> #ifndef MAX_SYSCALL_NR
>> # define MAX_SYSCALL_NR NR_syscalls
>> #endif
>>
>> change all the checks to test against MAX_SYSCALL_NR instead of
>> NR_syscalls.
>>
>> Then in arch/arm/include/asm/syscall.h have:
>>
>> #define MAX_SYSCALL_NR 0xa00000
>>
>> or whatever would be the highest syscall number for ARM.
> 
> Or do we just ignore the high "special" ARM syscalls and treat them (from
> the tracing point of view) as non-syscalls, avoiding the allocation of
> something around 1.2MB for the syscall bitmap.  I really don't know, I
> don't use any of this tracing stuff, so it isn't something I care about.
> 
> Maybe those who do use the facility should have an input here?

I checked strace and it knows about ARM's high syscalls.  I wouldn't
want to go from casually using strace to digging deeper with ftrace only
to get the impression that syscalls are disappearing.
Steven Rostedt Nov. 3, 2014, 5:58 p.m. UTC | #11
On Mon, 3 Nov 2014 11:08:03 -0600
Nathan Lynch <Nathan_Lynch@mentor.com> wrote:


> > Or do we just ignore the high "special" ARM syscalls and treat them (from
> > the tracing point of view) as non-syscalls, avoiding the allocation of
> > something around 1.2MB for the syscall bitmap.  I really don't know, I
> > don't use any of this tracing stuff, so it isn't something I care about.
> > 
> > Maybe those who do use the facility should have an input here?
> 
> I checked strace and it knows about ARM's high syscalls.  I wouldn't
> want to go from casually using strace to digging deeper with ftrace only
> to get the impression that syscalls are disappearing.

I agree, but currently the syscall tracing does not support different
mappings, and if there's a group of calls outside of NR_syscalls range,
they will currently be ignored.

The fix may be to restructure how syscall tracing works. But for now,
the only answer we have is to just ignore them. x86 has the same issue
with compat calls (i386 syscalls on x86_64 kernels).

-- Steve
diff mbox

Patch

diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index 4dc8b79..29228c4 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -313,7 +313,7 @@  static void ftrace_syscall_enter(void *data, struct pt_regs *regs, long id)
 	int size;
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
-	if (syscall_nr < 0)
+	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 
 	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE) */
@@ -360,7 +360,7 @@  static void ftrace_syscall_exit(void *data, struct pt_regs *regs, long ret)
 	int syscall_nr;
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
-	if (syscall_nr < 0)
+	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 
 	/* Here we're inside tp handler's rcu_read_lock_sched (__DO_TRACE()) */
@@ -567,7 +567,7 @@  static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
 	int size;
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
-	if (syscall_nr < 0)
+	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 	if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
 		return;
@@ -641,7 +641,7 @@  static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
 	int size;
 
 	syscall_nr = trace_get_syscall_nr(current, regs);
-	if (syscall_nr < 0)
+	if (syscall_nr < 0 || syscall_nr >= NR_syscalls)
 		return;
 	if (!test_bit(syscall_nr, enabled_perf_exit_syscalls))
 		return;