__register_sysctl_table: do not drop subdir
diff mbox series

Message ID 20200527104848.GA7914@nixbox
State New
Headers show
Series
  • __register_sysctl_table: do not drop subdir
Related show

Commit Message

Boris Sukholitko May 27, 2020, 10:48 a.m. UTC
Successful get_subdir returns dir with its header.nreg properly
adjusted. No need to drop the dir in that case.

Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Fixes: 7ec66d06362d (sysctl: Stop requiring explicit management of sysctl directories)
---
 fs/proc/proc_sysctl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Luis Chamberlain May 27, 2020, 12:58 p.m. UTC | #1
Eric since you authored the code which this code claism to fix, your
review would be appreciated.

On Wed, May 27, 2020 at 01:48:48PM +0300, Boris Sukholitko wrote:
> Successful get_subdir returns dir with its header.nreg properly
> adjusted. No need to drop the dir in that case.

This commit log is not that clear to me, can you explain what happens
without this patch, and how critical it is to fix it. How did you
notice this issue? If you don't apply this patch what issue do you
see?

Do we test for it? Can we?

  Luis

> 
> Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
> Fixes: 7ec66d06362d (sysctl: Stop requiring explicit management of sysctl directories)
> ---
>  fs/proc/proc_sysctl.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index b6f5d459b087..6f237f0eb531 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -1286,8 +1286,8 @@ struct ctl_table_header *__register_sysctl_table(
>  {
>  	struct ctl_table_root *root = set->dir.header.root;
>  	struct ctl_table_header *header;
> +	struct ctl_dir *dir, *start_dir;
>  	const char *name, *nextname;
> -	struct ctl_dir *dir;
>  	struct ctl_table *entry;
>  	struct ctl_node *node;
>  	int nr_entries = 0;
> @@ -1307,6 +1307,7 @@ struct ctl_table_header *__register_sysctl_table(
>  
>  	spin_lock(&sysctl_lock);
>  	dir = &set->dir;
> +	start_dir = dir;
>  	/* Reference moved down the diretory tree get_subdir */
>  	dir->header.nreg++;
>  	spin_unlock(&sysctl_lock);
> @@ -1333,7 +1334,8 @@ struct ctl_table_header *__register_sysctl_table(
>  	if (insert_header(dir, header))
>  		goto fail_put_dir_locked;
>  
> -	drop_sysctl_table(&dir->header);
> +	if (start_dir == dir)
> +		drop_sysctl_table(&dir->header);
>  	spin_unlock(&sysctl_lock);
>  
>  	return header;
> -- 
> 2.23.1
>
Boris Sukholitko May 28, 2020, 8:08 a.m. UTC | #2
On Wed, May 27, 2020 at 12:58:05PM +0000, Luis Chamberlain wrote:
> Eric since you authored the code which this code claism to fix, your
> review would be appreciated.
> 
> On Wed, May 27, 2020 at 01:48:48PM +0300, Boris Sukholitko wrote:
> > Successful get_subdir returns dir with its header.nreg properly
> > adjusted. No need to drop the dir in that case.
> 
> This commit log is not that clear to me
> can you explain what happens
> without this patch, and how critical it is to fix it. How did you
> notice this issue?

Apologies for being too terse with my explanation. I'll try to expand
below.

In testing of our kernel (based on 4.19, tainted, sorry!) on our aarch64 based hardware
we've come upon the following oops (lightly edited to omit irrelevant details):

000:50:01.133 Unable to handle kernel paging request at virtual address 0000000000007a12
000:50:02.209 Process brctl (pid: 14467, stack limit = 0x00000000bcf7a578)
000:50:02.209 CPU: 1 PID: 14467 Comm: brctl Tainted: P                  4.19.122 #1
000:50:02.209 Hardware name: Broadcom-v8A (DT)
000:50:02.209 pstate: 60000005 (nZCv daif -PAN -UAO)
000:50:02.209 pc : unregister_sysctl_table+0x1c/0xa0
000:50:02.209 lr : unregister_net_sysctl_table+0xc/0x20
000:50:02.209 sp : ffffff800e5ab9e0
000:50:02.209 x29: ffffff800e5ab9e0 x28: ffffffc016439ec0 
000:50:02.209 x27: 0000000000000000 x26: ffffff8008804078 
000:50:02.209 x25: ffffff80087b4dd8 x24: ffffffc015d65000 
000:50:02.209 x23: ffffffc01f0d6010 x22: ffffffc01f0d6000 
000:50:02.209 x21: ffffffc0166c4eb0 x20: 00000000000000bd 
000:50:02.209 x19: ffffffc01f0d6030 x18: 0000000000000400 
000:50:02.256 x17: 0000000000000000 x16: 0000000000000000 
000:50:02.256 x15: 0000000000000400 x14: 0000000000000129 
000:50:02.256 x13: 0000000000000001 x12: 0000000000000030 
000:50:02.256 x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
000:50:02.256 x9 : feff646663687161 x8 : ffffffffffffffff 
000:50:02.256 x7 : fefefefefefefefe x6 : 0000000000008080 
000:50:02.256 x5 : 00000000ffffffff x4 : ffffff8008905c38 
000:50:02.256 x3 : ffffffc01f0d602c x2 : 00000000000000bd 
000:50:02.256 x1 : ffffffc01f0d60c0 x0 : 0000000000007a12 
000:50:02.256 Call trace:
000:50:02.256  unregister_sysctl_table+0x1c/0xa0
000:50:02.256  unregister_net_sysctl_table+0xc/0x20
000:50:02.256  __devinet_sysctl_unregister.isra.0+0x2c/0x60
000:50:02.256  inetdev_event+0x198/0x510
000:50:02.256  notifier_call_chain+0x58/0xa0
000:50:02.303  raw_notifier_call_chain+0x14/0x20
000:50:02.303  call_netdevice_notifiers_info+0x34/0x80
000:50:02.303  rollback_registered_many+0x384/0x600
000:50:02.303  unregister_netdevice_queue+0x8c/0x110
000:50:02.303  br_dev_delete+0x8c/0xa0
000:50:02.303  br_del_bridge+0x44/0x70
000:50:02.303  br_ioctl_deviceless_stub+0xcc/0x310
000:50:02.303  sock_ioctl+0x194/0x3f0
000:50:02.303  compat_sock_ioctl+0x678/0xc00
000:50:02.303  __arm64_compat_sys_ioctl+0xf0/0xcb0
000:50:02.303  el0_svc_common+0x70/0x170
000:50:02.303  el0_svc_compat_handler+0x1c/0x30
000:50:02.303  el0_svc_compat+0x8/0x18
000:50:02.303 Code: a90153f3 aa0003f3 f9401000 b40000c0 (f9400001) 

The crash is in the call to count_subheaders(header->ctl_table_arg).

Although the header (being in x19 == 0xffffffc01f0d6030) looks like a
normal kernel pointer, ctl_table_arg (x0 == 0x0000000000007a12) looks
invalid.

Trying to find the issue, we've started tracing header allocation being
done by kzalloc in __register_sysctl_table and header freeing being done
in drop_sysctl_table.

Then we've noticed headers being freed which where not allocated before.
The faulty freeing was done on parent->header at the end of
drop_sysctl_table.

From this we've started to suspect some infelicity in header.nreg
refcounting, thus leading us the __register_sysctl_table fix in the
patch.

Here is more detailed explanation of the fix.

The current __register_sysctl_table logic looks like:

1. We start with some root dir, incrementing its header.nreg.

2. Then we find suitable dir using get_subdir function.

3. get_subdir decrements nreg on the parent dir and increments it on the
   dir being returned. See found label there.

4. We decrement dir's header.nreg for the symmetry with step 1.

IMHO, the bug is on step 4. If another dir is being returned by
get_subdir we decrement its nreg. I.e. the returned dir nreg stays 1
despite having children added to it.

This leads eventually to the innocent parent header being freed.

> If you don't apply this patch what issue do you see?

For some unexplained reason, the crashes are very rare and require
stressing the system while creating and destroing network interfaces.

> 
> Do we test for it? Can we?
> 

With some printk tracing the issue is easy to see while doing simple
brctl addbr / delbr to create and destroy bridge interface.

Probably there is some SLUB debug option which may allow to catch the
faulty free.

Thanks,
Boris.

>   Luis
> 
> > 
> > Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
> > Fixes: 7ec66d06362d (sysctl: Stop requiring explicit management of sysctl directories)
> > ---
> >  fs/proc/proc_sysctl.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> > index b6f5d459b087..6f237f0eb531 100644
> > --- a/fs/proc/proc_sysctl.c
> > +++ b/fs/proc/proc_sysctl.c
> > @@ -1286,8 +1286,8 @@ struct ctl_table_header *__register_sysctl_table(
> >  {
> >  	struct ctl_table_root *root = set->dir.header.root;
> >  	struct ctl_table_header *header;
> > +	struct ctl_dir *dir, *start_dir;
> >  	const char *name, *nextname;
> > -	struct ctl_dir *dir;
> >  	struct ctl_table *entry;
> >  	struct ctl_node *node;
> >  	int nr_entries = 0;
> > @@ -1307,6 +1307,7 @@ struct ctl_table_header *__register_sysctl_table(
> >  
> >  	spin_lock(&sysctl_lock);
> >  	dir = &set->dir;
> > +	start_dir = dir;
> >  	/* Reference moved down the diretory tree get_subdir */
> >  	dir->header.nreg++;
> >  	spin_unlock(&sysctl_lock);
> > @@ -1333,7 +1334,8 @@ struct ctl_table_header *__register_sysctl_table(
> >  	if (insert_header(dir, header))
> >  		goto fail_put_dir_locked;
> >  
> > -	drop_sysctl_table(&dir->header);
> > +	if (start_dir == dir)
> > +		drop_sysctl_table(&dir->header);
> >  	spin_unlock(&sysctl_lock);
> >  
> >  	return header;
> > -- 
> > 2.23.1
> >
Eric W. Biederman May 28, 2020, 2:04 p.m. UTC | #3
Boris Sukholitko <boris.sukholitko@broadcom.com> writes:

> On Wed, May 27, 2020 at 12:58:05PM +0000, Luis Chamberlain wrote:
>> Eric since you authored the code which this code claism to fix, your
>> review would be appreciated.
>> 
>> On Wed, May 27, 2020 at 01:48:48PM +0300, Boris Sukholitko wrote:
>> > Successful get_subdir returns dir with its header.nreg properly
>> > adjusted. No need to drop the dir in that case.
>> 
>> This commit log is not that clear to me
>> can you explain what happens
>> without this patch, and how critical it is to fix it. How did you
>> notice this issue?
>
> Apologies for being too terse with my explanation. I'll try to expand
> below.
>
> In testing of our kernel (based on 4.19, tainted, sorry!) on our aarch64 based hardware
> we've come upon the following oops (lightly edited to omit irrelevant
> details):

How does your 4.19 proc_sysctl.c compare with the latest proc-sysctl.c?
Have you backported all of the most recent bug fixes?

> 000:50:01.133 Unable to handle kernel paging request at virtual address 0000000000007a12
> 000:50:02.209 Process brctl (pid: 14467, stack limit = 0x00000000bcf7a578)
> 000:50:02.209 CPU: 1 PID: 14467 Comm: brctl Tainted: P                  4.19.122 #1
> 000:50:02.209 Hardware name: Broadcom-v8A (DT)
> 000:50:02.209 pstate: 60000005 (nZCv daif -PAN -UAO)
> 000:50:02.209 pc : unregister_sysctl_table+0x1c/0xa0
> 000:50:02.209 lr : unregister_net_sysctl_table+0xc/0x20
> 000:50:02.209 sp : ffffff800e5ab9e0
> 000:50:02.209 x29: ffffff800e5ab9e0 x28: ffffffc016439ec0 
> 000:50:02.209 x27: 0000000000000000 x26: ffffff8008804078 
> 000:50:02.209 x25: ffffff80087b4dd8 x24: ffffffc015d65000 
> 000:50:02.209 x23: ffffffc01f0d6010 x22: ffffffc01f0d6000 
> 000:50:02.209 x21: ffffffc0166c4eb0 x20: 00000000000000bd 
> 000:50:02.209 x19: ffffffc01f0d6030 x18: 0000000000000400 
> 000:50:02.256 x17: 0000000000000000 x16: 0000000000000000 
> 000:50:02.256 x15: 0000000000000400 x14: 0000000000000129 
> 000:50:02.256 x13: 0000000000000001 x12: 0000000000000030 
> 000:50:02.256 x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
> 000:50:02.256 x9 : feff646663687161 x8 : ffffffffffffffff 
> 000:50:02.256 x7 : fefefefefefefefe x6 : 0000000000008080 
> 000:50:02.256 x5 : 00000000ffffffff x4 : ffffff8008905c38 
> 000:50:02.256 x3 : ffffffc01f0d602c x2 : 00000000000000bd 
> 000:50:02.256 x1 : ffffffc01f0d60c0 x0 : 0000000000007a12 
> 000:50:02.256 Call trace:
> 000:50:02.256  unregister_sysctl_table+0x1c/0xa0
> 000:50:02.256  unregister_net_sysctl_table+0xc/0x20
> 000:50:02.256  __devinet_sysctl_unregister.isra.0+0x2c/0x60
> 000:50:02.256  inetdev_event+0x198/0x510
> 000:50:02.256  notifier_call_chain+0x58/0xa0
> 000:50:02.303  raw_notifier_call_chain+0x14/0x20
> 000:50:02.303  call_netdevice_notifiers_info+0x34/0x80
> 000:50:02.303  rollback_registered_many+0x384/0x600
> 000:50:02.303  unregister_netdevice_queue+0x8c/0x110
> 000:50:02.303  br_dev_delete+0x8c/0xa0
> 000:50:02.303  br_del_bridge+0x44/0x70
> 000:50:02.303  br_ioctl_deviceless_stub+0xcc/0x310
> 000:50:02.303  sock_ioctl+0x194/0x3f0
> 000:50:02.303  compat_sock_ioctl+0x678/0xc00
> 000:50:02.303  __arm64_compat_sys_ioctl+0xf0/0xcb0
> 000:50:02.303  el0_svc_common+0x70/0x170
> 000:50:02.303  el0_svc_compat_handler+0x1c/0x30
> 000:50:02.303  el0_svc_compat+0x8/0x18
> 000:50:02.303 Code: a90153f3 aa0003f3 f9401000 b40000c0 (f9400001) 
>
> The crash is in the call to count_subheaders(header->ctl_table_arg).
>
> Although the header (being in x19 == 0xffffffc01f0d6030) looks like a
> normal kernel pointer, ctl_table_arg (x0 == 0x0000000000007a12) looks
> invalid.
>
> Trying to find the issue, we've started tracing header allocation being
> done by kzalloc in __register_sysctl_table and header freeing being done
> in drop_sysctl_table.
>
> Then we've noticed headers being freed which where not allocated before.
> The faulty freeing was done on parent->header at the end of
> drop_sysctl_table.
>
> From this we've started to suspect some infelicity in header.nreg
> refcounting, thus leading us the __register_sysctl_table fix in the
> patch.
>
> Here is more detailed explanation of the fix.
>
> The current __register_sysctl_table logic looks like:
>
> 1. We start with some root dir, incrementing its header.nreg.
>
> 2. Then we find suitable dir using get_subdir function.
>
> 3. get_subdir decrements nreg on the parent dir and increments it on the
>    dir being returned. See found label there.
>
> 4. We decrement dir's header.nreg for the symmetry with step 1.
>
> IMHO, the bug is on step 4. If another dir is being returned by
> get_subdir we decrement its nreg. I.e. the returned dir nreg stays 1
> despite having children added to it.
>
> This leads eventually to the innocent parent header being freed.
>

But the insertion of children in insert_header also increases the count
so it does not look like that should be true.

>> If you don't apply this patch what issue do you see?
>
> For some unexplained reason, the crashes are very rare and require
> stressing the system while creating and destroing network interfaces.
>
>> 
>> Do we test for it? Can we?
>> 
>
> With some printk tracing the issue is easy to see while doing simple
> brctl addbr / delbr to create and destroy bridge interface.
>
> Probably there is some SLUB debug option which may allow to catch the
> faulty free.

I see some recent (within the last year) fixes to proc_sysctl.c in this
area.  Do you have those?  It looks like bridge up and down is stressing
this code.  Either those most recent fixes are wrong, your kernel is
missing them or this needs some more investigation.

Eric
Luis Chamberlain May 28, 2020, 2:20 p.m. UTC | #4
On Thu, May 28, 2020 at 09:04:02AM -0500, Eric W. Biederman wrote:
> I see some recent (within the last year) fixes to proc_sysctl.c in this
> area.  Do you have those?  It looks like bridge up and down is stressing
> this code.  Either those most recent fixes are wrong, your kernel is
> missing them or this needs some more investigation.

Thanks for the review Eric.

Boris, the elaborate deatils you provided would also be what would be
needed for a commit log, specially since this is fixing a crash. If
you confirm this is still present upstream by reproducing with a test
case it would be wonderful.

If you are familiar with what might be the issue, you can even construct
your own kernel-code proof of concept test using lib/test_sysctl.c. We use the
script tools/testing/selftests/sysctl/sysctl.sh to hammer on that.

  Luis
Boris Sukholitko May 31, 2020, 11:39 a.m. UTC | #5
On Thu, May 28, 2020 at 09:04:02AM -0500, Eric W. Biederman wrote:

[snip incorrect root cause analysis]

> 
> But the insertion of children in insert_header also increases the count
> so it does not look like that should be true.
> 

Yes, I have missed the insert_header nreg increment, thus making my root
cause analysis incorrect.

IMHO, the bug does exist in the latest kernel. Could you please see:

[PATCH] get_subdir: do not drop new subdir if returning it

I've just sent to the fsdevel mailing list. In it I've added invariant:

WARN_ON(dir->header.nreg < 2);

at the end of __register_sysctl_table which the latest kernel fails to
obey.

The fix seems to belong to get_subdir function, therefore I've broke the
thread and sent a new patch.

Thanks,
Boris.
Boris Sukholitko May 31, 2020, 11:44 a.m. UTC | #6
On Thu, May 28, 2020 at 02:20:45PM +0000, Luis Chamberlain wrote:
> On Thu, May 28, 2020 at 09:04:02AM -0500, Eric W. Biederman wrote:
> > I see some recent (within the last year) fixes to proc_sysctl.c in this
> > area.  Do you have those?  It looks like bridge up and down is stressing
> > this code.  Either those most recent fixes are wrong, your kernel is
> > missing them or this needs some more investigation.
> 
> Thanks for the review Eric.
> 

Seconded. My first try at fixing it was incorrect, hopefully the new
attempt ([PATCH] get_subdir: do not drop new subdir if returning it)
I've just sent will fare better.

> Boris, the elaborate deatils you provided would also be what would be
> needed for a commit log, specially since this is fixing a crash. If
> you confirm this is still present upstream by reproducing with a test
> case it would be wonderful.

Yes. It is present in the upstream kernel. In my patch I've added
warning to catch this bad condition. The warning fires easily during
boot.

Thanks,
Boris.
Luis Chamberlain June 1, 2020, 1:17 p.m. UTC | #7
On Sun, May 31, 2020 at 02:44:32PM +0300, Boris Sukholitko wrote:
> On Thu, May 28, 2020 at 02:20:45PM +0000, Luis Chamberlain wrote:
> > On Thu, May 28, 2020 at 09:04:02AM -0500, Eric W. Biederman wrote:
> > > I see some recent (within the last year) fixes to proc_sysctl.c in this
> > > area.  Do you have those?  It looks like bridge up and down is stressing
> > > this code.  Either those most recent fixes are wrong, your kernel is
> > > missing them or this needs some more investigation.
> > 
> > Thanks for the review Eric.
> > 
> 
> Seconded. My first try at fixing it was incorrect, hopefully the new
> attempt ([PATCH] get_subdir: do not drop new subdir if returning it)
> I've just sent will fare better.
> 
> > Boris, the elaborate deatils you provided would also be what would be
> > needed for a commit log, specially since this is fixing a crash. If
> > you confirm this is still present upstream by reproducing with a test
> > case it would be wonderful.
> 
> Yes. It is present in the upstream kernel. In my patch I've added
> warning to catch this bad condition. The warning fires easily during
> boot.

Can you add a test case for this to reproduce in lib/test_sysct.c?

  Luis

Patch
diff mbox series

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index b6f5d459b087..6f237f0eb531 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1286,8 +1286,8 @@  struct ctl_table_header *__register_sysctl_table(
 {
 	struct ctl_table_root *root = set->dir.header.root;
 	struct ctl_table_header *header;
+	struct ctl_dir *dir, *start_dir;
 	const char *name, *nextname;
-	struct ctl_dir *dir;
 	struct ctl_table *entry;
 	struct ctl_node *node;
 	int nr_entries = 0;
@@ -1307,6 +1307,7 @@  struct ctl_table_header *__register_sysctl_table(
 
 	spin_lock(&sysctl_lock);
 	dir = &set->dir;
+	start_dir = dir;
 	/* Reference moved down the diretory tree get_subdir */
 	dir->header.nreg++;
 	spin_unlock(&sysctl_lock);
@@ -1333,7 +1334,8 @@  struct ctl_table_header *__register_sysctl_table(
 	if (insert_header(dir, header))
 		goto fail_put_dir_locked;
 
-	drop_sysctl_table(&dir->header);
+	if (start_dir == dir)
+		drop_sysctl_table(&dir->header);
 	spin_unlock(&sysctl_lock);
 
 	return header;