From patchwork Sun Jul 10 09:04:19 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Topi Miettinen X-Patchwork-Id: 9222409 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id AF7A160772 for ; Sun, 10 Jul 2016 09:04:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9C4BB2521F for ; Sun, 10 Jul 2016 09:04:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 8E79C264FB; Sun, 10 Jul 2016 09:04:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, T_DKIM_INVALID, T_TVD_MIME_EPI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AB54A2521F for ; Sun, 10 Jul 2016 09:04:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750970AbcGJJEh (ORCPT ); Sun, 10 Jul 2016 05:04:37 -0400 Received: from mail-wm0-f66.google.com ([74.125.82.66]:32920 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750822AbcGJJEa (ORCPT ); Sun, 10 Jul 2016 05:04:30 -0400 Received: by mail-wm0-f66.google.com with SMTP id o80so1444075wme.0; Sun, 10 Jul 2016 02:04:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:cc:from:openpgp:message-id:date:user-agent :mime-version:in-reply-to; bh=gPnaAGzdtx3autrk4SGIGgkqokJWS5ms83GTBv+J0ug=; b=F6REqyEWXq7M+aEjBg1GxA1bcWkJVnsvVcpohxBLdgkn2JijucSfAQlX/x9EgV/K0S ymOdorHWVEZX9wvYjyCV3yi5NluNPHNCg/3QUKrn6xfoSCPe8Xyfia1CJnnpxJzyfpy+ T24b9kpxu2UjMDBynJy+Xj5hYDlocbPJB56Nx30EjAwRHSQB8IoEuKXznOWSj1UPqIUi 85te6FzF8IUTySPaPfzKZSZosm6TZii0mSqRZP/TLSxFGC5qWLlhXByGxDeWnyKRXGCY nrYGlU9VD6wx0V2inaGfBI9Fyxl+klWbNVX3FW+iAQv33dYQxG+h7NYGlDZHEIZlywlY i9+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:openpgp:message-id :date:user-agent:mime-version:in-reply-to; bh=gPnaAGzdtx3autrk4SGIGgkqokJWS5ms83GTBv+J0ug=; b=XTjTTr4drVeDgXYhLmFq0k3PIeKZ7rOT66uCpzZp8uH/GhKK2ArNBYJSgPX6fHE1b7 ZkPXB6K33En6w2ZyJYq4+R3tAETJ9QlydI3BnDOOzkiFQHR7wWhGJii6YvIqJia0Tnxc AhIv4xCqOFNfhQNq8xjLY95EmK0+Xj5/bcwO9KnaVswxFf+7Jif/b6Oo1mT0U91EZj3u TunWv/6tsFI73VHY3SeoPo4p0+Lcx9B1e34lwtTZE2Dw6zzgWuWlPaOln9V42F28RC/o 2MplO5dqStdi36SU430yvxXp7In7jXNPAefvDt74a1TO7NAZjHx7XNVVnvVwnJWmTXLF Zqrw== X-Gm-Message-State: ALyK8tI/pika9e8yG5Bol5y/pxRAmMA2RieVT4AkK7kzpRha1GpJsEKuxRG4XYgJRlgo5A== X-Received: by 10.28.47.77 with SMTP id v74mr6301088wmv.90.1468141468219; Sun, 10 Jul 2016 02:04:28 -0700 (PDT) Received: from ?IPv6:2001:2003:f54a:b700:2ab2:bdff:fe10:799e? ([2001:2003:f54a:b700:2ab2:bdff:fe10:799e]) by smtp.gmail.com with ESMTPSA id f73sm18039469wmg.1.2016.07.10.02.04.25 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 10 Jul 2016 02:04:27 -0700 (PDT) Subject: Re: [PATCH] capabilities: add capability cgroup controller To: Petr Mladek References: <20160624172447.GA3262@mtj.duckdns.org> <47890d79-0891-dd13-4f60-e7e5f1f3fed3@gmail.com> <20160627145457.GA26980@mail.hallyn.com> <58938c8b-aca6-a5b8-9533-58e78d878e85@gmail.com> <20160627194941.GA31843@mail.hallyn.com> <218f2bef-5e5e-89c4-154b-24dc49c82c31@gmail.com> <20160707091645.GG3238@pathway.suse.cz> <20160708091332.GD3556@pathway.suse.cz> Cc: "Serge E. Hallyn" , "Eric W. Biederman" , Tejun Heo , lkml , luto@kernel.org, Kees Cook , Jonathan Corbet , Li Zefan , Johannes Weiner , James Morris , Andrew Morton , David Howells , David Woodhouse , Ard Biesheuvel , "Paul E. McKenney" , "open list:DOCUMENTATION" , "open list:CONTROL GROUP (CGROUP)" , "open list:CAPABILITIES" From: Topi Miettinen Openpgp: id=A0F2EB0D8452DA908BEC8E911CF9ADDBD610E936 Message-ID: <5e3cd49c-2ae1-efc1-ea85-afa1ff7b22cb@gmail.com> Date: Sun, 10 Jul 2016 09:04:19 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.1.0 MIME-Version: 1.0 In-Reply-To: <20160708091332.GD3556@pathway.suse.cz> Sender: owner-linux-security-module@vger.kernel.org Precedence: bulk List-ID: X-Virus-Scanned: ClamAV using ClamSMTP On 07/08/16 09:13, Petr Mladek wrote: > On Thu 2016-07-07 20:27:13, Topi Miettinen wrote: >> On 07/07/16 09:16, Petr Mladek wrote: >>> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote: >>>> The attached patch would make any uses of capabilities generate audit >>>> messages. It works for simple tests as you can see from the commit >>>> message, but unfortunately the call to audit_cgroup_list() deadlocks the >>>> system when booting a full blown OS. There's no deadlock when the call >>>> is removed. >>>> >>>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be >>>> already held earlier before entering audit_cgroup_list(). Holding the >>>> locks is however required by task_cgroup_from_root(). Is there any way >>>> to avoid this? For example, only print some kind of cgroup ID numbers >>>> (are there unique and stable IDs, available without locks?) for those >>>> cgroups where the task is registered in the audit message? >>> >>> I am not sure if anyone know what really happens here. I suggest to >>> enable lockdep. It might detect possible deadlock even before it >>> really happens, see Documentation/locking/lockdep-design.txt >>> >>> It can be enabled by >>> >>> CONFIG_PROVE_LOCKING=y >>> >>> It depends on >>> >>> CONFIG_DEBUG_KERNEL=y >>> >>> and maybe some more options, see lib/Kconfig.debug >> >> Thanks a lot! I caught this stack dump: >> >> starting version 230 >> [ 3.416647] ------------[ cut here ]------------ >> [ 3.417310] WARNING: CPU: 0 PID: 95 at >> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871 >> lockdep_trace_alloc+0xb4/0xc0 >> [ 3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)) >> [ 3.417923] Modules linked in: >> [ 3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97 >> [ 3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >> BIOS Debian-1.8.2-1 04/01/2014 >> [ 3.418726] 0000000000000086 000000007970f3b0 ffff88000016fb00 >> ffffffff813c9c45 >> [ 3.418993] ffff88000016fb50 0000000000000000 ffff88000016fb40 >> ffffffff81091e9b >> [ 3.419176] 00000b3705e2c798 0000000000000046 0000000000000410 >> 00000000ffffffff >> [ 3.419374] Call Trace: >> [ 3.419511] [] dump_stack+0x67/0x92 >> [ 3.419644] [] __warn+0xcb/0xf0 >> [ 3.419745] [] warn_slowpath_fmt+0x5f/0x80 >> [ 3.419868] [] lockdep_trace_alloc+0xb4/0xc0 >> [ 3.419988] [] kmem_cache_alloc_node+0x42/0x600 >> [ 3.420156] [] ? debug_lockdep_rcu_enabled+0x1d/0x20 >> [ 3.420170] [] __alloc_skb+0x5b/0x1d0 >> [ 3.420170] [] audit_log_start+0x29b/0x480 >> [ 3.420170] [] ? __lock_task_sighand+0x95/0x270 >> [ 3.420170] [] audit_log_cap_use+0x39/0xf0 >> [ 3.420170] [] ns_capable+0x45/0x70 >> [ 3.420170] [] capable+0x17/0x20 >> [ 3.420170] [] oom_score_adj_write+0x150/0x2f0 >> [ 3.420170] [] __vfs_write+0x37/0x160 >> [ 3.420170] [] ? update_fast_ctr+0x17/0x30 >> [ 3.420170] [] ? percpu_down_read+0x49/0x90 >> [ 3.420170] [] ? __sb_start_write+0xb7/0xf0 >> [ 3.420170] [] ? __sb_start_write+0xb7/0xf0 >> [ 3.420170] [] vfs_write+0xb8/0x1b0 >> [ 3.420170] [] ? __fget_light+0x66/0x90 >> [ 3.420170] [] SyS_write+0x58/0xc0 >> [ 3.420170] [] do_syscall_64+0x5c/0x300 >> [ 3.420170] [] entry_SYSCALL64_slow_path+0x25/0x25 >> [ 3.420170] ---[ end trace fb586899fb556a5e ]--- >> [ 3.447922] random: systemd-udevd urandom read with 3 bits of entropy >> available >> [ 4.014078] clocksource: Switched to clocksource tsc >> Begin: Loading essential drivers ... done. >> >> This is with qemu and the boot continues normally. With real computer, >> there's no such output and system just seems to freeze. >> >> Could it be possible that the deadlock happens because there's some IO >> towards /sys/fs/cgroup, which causes a capability check and that in turn >> causes locking problems when we try to print cgroup list? > > The above warning is printed by the code from > kernel/locking/lockdep.c:2871 > > static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags) > { > [...] > /* We're only interested __GFP_FS allocations for now */ > if (!(gfp_mask & __GFP_FS)) > return; > > /* > * Oi! Can't be having __GFP_FS allocations with IRQs disabled. > */ > if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))) > return; > > > The backtrace shows that your new audit_log_cap_use() is called > from vfs_write(). You might try to use audit_log_start() with > GFP_NOFS instead of GFP_KERNEL. > > Note that this is rather intuitive advice. I still need to learn a lot > about memory management and kernel in general to be more sure about > a correct solution. > > Best Regards, > Petr > With the attached patch, the system boots without deadlock. Locking problems still remain: [ 3.652221] ====================================================== [ 3.652221] [ INFO: possible circular locking dependency detected ] [ 3.652221] 4.7.0-rc5+ #101 Not tainted [ 3.652221] ------------------------------------------------------- [ 3.652221] systemd/1 is trying to acquire lock: [ 3.652221] (tasklist_lock){.+.+..}, at: [] cgroup_mount+0x7ed/0xbc0 [ 3.652221] but task is already holding lock: [ 3.652221] (css_set_lock){......}, at: [] cgroup_mount+0x669/0xbc0 [ 3.652221] which lock already depends on the new lock. [ 3.652221] the existing dependency chain (in reverse order) is: [ 3.652221] -> #3 (css_set_lock){......}: [ 3.652221] [] lock_acquire+0xe3/0x1c0 [ 3.652221] [] _raw_spin_lock_irq+0x37/0x50 [ 3.652221] [] cgroup_setup_root+0x19e/0x2d0 [ 3.652221] [] cgroup_init+0xec/0x41d [ 3.652221] [] start_kernel+0x40c/0x465 [ 3.652221] [] x86_64_start_reservations+0x2f/0x31 [ 3.652221] [] x86_64_start_kernel+0x178/0x18b [ 3.652221] -> #2 (cgroup_mutex){+.+...}: [ 3.652221] [] lock_acquire+0xe3/0x1c0 [ 3.652221] [] mutex_lock_nested+0x5f/0x350 [ 3.652221] [] audit_cgroup_list+0x4a/0x2f0 [ 3.652221] [] audit_log_cap_use+0xd9/0xf0 [ 3.652221] [] ns_capable+0x45/0x70 [ 3.652221] [] capable+0x17/0x20 [ 3.652221] [] oom_score_adj_write+0x150/0x2f0 [ 3.652221] [] __vfs_write+0x37/0x160 [ 3.652221] [] vfs_write+0xb8/0x1b0 [ 3.652221] [] SyS_write+0x58/0xc0 [ 3.652221] [] do_syscall_64+0x5c/0x300 [ 3.652221] [] return_from_SYSCALL_64+0x0/0x7a [ 3.652221] -> #1 (&(&sighand->siglock)->rlock){+.+...}: [ 3.652221] [] lock_acquire+0xe3/0x1c0 [ 3.652221] [] _raw_spin_lock+0x31/0x40 [ 3.652221] [] copy_process.part.34+0x10f9/0x1b40 [ 3.652221] [] _do_fork+0xf3/0x6b0 [ 3.652221] [] kernel_thread+0x29/0x30 [ 3.652221] [] kthreadd+0x187/0x1e0 [ 3.652221] [] ret_from_fork+0x1f/0x40 [ 3.652221] -> #0 (tasklist_lock){.+.+..}: [ 3.652221] [] __lock_acquire+0x13cb/0x1440 [ 3.652221] [] lock_acquire+0xe3/0x1c0 [ 3.652221] [] _raw_read_lock+0x34/0x50 [ 3.652221] [] cgroup_mount+0x7ed/0xbc0 [ 3.652221] [] mount_fs+0x38/0x170 [ 3.652221] [] vfs_kern_mount+0x6b/0x150 [ 3.652221] [] do_mount+0x24c/0xe30 [ 3.652221] [] SyS_mount+0x95/0xe0 [ 3.652221] [] entry_SYSCALL_64_fastpath+0x18/0xa8 [ 3.652221] other info that might help us debug this: [ 3.652221] Chain exists of: tasklist_lock --> cgroup_mutex --> css_set_lock [ 3.652221] Possible unsafe locking scenario: [ 3.652221] CPU0 CPU1 [ 3.652221] ---- ---- [ 3.652221] lock(css_set_lock); [ 3.652221] lock(cgroup_mutex); [ 3.652221] lock(css_set_lock); [ 3.652221] lock(tasklist_lock); [ 3.652221] *** DEADLOCK *** [ 3.652221] 1 lock held by systemd/1: [ 3.652221] #0: (css_set_lock){......}, at: [] cgroup_mount+0x669/0xbc0 [ 3.652221] stack backtrace: [ 3.652221] CPU: 0 PID: 1 Comm: systemd Not tainted 4.7.0-rc5+ #101 [ 3.652221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 3.652221] 0000000000000086 0000000024b7c1ed ffff880006d13bb0 ffffffff813c9bf5 [ 3.652221] ffffffff829dbd20 ffffffff829cf2a0 ffff880006d13bf0 ffffffff810e60a3 [ 3.652221] ffff880006d13c30 ffff880006d067b0 ffff880006d06040 0000000000000001 [ 3.652221] Call Trace: [ 3.652221] [] dump_stack+0x67/0x92 [ 3.652221] [] print_circular_bug+0x1e3/0x250 [ 3.652221] [] __lock_acquire+0x13cb/0x1440 [ 3.652221] [] ? __kmalloc_track_caller+0x2bd/0x630 [ 3.652221] [] lock_acquire+0xe3/0x1c0 [ 3.652221] [] ? cgroup_mount+0x7ed/0xbc0 [ 3.652221] [] _raw_read_lock+0x34/0x50 [ 3.652221] [] ? cgroup_mount+0x7ed/0xbc0 [ 3.652221] [] cgroup_mount+0x7ed/0xbc0 [ 3.652221] [] ? lockdep_init_map+0x57/0x1f0 [ 3.652221] [] mount_fs+0x38/0x170 [ 3.652221] [] vfs_kern_mount+0x6b/0x150 [ 3.652221] [] do_mount+0x24c/0xe30 [ 3.652221] [] ? kmem_cache_alloc_trace+0x28b/0x5e0 [ 3.652221] [] ? strndup_user+0x46/0x80 [ 3.652221] [] SyS_mount+0x95/0xe0 [ 3.652221] [] entry_SYSCALL_64_fastpath+0x18/0xa8 Rate limiting would not be a bad idea, there were 329 audit log entries about capability use. -Topi From b857ec7d436f6fbb8c33e8d6a16d2f16175517d8 Mon Sep 17 00:00:00 2001 From: Topi Miettinen Date: Sun, 10 Jul 2016 11:45:30 +0300 Subject: [PATCH] cgroup: check options and capabilities before locking to avoid a deadlock Signed-off-by: Topi Miettinen --- kernel/cgroup.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 1931679..1190559 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -2100,6 +2100,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, return ERR_PTR(-EPERM); } + /* First find the desired set of subsystems */ + ret = parse_cgroupfs_options(data, &opts); + if (ret) + goto out_free; + + /* + * We know this subsystem has not yet been bound. Users in a non-init + * user namespace may only mount hierarchies with no bound subsystems, + * i.e. 'none,name=user1' + */ + if (!opts.none && !capable(CAP_SYS_ADMIN)) { + ret = -EPERM; + goto out_free; + } + /* * The first time anyone tries to mount a cgroup, enable the list * linking each css_set to its tasks and fix up all existing tasks. @@ -2121,11 +2136,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp); - /* First find the desired set of subsystems */ - ret = parse_cgroupfs_options(data, &opts); - if (ret) - goto out_unlock; - /* * Destruction of cgroup root is asynchronous, so subsystems may * still be dying after the previous unmount. Let's drain the @@ -2216,16 +2226,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type, goto out_unlock; } - /* - * We know this subsystem has not yet been bound. Users in a non-init - * user namespace may only mount hierarchies with no bound subsystems, - * i.e. 'none,name=user1' - */ - if (!opts.none && !capable(CAP_SYS_ADMIN)) { - ret = -EPERM; - goto out_unlock; - } - root = kzalloc(sizeof(*root), GFP_KERNEL); if (!root) { ret = -ENOMEM; -- 2.8.1