From patchwork Fri Mar 24 21:56:27 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tahsin Erdogan X-Patchwork-Id: 9644063 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1E8666020B for ; Fri, 24 Mar 2017 21:56:56 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 045C122B27 for ; Fri, 24 Mar 2017 21:56:56 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E9B5622701; Fri, 24 Mar 2017 21:56:55 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6007022701 for ; Fri, 24 Mar 2017 21:56:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751264AbdCXV4z (ORCPT ); Fri, 24 Mar 2017 17:56:55 -0400 Received: from mail-pf0-f172.google.com ([209.85.192.172]:33894 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750903AbdCXV4y (ORCPT ); Fri, 24 Mar 2017 17:56:54 -0400 Received: by mail-pf0-f172.google.com with SMTP id p189so1202037pfp.1 for ; Fri, 24 Mar 2017 14:56:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=ktDKQd39Pwzai0KKfNF9w8M+iDUQ1V32HlH+affuDD8=; b=lsn0akwp+JvbbwefTkdo6gT6eOrDx2vHWSozXodji509CfjHlxxw8usIwoSuswDxWG d9Qx/4R0o5bdx0rp2B7ZBi+8HmVG3b86RyGKswa00YqEKbjTdhTAyNjny9lsK+fDejCl 0UGGnv2xwamsOo0gceFrJ3jqF+6adEke6GW5MuE9w1KxWvMiWul7HtuO08CvMxn/5iir 8/jXsmIutj7KmiLFytsoj/RRvJ6+XKz3fPF0j2TZrg71EYV/zLD/xhRmcsLyiy5gmFxm aaEDNuLXdgBa63z3TGcOMs7DBoybWq4zMYnL9rWm+u734Ikn3G9LRkF6hyc7idgZkGQo cpJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=ktDKQd39Pwzai0KKfNF9w8M+iDUQ1V32HlH+affuDD8=; b=sJCGnaIzybPC6b5kkbLF0z7IX528SfxR14RO0W5v0ZAbHlpjhlfPVhKWGJiSgBuqDt e2KD29e0h1RUXnvRfTW/YyoQ9UHtRZs4DQC5IsHCpb9H/7W9u1IumRjZF269bx5edE/k UiwyPy+DznT+FTsbGffoT1u3FRZiqQ0Pca1iRirbWDkp33te8Nlmhtsbjb1XFIdP+Ss4 YoUInDGOCGidc9/S26+2DR5827vknyMgmtIrib6lAxHwBjx01t9NimJsQka8gmxCDKfJ tSZaCth2OA5Gxg+7oK0EIPCMv/A536SX3hyEPLk3lMsmI5PAq6BqePtHN+R2p3lP9dsA e3nw== X-Gm-Message-State: AFeK/H2WAvPu9NncnEFPTBGt40UqxXk5tlI+M2dFyN14Xk6GVKzWB0gAa4AUc2EYCsnd44e5 X-Received: by 10.99.149.6 with SMTP id p6mr11356931pgd.122.1490392612465; Fri, 24 Mar 2017 14:56:52 -0700 (PDT) Received: from tahsin1.mtv.corp.google.com ([100.99.140.90]) by smtp.gmail.com with ESMTPSA id s21sm6448074pgg.65.2017.03.24.14.56.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 24 Mar 2017 14:56:51 -0700 (PDT) From: Tahsin Erdogan To: Tejun Heo , Jens Axboe Cc: linux-block@vger.kernel.org, David Rientjes , linux-kernel@vger.kernel.org, Tahsin Erdogan Subject: [PATCH] blkcg: allocate struct blkcg_gq outside request queue spinlock Date: Fri, 24 Mar 2017 14:56:27 -0700 Message-Id: <20170324215627.12831-1-tahsin@google.com> X-Mailer: git-send-email 2.12.1.578.ge9c3154ca4-goog In-Reply-To: References: Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP blkg_conf_prep() currently calls blkg_lookup_create() while holding request queue spinlock. This means allocating memory for struct blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM failures in call paths like below: pcpu_alloc+0x68f/0x710 __alloc_percpu_gfp+0xd/0x10 __percpu_counter_init+0x55/0xc0 cfq_pd_alloc+0x3b2/0x4e0 blkg_alloc+0x187/0x230 blkg_create+0x489/0x670 blkg_lookup_create+0x9a/0x230 blkg_conf_prep+0x1fb/0x240 __cfqg_set_weight_device.isra.105+0x5c/0x180 cfq_set_weight_on_dfl+0x69/0xc0 cgroup_file_write+0x39/0x1c0 kernfs_fop_write+0x13f/0x1d0 __vfs_write+0x23/0x120 vfs_write+0xc2/0x1f0 SyS_write+0x44/0xb0 entry_SYSCALL_64_fastpath+0x18/0xad In the code path above, percpu allocator cannot call vmalloc() due to queue spinlock. A failure in this call path gives grief to tools which are trying to configure io weights. We see occasional failures happen shortly after reboots even when system is not under any memory pressure. Machines with a lot of cpus are more vulnerable to this condition. Do struct blkcg_gq allocations outside the queue spinlock to allow blocking during memory allocations. Suggested-by: Tejun Heo Signed-off-by: Tahsin Erdogan Acked-by: Tejun Heo --- v6: Due to Jens' objection to conditionally dropping locks based on gfp flags, go back to v1 approach. Perform queue bypass and policy enabled checks at every iteration. Add blkg_lookup_check() to reduce code duplication. v5: Removed stale blkg_alloc() in blkcg_init_queue() Pushed down radix_tree_preload() into blkg_create() because it disables preemption on return and makes it unsafe to call blocking memory allocations. v4: Simplified error checking in blkg_create() Factored out __blkg_lookup_create() v3: Pushed down all blkg allocations into blkg_create() v2: Moved blkg creation into blkg_lookup_create() to avoid duplicating blkg_lookup_create() logic. block/blk-cgroup.c | 123 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 98 insertions(+), 25 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index bbe7ee00bd3d..7c2947128f58 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -772,6 +772,27 @@ struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkcg_gq *blkg, } EXPORT_SYMBOL_GPL(blkg_rwstat_recursive_sum); +/* Performs queue bypass and policy enabled checks then looks up blkg. */ +static struct blkcg_gq *blkg_lookup_check(struct blkcg *blkcg, + const struct blkcg_policy *pol, + struct request_queue *q) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + lockdep_assert_held(q->queue_lock); + + if (!blkcg_policy_enabled(q, pol)) + return ERR_PTR(-EOPNOTSUPP); + + /* + * This could be the first entry point of blkcg implementation and + * we shouldn't allow anything to go through for a bypassing queue. + */ + if (unlikely(blk_queue_bypass(q))) + return ERR_PTR(blk_queue_dying(q) ? -ENODEV : -EBUSY); + + return __blkg_lookup(blkcg, q, true /* update_hint */); +} + /** * blkg_conf_prep - parse and prepare for per-blkg config update * @blkcg: target block cgroup @@ -789,6 +810,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, __acquires(rcu) __acquires(disk->queue->queue_lock) { struct gendisk *disk; + struct request_queue *q; struct blkcg_gq *blkg; struct module *owner; unsigned int major, minor; @@ -807,44 +829,95 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol, if (!disk) return -ENODEV; if (part) { - owner = disk->fops->owner; - put_disk(disk); - module_put(owner); - return -ENODEV; + ret = -ENODEV; + goto fail; } - rcu_read_lock(); - spin_lock_irq(disk->queue->queue_lock); + q = disk->queue; - if (blkcg_policy_enabled(disk->queue, pol)) - blkg = blkg_lookup_create(blkcg, disk->queue); - else - blkg = ERR_PTR(-EOPNOTSUPP); + rcu_read_lock(); + spin_lock_irq(q->queue_lock); + blkg = blkg_lookup_check(blkcg, pol, q); if (IS_ERR(blkg)) { ret = PTR_ERR(blkg); + goto fail_unlock; + } + + if (blkg) + goto success; + + /* + * Create blkgs walking down from blkcg_root to @blkcg, so that all + * non-root blkgs have access to their parents. + */ + while (true) { + struct blkcg *pos = blkcg; + struct blkcg *parent; + struct blkcg_gq *new_blkg; + + parent = blkcg_parent(blkcg); + while (parent && !__blkg_lookup(parent, q, false)) { + pos = parent; + parent = blkcg_parent(parent); + } + + /* Drop locks to do new blkg allocation with GFP_KERNEL. */ + spin_unlock_irq(q->queue_lock); rcu_read_unlock(); - spin_unlock_irq(disk->queue->queue_lock); - owner = disk->fops->owner; - put_disk(disk); - module_put(owner); - /* - * If queue was bypassing, we should retry. Do so after a - * short msleep(). It isn't strictly necessary but queue - * can be bypassing for some time and it's always nice to - * avoid busy looping. - */ - if (ret == -EBUSY) { - msleep(10); - ret = restart_syscall(); + + new_blkg = blkg_alloc(pos, q, GFP_KERNEL); + if (unlikely(!new_blkg)) { + ret = -ENOMEM; + goto fail; } - return ret; - } + rcu_read_lock(); + spin_lock_irq(q->queue_lock); + + blkg = blkg_lookup_check(pos, pol, q); + if (IS_ERR(blkg)) { + ret = PTR_ERR(blkg); + goto fail_unlock; + } + + if (blkg) { + blkg_free(new_blkg); + } else { + blkg = blkg_create(pos, q, new_blkg); + if (unlikely(IS_ERR(blkg))) { + ret = PTR_ERR(blkg); + goto fail_unlock; + } + } + + if (pos == blkcg) + goto success; + } +success: ctx->disk = disk; ctx->blkg = blkg; ctx->body = body; return 0; + +fail_unlock: + spin_unlock_irq(q->queue_lock); + rcu_read_unlock(); +fail: + owner = disk->fops->owner; + put_disk(disk); + module_put(owner); + /* + * If queue was bypassing, we should retry. Do so after a + * short msleep(). It isn't strictly necessary but queue + * can be bypassing for some time and it's always nice to + * avoid busy looping. + */ + if (ret == -EBUSY) { + msleep(10); + ret = restart_syscall(); + } + return ret; } EXPORT_SYMBOL_GPL(blkg_conf_prep);