From patchwork Mon Oct 3 15:44:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12997641 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 903E1C433F5 for ; Mon, 3 Oct 2022 15:45:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229604AbiJCPpj (ORCPT ); Mon, 3 Oct 2022 11:45:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60582 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229596AbiJCPph (ORCPT ); Mon, 3 Oct 2022 11:45:37 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F3B5B2B265 for ; Mon, 3 Oct 2022 08:45:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664811935; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+kEdx5r5f1vH74BIuZmapQSY2wVLzrW7YVUpCo4lDSU=; b=icfD198kcpevNWgLJVVC0RgaZL+hdXoymFtTWW9NAcvf/6Bv62oJojl90gsCoFzHhMBEMt zMzo8T1qUNeNM7dKf8DfJ4zyTaqNLgEKTujBKKoF7gDImiqGpP6qijI2RxFChIv7qLZk5L zDHKoDcICP2/JyKJsocawJfVPHfe/vc= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-300-L1wX7o9oOn---WCuSN2j8w-1; Mon, 03 Oct 2022 11:45:32 -0400 X-MC-Unique: L1wX7o9oOn---WCuSN2j8w-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 78B173817965; Mon, 3 Oct 2022 15:45:31 +0000 (UTC) Received: from llong.com (dhcp-17-215.bos.redhat.com [10.18.17.215]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2194D2027064; Mon, 3 Oct 2022 15:45:31 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Andy Shevchenko , Andrew Morton , Huang Ying , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v7 1/3] llist: Add a lock-less list variant terminated by a sentinel node Date: Mon, 3 Oct 2022 11:44:57 -0400 Message-Id: <20221003154459.207538-2-longman@redhat.com> In-Reply-To: <20221003154459.207538-1-longman@redhat.com> References: <20221003154459.207538-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org The lock-less list API is useful for dealing with list in a lock-less manner. However, one of the drawback of the current API is that there is not an easy way to determine if an entry has already been put into a lock-less list. This has to be tracked externally and the tracking will not be atomic unless some external synchronization logic is in place. This patch introduces a new variant of the lock-less list terminated by a sentinel node instead of by NULL. The function names start with "sllist" instead of "llist". The advantage of this scheme is that we can atomically determine if an entry has been put into a lock-less list by looking at the next pointer of the llist_node. Of course, the callers must clear the next pointer when an entry is removed from the lockless list. This is done automatically when the sllist_for_each_safe or sllist_for_each_entry_safe iteraters are used. The non-safe versions of sllist iterator are not provided. Signed-off-by: Waiman Long --- include/linux/llist.h | 132 +++++++++++++++++++++++++++++++++++++++++- lib/llist.c | 79 ++++++++++++++++++++++++- 2 files changed, 209 insertions(+), 2 deletions(-) diff --git a/include/linux/llist.h b/include/linux/llist.h index 85bda2d02d65..d9a921656adb 100644 --- a/include/linux/llist.h +++ b/include/linux/llist.h @@ -2,7 +2,8 @@ #ifndef LLIST_H #define LLIST_H /* - * Lock-less NULL terminated single linked list + * Lock-less NULL terminated singly linked list + * -------------------------------------------- * * Cases where locking is not needed: * If there are multiple producers and multiple consumers, llist_add can be @@ -46,6 +47,16 @@ * * Copyright 2010,2011 Intel Corp. * Author: Huang Ying + * + * Lock-less sentinel node terminated singly linked list + * ----------------------------------------------------- + * + * This is a variant of the generic lock-less list where the end of the list + * is terminated by a sentinel node instead of NULL. The advantage of this + * scheme is that we can use the next pointer of the llist_node to determine + * if it has been put into a lock-less list. However, the next pointer of + * the llist_node should be cleared ASAP after it has been removed from a + * lock-less list. */ #include @@ -64,6 +75,13 @@ struct llist_node { #define LLIST_HEAD_INIT(name) { NULL } #define LLIST_HEAD(name) struct llist_head name = LLIST_HEAD_INIT(name) +/* + * Sentinel lock-less list + */ +extern struct llist_node __llist_end; +#define SLLIST_HEAD_INIT(name) { &__llist_end } +#define SLLIST_HEAD(name) struct llist_head name = SLLIST_HEAD_INIT(name) + /** * init_llist_head - initialize lock-less list head * @head: the head for your lock-less list @@ -73,6 +91,15 @@ static inline void init_llist_head(struct llist_head *list) list->first = NULL; } +/** + * init_sllist_head - initialize sentinel lock-less list head + * @head: the head for your sentinel lock-less list + */ +static inline void init_sllist_head(struct llist_head *list) +{ + list->first = &__llist_end; +} + /** * llist_entry - get the struct of this entry * @ptr: the &struct llist_node pointer. @@ -99,6 +126,15 @@ static inline void init_llist_head(struct llist_head *list) #define member_address_is_nonnull(ptr, member) \ ((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member) != 0) +/** + * member_address_is_notsentinel - check whether member address is not sentinel + * @ptr: the object pointer (struct type * that contains the llist_node) + * @member: the name of the llist_node within the struct. + */ +#define member_address_is_notsentinel(ptr, member) \ + ((uintptr_t)(ptr) + offsetof(typeof(*(ptr)), member) \ + != (uintptr_t)&__llist_end) + /** * llist_for_each - iterate over some deleted entries of a lock-less list * @pos: the &struct llist_node to use as a loop cursor @@ -135,6 +171,18 @@ static inline void init_llist_head(struct llist_head *list) #define llist_for_each_safe(pos, n, node) \ for ((pos) = (node); (pos) && ((n) = (pos)->next, true); (pos) = (n)) +/** + * sllist_for_each_safe - iterate over entries of a sentinel lock-less list + * safe against removal of list entry + * @pos: the &struct llist_node to use as a loop cursor + * @n: another &struct llist_node to use as temporary storage + * @node: the first entry of deleted list entries + */ +#define sllist_for_each_safe(pos, n, node) \ + for ((pos) = (node); ((pos) != &__llist_end) && \ + ((n) = (pos)->next, \ + ({ WRITE_ONCE((pos)->next, NULL); }), true); (pos) = (n)) + /** * llist_for_each_entry - iterate over some deleted entries of lock-less list of given type * @pos: the type * to use as a loop cursor. @@ -178,6 +226,21 @@ static inline void init_llist_head(struct llist_head *list) (n = llist_entry(pos->member.next, typeof(*n), member), true); \ pos = n) +/** + * sllist_for_each_entry_safe - iterate over sentinel entries of lock-less list + * of given type safe against removal of list entry + * @pos: the type * to use as a loop cursor. + * @n: another type * to use as temporary storage + * @node: the first entry of deleted list entries. + * @member: the name of the llist_node with the struct. + */ +#define sllist_for_each_entry_safe(pos, n, node, member) \ + for (pos = llist_entry((node), typeof(*(pos)), member); \ + member_address_is_notsentinel(pos, member) && \ + (n = llist_entry((pos)->member.next, typeof(*(n)), member), \ + ({ WRITE_ONCE((pos)->member.next, NULL); }), true); \ + pos = n) + /** * llist_empty - tests whether a lock-less list is empty * @head: the list to test @@ -191,15 +254,35 @@ static inline bool llist_empty(const struct llist_head *head) return READ_ONCE(head->first) == NULL; } +/** + * sllist_empty - tests whether a lock-less list is empty + * @head: the list to test + */ +static inline bool sllist_empty(const struct llist_head *head) +{ + return READ_ONCE(head->first) == &__llist_end; +} + static inline struct llist_node *llist_next(struct llist_node *node) { return node->next; } +static inline struct llist_node *sllist_next(struct llist_node *node) +{ + struct llist_node *next = node->next; + + return (next == &__llist_end) ? NULL : next; +} + extern bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last, struct llist_head *head); +extern bool sllist_add_batch(struct llist_node *new_first, + struct llist_node *new_last, + struct llist_head *head); + static inline bool __llist_add_batch(struct llist_node *new_first, struct llist_node *new_last, struct llist_head *head) @@ -209,6 +292,15 @@ static inline bool __llist_add_batch(struct llist_node *new_first, return new_last->next == NULL; } +static inline bool __sllist_add_batch(struct llist_node *new_first, + struct llist_node *new_last, + struct llist_head *head) +{ + new_last->next = head->first; + head->first = new_first; + return new_last->next == &__llist_end; +} + /** * llist_add - add a new entry * @new: new entry to be added @@ -221,11 +313,28 @@ static inline bool llist_add(struct llist_node *new, struct llist_head *head) return llist_add_batch(new, new, head); } +/** + * sllist_add - add a new entry + * @new: new entry to be added + * @head: the head for your lock-less list + * + * Returns true if the list was empty prior to adding this entry. + */ +static inline bool sllist_add(struct llist_node *new, struct llist_head *head) +{ + return sllist_add_batch(new, new, head); +} + static inline bool __llist_add(struct llist_node *new, struct llist_head *head) { return __llist_add_batch(new, new, head); } +static inline bool __sllist_add(struct llist_node *new, struct llist_head *head) +{ + return __sllist_add_batch(new, new, head); +} + /** * llist_del_all - delete all entries from lock-less list * @head: the head of lock-less list to delete all entries @@ -239,6 +348,17 @@ static inline struct llist_node *llist_del_all(struct llist_head *head) return xchg(&head->first, NULL); } +/** + * sllist_del_all - delete all entries from sentinel lock-less list + * @head: the head of lock-less list to delete all entries + */ +static inline struct llist_node *sllist_del_all(struct llist_head *head) +{ + struct llist_node *first = xchg(&head->first, &__llist_end); + + return (first == &__llist_end) ? NULL : first; +} + static inline struct llist_node *__llist_del_all(struct llist_head *head) { struct llist_node *first = head->first; @@ -247,8 +367,18 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head) return first; } +static inline struct llist_node *__sllist_del_all(struct llist_head *head) +{ + struct llist_node *first = head->first; + + head->first = &__llist_end; + return (first == &__llist_end) ? NULL : first; +} + extern struct llist_node *llist_del_first(struct llist_head *head); +extern struct llist_node *sllist_del_first(struct llist_head *head); struct llist_node *llist_reverse_order(struct llist_node *head); +struct llist_node *sllist_reverse_order(struct llist_node *head); #endif /* LLIST_H */ diff --git a/lib/llist.c b/lib/llist.c index 611ce4881a87..418327394735 100644 --- a/lib/llist.c +++ b/lib/llist.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-only /* - * Lock-less NULL terminated single linked list + * Lock-less NULL and sentinel node terminated singly linked lists * * The basic atomic operation of this list is cmpxchg on long. On * architectures that don't have NMI-safe cmpxchg implementation, the @@ -12,8 +12,11 @@ */ #include #include +#include #include +struct llist_node __llist_end __ro_after_init; +EXPORT_SYMBOL_GPL(__llist_end); /** * llist_add_batch - add several linked entries in batch @@ -36,6 +39,27 @@ bool llist_add_batch(struct llist_node *new_first, struct llist_node *new_last, } EXPORT_SYMBOL_GPL(llist_add_batch); +/** + * sllist_add_batch - add several linked entries in batch + * @new_first: first entry in batch to be added + * @new_last: last entry in batch to be added + * @head: the head for your lock-less list + * + * Return whether list is empty before adding. + */ +bool sllist_add_batch(struct llist_node *new_first, struct llist_node *new_last, + struct llist_head *head) +{ + struct llist_node *first; + + do { + new_last->next = first = READ_ONCE(head->first); + } while (cmpxchg(&head->first, first, new_first) != first); + + return first == &__llist_end; +} +EXPORT_SYMBOL_GPL(sllist_add_batch); + /** * llist_del_first - delete the first entry of lock-less list * @head: the head for your lock-less list @@ -69,6 +93,33 @@ struct llist_node *llist_del_first(struct llist_head *head) } EXPORT_SYMBOL_GPL(llist_del_first); +/** + * sllist_del_first - delete the first entry of sentinel lock-less list + * @head: the head for your lock-less list + * + * If list is empty, return NULL, otherwise, return the first entry + * deleted, this is the newest added one. + */ +struct llist_node *sllist_del_first(struct llist_head *head) +{ + struct llist_node *entry, *old_entry, *next; + + entry = smp_load_acquire(&head->first); + for (;;) { + if (entry == &__llist_end) + return NULL; + old_entry = entry; + next = READ_ONCE(entry->next); + entry = cmpxchg(&head->first, old_entry, next); + if (entry == old_entry) + break; + } + WRITE_ONCE(entry->next, NULL); + + return entry; +} +EXPORT_SYMBOL_GPL(sllist_del_first); + /** * llist_reverse_order - reverse order of a llist chain * @head: first item of the list to be reversed @@ -90,3 +141,29 @@ struct llist_node *llist_reverse_order(struct llist_node *head) return new_head; } EXPORT_SYMBOL_GPL(llist_reverse_order); + +/** + * sllist_reverse_order - reverse order of a llist chain + * @head: first item of the list to be reversed + * + * Reverse the order of a chain of llist entries and return the + * new first entry. + */ +struct llist_node *sllist_reverse_order(struct llist_node *head) +{ + struct llist_node *new_head = &__llist_end; + + if (!head || (head == &__llist_end)) + return NULL; + + while (head != &__llist_end) { + struct llist_node *tmp = head; + + head = head->next; + tmp->next = new_head; + new_head = tmp; + } + + return new_head; +} +EXPORT_SYMBOL_GPL(sllist_reverse_order); From patchwork Mon Oct 3 15:44:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12997642 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 768C7C4332F for ; Mon, 3 Oct 2022 15:45:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229596AbiJCPpk (ORCPT ); Mon, 3 Oct 2022 11:45:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60682 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229597AbiJCPpi (ORCPT ); Mon, 3 Oct 2022 11:45:38 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1A44B2C136 for ; Mon, 3 Oct 2022 08:45:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664811935; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nCiuKqCC8+dzpfYHMY3mzh/8Ii02D9u9/3dDWKh6rBQ=; b=fii9TNqn1P9A1Qu6Ft9cYJ3HLbwIAX9AxxJ4Ga374smSJeXHHe9jXTny7HCPpdc1tETULt 46tUBVqO4C8VJcQZbaBumIGgpP/wOtfDnKPuB9UsAi3ATNZdW4JaqA6ExBvlN/ttMxpgl8 3V/+fA8B2JZ0Rz5qul2EQPdxGRlhtHI= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-417-e_s-32McPo2k_QvigBoonw-1; Mon, 03 Oct 2022 11:45:32 -0400 X-MC-Unique: e_s-32McPo2k_QvigBoonw-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D1FC629ABA1A; Mon, 3 Oct 2022 15:45:31 +0000 (UTC) Received: from llong.com (dhcp-17-215.bos.redhat.com [10.18.17.215]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8117F2027063; Mon, 3 Oct 2022 15:45:31 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Andy Shevchenko , Andrew Morton , Huang Ying , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v7 2/3] blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path Date: Mon, 3 Oct 2022 11:44:58 -0400 Message-Id: <20221003154459.207538-3-longman@redhat.com> In-Reply-To: <20221003154459.207538-1-longman@redhat.com> References: <20221003154459.207538-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org For blkcg_css_alloc(), the only error that will be returned is -ENOMEM. Simplify error handling code by returning this error directly instead of setting an intermediate "ret" variable. Signed-off-by: Waiman Long Reviewed-by: Ming Lei Acked-by: Tejun Heo --- block/blk-cgroup.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 869af9d72bcf..946592249795 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1177,7 +1177,6 @@ static struct cgroup_subsys_state * blkcg_css_alloc(struct cgroup_subsys_state *parent_css) { struct blkcg *blkcg; - struct cgroup_subsys_state *ret; int i; mutex_lock(&blkcg_pol_mutex); @@ -1186,10 +1185,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) blkcg = &blkcg_root; } else { blkcg = kzalloc(sizeof(*blkcg), GFP_KERNEL); - if (!blkcg) { - ret = ERR_PTR(-ENOMEM); + if (!blkcg) goto unlock; - } } for (i = 0; i < BLKCG_MAX_POLS ; i++) { @@ -1206,10 +1203,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) continue; cpd = pol->cpd_alloc_fn(GFP_KERNEL); - if (!cpd) { - ret = ERR_PTR(-ENOMEM); + if (!cpd) goto free_pd_blkcg; - } + blkcg->cpd[i] = cpd; cpd->blkcg = blkcg; cpd->plid = i; @@ -1238,7 +1234,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) kfree(blkcg); unlock: mutex_unlock(&blkcg_pol_mutex); - return ret; + return ERR_PTR(-ENOMEM); } static int blkcg_css_online(struct cgroup_subsys_state *css) From patchwork Mon Oct 3 15:44:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Waiman Long X-Patchwork-Id: 12997643 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 088CCC433FE for ; Mon, 3 Oct 2022 15:45:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229609AbiJCPpm (ORCPT ); Mon, 3 Oct 2022 11:45:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229610AbiJCPpj (ORCPT ); Mon, 3 Oct 2022 11:45:39 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7EE5B2CDE6 for ; Mon, 3 Oct 2022 08:45:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1664811937; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=u5dBR2FTG7JSHGy9Zrb4XaMiPPO3H7N8toXYMh3ofio=; b=eGrPRB/y8S6pPhTQMoKRCSZFbw/Su1/viMKaQe2R3jT4rfx3futJsovas3rAMrTZTeUt8j sa7QEjGJIShxVfqmGC01U9Yvgd+ly3vc9QJldMQFhm4KIQRuAlY2tC5wZWDXR/Gub/CTyc qMZGA1KB+hsf9SJf5560lsVv6esw4RY= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-392-p33vXtAmPc64F4BmCFGstg-1; Mon, 03 Oct 2022 11:45:32 -0400 X-MC-Unique: p33vXtAmPc64F4BmCFGstg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 370491C06EEB; Mon, 3 Oct 2022 15:45:32 +0000 (UTC) Received: from llong.com (dhcp-17-215.bos.redhat.com [10.18.17.215]) by smtp.corp.redhat.com (Postfix) with ESMTP id DAD212027063; Mon, 3 Oct 2022 15:45:31 +0000 (UTC) From: Waiman Long To: Tejun Heo , Jens Axboe Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, Ming Lei , Andy Shevchenko , Andrew Morton , Huang Ying , =?utf-8?q?Michal_Koutn=C3=BD?= , Waiman Long Subject: [PATCH v7 3/3] blk-cgroup: Optimize blkcg_rstat_flush() Date: Mon, 3 Oct 2022 11:44:59 -0400 Message-Id: <20221003154459.207538-4-longman@redhat.com> In-Reply-To: <20221003154459.207538-1-longman@redhat.com> References: <20221003154459.207538-1-longman@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org For a system with many CPUs and block devices, the time to do blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It can be especially problematic as interrupt is disabled during the flush. It was reported that it might take seconds to complete in some extreme cases leading to hard lockup messages. As it is likely that not all the percpu blkg_iostat_set's has been updated since the last flush, those stale blkg_iostat_set's don't need to be flushed in this case. This patch optimizes blkcg_rstat_flush() by keeping a lockless list of recently updated blkg_iostat_set's in a newly added percpu blkcg->lhead pointer. The blkg_iostat_set is added to a sentinel lockless list on the update side in blk_cgroup_bio_start(). It is removed from the sentinel lockless list when flushed in blkcg_rstat_flush(). Due to racing, it is possible that blk_iostat_set's in the lockless list may have no new IO stats to be flushed, but that is OK. To protect against destruction of blkg, a percpu reference is gotten when putting into the lockless list and put back when removed. A blkg_iostat_set can determine if it is in a lockless list by checking the content of its lnode.next pointer which will be non-NULL when in a sentinel lockless list. When booting up an instrumented test kernel with this patch on a 2-socket 96-thread system with cgroup v2, out of the 2051 calls to cgroup_rstat_flush() after bootup, 1788 of the calls were exited immediately because of empty lockless list. After an all-cpu kernel build, the ratio became 6295424/6340513. That was more than 99%. Signed-off-by: Waiman Long Acked-by: Tejun Heo --- block/blk-cgroup.c | 73 ++++++++++++++++++++++++++++++++++++++++++---- block/blk-cgroup.h | 9 ++++++ 2 files changed, 76 insertions(+), 6 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 946592249795..f1f580c46190 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -59,6 +59,37 @@ static struct workqueue_struct *blkcg_punt_bio_wq; #define BLKG_DESTROY_BATCH_SIZE 64 +/* + * Lockless lists for tracking IO stats update + * + * New IO stats are stored in the percpu iostat_cpu within blkcg_gq (blkg). + * There are multiple blkg's (one for each block device) attached to each + * blkcg. The rstat code keeps track of which cpu has IO stats updated, + * but it doesn't know which blkg has the updated stats. If there are many + * block devices in a system, the cost of iterating all the blkg's to flush + * out the IO stats can be high. To reduce such overhead, a set of percpu + * lockless lists (lhead) per blkcg are used to track the set of recently + * updated iostat_cpu's since the last flush. An iostat_cpu will be put + * onto the lockless list on the update side [blk_cgroup_bio_start()] if + * not there yet and then removed when being flushed [blkcg_rstat_flush()]. + * References to blkg are gotten and then put back in the process to + * protect against blkg removal. + * + * Return: 0 if successful or -ENOMEM if allocation fails. + */ +static int init_blkcg_sllists(struct blkcg *blkcg) +{ + int cpu; + + blkcg->lhead = alloc_percpu_gfp(struct llist_head, GFP_KERNEL); + if (!blkcg->lhead) + return -ENOMEM; + + for_each_possible_cpu(cpu) + init_sllist_head(per_cpu_ptr(blkcg->lhead, cpu)); + return 0; +} + /** * blkcg_css - find the current css * @@ -236,8 +267,10 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, blkg->blkcg = blkcg; u64_stats_init(&blkg->iostat.sync); - for_each_possible_cpu(cpu) + for_each_possible_cpu(cpu) { u64_stats_init(&per_cpu_ptr(blkg->iostat_cpu, cpu)->sync); + per_cpu_ptr(blkg->iostat_cpu, cpu)->blkg = blkg; + } for (i = 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol = blkcg_policy[i]; @@ -864,7 +897,9 @@ static void blkcg_iostat_update(struct blkcg_gq *blkg, struct blkg_iostat *cur, static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) { struct blkcg *blkcg = css_to_blkcg(css); - struct blkcg_gq *blkg; + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); + struct llist_node *lnode; + struct blkg_iostat_set *bisc, *next_bisc; /* Root-level stats are sourced from system-wide IO stats */ if (!cgroup_parent(css->cgroup)) @@ -872,9 +907,16 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) rcu_read_lock(); - hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) { + lnode = sllist_del_all(lhead); + if (!lnode) + goto out; + + /* + * Iterate only the iostat_cpu's queued in the lockless list. + */ + sllist_for_each_entry_safe(bisc, next_bisc, lnode, lnode) { + struct blkcg_gq *blkg = bisc->blkg; struct blkcg_gq *parent = blkg->parent; - struct blkg_iostat_set *bisc = per_cpu_ptr(blkg->iostat_cpu, cpu); struct blkg_iostat cur; unsigned int seq; @@ -890,8 +932,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) if (parent && parent->parent) blkcg_iostat_update(parent, &blkg->iostat.cur, &blkg->iostat.last); + percpu_ref_put(&blkg->refcnt); } +out: rcu_read_unlock(); } @@ -1170,6 +1214,7 @@ static void blkcg_css_free(struct cgroup_subsys_state *css) mutex_unlock(&blkcg_pol_mutex); + free_percpu(blkcg->lhead); kfree(blkcg); } @@ -1189,6 +1234,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) goto unlock; } + if (init_blkcg_sllists(blkcg)) + goto free_blkcg; + for (i = 0; i < BLKCG_MAX_POLS ; i++) { struct blkcg_policy *pol = blkcg_policy[i]; struct blkcg_policy_data *cpd; @@ -1229,7 +1277,8 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css) for (i--; i >= 0; i--) if (blkcg->cpd[i]) blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]); - + free_percpu(blkcg->lhead); +free_blkcg: if (blkcg != &blkcg_root) kfree(blkcg); unlock: @@ -1990,6 +2039,7 @@ static int blk_cgroup_io_type(struct bio *bio) void blk_cgroup_bio_start(struct bio *bio) { + struct blkcg *blkcg = bio->bi_blkg->blkcg; int rwd = blk_cgroup_io_type(bio), cpu; struct blkg_iostat_set *bis; unsigned long flags; @@ -2008,9 +2058,20 @@ void blk_cgroup_bio_start(struct bio *bio) } bis->cur.ios[rwd]++; + /* + * If the iostat_cpu isn't in a lockless list, put it into the + * list to indicate that a stat update is pending. + */ + if (!READ_ONCE(bis->lnode.next)) { + struct llist_head *lhead = this_cpu_ptr(blkcg->lhead); + + sllist_add(&bis->lnode, lhead); + percpu_ref_get(&bis->blkg->refcnt); + } + u64_stats_update_end_irqrestore(&bis->sync, flags); if (cgroup_subsys_on_dfl(io_cgrp_subsys)) - cgroup_rstat_updated(bio->bi_blkg->blkcg->css.cgroup, cpu); + cgroup_rstat_updated(blkcg->css.cgroup, cpu); put_cpu(); } diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index d2724d1dd7c9..0968b6c8ea12 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -18,6 +18,7 @@ #include #include #include +#include struct blkcg_gq; struct blkg_policy_data; @@ -43,6 +44,8 @@ struct blkg_iostat { struct blkg_iostat_set { struct u64_stats_sync sync; + struct llist_node lnode; + struct blkcg_gq *blkg; struct blkg_iostat cur; struct blkg_iostat last; }; @@ -97,6 +100,12 @@ struct blkcg { struct blkcg_policy_data *cpd[BLKCG_MAX_POLS]; struct list_head all_blkcgs_node; + + /* + * List of updated percpu blkg_iostat_set's since the last flush. + */ + struct llist_head __percpu *lhead; + #ifdef CONFIG_BLK_CGROUP_FC_APPID char fc_app_id[FC_APPID_LEN]; #endif