From patchwork Mon Mar 1 22:59:37 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Song Bao Hua (Barry Song)" X-Patchwork-Id: 12110745 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23B7EC433E0 for ; Mon, 1 Mar 2021 23:07:40 +0000 (UTC) Received: from merlin.infradead.org (merlin.infradead.org [205.233.59.134]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id B8B946023C for ; Mon, 1 Mar 2021 23:07:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B8B946023C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hisilicon.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=merlin.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:Cc:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:To:From: Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender :Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=bljpYN5kYDJEviW9ZdS4M8re2xqTufVXhcdAUPe8xSg=; b=P3waFo9k5BMs9J9X98evj/D2uv X6n2wi0trmv3W4dEkGzqOX7wqLElSzSp/Ma9TUllQy0z0VSTggMlWSwXikQdv9wVcmGdveBc/3hlA A6wmHUHGETtpz/7yEy1rg8PoOMiyz0dxQbKHa533lHHG5a+e5WjGdiroqXXkMRWABnbn3wl45HEfA AvaYDtTIF+d6a+91OX5qIk9eLLiv+mCe9W1QdzYT9EalXn7sHVnM50mh/l/mhJTXwxGWYH8y0fB/g kFnyXXF6CkuHhw6PLwZ3ycyHlcJrqmSNJT6nh4Robr/a0NMgnwHsmBq520N9YeEq2zJA05hZww1It 3Zk4tiRw==; Received: from localhost ([::1] helo=merlin.infradead.org) by merlin.infradead.org with esmtp (Exim 4.92.3 #3 (Red Hat Linux)) id 1lGrcH-0002TC-3C; Mon, 01 Mar 2021 23:06:17 +0000 Received: from szxga05-in.huawei.com ([45.249.212.191]) by merlin.infradead.org with esmtps (Exim 4.92.3 #3 (Red Hat Linux)) id 1lGrcD-0002Rc-O3 for linux-arm-kernel@lists.infradead.org; Mon, 01 Mar 2021 23:06:15 +0000 Received: from DGGEMS412-HUB.china.huawei.com (unknown [172.30.72.60]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4DqG6b5pxWzjThg; Tue, 2 Mar 2021 07:04:23 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.203.209) by DGGEMS412-HUB.china.huawei.com (10.3.19.212) with Microsoft SMTP Server id 14.3.498.0; Tue, 2 Mar 2021 07:05:55 +0800 From: Barry Song To: , , , , , , , , , , , , , Subject: [RFC PATCH v4 0/3] scheduler: expose the topology of clusters and add cluster scheduler Date: Tue, 2 Mar 2021 11:59:37 +1300 Message-ID: <20210301225940.16728-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 X-Originating-IP: [10.126.203.209] X-CFilter-Loop: Reflected X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210301_180614_297070_46F104FC X-CRM114-Status: UNSURE ( 9.09 ) X-CRM114-Notice: Please train this message. X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: juri.lelli@redhat.com, mark.rutland@arm.com, aubrey.li@linux.intel.com, hpa@zytor.com, prime.zeng@hisilicon.com, guodong.xu@linaro.org, gregkh@linuxfoundation.org, sudeep.holla@arm.com, linux-kernel@vger.kernel.org, linuxarm@openeuler.org, linux-acpi@vger.kernel.org, xuwei5@huawei.com, jonathan.cameron@huawei.com, yangyicong@huawei.com, x86@kernel.org, msys.mizuma@gmail.com, liguozhu@hisilicon.com, valentin.schneider@arm.com, linux-arm-kernel@lists.infradead.org, Barry Song Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 wakes up task4 Task2 wakes up task5 Task3 wakes up task6 Task4 wakes up task7 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 4 | | | |2 | |5 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (1): scheduler: add scheduler level for clusters Jonathan Cameron (1): topology: Represent clusters of CPUs within a die. Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 ++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 1 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 +++++++++++++++++++- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 14 +++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 +++++++++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 ++++++ kernel/sched/core.c | 18 +++++++++ kernel/sched/fair.c | 66 ++++++++++++++++++++++++------- kernel/sched/sched.h | 1 + kernel/sched/topology.c | 6 +++ 22 files changed, 315 insertions(+), 19 deletions(-) create mode 100644 include/linux/sched/cluster.h