From patchwork Sat Apr  5 14:52:10 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
X-Patchwork-Id: 14039151
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6179FC36010
	for <linux-mm@archiver.kernel.org>; Sat,  5 Apr 2025 14:52:29 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D2CDF6B0007; Sat,  5 Apr 2025 10:52:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CDE1A6B0008; Sat,  5 Apr 2025 10:52:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BA45F6B000A; Sat,  5 Apr 2025 10:52:26 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 936856B0007
	for <linux-mm@kvack.org>; Sat,  5 Apr 2025 10:52:26 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 9715381390
	for <linux-mm@kvack.org>; Sat,  5 Apr 2025 14:52:27 +0000 (UTC)
X-FDA: 83300281134.29.AC5CCF2
Received: from smtpout.efficios.com (smtpout.efficios.com [158.69.130.18])
	by imf08.hostedemail.com (Postfix) with ESMTP id ECAE1160009
	for <linux-mm@kvack.org>; Sat,  5 Apr 2025 14:52:25 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=efficios.com header.s=smtpout1 header.b=Eov2IRsr;
	spf=pass (imf08.hostedemail.com: domain of mathieu.desnoyers@efficios.com
 designates 158.69.130.18 as permitted sender)
 smtp.mailfrom=mathieu.desnoyers@efficios.com;
	dmarc=pass (policy=none) header.from=efficios.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1743864746;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:in-reply-to:
	 references:dkim-signature; bh=W3ztw4moWyoUu+zH5oZdx4RihyE4lU1l09xhtWJ9i/o=;
	b=nuDWN2FK+gWL79N7Dggpyq9ZxjvjfaxDSMgdhfq1sfU4bAO8hpceo570LwxA3PUAififRZ
	Elk/ktbSXrOXFEveMkkhCk1bTBgAwoZpJTyqtKtemZC8RdImwGrILsj3hHiohWBRyJhgof
	AMdNCNME6c8ehKU6PK20Ox36WcIt9rM=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=efficios.com header.s=smtpout1 header.b=Eov2IRsr;
	spf=pass (imf08.hostedemail.com: domain of mathieu.desnoyers@efficios.com
 designates 158.69.130.18 as permitted sender)
 smtp.mailfrom=mathieu.desnoyers@efficios.com;
	dmarc=pass (policy=none) header.from=efficios.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743864746; a=rsa-sha256;
	cv=none;
	b=D1jRyCC/OwC4peZIRLiwQWFjkbb0grbjQ6JhCtHzQMqE05KmJmW5gOel426axqntFvjiTd
	WFw/rrY4cQNSNG5L0WzxxIKBsNcHTa0psbCADkmNAm1Ys5KC9/XF04kWuyyMC+xf+7saIZ
	e4eWW5IsHHe5opSkMhfN0T7ItYHd6LI=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=efficios.com;
	s=smtpout1; t=1743864745;
	bh=0OtykuqpxeW7rYrgaVmT4bGRZtfNmygke5WZFpzMIbU=;
	h=From:To:Cc:Subject:Date:From;
	b=Eov2IRsrBLc9AFMLv0MTQrpqXTJDkmRJUbPsQJnWTa6+l1VynJJZcG5ylrPBGDN9j
	 bDUNKHKhfTCiICLG7YVatMW8nLr5Z8Qie+zRgpLZRGj5Kwzf7VTRwNx9l0mFj8SqB1
	 xJCVOEyy9Q6rpWi5jk505KfT4586Ns+UQ/c+EWd6flNwFViTQavQVWxlRbpbxVV/ib
	 HTkM0Ecb4hpqDL/QnQkI2vGQurZ1M6hmUlaWwOZplt+XT6bcXD7oYy8EDLCMCQeRwt
	 ApIzO0W/17YJSOB90km59hwFtCUl+D/RcgMOKEYkD6L9bRQm7hXBCDwYHrpNz7mprj
	 fVvGpbPb3/0FQ==
Received: from thinkos.internal.efficios.com (unknown
 [IPv6:2606:6d00:100:4000:cacb:9855:de1f:ded2])
	by smtpout.efficios.com (Postfix) with ESMTPSA id 4ZVJNS5tSYzYB2;
	Sat,  5 Apr 2025 10:52:24 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Sweet Tea Dorminy <sweettea@google.com>
Cc: linux-kernel@vger.kernel.org,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Dennis Zhou <dennis@kernel.org>,
	Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Martin Liu <liumartin@google.com>,
	David Rientjes <rientjes@google.com>,
	christian.koenig@amd.com,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@Oracle.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Christian Brauner <brauner@kernel.org>,
	Wei Yang <richard.weiyang@gmail.com>,
	David Hildenbrand <david@redhat.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	linux-mm@kvack.org,
	linux-trace-kernel@vger.kernel.org,
	Yu Zhao <yuzhao@google.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Mateusz Guzik <mjguzik@gmail.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: [RFC PATCH] Introduce Hierarchical Per-CPU Counters
Date: Sat,  5 Apr 2025 10:52:10 -0400
Message-Id: <20250405145210.616206-1-mathieu.desnoyers@efficios.com>
X-Mailer: git-send-email 2.39.5
MIME-Version: 1.0
X-Rspamd-Queue-Id: ECAE1160009
X-Stat-Signature: isnexpwq4qggz1c138mx7zghkyinemuw
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-HE-Tag: 1743864745-678894
X-HE-Meta: 
 U2FsdGVkX19uLBNOiUxH6FWpqhedUjsQfm7SsJl1XdRPa1atJjd7NTlVi64+eKuVPklPdw77MjsHJ/ukgfUIYOkM+I7/nobhMXAcxLJvTklgUpgRYTG/ExCVhH3klTz3JrNzzppjIYCR5GeS2BsXo/4VUv0rtrrJr0TJm8MYAt5yhqL7pW3H+KTsSwPh1kPgmJ9X0Rhv5K9vVs17qQYA4IlGqqPcyUWVs9zM/zLWp488rxGZozDNgNmUCtq1ve0PA1D0gMETZkPPM+dgoJrjEV3QUvwVds5SSJZ3IuB/hKdPHEUzoQFOCzFSiykLJVK05O448O1kl4FjCRh4iSxcQSHjFlx72h/yPCggFfbduNxmG5cwqXVYl3d4+b5YQEyrRMPAR1YLYdjmKAXZ+etY8rRVd5huq0IDcQAUGSSoDL05chf25vlu2I5BW1Ynjhvm/B6wZk5K7FGyqJ3KIeM1XjagTUXkMmHxB/QlrnbxvzHzWoc8sXzyR7IrUzrkqEZQ0SJnu9kcn167xIcgra/ktg2o7lY2fv3gYUujdXzuv5bZAeJR+RxIbT6JFf7vZE78e+STYUUhYAJwPG5Hc8InCnC79nnu9uQgXmnfEuG1McQ1MRFW6qnhI5fkuOo17Pc8UrKnehhuaO0Y/G9uKBk7gD01Me1w/Dj+YMiulXpei2SWnifNpFl/SVsDFoTQ/iBGbHFLrIymFjv3TRC31iBlSB+D3gMxJc19JxS8e8z6W94rjvkzOjqP015Y+KcyGHB4/kBoWmuQVNpv9H2N6P4X/QoGjq6eZeX/VX47JPE42VxUOdlT7oNqPuOt1tR+cTtXz9kRuHo5Y286HpDr5pPfyqRyQ4Fh/GquRldjZEr9qSwZEyf3/wq4kDVdyJVSsXYfiZ1CmsS8T7IUjEpJsH5Co5VRCUniMifpi9OiaKoRCtgWqXlxoPcPYf114fty6EVEIw0CLxxZbYK496lZqWF
 eQv+YvjZ
 nxsw7HfyHjsWtKorKEnkEkdmiyK+AaNdyvUexggznSmsCQrilBz8OO/nE8OQqGEBqagYa/iQyHi5L4aieaCu4pgYSHWjJSWjtnkPeLiX35PAAo05vtk2JX9Jftxad+g36cKorLZu/SNS0K6A+Q227L3FLbadKs5189c3kVSLPlT7hf4m0pfAbEKLE54Ua5HJ++S/JEQ3yg8rWZbdvETHurlfpePQ+6yQ/wJ+M1/AlEK1EAt45jQ1ueoRghpXm8rXTq1Tn6k4hNozQPg9gq4ZpNChySwbEMonMFc25CYADorNiEZb7KYlWnd7tHUcw1emSS35S+4kDu457bNuKo2fbewsAUH+3/ATQULK3HPS4xnGv2XpqOZHpyaGvojX4Tu3mWJSp+oEbkuQgrBtOTbZ+fsFkngO/Irv6H8ycGaa5x11ztJDhEyM/tlg4s9TeOJ3UWsyctxt2EkM5mN6gIkyNvj7/JTF/QJoEl/DBgyKHMGSuDAT85fHdbyu+R8xGPaHyixiZMuJtO26MemGFCNHDLF4/HHB8RoZoXde4VPUNMgkpH7VVNKWrNYseiTej93WuT3a8W2nNfcpMo0pc9QQ6Fw0SfSidFM7Qm8wVAZYIc6HbEXXbDr8p1+FRo7biRWgLh0IowKaQUmwzu7VAYxLEbCKF4JT2Fw7UNYzoo/3kyGyLyEF82x+TkrcAfx7ykSqE8o43fsWCn2t2rZGvjqKO14aKJZ0/IZ8VyusIShcnY+k2cEvyWxNaYeeKadD1suT6p0HCdOYqThb9rIEjEtUDYx54B3HTvHJKm1oCNtEP02AdjVkcZ2WZcsUXu8pWQbxX6gvH52WohFVp71bBzTSJm4Mt88mJTLLb3H7KqrpIXmSZCnLlHy31yWL7unsQHTr4xCay3LqtNG1vlaypVgtOSUmGsX/aGY1z899enaddnhf3hYSPz+NBuJLuEkWEV5f1KoqtJ4jTasAk7AWtUnFQzP1tR0ah
 9ZnUOLdB
 cULF1Td3m0xre987rl9Ipjz3TxzVTey+
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

* Motivation

The purpose of this hierarchical split-counter scheme is to:

- Minimize contention when incrementing and decrementing counters,
- Provide fast access to a sum approximation,
- Provide a sum approximation with an acceptable accuracy level when
  scaling to many-core systems.

It aims at fixing the per-mm RSS tracking which has become too
inaccurate for OOM killer purposes on large many-core systems [1].

* Design

The hierarchical per-CPU counters propagate a sum approximation through
a binary tree. When reaching the batch size, the carry is propagated
through a binary tree of which consists of log2(nr_cpu_ids) levels. The
batch size for each level is twice the batch size of the prior level.

Example propagation diagram with 8 cpus:

Level 0:  0    1    2    3    4    5    6    7
          |   /     |   /     |   /     |   /
          |  /      |  /      |  /      |  /
          | /       | /       | /       | /
Level 1:  0         1         2         3
          |       /           |       /
          |    /              |    /
          | /                 | /
Level 2:  0                   1
          |               /
          |         /
          |   /
Level 3:  0

The maximum inaccuracy is bound by: batch_size * log2(nr_cpus) * nr_cpus
which evolves with O(n*log(n)) as the number of CPUs increases.

Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1]
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Martin Liu <liumartin@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: christian.koenig@amd.com
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Sweet Tea Dorminy <sweettea@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R . Howlett" <Liam.Howlett@Oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-mm@kvack.org
Cc: linux-trace-kernel@vger.kernel.org
Cc: Yu Zhao <yuzhao@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
---
 include/linux/percpu_counter_tree.h |  98 ++++++++++
 lib/Makefile                        |   1 +
 lib/percpu_counter_tree.c           | 265 ++++++++++++++++++++++++++++
 3 files changed, 364 insertions(+)
 create mode 100644 include/linux/percpu_counter_tree.h
 create mode 100644 lib/percpu_counter_tree.c

diff --git a/include/linux/percpu_counter_tree.h b/include/linux/percpu_counter_tree.h
new file mode 100644
index 000000000000..9668e9673099
--- /dev/null
+++ b/include/linux/percpu_counter_tree.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0+ OR MIT */
+/* SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> */
+
+#ifndef _PERCPU_COUNTER_TREE_H
+#define _PERCPU_COUNTER_TREE_H
+
+#include <linux/cleanup.h>
+#include <linux/preempt.h>
+#include <linux/atomic.h>
+#include <linux/percpu.h>
+
+struct percpu_counter_tree_level_item {
+	atomic_t count;
+} ____cacheline_aligned_in_smp;
+
+struct percpu_counter_tree {
+	unsigned int level0_bit_mask;
+	unsigned int __percpu *level0;
+
+	unsigned int nr_levels;
+	unsigned int nr_cpus;
+	unsigned int batch_size;
+	struct percpu_counter_tree_level_item *items;
+	unsigned int inaccuracy;	/* approximation imprecise within ± inaccuracy */
+	int bias;			/* bias for counter_set */
+};
+
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, unsigned int batch_size);
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter);
+void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc);
+int percpu_counter_tree_precise_sum_unbiased(struct percpu_counter_tree *counter);
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter);
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *counter, int v);
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *counter, int v);
+void percpu_counter_tree_set_bias(struct percpu_counter_tree *counter, int bias);
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v);
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter);
+
+/* Fast paths */
+
+static inline
+int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask)
+{
+	if (inc < 0) {
+		inc = -(-inc & ~(bit_mask - 1));
+		/*
+		 * xor bit_mask: underflow.
+		 *
+		 * If inc has bit set, decrement an additional bit if
+		 * there is _no_ bit transition between orig and res.
+		 * Else, inc has bit cleared, decrement an additional
+		 * bit if there is a bit transition between orig and
+		 * res.
+		 */
+		if ((inc ^ orig ^ res) & bit_mask)
+			inc -= bit_mask;
+	} else {
+		inc &= ~(bit_mask - 1);
+		/*
+		 * xor bit_mask: overflow.
+		 *
+		 * If inc has bit set, increment an additional bit if
+		 * there is _no_ bit transition between orig and res.
+		 * Else, inc has bit cleared, increment an additional
+		 * bit if there is a bit transition between orig and
+		 * res.
+		 */
+		if ((inc ^ orig ^ res) & bit_mask)
+			inc += bit_mask;
+	}
+	return inc;
+}
+
+static inline
+void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc)
+{
+	unsigned int bit_mask = counter->level0_bit_mask, orig, res;
+
+	if (!inc)
+		return;
+	/* Make sure the fast and slow paths use the same cpu number. */
+	guard(migrate)();
+	res = this_cpu_add_return(*counter->level0, inc);
+	orig = res - inc;
+	inc = percpu_counter_tree_carry(orig, res, inc, bit_mask);
+        if (!inc)
+		return;
+	percpu_counter_tree_add_slowpath(counter, inc);
+}
+
+static inline
+int percpu_counter_tree_approx_sum(struct percpu_counter_tree *counter)
+{
+	return (int) (atomic_read(&counter->items[counter->nr_cpus - 2].count) +
+		      READ_ONCE(counter->bias));
+}
+
+#endif  /* _PERCPU_COUNTER_TREE_H */
diff --git a/lib/Makefile b/lib/Makefile
index d5cfc7afbbb8..d803a3a63288 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -201,6 +201,7 @@ obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
 obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o
 obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o
 obj-$(CONFIG_SMP) += percpu_counter.o
+obj-$(CONFIG_SMP) += percpu_counter_tree.o
 obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c
new file mode 100644
index 000000000000..8e456bfd0907
--- /dev/null
+++ b/lib/percpu_counter_tree.c
@@ -0,0 +1,265 @@
+// SPDX-License-Identifier: GPL-2.0+ OR MIT
+// SPDX-FileCopyrightText: 2025 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+
+/*
+ * Split Counters With Binary Tree Approximation Propagation
+ *
+ * * Propagation diagram when reaching batch size thresholds (± batch size):
+ *
+ * Example diagram for 8 CPUs:
+ *
+ * log2(8) = 3 levels
+ *
+ * At each level, each pair propagates its values to the next level when
+ * reaching the batch size thresholds.
+ *
+ * Counters at levels 0, 1, 2 can be kept on a single byte (±128 range),
+ * although it may be relevant to keep them on 32-bit counters for
+ * simplicity. (complexity vs memory footprint tradeoff)
+ *
+ * Counter at level 3 can be kept on a 32-bit counter.
+ *
+ * Level 0:  0    1    2    3    4    5    6    7
+ *           |   /     |   /     |   /     |   /
+ *           |  /      |  /      |  /      |  /
+ *           | /       | /       | /       | /
+ * Level 1:  0         1         2         3
+ *           |       /           |       /
+ *           |    /              |    /
+ *           | /                 | /
+ * Level 2:  0                   1
+ *           |               /
+ *           |         /
+ *           |   /
+ * Level 3:  0
+ *
+ * * Approximation inaccuracy:
+ *
+ * BATCH(level N): Level N batch size.
+ *
+ * Example for BATCH(level 0) = 32.
+ *
+ * BATCH(level 0) =  32
+ * BATCH(level 1) =  64
+ * BATCH(level 2) = 128
+ * BATCH(level N) = BATCH(level 0) * 2^N
+ *
+ *            per-counter     global
+ *            inaccuracy      inaccuracy
+ * Level 0:   [ -32 ..  +31]  ±256  (8 * 32)
+ * Level 1:   [ -64 ..  +63]  ±256  (4 * 64)
+ * Level 2:   [-128 .. +127]  ±256  (2 * 128)
+ * Total:      ------         ±768  (log2(nr_cpu_ids) * BATCH(level 0) * nr_cpu_ids)
+ *
+ * -----
+ *
+ * Approximate Sum Carry Propagation
+ *
+ * Let's define a number of counter bits for each level, e.g.:
+ *
+ * log2(BATCH(level 0)) = log2(32) = 5
+ *
+ *               nr_bit        value_mask                      range
+ * Level 0:      5 bits        v                             0 ..  +31
+ * Level 1:      1 bit        (v & ~((1UL << 5) - 1))        0 ..  +63
+ * Level 2:      1 bit        (v & ~((1UL << 6) - 1))        0 .. +127
+ * Level 3:     25 bits       (v & ~((1UL << 7) - 1))        0 .. 2^32-1
+ *
+ * Note: Use a full 32-bit per-cpu counter at level 0 to allow precise sum.
+ *
+ * Note: Use cacheline aligned counters at levels above 0 to prevent false sharing.
+ *       If memory footprint is an issue, a specialized allocator could be used
+ *       to eliminate padding.
+ *
+ * Example with expanded values:
+ *
+ * counter_add(counter, inc):
+ *
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = percpu_add_return(counter @ Level 0, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b00011111);  // Clear used bits
+ *                 // xor bit 5: underflow
+ *                 if ((inc ^ orig ^ res) & 0b00100000)
+ *                         inc -= 0b00100000;
+ *         } else {
+ *                 inc &= ~0b00011111;           // Clear used bits
+ *                 // xor bit 5: overflow
+ *                 if ((inc ^ orig ^ res) & 0b00100000)
+ *                         inc += 0b00100000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = atomic_add_return(counter @ Level 1, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b00111111);  // Clear used bits
+ *                 // xor bit 6: underflow
+ *                 if ((inc ^ orig ^ res) & 0b01000000)
+ *                         inc -= 0b01000000;
+ *         } else {
+ *                 inc &= ~0b00111111;           // Clear used bits
+ *                 // xor bit 6: overflow
+ *                 if ((inc ^ orig ^ res) & 0b01000000)
+ *                         inc += 0b01000000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         res = atomic_add_return(counter @ Level 2, inc);
+ *         orig = res - inc;
+ *         if (inc < 0) {
+ *                 inc = -(-inc & ~0b01111111);  // Clear used bits
+ *                 // xor bit 7: underflow
+ *                 if ((inc ^ orig ^ res) & 0b10000000)
+ *                         inc -= 0b10000000;
+ *         } else {
+ *                 inc &= ~0b01111111;           // Clear used bits
+ *                 // xor bit 7: overflow
+ *                 if ((inc ^ orig ^ res) & 0b10000000)
+ *                         inc += 0b10000000;
+ *         }
+ *         if (!inc)
+ *                 return;
+ *
+ *         atomic_add(counter @ Level 3, inc);
+ */
+
+#include <linux/percpu_counter_tree.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/atomic.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/math.h>
+
+int percpu_counter_tree_init(struct percpu_counter_tree *counter, unsigned int batch_size)
+{
+	/* Batch size must be power of 2 */
+	if (!batch_size || (batch_size & (batch_size - 1)))
+		return -EINVAL;
+	counter->nr_levels = get_count_order(nr_cpu_ids);
+	counter->nr_cpus = 1UL << counter->nr_levels;
+	counter->batch_size = batch_size;
+	counter->level0_bit_mask = 1UL << get_count_order(batch_size);
+	counter->inaccuracy = counter->nr_levels * batch_size * counter->nr_cpus;
+	counter->bias = 0;
+	counter->level0 = alloc_percpu(unsigned int);
+	if (!counter->level0)
+		return -ENOMEM;
+	counter->items = kzalloc(counter->nr_cpus - 1 *
+				 sizeof(struct percpu_counter_tree_level_item),
+				 GFP_KERNEL);
+	if (!counter->items) {
+		free_percpu(counter->level0);
+		return -ENOMEM;
+	}
+	return 0;
+}
+
+void percpu_counter_tree_destroy(struct percpu_counter_tree *counter)
+{
+	free_percpu(counter->level0);
+	kfree(counter->items);
+}
+
+/* Called with migration disabled. */
+void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc)
+{
+	struct percpu_counter_tree_level_item *item = counter->items;
+	unsigned int level_items = counter->nr_cpus >> 1;
+	unsigned int level, nr_levels = counter->nr_levels;
+	unsigned int bit_mask = counter->level0_bit_mask;
+	unsigned int cpu = smp_processor_id();
+
+	for (level = 1; level < nr_levels; level++) {
+		atomic_t *count = &item[cpu & (level_items - 1)].count;
+		unsigned int orig, res;
+
+		bit_mask <<= 1;
+		res = atomic_add_return_relaxed(inc, count);
+		orig = res - inc;
+		inc = percpu_counter_tree_carry(orig, res, inc, bit_mask);
+		item += level_items;
+		level_items >>= 1;
+		if (!inc)
+			return;
+	}
+	atomic_add(inc, &item->count);
+}
+
+/*
+ * Precise sum.
+ * Keep "int" counters per-cpu, and perform the sum of all per-cpu
+ * counters.
+ */
+int percpu_counter_tree_precise_sum_unbiased(struct percpu_counter_tree *counter)
+{
+	unsigned int sum = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		sum += *per_cpu_ptr(counter->level0, cpu);
+	return (int) sum;
+}
+
+int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter)
+{
+	return percpu_counter_tree_precise_sum_unbiased(counter) + READ_ONCE(counter->bias);
+}
+
+/*
+ * Do an approximate comparison of a counter against a given value.
+ * Return 1 if the value is greater than counter,
+ * Return -1 if the value lower than counter,
+ * Return 0 if the value is within the inaccuracy range of the counter.
+ */
+int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_approx_sum(counter);
+
+	if (abs(v - count) <= counter->inaccuracy)
+		return 0;
+	if (v > count)
+		return 1;
+	return -1;
+}
+
+/*
+ * Compare counter against a given value.
+ * Return 1 if the value is greater than counter,
+ * Return -1 if the value lower than counter,
+ * Return 0 if the value is equal to the counter.
+ */
+int percpu_counter_tree_precise_compare(struct percpu_counter_tree *counter, int v)
+{
+	int count = percpu_counter_tree_approx_sum(counter);
+
+	if (abs(v - count) <= counter->inaccuracy)
+		count = percpu_counter_tree_precise_sum(counter);
+	if (v > count)
+		return 1;
+	if (v < count)
+		return -1;
+	return 0;
+}
+
+void percpu_counter_tree_set_bias(struct percpu_counter_tree *counter, int bias)
+{
+	WRITE_ONCE(counter->bias, bias);
+}
+
+void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v)
+{
+	percpu_counter_tree_set_bias(counter,
+				     v - percpu_counter_tree_precise_sum_unbiased(counter));
+}
+
+unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter)
+{
+	return counter->inaccuracy;
+}