From patchwork Fri Oct 13 00:32:37 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Timofey Titovets <nefelim4ag@gmail.com>
X-Patchwork-Id: 10003319
Return-Path: <linux-btrfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	9E15D602BF for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Fri, 13 Oct 2017 00:32:52 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C4D828F0A
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Fri, 13 Oct 2017 00:32:52 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 5E0A928F0E; Fri, 13 Oct 2017 00:32:52 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00,
	DKIM_ADSP_CUSTOM_MED,
	DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM,
	T_DKIM_INVALID autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B9CB528F0A
	for <patchwork-linux-btrfs@patchwork.kernel.org>;
	Fri, 13 Oct 2017 00:32:51 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753367AbdJMAcs (ORCPT
	<rfc822;patchwork-linux-btrfs@patchwork.kernel.org>);
	Thu, 12 Oct 2017 20:32:48 -0400
Received: from mail-wm0-f65.google.com ([74.125.82.65]:43840 "EHLO
	mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752644AbdJMAcr (ORCPT
	<rfc822;linux-btrfs@vger.kernel.org>);
	Thu, 12 Oct 2017 20:32:47 -0400
Received: by mail-wm0-f65.google.com with SMTP id m72so26003523wmc.0
	for <linux-btrfs@vger.kernel.org>;
	Thu, 12 Oct 2017 17:32:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id;
	bh=ihS/Vv4S1j41lRqtqnH8QwzcWGAcCXfMfZxuk1aKFhY=;
	b=FTWuMWcDyU41m9QJj9K0rWD6Ds/cZ5eOPSpNrUooiHFUnFMZF7JYtkZhAmQGdtQCaI
	y7biwyOqS65VBjTV+ecBysyeGy9YH2D9CrXuyRVaocmPaYYUWIiWu8JGUBAd8j5DB/KQ
	ME+Fuz5iQi6IGEGekFJ0WjRvH/q4V1pZoIrCbKtqph2MFmVsRWVqxDvtXSWmbL9o4cwM
	2tUTSh02JmHchvekCv9Eyk9aXgBxyUlydw5GGTp9oU/NdlJAHgsoCd7qJ28FnKwjIxgy
	BVKexApjU1UFjnVDkglsLGFkzk7rJ1ztsqorVwmnmo5W9qth2DfELtWX4g0uy0Cp+pfl
	gWFA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id;
	bh=ihS/Vv4S1j41lRqtqnH8QwzcWGAcCXfMfZxuk1aKFhY=;
	b=QkZOuUm5W7JvrqCKpQhu1Ug2TdLKnocZ/hp5zIiMZk6e6yogNejRoFGe+gCIGwMHKQ
	awNOCKQouEjc+GgWjWIwBT8nxyBWEkQcmbLzvLx54wtBR2XfbrCBHR8ocmMPqqQC9gPL
	RJ+N6MN1GYlDAstoDPQM7YdOf/nW2vGIZN4iE0zYlRanPJuSipSw7HCKnXo3FFoaGTsC
	941jrsoDX/2aTq2yM9OEGlir36BuhEMgMNyO1oostfUhoLn+QAHPKHDhOP+Plxe/DvEY
	6AsWqL2PKOJKEBRhi+Y7RQazaOavNwLLvBx1Gw9dhUGU6n30KYzSjJCkmY5kKhJihPwO
	fKHQ==
X-Gm-Message-State: AMCzsaU3dEMTcgpuVnNv4g34xoe/7s3dWTgbM1Vn1ZOdiMlMWSm/zxuN
	q/kM8EiiYjTOzgRp3VUYWoKISA==
X-Google-Smtp-Source: 
 ABhQp+RJ5iu0sqr3Q7h6qEzoNL+t9G4Ip5EJPQb9oy8dzak6W7tx32vZ2K+ZzhgLTwEkWMScX2lStg==
X-Received: by 10.28.155.18 with SMTP id d18mr1218wme.107.1507854765849;
	Thu, 12 Oct 2017 17:32:45 -0700 (PDT)
Received: from titovetst-l.lan (nat6-minsk-pool-46-53-208-190.telecom.by.
	[46.53.208.190]) by smtp.gmail.com with ESMTPSA id
	v5sm2165527wrf.29.2017.10.12.17.32.44
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Thu, 12 Oct 2017 17:32:45 -0700 (PDT)
From: Timofey Titovets <nefelim4ag@gmail.com>
To: linux-btrfs@vger.kernel.org
Cc: Timofey Titovets <nefelim4ag@gmail.com>
Subject: [RFC PATCH] Btrfs: heuristic replace heap sort with radix sort
Date: Fri, 13 Oct 2017 03:32:37 +0300
Message-Id: <20171013003237.3061-1-nefelim4ag@gmail.com>
X-Mailer: git-send-email 2.14.2
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-btrfs.vger.kernel.org>
X-Mailing-List: linux-btrfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

Slowest part of heuristic for now is kernel heap sort()
It's can take up to 55% of runtime on sorting bucket items.

As sorting will always call on most data sets to get correctly
byte_core_set_size, the only way to speed up heuristic, is to
speed up sort on bucket.

So, add a general radix_sort function.
Radix sort require 2 buffers, one full size of input array
and one for store counters (jump addresses).

So for buffer array, just allocate BUCKET_SIZE*2 for bucket,
and use free tale as a buffer, to improve data locality.
That increase usage per heuristic workspace +1KiB
8KiB + 1KiB -> 8KiB + 2KiB

That is LSD Radix, i use 4 bit as a base for calculating,
to make counters array acceptable small (16el*8byte).

Not tested on Big.Endian.
I try handle that by some kernel macros.

Performance tested in userspace copy of heuristic code,
throughput:
    - average <-> random data: ~3500 MiB/s      - heap  sort
    - average <-> random data: ~6000 MiB/s +71% - radix sort

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
---
 fs/btrfs/compression.c | 153 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 141 insertions(+), 12 deletions(-)

--
2.14.2
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 01738b9a8dc7..f320fbb1de17 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -753,6 +753,7 @@ struct heuristic_ws {
 	u32 sample_size;
 	/* Bucket store counter for each byte type */
 	struct bucket_item *bucket;
+	struct bucket_item *bucket_buf;
 	struct list_head list;
 };

@@ -778,10 +779,12 @@ static struct list_head *alloc_heuristic_ws(void){
 	if (!ws->sample)
 		goto fail;

-	ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL);
+	ws->bucket = kcalloc(BUCKET_SIZE*2, sizeof(*ws->bucket), GFP_KERNEL);
 	if (!ws->bucket)
 		goto fail;

+	ws->bucket_buf = &ws->bucket[BUCKET_SIZE];
+
 	INIT_LIST_HEAD(&ws->list);
 	return &ws->list;
 fail:
@@ -1225,6 +1228,137 @@ int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start,
 	return 1;
 }

+#define RADIX_BASE 4
+#define COUNTERS_SIZE (1 << RADIX_BASE)
+
+static inline u8 get4bits(u64 num, int shift) {
+	u8 low4bits;
+	num = num >> shift;
+	/* Reverse order */
+	low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE);
+	return low4bits;
+}
+
+static inline void copy_cell(void *dst, const void *src)
+{
+	struct bucket_item *dstv = (struct bucket_item *) dst;
+	struct bucket_item *srcv = (struct bucket_item *) src;
+	*dstv = *srcv;
+}
+
+static inline u64 get_num(const void *a)
+{
+	struct bucket_item *av = (struct bucket_item *) a;
+	return cpu_to_le32(av->count);
+}
+
+/*
+ * Kernel compatible radix sort implementation
+ * Use 4 bits as radix base
+ * Use 16 64bit counters for calculating new possition in buf array
+ * Tested only on Little Endian
+ *
+ * @array     - array that will be sorted
+ * @array_buf - buffer array to store sorting results
+ *              must be equal in size to @array
+ * @num       - array size
+ * @size      - item size
+ * @max_cell  - Link to element with maximum possible value
+ *              that can be used to cap radix sort iterations
+ *              if we know maximum value before call sort
+ * @get_num   - function to extract number from array
+ * @copy_cell - function to copy data from array to array_buf
+ *              and vise versa
+ * @get4bits  - function to get 4 bits from number at specified offset
+ */
+static void radix_sort(void *array, void *array_buf,
+		       int num, int size,
+		       const void *max_cell,
+		       u64 (*get_num)(const void *),
+		       void (*copy_cell)(void *dest, const void* src),
+		       u8 (*get4bits)(u64 num, int shift))
+{
+	u64 max_num;
+	u64 buf_num;
+	u64 counters[COUNTERS_SIZE];
+	u64 new_addr;
+	s64 i;
+	int addr;
+	int bitlen;
+	int shift;
+
+	/*
+	 * Try avoid useless loop iterations
+	 * For small numbers stored in big counters
+	 * example: 48 33 4 ... in 64bit array
+	 */
+	if (!max_cell) {
+		max_num = get_num(array);
+		for (i = 0 + size; i < num*size; i += size) {
+			buf_num = get_num(array + i);
+			if (le64_to_cpu(buf_num) > le64_to_cpu(max_num))
+				max_num = buf_num;
+		}
+	} else {
+		max_num = get_num(max_cell);
+	}
+
+	buf_num = ilog2(le64_to_cpu(max_num));
+	bitlen = ALIGN(buf_num, RADIX_BASE*2);
+
+	shift = 0;
+	while (shift < bitlen) {
+		memset(counters, 0, sizeof(counters));
+
+		for (i = 0; i < num*size; i += size) {
+			buf_num = get_num(array + i);
+			addr = get4bits(buf_num, shift);
+			counters[addr]++;
+		}
+
+		for (i = 1; i < COUNTERS_SIZE; i++)
+			counters[i] += counters[i-1];
+
+		for (i = (num - 1) * size; i >= 0; i -= size) {
+			buf_num = get_num(array + i);
+			addr = get4bits(buf_num, shift);
+			counters[addr]--;
+			new_addr = counters[addr];
+			copy_cell(array_buf + (new_addr*size), array + i);
+		}
+
+		shift += RADIX_BASE;
+
+		/*
+		 * For normal radix, that expected to
+		 * move data from tmp array, to main.
+		 * But that require some CPU time
+		 * Avoid that by doing another sort iteration
+		 * to origin array instead of memcpy()
+		 */
+		memset(counters, 0, sizeof(counters));
+
+		for (i = 0; i < num*size; i += size) {
+			buf_num = get_num(array_buf + i);
+			addr = get4bits(buf_num, shift);
+			counters[addr]++;
+		}
+
+		for (i = 1; i < COUNTERS_SIZE; i++)
+			counters[i] += counters[i-1];
+
+		for (i = (num - 1) * size; i >= 0; i -= size) {
+			buf_num = get_num(array_buf + i);
+			addr = get4bits(buf_num, shift);
+			counters[addr]--;
+			new_addr = counters[addr];
+			copy_cell(array + (new_addr*size), array_buf + i);
+		}
+
+		shift += RADIX_BASE;
+	}
+}
+
 /*
  * Shannon Entropy calculation
  *
@@ -1280,15 +1414,6 @@ static u32 shannon_entropy(struct heuristic_ws *ws)
 	return entropy_sum * 100 / entropy_max;
 }

-/* Compare buckets by size, ascending */
-static inline int bucket_comp_rev(const void *lv, const void *rv)
-{
-	const struct bucket_item *l = (struct bucket_item *)(lv);
-	const struct bucket_item *r = (struct bucket_item *)(rv);
-
-	return r->count - l->count;
-}
-
 /*
  * Byte Core set size
  * How many bytes use 90% of sample
@@ -1317,9 +1442,13 @@ static int byte_core_set_size(struct heuristic_ws *ws)
 	u32 coreset_sum = 0;
 	u32 core_set_threshold = ws->sample_size * 90 / 100;
 	struct bucket_item *bucket = ws->bucket;
+	struct bucket_item max_cell;

-	/* Sort in reverse order */
-	sort(bucket, BUCKET_SIZE, sizeof(*bucket), &bucket_comp_rev, NULL);
+	max_cell.count = MAX_SAMPLE_SIZE;
+	radix_sort(bucket, ws->bucket_buf,
+		   BUCKET_SIZE, sizeof(*bucket),
+		   &max_cell,
+		   get_num, copy_cell, get4bits);

 	for (i = 0; i < BYTE_CORE_SET_LOW; i++)
 		coreset_sum += bucket[i].count;