From patchwork Fri Oct 13 00:32:37 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Timofey Titovets X-Patchwork-Id: 10003319 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 9E15D602BF for ; Fri, 13 Oct 2017 00:32:52 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C4D828F0A for ; Fri, 13 Oct 2017 00:32:52 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5E0A928F0E; Fri, 13 Oct 2017 00:32:52 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B9CB528F0A for ; Fri, 13 Oct 2017 00:32:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753367AbdJMAcs (ORCPT ); Thu, 12 Oct 2017 20:32:48 -0400 Received: from mail-wm0-f65.google.com ([74.125.82.65]:43840 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752644AbdJMAcr (ORCPT ); Thu, 12 Oct 2017 20:32:47 -0400 Received: by mail-wm0-f65.google.com with SMTP id m72so26003523wmc.0 for ; Thu, 12 Oct 2017 17:32:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=ihS/Vv4S1j41lRqtqnH8QwzcWGAcCXfMfZxuk1aKFhY=; b=FTWuMWcDyU41m9QJj9K0rWD6Ds/cZ5eOPSpNrUooiHFUnFMZF7JYtkZhAmQGdtQCaI y7biwyOqS65VBjTV+ecBysyeGy9YH2D9CrXuyRVaocmPaYYUWIiWu8JGUBAd8j5DB/KQ ME+Fuz5iQi6IGEGekFJ0WjRvH/q4V1pZoIrCbKtqph2MFmVsRWVqxDvtXSWmbL9o4cwM 2tUTSh02JmHchvekCv9Eyk9aXgBxyUlydw5GGTp9oU/NdlJAHgsoCd7qJ28FnKwjIxgy BVKexApjU1UFjnVDkglsLGFkzk7rJ1ztsqorVwmnmo5W9qth2DfELtWX4g0uy0Cp+pfl gWFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=ihS/Vv4S1j41lRqtqnH8QwzcWGAcCXfMfZxuk1aKFhY=; b=QkZOuUm5W7JvrqCKpQhu1Ug2TdLKnocZ/hp5zIiMZk6e6yogNejRoFGe+gCIGwMHKQ awNOCKQouEjc+GgWjWIwBT8nxyBWEkQcmbLzvLx54wtBR2XfbrCBHR8ocmMPqqQC9gPL RJ+N6MN1GYlDAstoDPQM7YdOf/nW2vGIZN4iE0zYlRanPJuSipSw7HCKnXo3FFoaGTsC 941jrsoDX/2aTq2yM9OEGlir36BuhEMgMNyO1oostfUhoLn+QAHPKHDhOP+Plxe/DvEY 6AsWqL2PKOJKEBRhi+Y7RQazaOavNwLLvBx1Gw9dhUGU6n30KYzSjJCkmY5kKhJihPwO fKHQ== X-Gm-Message-State: AMCzsaU3dEMTcgpuVnNv4g34xoe/7s3dWTgbM1Vn1ZOdiMlMWSm/zxuN q/kM8EiiYjTOzgRp3VUYWoKISA== X-Google-Smtp-Source: ABhQp+RJ5iu0sqr3Q7h6qEzoNL+t9G4Ip5EJPQb9oy8dzak6W7tx32vZ2K+ZzhgLTwEkWMScX2lStg== X-Received: by 10.28.155.18 with SMTP id d18mr1218wme.107.1507854765849; Thu, 12 Oct 2017 17:32:45 -0700 (PDT) Received: from titovetst-l.lan (nat6-minsk-pool-46-53-208-190.telecom.by. [46.53.208.190]) by smtp.gmail.com with ESMTPSA id v5sm2165527wrf.29.2017.10.12.17.32.44 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 12 Oct 2017 17:32:45 -0700 (PDT) From: Timofey Titovets To: linux-btrfs@vger.kernel.org Cc: Timofey Titovets Subject: [RFC PATCH] Btrfs: heuristic replace heap sort with radix sort Date: Fri, 13 Oct 2017 03:32:37 +0300 Message-Id: <20171013003237.3061-1-nefelim4ag@gmail.com> X-Mailer: git-send-email 2.14.2 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Slowest part of heuristic for now is kernel heap sort() It's can take up to 55% of runtime on sorting bucket items. As sorting will always call on most data sets to get correctly byte_core_set_size, the only way to speed up heuristic, is to speed up sort on bucket. So, add a general radix_sort function. Radix sort require 2 buffers, one full size of input array and one for store counters (jump addresses). So for buffer array, just allocate BUCKET_SIZE*2 for bucket, and use free tale as a buffer, to improve data locality. That increase usage per heuristic workspace +1KiB 8KiB + 1KiB -> 8KiB + 2KiB That is LSD Radix, i use 4 bit as a base for calculating, to make counters array acceptable small (16el*8byte). Not tested on Big.Endian. I try handle that by some kernel macros. Performance tested in userspace copy of heuristic code, throughput: - average <-> random data: ~3500 MiB/s - heap sort - average <-> random data: ~6000 MiB/s +71% - radix sort Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 153 +++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 141 insertions(+), 12 deletions(-) -- 2.14.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 01738b9a8dc7..f320fbb1de17 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -753,6 +753,7 @@ struct heuristic_ws { u32 sample_size; /* Bucket store counter for each byte type */ struct bucket_item *bucket; + struct bucket_item *bucket_buf; struct list_head list; }; @@ -778,10 +779,12 @@ static struct list_head *alloc_heuristic_ws(void){ if (!ws->sample) goto fail; - ws->bucket = kcalloc(BUCKET_SIZE, sizeof(*ws->bucket), GFP_KERNEL); + ws->bucket = kcalloc(BUCKET_SIZE*2, sizeof(*ws->bucket), GFP_KERNEL); if (!ws->bucket) goto fail; + ws->bucket_buf = &ws->bucket[BUCKET_SIZE]; + INIT_LIST_HEAD(&ws->list); return &ws->list; fail: @@ -1225,6 +1228,137 @@ int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start, return 1; } +#define RADIX_BASE 4 +#define COUNTERS_SIZE (1 << RADIX_BASE) + +static inline u8 get4bits(u64 num, int shift) { + u8 low4bits; + num = num >> shift; + /* Reverse order */ + low4bits = (COUNTERS_SIZE - 1) - (num % COUNTERS_SIZE); + return low4bits; +} + +static inline void copy_cell(void *dst, const void *src) +{ + struct bucket_item *dstv = (struct bucket_item *) dst; + struct bucket_item *srcv = (struct bucket_item *) src; + *dstv = *srcv; +} + +static inline u64 get_num(const void *a) +{ + struct bucket_item *av = (struct bucket_item *) a; + return cpu_to_le32(av->count); +} + +/* + * Kernel compatible radix sort implementation + * Use 4 bits as radix base + * Use 16 64bit counters for calculating new possition in buf array + * Tested only on Little Endian + * + * @array - array that will be sorted + * @array_buf - buffer array to store sorting results + * must be equal in size to @array + * @num - array size + * @size - item size + * @max_cell - Link to element with maximum possible value + * that can be used to cap radix sort iterations + * if we know maximum value before call sort + * @get_num - function to extract number from array + * @copy_cell - function to copy data from array to array_buf + * and vise versa + * @get4bits - function to get 4 bits from number at specified offset + */ +static void radix_sort(void *array, void *array_buf, + int num, int size, + const void *max_cell, + u64 (*get_num)(const void *), + void (*copy_cell)(void *dest, const void* src), + u8 (*get4bits)(u64 num, int shift)) +{ + u64 max_num; + u64 buf_num; + u64 counters[COUNTERS_SIZE]; + u64 new_addr; + s64 i; + int addr; + int bitlen; + int shift; + + /* + * Try avoid useless loop iterations + * For small numbers stored in big counters + * example: 48 33 4 ... in 64bit array + */ + if (!max_cell) { + max_num = get_num(array); + for (i = 0 + size; i < num*size; i += size) { + buf_num = get_num(array + i); + if (le64_to_cpu(buf_num) > le64_to_cpu(max_num)) + max_num = buf_num; + } + } else { + max_num = get_num(max_cell); + } + + buf_num = ilog2(le64_to_cpu(max_num)); + bitlen = ALIGN(buf_num, RADIX_BASE*2); + + shift = 0; + while (shift < bitlen) { + memset(counters, 0, sizeof(counters)); + + for (i = 0; i < num*size; i += size) { + buf_num = get_num(array + i); + addr = get4bits(buf_num, shift); + counters[addr]++; + } + + for (i = 1; i < COUNTERS_SIZE; i++) + counters[i] += counters[i-1]; + + for (i = (num - 1) * size; i >= 0; i -= size) { + buf_num = get_num(array + i); + addr = get4bits(buf_num, shift); + counters[addr]--; + new_addr = counters[addr]; + copy_cell(array_buf + (new_addr*size), array + i); + } + + shift += RADIX_BASE; + + /* + * For normal radix, that expected to + * move data from tmp array, to main. + * But that require some CPU time + * Avoid that by doing another sort iteration + * to origin array instead of memcpy() + */ + memset(counters, 0, sizeof(counters)); + + for (i = 0; i < num*size; i += size) { + buf_num = get_num(array_buf + i); + addr = get4bits(buf_num, shift); + counters[addr]++; + } + + for (i = 1; i < COUNTERS_SIZE; i++) + counters[i] += counters[i-1]; + + for (i = (num - 1) * size; i >= 0; i -= size) { + buf_num = get_num(array_buf + i); + addr = get4bits(buf_num, shift); + counters[addr]--; + new_addr = counters[addr]; + copy_cell(array + (new_addr*size), array_buf + i); + } + + shift += RADIX_BASE; + } +} + /* * Shannon Entropy calculation * @@ -1280,15 +1414,6 @@ static u32 shannon_entropy(struct heuristic_ws *ws) return entropy_sum * 100 / entropy_max; } -/* Compare buckets by size, ascending */ -static inline int bucket_comp_rev(const void *lv, const void *rv) -{ - const struct bucket_item *l = (struct bucket_item *)(lv); - const struct bucket_item *r = (struct bucket_item *)(rv); - - return r->count - l->count; -} - /* * Byte Core set size * How many bytes use 90% of sample @@ -1317,9 +1442,13 @@ static int byte_core_set_size(struct heuristic_ws *ws) u32 coreset_sum = 0; u32 core_set_threshold = ws->sample_size * 90 / 100; struct bucket_item *bucket = ws->bucket; + struct bucket_item max_cell; - /* Sort in reverse order */ - sort(bucket, BUCKET_SIZE, sizeof(*bucket), &bucket_comp_rev, NULL); + max_cell.count = MAX_SAMPLE_SIZE; + radix_sort(bucket, ws->bucket_buf, + BUCKET_SIZE, sizeof(*bucket), + &max_cell, + get_num, copy_cell, get4bits); for (i = 0; i < BYTE_CORE_SET_LOW; i++) coreset_sum += bucket[i].count;