From patchwork Sun Oct 8 13:11:59 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Timofey Titovets X-Patchwork-Id: 9991829 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 1692E60244 for ; Sun, 8 Oct 2017 13:12:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07087285C8 for ; Sun, 8 Oct 2017 13:12:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EDA59285CD; Sun, 8 Oct 2017 13:12:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, RCVD_IN_DNSWL_HI, RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7152E285C8 for ; Sun, 8 Oct 2017 13:12:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752382AbdJHNML (ORCPT ); Sun, 8 Oct 2017 09:12:11 -0400 Received: from mail-lf0-f67.google.com ([209.85.215.67]:35735 "EHLO mail-lf0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751168AbdJHNMK (ORCPT ); Sun, 8 Oct 2017 09:12:10 -0400 Received: by mail-lf0-f67.google.com with SMTP id a132so817921lfa.2 for ; Sun, 08 Oct 2017 06:12:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=sTcZGn/Nlv+XliNYOHhZ+5LUag7tKWOhiSBREW5Q5FI=; b=CIUYZBnyrXnEngvkLCuEekhNi2nyrZTRSz7NUEcdiVAP4tmgkjj/1HT2aR9FGWl1Wy 8XplTGHvtyXAjQrtE1Sd2bcvxcGIbdaIKkzdhH16i3HBG1V2bWLZ0x9mcCUFYEpHgqv+ ftxCoL9rFwiWvHGQ0fwW9EvMOoNEqrCkMDwR9CghJpU3BYcyUzja3y+6QaJra1pbYQJr 4ZUvSWHRa0ajoS4r5n+Qj2u4q9kWGY/yM1fInk0IzdSICKssxqsVvem807bv5hU72/qN kT1yal+6IrqrgDuLb4fciTBkH8LnOORhlFt3mwMEQg5HWLlEiqtwMr1Wh8Rj6nJ8sVzb fh7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=sTcZGn/Nlv+XliNYOHhZ+5LUag7tKWOhiSBREW5Q5FI=; b=mQ2EUXUuAu3rhxsla9xb6xSw0/HZl3pduRTUTicw5Ms1kEiCIYKTUSEFpX1/eupod2 nhcvB1tDjEgP0f4epcy5wsWs8WN+5BXdkUyBGC2lLKsLB+XDWQEhLzpa541ZN0J/xzJp +koWkzjjKv/pwS2plsQZZLPqNRHjSwcn5Kna/ZdXOPRF+WlOd6lsFPy1HUlIgKldfUMY 0vozJkOXwpH+f6Mq1L9N9UJTz9/WBtgDMrzfXGfnbwhSKJE79fVqITYHiB3Bw+rqShiw cJ+dJm4dMOwl8aM2Ta9jyqCFUP2vGvaqyy74H90j8Urc06nbONZwMeHOco5OswptAgV2 AHAQ== X-Gm-Message-State: AMCzsaVSQELscNcdhmiJYdP+PhZ1Bvm8IacmsPSBtVVg5NHLtLPoT5CW qmD5+AcvXBzg0+Gz0rfN7sItfA== X-Google-Smtp-Source: AOwi7QApMm1Z/+VscT7AjTELDGv/GLpivs02R5KxAOMVEiKHqgYAJAYjTYsaU+IXvRGZsDMKMNaupg== X-Received: by 10.46.16.218 with SMTP id 87mr26842ljq.115.1507468328349; Sun, 08 Oct 2017 06:12:08 -0700 (PDT) Received: from titovetst-l.lan (nat6-minsk-pool-46-53-208-190.telecom.by. [46.53.208.190]) by smtp.gmail.com with ESMTPSA id 65sm1408193ljb.69.2017.10.08.06.12.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 08 Oct 2017 06:12:07 -0700 (PDT) From: Timofey Titovets To: linux-btrfs@vger.kernel.org Cc: Timofey Titovets Subject: [PATCH v2] Btrfs: heuristic add shannon entropy calculation Date: Sun, 8 Oct 2017 16:11:59 +0300 Message-Id: <20171008131159.31239-1-nefelim4ag@gmail.com> X-Mailer: git-send-email 2.14.2 Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Byte distribution check in heuristic will filter edge data cases and some time fail to classificate input data Let's fix that by adding Shannon entropy calculation, that will cover classification of most other data types. As Shannon entropy need log2 with some precision to work, let's use ilog2(N) and for increase precision, by do ilog2(pow(N, 4)) Shannon entropy slightly changed for avoiding signed numbers and divisions. Changes: v1 -> v2: - Replace log2_lshift4 wiht ilog2 - To compensate accuracy errors of ilog2 @ENTROPY_LVL_ACEPTABLE 70 -> 65 @ENTROPY_LVL_HIGH 85 -> 80 - Drop usage of u64 from shannon calculation Signed-off-by: Timofey Titovets --- fs/btrfs/compression.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 1 deletion(-) -- 2.14.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 0060bc4ae5f2..8efbce5633b5 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -1224,6 +1225,60 @@ int btrfs_decompress_buf2page(const char *buf, unsigned long buf_start, return 1; } +/* + * Shannon Entropy calculation + * + * Pure byte distribution analyze fail to determine + * compressiability of data. Try calculate entropy to + * estimate the average minimum number of bits needed + * to encode a sample data. + * + * For comfort, use return of percentage of needed bit's, + * instead of bit's amaount directly. + * + * @ENTROPY_LVL_ACEPTABLE - below that threshold sample has low byte + * entropy and can be compressible with high probability + * + * @ENTROPY_LVL_HIGH - data are not compressible with high probability, + * + * Use of ilog2() decrease precision, so to see ~ correct answer + * LVL decreased by 5. + */ + +#define ENTROPY_LVL_ACEPTABLE 65 +#define ENTROPY_LVL_HIGH 80 + +/* + * For increase precision in shannon_entropy calculation + * Let's do pow(n, M) to save more digits after comma + * + * Max int bit lengh 64 + * ilog2(MAX_SAMPLE_SIZE) -> 13 + * 13*4 = 52 < 64 -> M = 4 + * So use pow(n, 4) + */ +static inline u32 ilog2_w(u64 n) +{ + return ilog2(n * n * n * n); +} + +static u32 shannon_entropy(struct heuristic_ws *ws) +{ + const u32 entropy_max = 8*ilog2_w(2); + u32 entropy_sum = 0; + u32 p, p_base, sz_base; + u32 i; + + sz_base = ilog2_w(ws->sample_size); + for (i = 0; i < BUCKET_SIZE && ws->bucket[i].count > 0; i++) { + p = ws->bucket[i].count; + p_base = ilog2_w(p); + entropy_sum += p * (sz_base - p_base); + } + + entropy_sum /= ws->sample_size; + return entropy_sum * 100 / entropy_max; +} /* Compare buckets by size, ascending */ static inline int bucket_comp_rev(const void *lv, const void *rv) @@ -1404,7 +1459,7 @@ int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end) struct heuristic_ws *ws; u32 i; u8 byte; - int ret = 1; + int ret = 0; ws = list_entry(ws_list, struct heuristic_ws, list); @@ -1439,6 +1494,32 @@ int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end) goto out; } + i = shannon_entropy(ws); + if (i <= ENTROPY_LVL_ACEPTABLE) { + ret = 4; + goto out; + } + + /* + * At below entropy levels additional + * analysis needed for show green light to compression + * For now just assume that compression at that level are + * not worth for resources, becuase: + * 1. that possible to defrag data later + * 2. Case where that will really show compressible data are rare + * ex. 150 byte types, every bucket have counter at level ~54 + * Heuristic will be confused, that can happen when data + * have some internal repeated patterns abbacbbc.. + * that can be detected only by analize sample for byte pairs + */ + if (i < ENTROPY_LVL_HIGH) { + ret = 5; + goto out; + } else { + ret = 0; + goto out; + } + out: __free_workspace(0, ws_list, true); return ret;