[090/165] lz4: fix kernel decompression speed

Message ID	20200812013500.TPqi6kXv2%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=d60w=BW=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 83DC7206B2 Date: Tue, 11 Aug 2020 18:35:00 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: 4sschmid@informatik.uni-hamburg.de, akpm@linux-foundation.org, gaoxiang25@huawei.com, gregkh@linuxfoundation.org, linux-mm@kvack.org, mingo@kernel.org, mm-commits@vger.kernel.org, nivedita@alum.mit.edu, terrelln@fb.com, torvalds@linux-foundation.org, yann.collet.73@gmail.com Subject: [patch 090/165] lz4: fix kernel decompression speed Message-ID: <20200812013500.TPqi6kXv2%akpm@linux-foundation.org> In-Reply-To: <20200811182949.e12ae9a472e3b5e27e16ad6c@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/165] percpu: return number of released bytes from pcpu_free_area() \| expand [001/165] percpu: return number of released bytes from pcpu_free_area() [002/165] mm: memcg/percpu: account percpu memory to memory cgroups [003/165] mm: memcg/percpu: per-memcg percpu memory statistics [004/165] mm: memcg: charge memcg percpu memory to the parent cgroup [006/165] mm/hugetlb: add mempolicy check in the reservation routine [007/165] mm/vmscan: make active/inactive ratio as 1:1 for anon lru [008/165] mm/vmscan: protect the workingset on anonymous LRU [009/165] mm/workingset: prepare the workingset detection infrastructure for anon LRU [010/165] mm/swapcache: support to handle the shadow entries [011/165] mm/swap: implement workingset detection for anonymous LRU [012/165] mm/vmscan: restore active/inactive ratio for anonymous LRU [013/165] /proc/PID/smaps: consistent whitespace output format [014/165] mm: proactive compaction [015/165] mm: fix compile error due to COMPACTION_HPAGE_ORDER [016/165] mm: use unsigned types for fragmentation score [017/165] mm/compaction: correct the comments of compact_defer_shift [018/165] mm: mempolicy: fix kerneldoc of numa_map_to_online_node() [019/165] mm/mempolicy.c: check parameters first in kernel_get_mempolicy [020/165] include/linux/mempolicy.h: fix typo [021/165] mm, oom: make the calculation of oom badness more accurate [022/165] doc, mm: sync up oom_score_adj documentation [023/165] doc, mm: clarify /proc/<pid>/oom_score value range [024/165] mm, oom: show process exiting information in __oom_kill_process() [025/165] hugetlbfs: prevent filesystem stacking of hugetlbfs [026/165] hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem [027/165] mm/migrate: optimize migrate_vma_setup() for holes [028/165] mm/migrate: add migrate-shared test for migrate_vma_() [029/165] mm: thp: remove debug_cow switch [030/165] mm/vmstat: add events for THP migration without split [031/165] mm/cma.c: fix NULL pointer dereference when cma could not be activated [032/165] mm: cma: fix the name of CMA areas [033/165] mm: hugetlb: fix the name of hugetlb CMA [034/165] cma: don't quit at first error when activating reserved areas [035/165] include/linux/sched/mm.h: optimize current_gfp_context() [036/165] mm: mmu_notifier: fix and extend kerneldoc [037/165] x86/mm: use max memory block size on bare metal [038/165] mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid() [039/165] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done [040/165] mm, memory_hotplug: update pcp lists everytime onlining a memory block [041/165] mm: drop duplicated words in <linux/pgtable.h> [042/165] mm: drop duplicated words in <linux/mm.h> [043/165] include/linux/highmem.h: fix duplicated words in a comment [044/165] include/linux/frontswap.h: drop duplicated word in a comment [045/165] include/linux/memcontrol.h: drop duplicate word and fix spello [046/165] sh/mm: drop unused MAX_PHYSADDR_BITS [047/165] sparc: drop unused MAX_PHYSADDR_BITS [048/165] mm/compaction.c: delete duplicated word [049/165] mm/filemap.c: delete duplicated word [050/165] mm/hmm.c: delete duplicated word [051/165] mm/hugetlb.c: delete duplicated words [052/165] mm/memcontrol.c: delete duplicated words [053/165] mm/memory.c: delete duplicated words [054/165] mm/migrate.c: delete duplicated word [055/165] mm/nommu.c: delete duplicated words [056/165] mm/page_alloc.c: delete or fix duplicated words [057/165] mm/shmem.c: delete duplicated word [058/165] mm/slab_common.c: delete duplicated word [059/165] mm/usercopy.c: delete duplicated word [060/165] mm/vmscan.c: delete or fix duplicated words [061/165] mm/zpool.c: delete duplicated word and fix grammar [062/165] mm/zsmalloc.c: fix duplicated words [063/165] syscalls: use uaccess_kernel in addr_limit_user_check [064/165] nds32: use uaccess_kernel in show_regs [065/165] riscv: include <asm/pgtable.h> in <asm/uaccess.h> [066/165] uaccess: remove segment_eq [067/165] uaccess: add force_uaccess_{begin,end} helpers [068/165] exec: use force_uaccess_begin during exec and exit [069/165] alpha: fix annotation of io{read,write}{16,32}be() [070/165] include/linux/compiler-clang.h: drop duplicated word in a comment [071/165] include/linux/exportfs.h: drop duplicated word in a comment [072/165] include/linux/async_tx.h: drop duplicated word in a comment [073/165] include/linux/xz.h: drop duplicated word [074/165] kernel: add a kernel_wait helper [075/165] ./Makefile: add debug option to enable function aligned on 32 bytes [076/165] kernel.h: remove duplicate include of asm/div64.h [077/165] include/: replace HTTP links with HTTPS ones [078/165] include/linux/poison.h: remove obsolete comment [079/165] sparse: group the defines by functionality [080/165] lib/bitmap.c: fix bitmap_cut() for partial overlapping case [081/165] lib/test_bitmap.c: add test for bitmap_cut() [082/165] lib/generic-radix-tree.c: remove unneeded __rcu [083/165] lib/test_bitops: do the full test during module init [084/165] lib/test_lockup.c: make symbol 'test_works' static [085/165] lib/Kconfig.debug: make TEST_LOCKUP depend on module [086/165] lib/test_lockup.c: fix return value of test_lockup_init() [087/165] lib/: replace HTTP links with HTTPS ones [088/165] kstrto: correct documentation references to simple_strto() [089/165] kstrto: do not describe simple_strto*() as obsolete/replaced [090/165] lz4: fix kernel decompression speed [091/165] lib/test_bits.c: add tests of GENMASK [092/165] checkpatch: add test for possible misuse of IS_ENABLED() without CONFIG_ [093/165] checkpatch: add --fix option for ASSIGN_IN_IF [094/165] checkpatch: fix CONST_STRUCT when const_structs.checkpatch is missing [095/165] checkpatch: add test for repeated words [096/165] checkpatch: remove missing switch/case break test [097/165] autofs: fix doubled word [098/165] fs/minix: check return value of sb_getblk() [099/165] fs/minix: don't allow getting deleted inodes [100/165] fs/minix: reject too-large maximum file size [101/165] fs/minix: set s_maxbytes correctly [102/165] fs/minix: fix block limit check for V1 filesystems [103/165] fs/minix: remove expected error message in block_to_path() [104/165] nilfs2: only call unlock_new_inode() if I_NEW [105/165] nilfs2: convert __nilfs_msg to integrate the level and format [106/165] nilfs2: use a more common logging style [107/165] fs/ufs: avoid potential u32 multiplication overflow [108/165] fatfs: switch write_lock to read_lock in fat_ioctl_get_attributes [109/165] VFAT/FAT/MSDOS FILESYSTEM: replace HTTP links with HTTPS ones [110/165] fat: fix fat_ra_init() for data clusters == 0 [111/165] fs/signalfd.c: fix inconsistent return codes for signalfd4 [112/165] selftests: kmod: use variable NAME in kmod_test_0001() [113/165] kmod: remove redundant "be an" in the comment [114/165] test_kmod: avoid potential double free in trigger_config_run_type() [115/165] coredump: add %f for executable filename [116/165] exec: change uselib(2) IS_SREG() failure to EACCES [117/165] exec: move S_ISREG() check earlier [118/165] exec: move path_noexec() check earlier [119/165] kdump: append kernel build-id string to VMCOREINFO [120/165] drivers/rapidio/devices/rio_mport_cdev.c: use struct_size() helper [121/165] drivers/rapidio/rio-scan.c: use struct_size() helper [122/165] rapidio/rio_mport_cdev: use array_size() helper in copy_{from,to}_user() [123/165] kernel/panic.c: make oops_may_print() return bool [124/165] lib/Kconfig.debug: fix typo in the help text of CONFIG_PANIC_TIMEOUT [125/165] panic: make print_oops_end_marker() static [126/165] kcov: unconditionally add -fno-stack-protector to compiler options [127/165] kcov: make some symbols static [128/165] scripts/gdb: fix python 3.8 SyntaxWarning [129/165] ipc: uninline functions [130/165] ipc/shm.c: remove the superfluous break [131/165] mm/page_isolation: prefer the node of the source page [132/165] mm/migrate: move migration helper from .h to .c [133/165] mm/hugetlb: unify migration callbacks [134/165] mm/migrate: clear __GFP_RECLAIM to make the migration callback consistent with regular TH… [135/165] mm/migrate: introduce a standard migration target allocation function [136/165] mm/mempolicy: use a standard migration target allocation callback [137/165] mm/page_alloc: remove a wrapper for alloc_migration_target() [138/165] mm/gup: restrict CMA region by using allocation scope API [139/165] mm/hugetlb: make hugetlb migration callback CMA aware [140/165] mm/gup: use a standard migration target allocation callback [141/165] mm: do page fault accounting in handle_mm_fault [142/165] mm/alpha: use general page fault accounting [143/165] mm/arc: use general page fault accounting [144/165] mm/arm: use general page fault accounting [145/165] mm/arm64: use general page fault accounting [146/165] mm/csky: use general page fault accounting [147/165] mm/hexagon: use general page fault accounting [148/165] mm/ia64: use general page fault accounting [149/165] mm/m68k: use general page fault accounting [150/165] mm/microblaze: use general page fault accounting [151/165] mm/mips: use general page fault accounting [152/165] mm/nds32: use general page fault accounting [153/165] mm/nios2: use general page fault accounting [154/165] mm/openrisc: use general page fault accounting [155/165] mm/parisc: use general page fault accounting [156/165] mm/powerpc: use general page fault accounting [157/165] mm/riscv: use general page fault accounting [158/165] mm/s390: use general page fault accounting [159/165] mm/sh: use general page fault accounting [160/165] mm/sparc32: use general page fault accounting [161/165] mm/sparc64: use general page fault accounting [162/165] mm/x86: use general page fault accounting [163/165] mm/xtensa: use general page fault accounting [164/165] mm: clean up the last pieces of page fault accountings [165/165] mm/gup: remove task_struct pointer for all gup code

Message ID

20200812013500.TPqi6kXv2%akpm@linux-foundation.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 83DC7206B2
Date: Tue, 11 Aug 2020 18:35:00 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: 4sschmid@informatik.uni-hamburg.de, akpm@linux-foundation.org,
 gaoxiang25@huawei.com, gregkh@linuxfoundation.org, linux-mm@kvack.org,
 mingo@kernel.org, mm-commits@vger.kernel.org, nivedita@alum.mit.edu,
 terrelln@fb.com, torvalds@linux-foundation.org,
 yann.collet.73@gmail.com
Subject: [patch 090/165] lz4: fix kernel decompression speed
Message-ID: <20200812013500.TPqi6kXv2%akpm@linux-foundation.org>
In-Reply-To: <20200811182949.e12ae9a472e3b5e27e16ad6c@linux-foundation.org>
User-Agent: s-nail v14.8.16
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/165] percpu: return number of released bytes from pcpu_free_area() | expand

Commit Message

Andrew Morton Aug. 12, 2020, 1:35 a.m. UTC

From: Nick Terrell <terrelln@fb.com>
Subject: lz4: fix kernel decompression speed

This patch replaces all memcpy() calls with LZ4_memcpy() which calls
__builtin_memcpy() so the compiler can inline it.

LZ4 relies heavily on memcpy() with a constant size being inlined.  In x86
and i386 pre-boot environments memcpy() cannot be inlined because memcpy()
doesn't get defined as __builtin_memcpy().

An equivalent patch has been applied upstream so that the next import
won't lose this change [1].

I've measured the kernel decompression speed using QEMU before and after
this patch for the x86_64 and i386 architectures.  The speed-up is about
10x as shown below.

Code	Arch	Kernel Size	Time	Speed
v5.8	x86_64	11504832 B	148 ms	 79 MB/s
patch	x86_64	11503872 B	 13 ms	885 MB/s
v5.8	i386	 9621216 B	 91 ms	106 MB/s
patch	i386	 9620224 B	 10 ms	962 MB/s

I also measured the time to decompress the initramfs on x86_64, i386, and
arm.  All three show the same decompression speed before and after, as
expected.

[1] https://github.com/lz4/lz4/pull/890

Link: http://lkml.kernel.org/r/20200803194022.2966806-1-nickrterrell@gmail.com
Signed-off-by: Nick Terrell <terrelln@fb.com>
Cc: Yann Collet <yann.collet.73@gmail.com>
Cc: Gao Xiang <gaoxiang25@huawei.com>
Cc: Sven Schmidt <4sschmid@informatik.uni-hamburg.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/lz4/lz4_compress.c   |    4 ++--
 lib/lz4/lz4_decompress.c |   18 +++++++++---------
 lib/lz4/lz4defs.h        |   10 ++++++++++
 lib/lz4/lz4hc_compress.c |    2 +-
 4 files changed, 22 insertions(+), 12 deletions(-)

Comments

Linus Torvalds Aug. 12, 2020, 5:54 p.m. UTC | #1

On Tue, Aug 11, 2020 at 6:35 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Nick Terrell <terrelln@fb.com>
> Subject: lz4: fix kernel decompression speed
>
> This patch replaces all memcpy() calls with LZ4_memcpy() which calls
> __builtin_memcpy() so the compiler can inline it.

Wasn't this LZ4_memcpy() wrapper made unnecessary by just making
memcpy() work properly on its own?

So I'm dropping this patch.

If it turns out that I mis-remembered (or mis-understood), please re-send. Nick?

               Linus

Arvind Sankar Aug. 14, 2020, 5:20 p.m. UTC | #2

On Wed, Aug 12, 2020 at 10:54:08AM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 6:35 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Nick Terrell <terrelln@fb.com>
> > Subject: lz4: fix kernel decompression speed
> >
> > This patch replaces all memcpy() calls with LZ4_memcpy() which calls
> > __builtin_memcpy() so the compiler can inline it.
> 
> Wasn't this LZ4_memcpy() wrapper made unnecessary by just making
> memcpy() work properly on its own?
> 
> So I'm dropping this patch.
> 
> If it turns out that I mis-remembered (or mis-understood), please re-send. Nick?
> 
>                Linus

I assume you're referring to my patch [0]. It hasn't been picked up yet,
and it is only for x86.

I think Nick wanted his patch to be merged as well [1], because it will
also address other architectures (eg it looks like at least s390 LZ4
probably has the same issue), and is already in upstream LZ4.

Thanks.

[0] https://lore.kernel.org/lkml/20200804234817.3922187-1-nivedita@alum.mit.edu/
[1] https://lore.kernel.org/lkml/3961E1BD-8F58-4240-A3B3-B7032A405B42@fb.com/

--- a/lib/lz4/lz4_compress.c~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4_compress.c
@@ -446,7 +446,7 @@  _last_literals:
 			*op++ = (BYTE)(lastRun << ML_BITS);
 		}
 
-		memcpy(op, anchor, lastRun);
+		LZ4_memcpy(op, anchor, lastRun);
 
 		op += lastRun;
 	}
@@ -708,7 +708,7 @@  _last_literals:
 		} else {
 			*op++ = (BYTE)(lastRunSize<<ML_BITS);
 		}
-		memcpy(op, anchor, lastRunSize);
+		LZ4_memcpy(op, anchor, lastRunSize);
 		op += lastRunSize;
 	}
 
--- a/lib/lz4/lz4_decompress.c~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4_decompress.c
@@ -153,7 +153,7 @@  static FORCE_INLINE int LZ4_decompress_g
 		   && likely((endOnInput ? ip < shortiend : 1) &
 			     (op <= shortoend))) {
 			/* Copy the literals */
-			memcpy(op, ip, endOnInput ? 16 : 8);
+			LZ4_memcpy(op, ip, endOnInput ? 16 : 8);
 			op += length; ip += length;
 
 			/*
@@ -172,9 +172,9 @@  static FORCE_INLINE int LZ4_decompress_g
 			    (offset >= 8) &&
 			    (dict == withPrefix64k || match >= lowPrefix)) {
 				/* Copy the match. */
-				memcpy(op + 0, match + 0, 8);
-				memcpy(op + 8, match + 8, 8);
-				memcpy(op + 16, match + 16, 2);
+				LZ4_memcpy(op + 0, match + 0, 8);
+				LZ4_memcpy(op + 8, match + 8, 8);
+				LZ4_memcpy(op + 16, match + 16, 2);
 				op += length + MINMATCH;
 				/* Both stages worked, load the next token. */
 				continue;
@@ -263,7 +263,7 @@  static FORCE_INLINE int LZ4_decompress_g
 				}
 			}
 
-			memcpy(op, ip, length);
+			LZ4_memcpy(op, ip, length);
 			ip += length;
 			op += length;
 
@@ -350,7 +350,7 @@  _copy_match:
 				size_t const copySize = (size_t)(lowPrefix - match);
 				size_t const restSize = length - copySize;
 
-				memcpy(op, dictEnd - copySize, copySize);
+				LZ4_memcpy(op, dictEnd - copySize, copySize);
 				op += copySize;
 				if (restSize > (size_t)(op - lowPrefix)) {
 					/* overlap copy */
@@ -360,7 +360,7 @@  _copy_match:
 					while (op < endOfMatch)
 						*op++ = *copyFrom++;
 				} else {
-					memcpy(op, lowPrefix, restSize);
+					LZ4_memcpy(op, lowPrefix, restSize);
 					op += restSize;
 				}
 			}
@@ -386,7 +386,7 @@  _copy_match:
 				while (op < copyEnd)
 					*op++ = *match++;
 			} else {
-				memcpy(op, match, mlen);
+				LZ4_memcpy(op, match, mlen);
 			}
 			op = copyEnd;
 			if (op == oend)
@@ -400,7 +400,7 @@  _copy_match:
 			op[2] = match[2];
 			op[3] = match[3];
 			match += inc32table[offset];
-			memcpy(op + 4, match, 4);
+			LZ4_memcpy(op + 4, match, 4);
 			match -= dec64table[offset];
 		} else {
 			LZ4_copy8(op, match);
--- a/lib/lz4/lz4defs.h~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4defs.h
@@ -137,6 +137,16 @@  static FORCE_INLINE void LZ4_writeLE16(v
 	return put_unaligned_le16(value, memPtr);
 }
 
+/*
+ * LZ4 relies on memcpy with a constant size being inlined. In freestanding
+ * environments, the compiler can't assume the implementation of memcpy() is
+ * standard compliant, so apply its specialized memcpy() inlining logic. When
+ * possible, use __builtin_memcpy() to tell the compiler to analyze memcpy()
+ * as-if it were standard compliant, so it can inline it in freestanding
+ * environments. This is needed when decompressing the Linux Kernel, for example.
+ */
+#define LZ4_memcpy(dst, src, size) __builtin_memcpy(dst, src, size)
+
 static FORCE_INLINE void LZ4_copy8(void *dst, const void *src)
 {
 #if LZ4_ARCH64
--- a/lib/lz4/lz4hc_compress.c~lz4-fix-kernel-decompression-speed
+++ a/lib/lz4/lz4hc_compress.c
@@ -570,7 +570,7 @@  _Search3:
 			*op++ = (BYTE) lastRun;
 		} else
 			*op++ = (BYTE)(lastRun<<ML_BITS);
-		memcpy(op, anchor, iend - anchor);
+		LZ4_memcpy(op, anchor, iend - anchor);
 		op += iend - anchor;
 	}

[090/165] lz4: fix kernel decompression speed

Commit Message

Comments

Patch