diff mbox

[V6,2/2,RESEND] ksm: replace jhash2 with faster hash

Message ID 20180418193220.4603-3-timofey.titovets@synesis.ru (mailing list archive)
State New, archived
Headers show

Commit Message

Timofey Titovets April 18, 2018, 7:32 p.m. UTC
From: Timofey Titovets <nefelim4ag@gmail.com>

1. Pickup, Sioh Lee crc32 patch, after some long conversation
2. Merge with my work on xxhash
3. Add autoselect code to choice fastest hash helper.

Base idea are same, replace jhash2 with something faster.

Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c   hash() 12081 MB/s
ksm: xxh64    hash()  8770 MB/s
ksm: xxh32    hash()  4529 MB/s
ksm: jhash2   hash()  1569 MB/s

As jhash2 always will be slower (for data size like PAGE_SIZE),
just drop it from choice.

Add function to autoselect hash algo on boot,
based on hashing speed, like raid6 code does.

Move init of zero_checksum from init, to first call of fasthash():
  1. KSM Init run on early kernel init,
     run perf testing stuff on main kernel boot thread looks bad to me.
  2. Crypto subsystem not avaliable at that early booting,
     so crc32c even, compiled in, not avaliable
     As crypto and ksm init, run at subsys_initcall() (4) kernel level of init,
     all possible consumers will run later at 5+ levels

Output after first try of KSM to hash page:
ksm: crc32c hash() 15218 MB/s
ksm: xxhash hash()  8640 MB/s
ksm: choice crc32c as hash function

Thanks.

Changes:
  v1 -> v2:
    - Move xxhash() to xxhash.h/c and separate patches
  v2 -> v3:
    - Move xxhash() xxhash.c -> xxhash.h
    - replace xxhash_t with 'unsigned long'
    - update kerneldoc above xxhash()
  v3 -> v4:
    - Merge xxhash/crc32 patches
    - Replace crc32 with crc32c (crc32 have same as jhash2 speed)
    - Add auto speed test and auto choice of fastest hash function
  v4 -> v5:
    - Pickup missed xxhash patch
    - Update code with compile time choicen xxhash
    - Add more macros to make code more readable
    - As now that only possible use xxhash or crc32c,
      on crc32c allocation error, skip speed test and fallback to xxhash
    - For workaround too early init problem (crc32c not avaliable),
      move zero_checksum init to first call of fastcall()
    - Don't alloc page for hash testing, use arch zero pages for that
  v5 -> v6:
    - Use libcrc32c instead of CRYPTO API, mainly for
      code/Kconfig deps Simplification
    - Add crc32c_available():
      libcrc32c will BUG_ON on crc32c problems,
      so test crc32c avaliable by crc32c_available()
    - Simplify choice_fastest_hash()
    - Simplify fasthash()
    - struct rmap_item && stable_node have sizeof == 64 on x86_64,
      that makes them cache friendly. As we don't suffer from hash collisions,
      change hash type from unsigned long back to u32.
    - Fix kbuild robot warning, make all local functions static

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: leesioh <solee@os.korea.ac.kr>
CC: Andrea Arcangeli <aarcange@redhat.com>
CC: linux-mm@kvack.org
CC: kvm@vger.kernel.org
---
 mm/Kconfig |  2 ++
 mm/ksm.c   | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 91 insertions(+), 4 deletions(-)

Comments

Claudio Imbrenda May 8, 2018, 3:26 p.m. UTC | #1
On Wed, 18 Apr 2018 22:32:20 +0300
Timofey Titovets <nefelim4ag@gmail.com> wrote:

> From: Timofey Titovets <nefelim4ag@gmail.com>
> 
> 1. Pickup, Sioh Lee crc32 patch, after some long conversation
> 2. Merge with my work on xxhash
> 3. Add autoselect code to choice fastest hash helper.
> 
> Base idea are same, replace jhash2 with something faster.
> 
> Perf numbers:
> Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
> ksm: crc32c   hash() 12081 MB/s
> ksm: xxh64    hash()  8770 MB/s
> ksm: xxh32    hash()  4529 MB/s
> ksm: jhash2   hash()  1569 MB/s
> 
> As jhash2 always will be slower (for data size like PAGE_SIZE),
> just drop it from choice.
> 
> Add function to autoselect hash algo on boot,
> based on hashing speed, like raid6 code does.
> 
> Move init of zero_checksum from init, to first call of fasthash():
>   1. KSM Init run on early kernel init,
>      run perf testing stuff on main kernel boot thread looks bad to

This is my personal opinion, but I think it would be better and more
uniform to have it during boot like raid6. It doesn't take too much
time, and it allows to see immediately in dmesg what is going on.

> me. 2. Crypto subsystem not avaliable at that early booting,
>      so crc32c even, compiled in, not avaliable
>      As crypto and ksm init, run at subsys_initcall() (4) kernel
> level of init, all possible consumers will run later at 5+ levels

have you tried moving ksm to a later stage? before commit 
a64fb3cd610c8e680 KSM was in fact initialized at level 6. After all, KSM
cannot be triggered until userspace starts.

> Output after first try of KSM to hash page:
> ksm: crc32c hash() 15218 MB/s
> ksm: xxhash hash()  8640 MB/s
> ksm: choice crc32c as hash function
> 
> Thanks.
> 
> Changes:
>   v1 -> v2:
>     - Move xxhash() to xxhash.h/c and separate patches
>   v2 -> v3:
>     - Move xxhash() xxhash.c -> xxhash.h
>     - replace xxhash_t with 'unsigned long'
>     - update kerneldoc above xxhash()
>   v3 -> v4:
>     - Merge xxhash/crc32 patches
>     - Replace crc32 with crc32c (crc32 have same as jhash2 speed)
>     - Add auto speed test and auto choice of fastest hash function
>   v4 -> v5:
>     - Pickup missed xxhash patch
>     - Update code with compile time choicen xxhash
>     - Add more macros to make code more readable
>     - As now that only possible use xxhash or crc32c,
>       on crc32c allocation error, skip speed test and fallback to
> xxhash
>     - For workaround too early init problem (crc32c not avaliable),
>       move zero_checksum init to first call of fastcall()
>     - Don't alloc page for hash testing, use arch zero pages for that
>   v5 -> v6:
>     - Use libcrc32c instead of CRYPTO API, mainly for
>       code/Kconfig deps Simplification
>     - Add crc32c_available():
>       libcrc32c will BUG_ON on crc32c problems,
>       so test crc32c avaliable by crc32c_available()
>     - Simplify choice_fastest_hash()
>     - Simplify fasthash()
>     - struct rmap_item && stable_node have sizeof == 64 on x86_64,
>       that makes them cache friendly. As we don't suffer from hash
> collisions, change hash type from unsigned long back to u32.
>     - Fix kbuild robot warning, make all local functions static
> 
> Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> Signed-off-by: leesioh <solee@os.korea.ac.kr>
> CC: Andrea Arcangeli <aarcange@redhat.com>
> CC: linux-mm@kvack.org
> CC: kvm@vger.kernel.org
> ---
>  mm/Kconfig |  2 ++
>  mm/ksm.c   | 93
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 2
> files changed, 91 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 03ff7703d322..b60bee4bb07e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -305,6 +305,8 @@ config MMU_NOTIFIER
>  config KSM
>  	bool "Enable KSM for page merging"
>  	depends on MMU
> +	select XXHASH
> +	select LIBCRC32C
>  	help
>  	  Enable Kernel Samepage Merging: KSM periodically scans
> those areas of an application's address space that an app has advised
> may be diff --git a/mm/ksm.c b/mm/ksm.c
> index c406f75957ad..2b84407fb918 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -25,7 +25,6 @@
>  #include <linux/pagemap.h>
>  #include <linux/rmap.h>
>  #include <linux/spinlock.h>
> -#include <linux/jhash.h>
>  #include <linux/delay.h>
>  #include <linux/kthread.h>
>  #include <linux/wait.h>
> @@ -41,6 +40,13 @@
>  #include <linux/numa.h>
> 
>  #include <asm/tlbflush.h>
> +
> +/* Support for xxhash and crc32c */
> +#include <crypto/hash.h>
> +#include <linux/crc32c.h>
> +#include <linux/xxhash.h>
> +#include <linux/sizes.h>
> +
>  #include "internal.h"
> 
>  #ifdef CONFIG_NUMA
> @@ -284,6 +290,87 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
>  		sizeof(struct __struct), __alignof__(struct
> __struct),\ (__flags), NULL)
> 
> +#define TIME_125MS  (HZ >> 3)
> +#define PERF_TO_MBS(X) (X*PAGE_SIZE*(1 << 3)/(SZ_1M))
> +
> +#define HASH_NONE   0
> +#define HASH_CRC32C 1
> +#define HASH_XXHASH 2
> +
> +static int fastest_hash = HASH_NONE;
> +
> +static bool __init crc32c_available(void)
> +{
> +	static struct shash_desc desc;
> +
> +	desc.tfm = crypto_alloc_shash("crc32c", 0, 0);

will this work without the crypto api?

> +	desc.flags = 0;
> +
> +	if (IS_ERR(desc.tfm)) {
> +		pr_warn("ksm: alloc crc32c shash error %ld\n",
> +			-PTR_ERR(desc.tfm));
> +		return false;
> +	}
> +
> +	crypto_free_shash(desc.tfm);
> +	return true;
> +}
> +
> +static void __init choice_fastest_hash(void)

s/choice/choose/

> +{
> +
> +	unsigned long je;
> +	unsigned long perf_crc32c = 0;
> +	unsigned long perf_xxhash = 0;
> +
> +	fastest_hash = HASH_XXHASH;
> +	if (!crc32c_available())
> +		goto out;
> +
> +	preempt_disable();
> +	je = jiffies + TIME_125MS;
> +	while (time_before(jiffies, je)) {
> +		crc32c(0, ZERO_PAGE(0), PAGE_SIZE);
> +		perf_crc32c++;
> +	}
> +	preempt_enable();
> +
> +	preempt_disable();
> +	je = jiffies + TIME_125MS;
> +	while (time_before(jiffies, je)) {
> +		xxhash(ZERO_PAGE(0), PAGE_SIZE, 0);
> +		perf_xxhash++;
> +	}
> +	preempt_enable();
> +
> +	pr_info("ksm: crc32c hash() %5ld MB/s\n",
> PERF_TO_MBS(perf_crc32c));
> +	pr_info("ksm: xxhash hash() %5ld MB/s\n",
> PERF_TO_MBS(perf_xxhash)); +
> +	if (perf_crc32c > perf_xxhash)
> +		fastest_hash = HASH_CRC32C;
> +out:
> +	if (fastest_hash == HASH_CRC32C)
> +		pr_info("ksm: choice crc32c as hash function\n");
> +	else
> +		pr_info("ksm: choice xxhash as hash function\n");
> +}

I wonder if this can be generalized to have a list of possible hash
functions, filtered by availability, and then tested for performance,
more like the raid6 functions.

> +
> +static u32 fasthash(const void *input, size_t length)
> +{
> +again:
> +	switch (fastest_hash) {
> +	case HASH_CRC32C:
> +		return crc32c(0, input, length);
> +	case HASH_XXHASH:
> +		return xxhash(input, length, 0);
> +	default:
> +		choice_fastest_hash();

same here s/choice/choose/

> +		/* The correct value depends on page size and
> endianness */
> +		zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
> +		goto again;
> +	}
> +}
> +

so if I understand correctly, the benchmark function will be called
only when the function is called for the first time?

>  static int __init ksm_slab_init(void)
>  {
>  	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
> @@ -979,7 +1066,7 @@ static u32 calc_checksum(struct page *page)
>  {
>  	u32 checksum;
>  	void *addr = kmap_atomic(page);
> -	checksum = jhash2(addr, PAGE_SIZE / 4, 17);
> +	checksum = fasthash(addr, PAGE_SIZE);
>  	kunmap_atomic(addr);
>  	return checksum;
>  }
> @@ -3061,8 +3148,6 @@ static int __init ksm_init(void)
>  	struct task_struct *ksm_thread;
>  	int err;
> 
> -	/* The correct value depends on page size and endianness */
> -	zero_checksum = calc_checksum(ZERO_PAGE(0));
>  	/* Default to false for backwards compatibility */
>  	ksm_use_zero_pages = false;
>
Timofey Titovets May 11, 2018, 11:06 p.m. UTC | #2
вт, 8 мая 2018 г. в 18:26, Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>:

> On Wed, 18 Apr 2018 22:32:20 +0300
> Timofey Titovets <nefelim4ag@gmail.com> wrote:

> > From: Timofey Titovets <nefelim4ag@gmail.com>
> >
> > 1. Pickup, Sioh Lee crc32 patch, after some long conversation
> > 2. Merge with my work on xxhash
> > 3. Add autoselect code to choice fastest hash helper.
> >
> > Base idea are same, replace jhash2 with something faster.
> >
> > Perf numbers:
> > Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
> > ksm: crc32c   hash() 12081 MB/s
> > ksm: xxh64    hash()  8770 MB/s
> > ksm: xxh32    hash()  4529 MB/s
> > ksm: jhash2   hash()  1569 MB/s
> >
> > As jhash2 always will be slower (for data size like PAGE_SIZE),
> > just drop it from choice.
> >
> > Add function to autoselect hash algo on boot,
> > based on hashing speed, like raid6 code does.
> >
> > Move init of zero_checksum from init, to first call of fasthash():
> >   1. KSM Init run on early kernel init,
> >      run perf testing stuff on main kernel boot thread looks bad to

> This is my personal opinion, but I think it would be better and more
> uniform to have it during boot like raid6. It doesn't take too much
> time, and it allows to see immediately in dmesg what is going on.

I don't like such things at boot, that will slowdown boot and add
useless work in *MOST* cases.

ex. Anyone who use btrfs as rootfs must wait raid6_pq init, for mount.
Even if they didn't use raid56 functionality.

Same for ksm, who use ksm? I think that 90% of users currently
are servers with KVM's VMs.

i.e. i don't think that you use it on your notebook,
and add 250ms to every bootup, even, if you did not use ksm
looks as bad idea for me.

And as that a mm subsystem, that will lead to *every linux device in the
world*
with compiled in ksm, will spend time and energy to ksm init.

> > me. 2. Crypto subsystem not avaliable at that early booting,
> >      so crc32c even, compiled in, not avaliable
> >      As crypto and ksm init, run at subsys_initcall() (4) kernel
> > level of init, all possible consumers will run later at 5+ levels

> have you tried moving ksm to a later stage? before commit
> a64fb3cd610c8e680 KSM was in fact initialized at level 6. After all, KSM
> cannot be triggered until userspace starts.

Of course and that works,
but i didn't have sufficient competence,
to suggest such changes.

> > Output after first try of KSM to hash page:
> > ksm: crc32c hash() 15218 MB/s
> > ksm: xxhash hash()  8640 MB/s
> > ksm: choice crc32c as hash function
> >
> > Thanks.
> >
> > Changes:
> >   v1 -> v2:
> >     - Move xxhash() to xxhash.h/c and separate patches
> >   v2 -> v3:
> >     - Move xxhash() xxhash.c -> xxhash.h
> >     - replace xxhash_t with 'unsigned long'
> >     - update kerneldoc above xxhash()
> >   v3 -> v4:
> >     - Merge xxhash/crc32 patches
> >     - Replace crc32 with crc32c (crc32 have same as jhash2 speed)
> >     - Add auto speed test and auto choice of fastest hash function
> >   v4 -> v5:
> >     - Pickup missed xxhash patch
> >     - Update code with compile time choicen xxhash
> >     - Add more macros to make code more readable
> >     - As now that only possible use xxhash or crc32c,
> >       on crc32c allocation error, skip speed test and fallback to
> > xxhash
> >     - For workaround too early init problem (crc32c not avaliable),
> >       move zero_checksum init to first call of fastcall()
> >     - Don't alloc page for hash testing, use arch zero pages for that
> >   v5 -> v6:
> >     - Use libcrc32c instead of CRYPTO API, mainly for
> >       code/Kconfig deps Simplification
> >     - Add crc32c_available():
> >       libcrc32c will BUG_ON on crc32c problems,
> >       so test crc32c avaliable by crc32c_available()
> >     - Simplify choice_fastest_hash()
> >     - Simplify fasthash()
> >     - struct rmap_item && stable_node have sizeof == 64 on x86_64,
> >       that makes them cache friendly. As we don't suffer from hash
> > collisions, change hash type from unsigned long back to u32.
> >     - Fix kbuild robot warning, make all local functions static
> >
> > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> > Signed-off-by: leesioh <solee@os.korea.ac.kr>
> > CC: Andrea Arcangeli <aarcange@redhat.com>
> > CC: linux-mm@kvack.org
> > CC: kvm@vger.kernel.org
> > ---
> >  mm/Kconfig |  2 ++
> >  mm/ksm.c   | 93
> > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 2
> > files changed, 91 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 03ff7703d322..b60bee4bb07e 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -305,6 +305,8 @@ config MMU_NOTIFIER
> >  config KSM
> >       bool "Enable KSM for page merging"
> >       depends on MMU
> > +     select XXHASH
> > +     select LIBCRC32C
> >       help
> >         Enable Kernel Samepage Merging: KSM periodically scans
> > those areas of an application's address space that an app has advised
> > may be diff --git a/mm/ksm.c b/mm/ksm.c
> > index c406f75957ad..2b84407fb918 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -25,7 +25,6 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/rmap.h>
> >  #include <linux/spinlock.h>
> > -#include <linux/jhash.h>
> >  #include <linux/delay.h>
> >  #include <linux/kthread.h>
> >  #include <linux/wait.h>
> > @@ -41,6 +40,13 @@
> >  #include <linux/numa.h>
> >
> >  #include <asm/tlbflush.h>
> > +
> > +/* Support for xxhash and crc32c */
> > +#include <crypto/hash.h>
> > +#include <linux/crc32c.h>
> > +#include <linux/xxhash.h>
> > +#include <linux/sizes.h>
> > +
> >  #include "internal.h"
> >
> >  #ifdef CONFIG_NUMA
> > @@ -284,6 +290,87 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
> >               sizeof(struct __struct), __alignof__(struct
> > __struct),\ (__flags), NULL)
> >
> > +#define TIME_125MS  (HZ >> 3)
> > +#define PERF_TO_MBS(X) (X*PAGE_SIZE*(1 << 3)/(SZ_1M))
> > +
> > +#define HASH_NONE   0
> > +#define HASH_CRC32C 1
> > +#define HASH_XXHASH 2
> > +
> > +static int fastest_hash = HASH_NONE;
> > +
> > +static bool __init crc32c_available(void)
> > +{
> > +     static struct shash_desc desc;
> > +
> > +     desc.tfm = crypto_alloc_shash("crc32c", 0, 0);

> will this work without the crypto api?

I didn't know a way to compile kernel without crypto api,
To many different sub systems depends on him,
if i read Kconfig correctly of course.

> > +     desc.flags = 0;
> > +
> > +     if (IS_ERR(desc.tfm)) {
> > +             pr_warn("ksm: alloc crc32c shash error %ld\n",
> > +                     -PTR_ERR(desc.tfm));
> > +             return false;
> > +     }
> > +
> > +     crypto_free_shash(desc.tfm);
> > +     return true;
> > +}
> > +
> > +static void __init choice_fastest_hash(void)

> s/choice/choose/

> > +{
> > +
> > +     unsigned long je;
> > +     unsigned long perf_crc32c = 0;
> > +     unsigned long perf_xxhash = 0;
> > +
> > +     fastest_hash = HASH_XXHASH;
> > +     if (!crc32c_available())
> > +             goto out;
> > +
> > +     preempt_disable();
> > +     je = jiffies + TIME_125MS;
> > +     while (time_before(jiffies, je)) {
> > +             crc32c(0, ZERO_PAGE(0), PAGE_SIZE);
> > +             perf_crc32c++;
> > +     }
> > +     preempt_enable();
> > +
> > +     preempt_disable();
> > +     je = jiffies + TIME_125MS;
> > +     while (time_before(jiffies, je)) {
> > +             xxhash(ZERO_PAGE(0), PAGE_SIZE, 0);
> > +             perf_xxhash++;
> > +     }
> > +     preempt_enable();
> > +
> > +     pr_info("ksm: crc32c hash() %5ld MB/s\n",
> > PERF_TO_MBS(perf_crc32c));
> > +     pr_info("ksm: xxhash hash() %5ld MB/s\n",
> > PERF_TO_MBS(perf_xxhash)); +
> > +     if (perf_crc32c > perf_xxhash)
> > +             fastest_hash = HASH_CRC32C;
> > +out:
> > +     if (fastest_hash == HASH_CRC32C)
> > +             pr_info("ksm: choice crc32c as hash function\n");
> > +     else
> > +             pr_info("ksm: choice xxhash as hash function\n");
> > +}

> I wonder if this can be generalized to have a list of possible hash
> functions, filtered by availability, and then tested for performance,
> more like the raid6 functions.

IIRC:
We was talk about that on old version of patch set.
And we decide what:
  - in ideal situation, ksm must use only one hash function, always.
    But, we afraid about that crc32c with hardware acceleration, can be
missed by some way.
    So, as appropriate fallback, xxhash added, as general proporse, which
must work
    good enough for ksm in most cases.

So adding more complex logic, like raid6_pq have with all of different
instruction set are overkill.

> > +
> > +static u32 fasthash(const void *input, size_t length)
> > +{
> > +again:
> > +     switch (fastest_hash) {
> > +     case HASH_CRC32C:
> > +             return crc32c(0, input, length);
> > +     case HASH_XXHASH:
> > +             return xxhash(input, length, 0);
> > +     default:
> > +             choice_fastest_hash();

> same here s/choice/choose/

> > +             /* The correct value depends on page size and
> > endianness */
> > +             zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
> > +             goto again;
> > +     }
> > +}
> > +

> so if I understand correctly, the benchmark function will be called
> only when the function is called for the first time?

yes, that is.
That a little bit tricky,
but it's will be called only from KSM thread,
and only what KSM thread will try do some useful work.

So that must not block anything.

Thanks.
Claudio Imbrenda May 14, 2018, 10:17 a.m. UTC | #3
On Sat, 12 May 2018 02:06:20 +0300
Timofey Titovets <nefelim4ag@gmail.com> wrote:

> вт, 8 мая 2018 г. в 18:26, Claudio Imbrenda
> <imbrenda@linux.vnet.ibm.com>:
> 
> > On Wed, 18 Apr 2018 22:32:20 +0300
> > Timofey Titovets <nefelim4ag@gmail.com> wrote:  
> 
> > > From: Timofey Titovets <nefelim4ag@gmail.com>
> > >
> > > 1. Pickup, Sioh Lee crc32 patch, after some long conversation
> > > 2. Merge with my work on xxhash
> > > 3. Add autoselect code to choice fastest hash helper.
> > >
> > > Base idea are same, replace jhash2 with something faster.
> > >
> > > Perf numbers:
> > > Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
> > > ksm: crc32c   hash() 12081 MB/s
> > > ksm: xxh64    hash()  8770 MB/s
> > > ksm: xxh32    hash()  4529 MB/s
> > > ksm: jhash2   hash()  1569 MB/s
> > >
> > > As jhash2 always will be slower (for data size like PAGE_SIZE),
> > > just drop it from choice.
> > >
> > > Add function to autoselect hash algo on boot,
> > > based on hashing speed, like raid6 code does.
> > >
> > > Move init of zero_checksum from init, to first call of fasthash():
> > >   1. KSM Init run on early kernel init,
> > >      run perf testing stuff on main kernel boot thread looks bad
> > > to  
> 
> > This is my personal opinion, but I think it would be better and more
> > uniform to have it during boot like raid6. It doesn't take too much
> > time, and it allows to see immediately in dmesg what is going on.  
> 
> I don't like such things at boot, that will slowdown boot and add
> useless work in *MOST* cases.
> 
> ex. Anyone who use btrfs as rootfs must wait raid6_pq init, for mount.
> Even if they didn't use raid56 functionality.
> 
> Same for ksm, who use ksm? I think that 90% of users currently
> are servers with KVM's VMs.
> 
> i.e. i don't think that you use it on your notebook,
> and add 250ms to every bootup, even, if you did not use ksm
> looks as bad idea for me.
> 
> And as that a mm subsystem, that will lead to *every linux device in
> the world*
> with compiled in ksm, will spend time and energy to ksm init.

fair enough

> > > me. 2. Crypto subsystem not avaliable at that early booting,
> > >      so crc32c even, compiled in, not avaliable
> > >      As crypto and ksm init, run at subsys_initcall() (4) kernel
> > > level of init, all possible consumers will run later at 5+
> > > levels  
> 
> > have you tried moving ksm to a later stage? before commit
> > a64fb3cd610c8e680 KSM was in fact initialized at level 6. After
> > all, KSM cannot be triggered until userspace starts.  
> 
> Of course and that works,
> but i didn't have sufficient competence,
> to suggest such changes.
> 
> > > Output after first try of KSM to hash page:
> > > ksm: crc32c hash() 15218 MB/s
> > > ksm: xxhash hash()  8640 MB/s
> > > ksm: choice crc32c as hash function
> > >
> > > Thanks.
> > >
> > > Changes:
> > >   v1 -> v2:
> > >     - Move xxhash() to xxhash.h/c and separate patches
> > >   v2 -> v3:
> > >     - Move xxhash() xxhash.c -> xxhash.h
> > >     - replace xxhash_t with 'unsigned long'
> > >     - update kerneldoc above xxhash()
> > >   v3 -> v4:
> > >     - Merge xxhash/crc32 patches
> > >     - Replace crc32 with crc32c (crc32 have same as jhash2 speed)
> > >     - Add auto speed test and auto choice of fastest hash function
> > >   v4 -> v5:
> > >     - Pickup missed xxhash patch
> > >     - Update code with compile time choicen xxhash
> > >     - Add more macros to make code more readable
> > >     - As now that only possible use xxhash or crc32c,
> > >       on crc32c allocation error, skip speed test and fallback to
> > > xxhash
> > >     - For workaround too early init problem (crc32c not
> > > avaliable), move zero_checksum init to first call of fastcall()
> > >     - Don't alloc page for hash testing, use arch zero pages for
> > > that v5 -> v6:
> > >     - Use libcrc32c instead of CRYPTO API, mainly for
> > >       code/Kconfig deps Simplification
> > >     - Add crc32c_available():
> > >       libcrc32c will BUG_ON on crc32c problems,
> > >       so test crc32c avaliable by crc32c_available()
> > >     - Simplify choice_fastest_hash()
> > >     - Simplify fasthash()
> > >     - struct rmap_item && stable_node have sizeof == 64 on x86_64,
> > >       that makes them cache friendly. As we don't suffer from hash
> > > collisions, change hash type from unsigned long back to u32.
> > >     - Fix kbuild robot warning, make all local functions static
> > >
> > > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> > > Signed-off-by: leesioh <solee@os.korea.ac.kr>
> > > CC: Andrea Arcangeli <aarcange@redhat.com>
> > > CC: linux-mm@kvack.org
> > > CC: kvm@vger.kernel.org
> > > ---
> > >  mm/Kconfig |  2 ++
> > >  mm/ksm.c   | 93
> > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 2
> > > files changed, 91 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 03ff7703d322..b60bee4bb07e 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -305,6 +305,8 @@ config MMU_NOTIFIER
> > >  config KSM
> > >       bool "Enable KSM for page merging"
> > >       depends on MMU
> > > +     select XXHASH
> > > +     select LIBCRC32C
> > >       help
> > >         Enable Kernel Samepage Merging: KSM periodically scans
> > > those areas of an application's address space that an app has
> > > advised may be diff --git a/mm/ksm.c b/mm/ksm.c
> > > index c406f75957ad..2b84407fb918 100644
> > > --- a/mm/ksm.c
> > > +++ b/mm/ksm.c
> > > @@ -25,7 +25,6 @@
> > >  #include <linux/pagemap.h>
> > >  #include <linux/rmap.h>
> > >  #include <linux/spinlock.h>
> > > -#include <linux/jhash.h>
> > >  #include <linux/delay.h>
> > >  #include <linux/kthread.h>
> > >  #include <linux/wait.h>
> > > @@ -41,6 +40,13 @@
> > >  #include <linux/numa.h>
> > >
> > >  #include <asm/tlbflush.h>
> > > +
> > > +/* Support for xxhash and crc32c */
> > > +#include <crypto/hash.h>
> > > +#include <linux/crc32c.h>
> > > +#include <linux/xxhash.h>
> > > +#include <linux/sizes.h>
> > > +
> > >  #include "internal.h"
> > >
> > >  #ifdef CONFIG_NUMA
> > > @@ -284,6 +290,87 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
> > >               sizeof(struct __struct), __alignof__(struct
> > > __struct),\ (__flags), NULL)
> > >
> > > +#define TIME_125MS  (HZ >> 3)
> > > +#define PERF_TO_MBS(X) (X*PAGE_SIZE*(1 << 3)/(SZ_1M))
> > > +
> > > +#define HASH_NONE   0
> > > +#define HASH_CRC32C 1
> > > +#define HASH_XXHASH 2
> > > +
> > > +static int fastest_hash = HASH_NONE;
> > > +
> > > +static bool __init crc32c_available(void)
> > > +{
> > > +     static struct shash_desc desc;
> > > +
> > > +     desc.tfm = crypto_alloc_shash("crc32c", 0, 0);  
> 
> > will this work without the crypto api?  
> 
> I didn't know a way to compile kernel without crypto api,
> To many different sub systems depends on him,
> if i read Kconfig correctly of course.

I'm confused here. Why did you want to drop the dependency on the
crypto API in Kconfig if you are using it anyway? Or did I
misunderstand?

> > > +     desc.flags = 0;
> > > +
> > > +     if (IS_ERR(desc.tfm)) {
> > > +             pr_warn("ksm: alloc crc32c shash error %ld\n",
> > > +                     -PTR_ERR(desc.tfm));
> > > +             return false;
> > > +     }
> > > +
> > > +     crypto_free_shash(desc.tfm);
> > > +     return true;
> > > +}
> > > +
> > > +static void __init choice_fastest_hash(void)  
> 
> > s/choice/choose/  
> 
> > > +{
> > > +
> > > +     unsigned long je;
> > > +     unsigned long perf_crc32c = 0;
> > > +     unsigned long perf_xxhash = 0;
> > > +
> > > +     fastest_hash = HASH_XXHASH;
> > > +     if (!crc32c_available())
> > > +             goto out;
> > > +
> > > +     preempt_disable();
> > > +     je = jiffies + TIME_125MS;
> > > +     while (time_before(jiffies, je)) {
> > > +             crc32c(0, ZERO_PAGE(0), PAGE_SIZE);
> > > +             perf_crc32c++;
> > > +     }
> > > +     preempt_enable();
> > > +
> > > +     preempt_disable();
> > > +     je = jiffies + TIME_125MS;
> > > +     while (time_before(jiffies, je)) {
> > > +             xxhash(ZERO_PAGE(0), PAGE_SIZE, 0);
> > > +             perf_xxhash++;
> > > +     }
> > > +     preempt_enable();
> > > +
> > > +     pr_info("ksm: crc32c hash() %5ld MB/s\n",
> > > PERF_TO_MBS(perf_crc32c));
> > > +     pr_info("ksm: xxhash hash() %5ld MB/s\n",
> > > PERF_TO_MBS(perf_xxhash)); +
> > > +     if (perf_crc32c > perf_xxhash)
> > > +             fastest_hash = HASH_CRC32C;
> > > +out:
> > > +     if (fastest_hash == HASH_CRC32C)
> > > +             pr_info("ksm: choice crc32c as hash function\n");
> > > +     else
> > > +             pr_info("ksm: choice xxhash as hash function\n");
> > > +}  
> 
> > I wonder if this can be generalized to have a list of possible hash
> > functions, filtered by availability, and then tested for
> > performance, more like the raid6 functions.  
> 
> IIRC:
> We was talk about that on old version of patch set.
> And we decide what:
>   - in ideal situation, ksm must use only one hash function, always.
>     But, we afraid about that crc32c with hardware acceleration, can
> be missed by some way.
>     So, as appropriate fallback, xxhash added, as general proporse,
> which must work
>     good enough for ksm in most cases.
> 
> So adding more complex logic, like raid6_pq have with all of different
> instruction set are overkill.

fair enough

> > > +
> > > +static u32 fasthash(const void *input, size_t length)
> > > +{
> > > +again:
> > > +     switch (fastest_hash) {
> > > +     case HASH_CRC32C:
> > > +             return crc32c(0, input, length);
> > > +     case HASH_XXHASH:
> > > +             return xxhash(input, length, 0);
> > > +     default:
> > > +             choice_fastest_hash();  
> 
> > same here s/choice/choose/  
> 
> > > +             /* The correct value depends on page size and
> > > endianness */
> > > +             zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
> > > +             goto again;
> > > +     }
> > > +}
> > > +  
> 
> > so if I understand correctly, the benchmark function will be called
> > only when the function is called for the first time?  
> 
> yes, that is.
> That a little bit tricky,
> but it's will be called only from KSM thread,
> and only what KSM thread will try do some useful work.
> 
> So that must not block anything.
> 
> Thanks.


best regards

Claudio Imbrenda
Timofey Titovets May 16, 2018, 10:26 a.m. UTC | #4
пн, 14 мая 2018 г. в 13:17, Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>:

> On Sat, 12 May 2018 02:06:20 +0300
> Timofey Titovets <nefelim4ag@gmail.com> wrote:

> > вт, 8 мая 2018 г. в 18:26, Claudio Imbrenda
> > <imbrenda@linux.vnet.ibm.com>:
> >
> > > On Wed, 18 Apr 2018 22:32:20 +0300
> > > Timofey Titovets <nefelim4ag@gmail.com> wrote:
> >
> > > > From: Timofey Titovets <nefelim4ag@gmail.com>
> > > >
> > > > 1. Pickup, Sioh Lee crc32 patch, after some long conversation
> > > > 2. Merge with my work on xxhash
> > > > 3. Add autoselect code to choice fastest hash helper.
> > > >
> > > > Base idea are same, replace jhash2 with something faster.
> > > >
> > > > Perf numbers:
> > > > Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
> > > > ksm: crc32c   hash() 12081 MB/s
> > > > ksm: xxh64    hash()  8770 MB/s
> > > > ksm: xxh32    hash()  4529 MB/s
> > > > ksm: jhash2   hash()  1569 MB/s
> > > >
> > > > As jhash2 always will be slower (for data size like PAGE_SIZE),
> > > > just drop it from choice.
> > > >
> > > > Add function to autoselect hash algo on boot,
> > > > based on hashing speed, like raid6 code does.
> > > >
> > > > Move init of zero_checksum from init, to first call of fasthash():
> > > >   1. KSM Init run on early kernel init,
> > > >      run perf testing stuff on main kernel boot thread looks bad
> > > > to
> >
> > > This is my personal opinion, but I think it would be better and more
> > > uniform to have it during boot like raid6. It doesn't take too much
> > > time, and it allows to see immediately in dmesg what is going on.
> >
> > I don't like such things at boot, that will slowdown boot and add
> > useless work in *MOST* cases.
> >
> > ex. Anyone who use btrfs as rootfs must wait raid6_pq init, for mount.
> > Even if they didn't use raid56 functionality.
> >
> > Same for ksm, who use ksm? I think that 90% of users currently
> > are servers with KVM's VMs.
> >
> > i.e. i don't think that you use it on your notebook,
> > and add 250ms to every bootup, even, if you did not use ksm
> > looks as bad idea for me.
> >
> > And as that a mm subsystem, that will lead to *every linux device in
> > the world*
> > with compiled in ksm, will spend time and energy to ksm init.

> fair enough

> > > > me. 2. Crypto subsystem not avaliable at that early booting,
> > > >      so crc32c even, compiled in, not avaliable
> > > >      As crypto and ksm init, run at subsys_initcall() (4) kernel
> > > > level of init, all possible consumers will run later at 5+
> > > > levels
> >
> > > have you tried moving ksm to a later stage? before commit
> > > a64fb3cd610c8e680 KSM was in fact initialized at level 6. After
> > > all, KSM cannot be triggered until userspace starts.
> >
> > Of course and that works,
> > but i didn't have sufficient competence,
> > to suggest such changes.
> >
> > > > Output after first try of KSM to hash page:
> > > > ksm: crc32c hash() 15218 MB/s
> > > > ksm: xxhash hash()  8640 MB/s
> > > > ksm: choice crc32c as hash function
> > > >
> > > > Thanks.
> > > >
> > > > Changes:
> > > >   v1 -> v2:
> > > >     - Move xxhash() to xxhash.h/c and separate patches
> > > >   v2 -> v3:
> > > >     - Move xxhash() xxhash.c -> xxhash.h
> > > >     - replace xxhash_t with 'unsigned long'
> > > >     - update kerneldoc above xxhash()
> > > >   v3 -> v4:
> > > >     - Merge xxhash/crc32 patches
> > > >     - Replace crc32 with crc32c (crc32 have same as jhash2 speed)
> > > >     - Add auto speed test and auto choice of fastest hash function
> > > >   v4 -> v5:
> > > >     - Pickup missed xxhash patch
> > > >     - Update code with compile time choicen xxhash
> > > >     - Add more macros to make code more readable
> > > >     - As now that only possible use xxhash or crc32c,
> > > >       on crc32c allocation error, skip speed test and fallback to
> > > > xxhash
> > > >     - For workaround too early init problem (crc32c not
> > > > avaliable), move zero_checksum init to first call of fastcall()
> > > >     - Don't alloc page for hash testing, use arch zero pages for
> > > > that v5 -> v6:
> > > >     - Use libcrc32c instead of CRYPTO API, mainly for
> > > >       code/Kconfig deps Simplification
> > > >     - Add crc32c_available():
> > > >       libcrc32c will BUG_ON on crc32c problems,
> > > >       so test crc32c avaliable by crc32c_available()
> > > >     - Simplify choice_fastest_hash()
> > > >     - Simplify fasthash()
> > > >     - struct rmap_item && stable_node have sizeof == 64 on x86_64,
> > > >       that makes them cache friendly. As we don't suffer from hash
> > > > collisions, change hash type from unsigned long back to u32.
> > > >     - Fix kbuild robot warning, make all local functions static
> > > >
> > > > Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
> > > > Signed-off-by: leesioh <solee@os.korea.ac.kr>
> > > > CC: Andrea Arcangeli <aarcange@redhat.com>
> > > > CC: linux-mm@kvack.org
> > > > CC: kvm@vger.kernel.org
> > > > ---
> > > >  mm/Kconfig |  2 ++
> > > >  mm/ksm.c   | 93
> > > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--- 2
> > > > files changed, 91 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index 03ff7703d322..b60bee4bb07e 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -305,6 +305,8 @@ config MMU_NOTIFIER
> > > >  config KSM
> > > >       bool "Enable KSM for page merging"
> > > >       depends on MMU
> > > > +     select XXHASH
> > > > +     select LIBCRC32C
> > > >       help
> > > >         Enable Kernel Samepage Merging: KSM periodically scans
> > > > those areas of an application's address space that an app has
> > > > advised may be diff --git a/mm/ksm.c b/mm/ksm.c
> > > > index c406f75957ad..2b84407fb918 100644
> > > > --- a/mm/ksm.c
> > > > +++ b/mm/ksm.c
> > > > @@ -25,7 +25,6 @@
> > > >  #include <linux/pagemap.h>
> > > >  #include <linux/rmap.h>
> > > >  #include <linux/spinlock.h>
> > > > -#include <linux/jhash.h>
> > > >  #include <linux/delay.h>
> > > >  #include <linux/kthread.h>
> > > >  #include <linux/wait.h>
> > > > @@ -41,6 +40,13 @@
> > > >  #include <linux/numa.h>
> > > >
> > > >  #include <asm/tlbflush.h>
> > > > +
> > > > +/* Support for xxhash and crc32c */
> > > > +#include <crypto/hash.h>
> > > > +#include <linux/crc32c.h>
> > > > +#include <linux/xxhash.h>
> > > > +#include <linux/sizes.h>
> > > > +
> > > >  #include "internal.h"
> > > >
> > > >  #ifdef CONFIG_NUMA
> > > > @@ -284,6 +290,87 @@ static DEFINE_SPINLOCK(ksm_mmlist_lock);
> > > >               sizeof(struct __struct), __alignof__(struct
> > > > __struct),\ (__flags), NULL)
> > > >
> > > > +#define TIME_125MS  (HZ >> 3)
> > > > +#define PERF_TO_MBS(X) (X*PAGE_SIZE*(1 << 3)/(SZ_1M))
> > > > +
> > > > +#define HASH_NONE   0
> > > > +#define HASH_CRC32C 1
> > > > +#define HASH_XXHASH 2
> > > > +
> > > > +static int fastest_hash = HASH_NONE;
> > > > +
> > > > +static bool __init crc32c_available(void)
> > > > +{
> > > > +     static struct shash_desc desc;
> > > > +
> > > > +     desc.tfm = crypto_alloc_shash("crc32c", 0, 0);
> >
> > > will this work without the crypto api?
> >
> > I didn't know a way to compile kernel without crypto api,
> > To many different sub systems depends on him,
> > if i read Kconfig correctly of course.

> I'm confused here. Why did you want to drop the dependency on the
> crypto API in Kconfig if you are using it anyway? Or did I
> misunderstand?

Misunderstanding, i think.

You ask,
`will this work without the crypto api?`

I answer that: i didn't see obvious way to try compile that without crypto
api.
Even without add deps on crypto api from ksm, kernel have plenty of other
subsystems,
which depends on crypto api.

(As patch add dependency on crypto api,
we have no way to compile ksm without crypto api,
so i'm not sure what ksm without crypto api are issue which must be fixed)

Thanks.

> > > > +     desc.flags = 0;
> > > > +
> > > > +     if (IS_ERR(desc.tfm)) {
> > > > +             pr_warn("ksm: alloc crc32c shash error %ld\n",
> > > > +                     -PTR_ERR(desc.tfm));
> > > > +             return false;
> > > > +     }
> > > > +
> > > > +     crypto_free_shash(desc.tfm);
> > > > +     return true;
> > > > +}
> > > > +
> > > > +static void __init choice_fastest_hash(void)
> >
> > > s/choice/choose/
> >
> > > > +{
> > > > +
> > > > +     unsigned long je;
> > > > +     unsigned long perf_crc32c = 0;
> > > > +     unsigned long perf_xxhash = 0;
> > > > +
> > > > +     fastest_hash = HASH_XXHASH;
> > > > +     if (!crc32c_available())
> > > > +             goto out;
> > > > +
> > > > +     preempt_disable();
> > > > +     je = jiffies + TIME_125MS;
> > > > +     while (time_before(jiffies, je)) {
> > > > +             crc32c(0, ZERO_PAGE(0), PAGE_SIZE);
> > > > +             perf_crc32c++;
> > > > +     }
> > > > +     preempt_enable();
> > > > +
> > > > +     preempt_disable();
> > > > +     je = jiffies + TIME_125MS;
> > > > +     while (time_before(jiffies, je)) {
> > > > +             xxhash(ZERO_PAGE(0), PAGE_SIZE, 0);
> > > > +             perf_xxhash++;
> > > > +     }
> > > > +     preempt_enable();
> > > > +
> > > > +     pr_info("ksm: crc32c hash() %5ld MB/s\n",
> > > > PERF_TO_MBS(perf_crc32c));
> > > > +     pr_info("ksm: xxhash hash() %5ld MB/s\n",
> > > > PERF_TO_MBS(perf_xxhash)); +
> > > > +     if (perf_crc32c > perf_xxhash)
> > > > +             fastest_hash = HASH_CRC32C;
> > > > +out:
> > > > +     if (fastest_hash == HASH_CRC32C)
> > > > +             pr_info("ksm: choice crc32c as hash function\n");
> > > > +     else
> > > > +             pr_info("ksm: choice xxhash as hash function\n");
> > > > +}
> >
> > > I wonder if this can be generalized to have a list of possible hash
> > > functions, filtered by availability, and then tested for
> > > performance, more like the raid6 functions.
> >
> > IIRC:
> > We was talk about that on old version of patch set.
> > And we decide what:
> >   - in ideal situation, ksm must use only one hash function, always.
> >     But, we afraid about that crc32c with hardware acceleration, can
> > be missed by some way.
> >     So, as appropriate fallback, xxhash added, as general proporse,
> > which must work
> >     good enough for ksm in most cases.
> >
> > So adding more complex logic, like raid6_pq have with all of different
> > instruction set are overkill.

> fair enough

> > > > +
> > > > +static u32 fasthash(const void *input, size_t length)
> > > > +{
> > > > +again:
> > > > +     switch (fastest_hash) {
> > > > +     case HASH_CRC32C:
> > > > +             return crc32c(0, input, length);
> > > > +     case HASH_XXHASH:
> > > > +             return xxhash(input, length, 0);
> > > > +     default:
> > > > +             choice_fastest_hash();
> >
> > > same here s/choice/choose/
> >
> > > > +             /* The correct value depends on page size and
> > > > endianness */
> > > > +             zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
> > > > +             goto again;
> > > > +     }
> > > > +}
> > > > +
> >
> > > so if I understand correctly, the benchmark function will be called
> > > only when the function is called for the first time?
> >
> > yes, that is.
> > That a little bit tricky,
> > but it's will be called only from KSM thread,
> > and only what KSM thread will try do some useful work.
> >
> > So that must not block anything.
> >
> > Thanks.


> best regards

> Claudio Imbrenda



--
Have a nice day,
Timofey.
Pavel Tatashin May 22, 2018, 8:22 p.m. UTC | #5
Hi Timofey,

> 
> Perf numbers:
> Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
> ksm: crc32c   hash() 12081 MB/s
> ksm: xxh64    hash()  8770 MB/s
> ksm: xxh32    hash()  4529 MB/s
> ksm: jhash2   hash()  1569 MB/s

That is a very nice improvement over jhash2!

> Add function to autoselect hash algo on boot,
> based on hashing speed, like raid6 code does.

Are you aware of hardware where crc32c is slower compared to xxhash?
Perhaps always use crc32c when available?

> +
> +static u32 fasthash(const void *input, size_t length)
> +{
> +again:
> +	switch (fastest_hash) {
> +	case HASH_CRC32C:
> +		return crc32c(0, input, length);
> +	case HASH_XXHASH:
> +		return xxhash(input, length, 0);

You are loosing half of 64-bit word in xxh64 case? Is this acceptable? May
be do one more xor: in 64-bit case in xxhash() do: (v >> 32) | (u32)v ?

> +	default:
> +		choice_fastest_hash();
> +		/* The correct value depends on page size and endianness */
> +		zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
> +		goto again;
> +	}
> +}

choice_fastest_hash() does not belong to fasthash(). We are loosing leaf
function optimizations if you keep it in this hot-path. Also, fastest_hash
should really be a static branch in order to avoid extra load and conditional
branch.

I think, crc32c should simply be used when it is available, and use xxhash
otherwise, the decision should be made in ksm_init()

Thank you,
Pavel
Timofey Titovets May 23, 2018, 1:45 p.m. UTC | #6
вт, 22 мая 2018 г. в 23:22, Pavel Tatashin <pasha.tatashin@oracle.com>:

> Hi Timofey,

> >
> > Perf numbers:
> > Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
> > ksm: crc32c   hash() 12081 MB/s
> > ksm: xxh64    hash()  8770 MB/s
> > ksm: xxh32    hash()  4529 MB/s
> > ksm: jhash2   hash()  1569 MB/s

> That is a very nice improvement over jhash2!

> > Add function to autoselect hash algo on boot,
> > based on hashing speed, like raid6 code does.

> Are you aware of hardware where crc32c is slower compared to xxhash?
> Perhaps always use crc32c when available?

crc32c will always be available, because of Kconfig.
But if crc32c doesn't have HW acceleration, it will be slower.

For talk about range of HW, i must have that HW,
so i can't say that *all* supported HW, have crc32c with acceleration.

> > +
> > +static u32 fasthash(const void *input, size_t length)
> > +{
> > +again:
> > +     switch (fastest_hash) {
> > +     case HASH_CRC32C:
> > +             return crc32c(0, input, length);
> > +     case HASH_XXHASH:
> > +             return xxhash(input, length, 0);

> You are loosing half of 64-bit word in xxh64 case? Is this acceptable? May
> be do one more xor: in 64-bit case in xxhash() do: (v >> 32) | (u32)v ?

AFAIK, that lead to make hash function worse.
Even, in ksm hash used only for check if page has changed since last scan,
so that doesn't matter really (IMHO).

> > +     default:
> > +             choice_fastest_hash();
> > +             /* The correct value depends on page size and endianness
*/
> > +             zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
> > +             goto again;
> > +     }
> > +}

> choice_fastest_hash() does not belong to fasthash(). We are loosing leaf
> function optimizations if you keep it in this hot-path. Also, fastest_hash
> should really be a static branch in order to avoid extra load and
conditional
> branch.

I don't think what that will give any noticeable performance benefit.
In compare to hash computation and memcmp in RB.

In theory, that can be replaced with self written jump table, to *avoid*
run time overhead.
AFAIK at 5 entries, gcc convert switch to jump table itself.

> I think, crc32c should simply be used when it is available, and use xxhash
> otherwise, the decision should be made in ksm_init()

I already said, in above conversation, why i think do that at ksm_init() is
a bad idea.

> Thank you,
> Pavel

Thanks.
Pavel Tatashin May 23, 2018, 2:24 p.m. UTC | #7
Hi Timofey,

> crc32c will always be available, because of Kconfig.
> But if crc32c doesn't have HW acceleration, it will be slower.

> For talk about range of HW, i must have that HW,
> so i can't say that *all* supported HW, have crc32c with acceleration.

How about always defaulting to crc32c when HW acceleration is present
without doing timings?
Do you have performance numbers of crc32c without acceleration?

> > You are loosing half of 64-bit word in xxh64 case? Is this acceptable?
May
> > be do one more xor: in 64-bit case in xxhash() do: (v >> 32) | (u32)v ?

> AFAIK, that lead to make hash function worse.
> Even, in ksm hash used only for check if page has changed since last scan,
> so that doesn't matter really (IMHO).

I understand that losing half of the hash result might be acceptable in
this case, but I am not really sure how XOirng one more time can possibly
make hash function worse, could you please elaborate?

> > choice_fastest_hash() does not belong to fasthash(). We are loosing leaf
> > function optimizations if you keep it in this hot-path. Also,
fastest_hash
> > should really be a static branch in order to avoid extra load and
> conditional
> > branch.

> I don't think what that will give any noticeable performance benefit.
> In compare to hash computation and memcmp in RB.

You are right, it is small compared to hash and memcmp, but still I think
it makes sense to use static branch, after all the value will never change
during runtime after the first time it is set.


> In theory, that can be replaced with self written jump table, to *avoid*
> run time overhead.
> AFAIK at 5 entries, gcc convert switch to jump table itself.

> > I think, crc32c should simply be used when it is available, and use
xxhash
> > otherwise, the decision should be made in ksm_init()

> I already said, in above conversation, why i think do that at ksm_init()
is
> a bad idea.

It really feels wrong to keep  choice_fastest_hash() in fasthash(), it is
done only once and really belongs to the init function, like ksm_init(). As
I understand, you think it is a bad idea to keep it in ksm_init() because
it slows down boot by 0.25s, which I agree with your is substantial. But, I
really do not think that we should spend those 0.25s at all deciding what
hash function is optimal, and instead default to one or another during boot
based on hardware we are booting on. If crc32c without hw acceleration is
no worse than jhash2, maybe we should simply switch to  crc32c?

Thank you,
Pavel
Timofey Titovets May 24, 2018, 8:01 a.m. UTC | #8
ср, 23 мая 2018 г. в 17:24, Pavel Tatashin <pasha.tatashin@oracle.com>:

> Hi Timofey,

> > crc32c will always be available, because of Kconfig.
> > But if crc32c doesn't have HW acceleration, it will be slower.

> > For talk about range of HW, i must have that HW,
> > so i can't say that *all* supported HW, have crc32c with acceleration.

> How about always defaulting to crc32c when HW acceleration is present
> without doing timings?
IIRC, yes, shash api can return 'cra_priority'.

> Do you have performance numbers of crc32c without acceleration?
Yes, https://lkml.org/lkml/2017/12/30/222

The experimental results (the experimental value is the average of the
measured values)
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

> > > You are loosing half of 64-bit word in xxh64 case? Is this acceptable?
> May
> > > be do one more xor: in 64-bit case in xxhash() do: (v >> 32) | (u32)v
?

> > AFAIK, that lead to make hash function worse.
> > Even, in ksm hash used only for check if page has changed since last
scan,
> > so that doesn't matter really (IMHO).

> I understand that losing half of the hash result might be acceptable in
> this case, but I am not really sure how XOirng one more time can possibly
> make hash function worse, could you please elaborate?

IIRC, because of xor are symmetric
i.e. shift:
0b01011010 >> 4 = 0b0101
and xor:
0b0101 ^ 0b1010 = 0b1111
Xor will decrease randomness/entropy and will lead to hash collisions.

> > > choice_fastest_hash() does not belong to fasthash(). We are loosing
leaf
> > > function optimizations if you keep it in this hot-path. Also,
> fastest_hash
> > > should really be a static branch in order to avoid extra load and
> > conditional
> > > branch.

> > I don't think what that will give any noticeable performance benefit.
> > In compare to hash computation and memcmp in RB.

> You are right, it is small compared to hash and memcmp, but still I think
> it makes sense to use static branch, after all the value will never change
> during runtime after the first time it is set.


> > In theory, that can be replaced with self written jump table, to *avoid*
> > run time overhead.
> > AFAIK at 5 entries, gcc convert switch to jump table itself.

> > > I think, crc32c should simply be used when it is available, and use
> xxhash
> > > otherwise, the decision should be made in ksm_init()

> > I already said, in above conversation, why i think do that at ksm_init()
> is
> > a bad idea.

> It really feels wrong to keep  choice_fastest_hash() in fasthash(), it is
> done only once and really belongs to the init function, like ksm_init().
As

That possible to move decision from lazy load, to ksm_thread,
that will allow us to start bench and not slowdown boot.

But for that to works, ksm must start later, after init of crypto.

> I understand, you think it is a bad idea to keep it in ksm_init() because
> it slows down boot by 0.25s, which I agree with your is substantial. But,
I
> really do not think that we should spend those 0.25s at all deciding what
> hash function is optimal, and instead default to one or another during
boot
> based on hardware we are booting on. If crc32c without hw acceleration is
> no worse than jhash2, maybe we should simply switch to  crc32c?

crc32c with no hw, are slower in compare to jhash2 on x86, so i think on
other arches result will be same.

> Thank you,
> Pavel

Thanks.

--
Have a nice day,
Timofey.
Mike Rapoport May 27, 2018, 1:03 p.m. UTC | #9
Hi,

On Thu, May 24, 2018 at 11:01:20AM +0300, Timofey Titovets wrote:
> ср, 23 мая 2018 г. в 17:24, Pavel Tatashin <pasha.tatashin@oracle.com>:
> 
> > Hi Timofey,

[ ... ]
 
> > It really feels wrong to keep  choice_fastest_hash() in fasthash(), it is
> > done only once and really belongs to the init function, like ksm_init().
> As
> 
> That possible to move decision from lazy load, to ksm_thread,
> that will allow us to start bench and not slowdown boot.
> 
> But for that to works, ksm must start later, after init of crypto.
 
What about moving choice_fastest_hash() to run_store()?

KSM anyway starts with ksm_run = KSM_RUN_STOP and does not scan until
userspace writes !0 to /sys/kernel/mm/ksm/run.

Selection of the hash function when KSM is actually enabled seems quite
appropriate...

> > I understand, you think it is a bad idea to keep it in ksm_init() because
> > it slows down boot by 0.25s, which I agree with your is substantial. But,
> I
> > really do not think that we should spend those 0.25s at all deciding what
> > hash function is optimal, and instead default to one or another during
> boot
> > based on hardware we are booting on. If crc32c without hw acceleration is
> > no worse than jhash2, maybe we should simply switch to  crc32c?
> 
> crc32c with no hw, are slower in compare to jhash2 on x86, so i think on
> other arches result will be same.
> 
> > Thank you,
> > Pavel
> 
> Thanks.
> 
> --
> Have a nice day,
> Timofey.
>
Pavel Tatashin May 29, 2018, 2:45 p.m. UTC | #10
> What about moving choice_fastest_hash() to run_store()?

> KSM anyway starts with ksm_run = KSM_RUN_STOP and does not scan until
> userspace writes !0 to /sys/kernel/mm/ksm/run.

> Selection of the hash function when KSM is actually enabled seems quite
> appropriate...

Hi Mike,

This is a good idea to select hash function from run_store() when (flags &
KSM_RUN_MERGE) is set for the first time.

Pavel
Timofey Titovets June 7, 2018, 8:58 a.m. UTC | #11
вт, 29 мая 2018 г. в 17:46, Pavel Tatashin <pasha.tatashin@oracle.com>:
>
> > What about moving choice_fastest_hash() to run_store()?
>
> > KSM anyway starts with ksm_run = KSM_RUN_STOP and does not scan until
> > userspace writes !0 to /sys/kernel/mm/ksm/run.
>
> > Selection of the hash function when KSM is actually enabled seems quite
> > appropriate...
>
> Hi Mike,
>
> This is a good idea to select hash function from run_store() when (flags &
> KSM_RUN_MERGE) is set for the first time.
>
> Pavel

IIRC, run_store hidden under '#ifdef CONFIG_SYSFS'
So, what's about case without CONFIG_SYSFS?
Mike Rapoport June 7, 2018, 11:52 a.m. UTC | #12
On Thu, Jun 07, 2018 at 11:58:05AM +0300, Timofey Titovets wrote:
> вт, 29 мая 2018 г. в 17:46, Pavel Tatashin <pasha.tatashin@oracle.com>:
> >
> > > What about moving choice_fastest_hash() to run_store()?
> >
> > > KSM anyway starts with ksm_run = KSM_RUN_STOP and does not scan until
> > > userspace writes !0 to /sys/kernel/mm/ksm/run.
> >
> > > Selection of the hash function when KSM is actually enabled seems quite
> > > appropriate...
> >
> > Hi Mike,
> >
> > This is a good idea to select hash function from run_store() when (flags &
> > KSM_RUN_MERGE) is set for the first time.
> >
> > Pavel
> 
> IIRC, run_store hidden under '#ifdef CONFIG_SYSFS'
> So, what's about case without CONFIG_SYSFS?

With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
zero and ksm_do_scan will never be called.
 
> -- 
> Have a nice day,
> Timofey.
>
Pavel Tatashin June 8, 2018, 1:29 a.m. UTC | #13
> With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
> zero and ksm_do_scan will never be called.
>

Unfortunatly, this is not so:

In: /linux-master/mm/ksm.c

3143#else
3144 ksm_run = KSM_RUN_MERGE; /* no way for user to start it */
3145
3146#endif /* CONFIG_SYSFS */

So, we do set ksm_run to run right from ksm_init() when CONFIG_SYSFS=n.

I wonder if this is acceptible to only use xxhash when CONFIG_SYSFS=n ?

Thank you,
Pavel
Mike Rapoport June 10, 2018, 5:38 a.m. UTC | #14
On Thu, Jun 07, 2018 at 09:29:49PM -0400, Pavel Tatashin wrote:
> > With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
> > zero and ksm_do_scan will never be called.
> >
> 
> Unfortunatly, this is not so:
> 
> In: /linux-master/mm/ksm.c
> 
> 3143#else
> 3144 ksm_run = KSM_RUN_MERGE; /* no way for user to start it */
> 3145
> 3146#endif /* CONFIG_SYSFS */
> 
> So, we do set ksm_run to run right from ksm_init() when CONFIG_SYSFS=n.

Huh, missed that one...
 
> I wonder if this is acceptible to only use xxhash when CONFIG_SYSFS=n ?

A bit unrelated to CONFIG_SYSFS, but rather for rare use-cases in general.
What will happen in the following scenario:

* The system has crc32c HW acceleration
* KSM chooses crc32c
* KSM runs with crc32c
* user removes crc32c HW acceleration module

If I understand correctly, we'll then fall back to pure SW crc32c
calculations, right?

> Thank you,
> Pavel
>
Pavel Tatashin June 22, 2018, 6:48 p.m. UTC | #15
> A bit unrelated to CONFIG_SYSFS, but rather for rare use-cases in general.
> What will happen in the following scenario:
>
> * The system has crc32c HW acceleration
> * KSM chooses crc32c
> * KSM runs with crc32c
> * user removes crc32c HW acceleration module
>
> If I understand correctly, we'll then fall back to pure SW crc32c
> calculations, right?

Yes, we fallback to the SW crc32c, which is slower compared to hw
optimized, but we won't change hash function once it is set. I do not
think it makes sense to add any extra logic into ksm for that, even
after every page is unmerged and ksm thread is stopped.

Pavel
Mike Rapoport June 25, 2018, 8:48 a.m. UTC | #16
On Thu, Jun 07, 2018 at 09:29:49PM -0400, Pavel Tatashin wrote:
> > With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
> > zero and ksm_do_scan will never be called.
> >
> 
> Unfortunatly, this is not so:
> 
> In: /linux-master/mm/ksm.c
> 
> 3143#else
> 3144 ksm_run = KSM_RUN_MERGE; /* no way for user to start it */
> 3145
> 3146#endif /* CONFIG_SYSFS */
> 
> So, we do set ksm_run to run right from ksm_init() when CONFIG_SYSFS=n.
> 
> I wonder if this is acceptible to only use xxhash when CONFIG_SYSFS=n ?

BTW, with CONFIG_SYSFS=n KSM may start running before hardware acceleration
for crc32c is initialized...
 
> Thank you,
> Pavel
>
Timofey Titovets Sept. 13, 2018, 10:35 a.m. UTC | #17
пн, 25 июн. 2018 г. в 11:48, Mike Rapoport <rppt@linux.vnet.ibm.com>:
>
> On Thu, Jun 07, 2018 at 09:29:49PM -0400, Pavel Tatashin wrote:
> > > With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
> > > zero and ksm_do_scan will never be called.
> > >
> >
> > Unfortunatly, this is not so:
> >
> > In: /linux-master/mm/ksm.c
> >
> > 3143#else
> > 3144 ksm_run = KSM_RUN_MERGE; /* no way for user to start it */
> > 3145
> > 3146#endif /* CONFIG_SYSFS */
> >
> > So, we do set ksm_run to run right from ksm_init() when CONFIG_SYSFS=n.
> >
> > I wonder if this is acceptible to only use xxhash when CONFIG_SYSFS=n ?
>
> BTW, with CONFIG_SYSFS=n KSM may start running before hardware acceleration
> for crc32c is initialized...
>
> > Thank you,
> > Pavel
> >
>
> --
> Sincerely yours,
> Mike.
>

Little thread bump.
That patchset can't move forward already for about ~8 month.
As i see main question in thread: that we have a race with ksm
initialization and availability of crypto api.
Maybe we then can fall back to simple plan, and just replace old good
buddy jhash by just more fast xxhash?
That allow move question with crypto api & crc32 to background, and
make things better for now, in 2-3 times.

What you all think about that?

> crc32c_intel: 1084.10ns
> crc32c (no hardware acceleration): 7012.51ns
> xxhash32: 2227.75ns
> xxhash64: 1413.16ns
> jhash2: 5128.30ns
Mike Rapoport Sept. 13, 2018, 6:01 p.m. UTC | #18
(updated Pasha's e-mail)

Thu, Sep 13, 2018 at 01:35:20PM +0300, Timofey Titovets wrote:
> пн, 25 июн. 2018 г. в 11:48, Mike Rapoport <rppt@linux.vnet.ibm.com>:
> >
> > On Thu, Jun 07, 2018 at 09:29:49PM -0400, Pavel Tatashin wrote:
> > > > With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
> > > > zero and ksm_do_scan will never be called.
> > > >
> > >
> > > Unfortunatly, this is not so:
> > >
> > > In: /linux-master/mm/ksm.c
> > >
> > > 3143#else
> > > 3144 ksm_run = KSM_RUN_MERGE; /* no way for user to start it */
> > > 3145
> > > 3146#endif /* CONFIG_SYSFS */
> > >
> > > So, we do set ksm_run to run right from ksm_init() when CONFIG_SYSFS=n.
> > >
> > > I wonder if this is acceptible to only use xxhash when CONFIG_SYSFS=n ?
> >
> > BTW, with CONFIG_SYSFS=n KSM may start running before hardware acceleration
> > for crc32c is initialized...
> >
> > > Thank you,
> > > Pavel
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
> >
> 
> Little thread bump.
> That patchset can't move forward already for about ~8 month.
> As i see main question in thread: that we have a race with ksm
> initialization and availability of crypto api.
> Maybe we then can fall back to simple plan, and just replace old good
> buddy jhash by just more fast xxhash?
> That allow move question with crypto api & crc32 to background, and
> make things better for now, in 2-3 times.
> 
> What you all think about that?

Sounds reasonable to me

> > crc32c_intel: 1084.10ns
> > crc32c (no hardware acceleration): 7012.51ns
> > xxhash32: 2227.75ns
> > xxhash64: 1413.16ns
> > jhash2: 5128.30ns
> 
> -- 
> Have a nice day,
> Timofey.
>
Pasha Tatashin Sept. 13, 2018, 6:10 p.m. UTC | #19
On 9/13/18 2:01 PM, Mike Rapoport wrote:
> (updated Pasha's e-mail)
> 
> Thu, Sep 13, 2018 at 01:35:20PM +0300, Timofey Titovets wrote:
>> пн, 25 июн. 2018 г. в 11:48, Mike Rapoport <rppt@linux.vnet.ibm.com>:
>>>
>>> On Thu, Jun 07, 2018 at 09:29:49PM -0400, Pavel Tatashin wrote:
>>>>> With CONFIG_SYSFS=n there is nothing that will set ksm_run to anything but
>>>>> zero and ksm_do_scan will never be called.
>>>>>
>>>>
>>>> Unfortunatly, this is not so:
>>>>
>>>> In: /linux-master/mm/ksm.c
>>>>
>>>> 3143#else
>>>> 3144 ksm_run = KSM_RUN_MERGE; /* no way for user to start it */
>>>> 3145
>>>> 3146#endif /* CONFIG_SYSFS */
>>>>
>>>> So, we do set ksm_run to run right from ksm_init() when CONFIG_SYSFS=n.
>>>>
>>>> I wonder if this is acceptible to only use xxhash when CONFIG_SYSFS=n ?
>>>
>>> BTW, with CONFIG_SYSFS=n KSM may start running before hardware acceleration
>>> for crc32c is initialized...
>>>
>>>> Thank you,
>>>> Pavel
>>>>
>>>
>>> --
>>> Sincerely yours,
>>> Mike.
>>>
>>
>> Little thread bump.
>> That patchset can't move forward already for about ~8 month.
>> As i see main question in thread: that we have a race with ksm
>> initialization and availability of crypto api.
>> Maybe we then can fall back to simple plan, and just replace old good
>> buddy jhash by just more fast xxhash?
>> That allow move question with crypto api & crc32 to background, and
>> make things better for now, in 2-3 times.
>>
>> What you all think about that?
> 
> Sounds reasonable to me

Same here, please send a new patch with xxhash, and after that we can
work on a faster crc32.

Thank you,
Pavel

> 
>>> crc32c_intel: 1084.10ns
>>> crc32c (no hardware acceleration): 7012.51ns
>>> xxhash32: 2227.75ns
>>> xxhash64: 1413.16ns
>>> jhash2: 5128.30ns
>>
>> -- 
>> Have a nice day,
>> Timofey.
>>
>
diff mbox

Patch

diff --git a/mm/Kconfig b/mm/Kconfig
index 03ff7703d322..b60bee4bb07e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -305,6 +305,8 @@  config MMU_NOTIFIER
 config KSM
 	bool "Enable KSM for page merging"
 	depends on MMU
+	select XXHASH
+	select LIBCRC32C
 	help
 	  Enable Kernel Samepage Merging: KSM periodically scans those areas
 	  of an application's address space that an app has advised may be
diff --git a/mm/ksm.c b/mm/ksm.c
index c406f75957ad..2b84407fb918 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -25,7 +25,6 @@ 
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/spinlock.h>
-#include <linux/jhash.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
 #include <linux/wait.h>
@@ -41,6 +40,13 @@ 
 #include <linux/numa.h>
 
 #include <asm/tlbflush.h>
+
+/* Support for xxhash and crc32c */
+#include <crypto/hash.h>
+#include <linux/crc32c.h>
+#include <linux/xxhash.h>
+#include <linux/sizes.h>
+
 #include "internal.h"
 
 #ifdef CONFIG_NUMA
@@ -284,6 +290,87 @@  static DEFINE_SPINLOCK(ksm_mmlist_lock);
 		sizeof(struct __struct), __alignof__(struct __struct),\
 		(__flags), NULL)
 
+#define TIME_125MS  (HZ >> 3)
+#define PERF_TO_MBS(X) (X*PAGE_SIZE*(1 << 3)/(SZ_1M))
+
+#define HASH_NONE   0
+#define HASH_CRC32C 1
+#define HASH_XXHASH 2
+
+static int fastest_hash = HASH_NONE;
+
+static bool __init crc32c_available(void)
+{
+	static struct shash_desc desc;
+
+	desc.tfm = crypto_alloc_shash("crc32c", 0, 0);
+	desc.flags = 0;
+
+	if (IS_ERR(desc.tfm)) {
+		pr_warn("ksm: alloc crc32c shash error %ld\n",
+			-PTR_ERR(desc.tfm));
+		return false;
+	}
+
+	crypto_free_shash(desc.tfm);
+	return true;
+}
+
+static void __init choice_fastest_hash(void)
+{
+
+	unsigned long je;
+	unsigned long perf_crc32c = 0;
+	unsigned long perf_xxhash = 0;
+
+	fastest_hash = HASH_XXHASH;
+	if (!crc32c_available())
+		goto out;
+
+	preempt_disable();
+	je = jiffies + TIME_125MS;
+	while (time_before(jiffies, je)) {
+		crc32c(0, ZERO_PAGE(0), PAGE_SIZE);
+		perf_crc32c++;
+	}
+	preempt_enable();
+
+	preempt_disable();
+	je = jiffies + TIME_125MS;
+	while (time_before(jiffies, je)) {
+		xxhash(ZERO_PAGE(0), PAGE_SIZE, 0);
+		perf_xxhash++;
+	}
+	preempt_enable();
+
+	pr_info("ksm: crc32c hash() %5ld MB/s\n", PERF_TO_MBS(perf_crc32c));
+	pr_info("ksm: xxhash hash() %5ld MB/s\n", PERF_TO_MBS(perf_xxhash));
+
+	if (perf_crc32c > perf_xxhash)
+		fastest_hash = HASH_CRC32C;
+out:
+	if (fastest_hash == HASH_CRC32C)
+		pr_info("ksm: choice crc32c as hash function\n");
+	else
+		pr_info("ksm: choice xxhash as hash function\n");
+}
+
+static u32 fasthash(const void *input, size_t length)
+{
+again:
+	switch (fastest_hash) {
+	case HASH_CRC32C:
+		return crc32c(0, input, length);
+	case HASH_XXHASH:
+		return xxhash(input, length, 0);
+	default:
+		choice_fastest_hash();
+		/* The correct value depends on page size and endianness */
+		zero_checksum = fasthash(ZERO_PAGE(0), PAGE_SIZE);
+		goto again;
+	}
+}
+
 static int __init ksm_slab_init(void)
 {
 	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
@@ -979,7 +1066,7 @@  static u32 calc_checksum(struct page *page)
 {
 	u32 checksum;
 	void *addr = kmap_atomic(page);
-	checksum = jhash2(addr, PAGE_SIZE / 4, 17);
+	checksum = fasthash(addr, PAGE_SIZE);
 	kunmap_atomic(addr);
 	return checksum;
 }
@@ -3061,8 +3148,6 @@  static int __init ksm_init(void)
 	struct task_struct *ksm_thread;
 	int err;
 
-	/* The correct value depends on page size and endianness */
-	zero_checksum = calc_checksum(ZERO_PAGE(0));
 	/* Default to false for backwards compatibility */
 	ksm_use_zero_pages = false;