diff mbox

[RFC] change non-atomic bitops method

Message ID 35FD53F367049845BC99AC72306C23D1044A02027E0E@CNBJMBX05.corpusers.net (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Yalin Feb. 3, 2015, 8:42 a.m. UTC
> -----Original Message-----
> From: Wang, Yalin
> Sent: Tuesday, February 03, 2015 3:04 PM
> To: 'Andrew Morton'
> Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: RE: [RFC] change non-atomic bitops method
> 
> > -----Original Message-----
> > From: Andrew Morton [mailto:akpm@linux-foundation.org]
> > Sent: Tuesday, February 03, 2015 2:39 PM
> > To: Wang, Yalin
> > Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> > 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> > kernel@lists.infradead.org'
> > Subject: Re: [RFC] change non-atomic bitops method
> >
> > On Tue, 3 Feb 2015 13:42:45 +0800 "Wang, Yalin"
> <Yalin.Wang@sonymobile.com>
> > wrote:
> > >
> > > ...
> > >
> > > #ifdef CHECK_BEFORE_SET
> > > 			if (p[i] != times)
> > > #endif
> > >
> > > ...
> > >
> > > ----
> > > One run on CPU0, reader thread run on CPU1,
> > > Test result:
> > > sudo ./cache_test
> > > reader:8.426228173
> > > 8.672198335
> > >
> > > With -DCHECK_BEFORE_SET
> > > sudo ./cache_test_check
> > > reader:7.537036819
> > > 10.799746531
> > >
> >
> > You aren't measuring the right thing.  You should compare
> >
> > 	if (p[i] != x)
> > 		p[i] = x;
> >
> > versus
> >
> > 	p[i] = x;
> >
> > and you should do this for two cases:
> >
> > a) p[i] == x
> >
> > b) p[i] != x
> >
> >
> > The first code sequence will be slower when (p[i] != x) and faster when
> > (p[i] == x).
> >
> >
> > Next, we should instrument the kernel to work out the frequency of
> > set_bit on an already-set bit.
> >
> > It is only with both these ratios that we can work out whether the
> > patch is a net gain.  My suspicion is that set_bit on an already-set
> > bit is so rare that the patch will be a loss.
> I see, let's change the test a little:
> 1)
> 	memset(p, 0, SIZE);
> 	if (p[i] != 0)
> 		p[i] = 0;  // never called
> 
> 	#sudo ./cache_test_check
> 	6.698153838
> 	reader:7.529402625
> 
> 
> 2)
> 	memset(p, 0, SIZE);
> 	if (p[i] == 0)
> 		p[i] = 0; // always called
> 	#sudo ./cache_test_check
> 	reader:7.895421311
> 	9.000889973
> 
> Thanks
> 
> 
I make a change in kernel to test hit/miss ratio:
---
root@D5303:/ # cat /proc/meminfo
VmallocUsed:       10348 kB
VmallocChunk:      75632 kB
__set_bit_miss_count:10002 __set_bit_success_count:1096661
__clear_bit_miss_count:359484 __clear_bit_success_count:3674617
__test_and_set_bit_miss_count:7 __test_and_set_bit_success_count:221
__test_and_clear_bit_miss_count:924611 __test_and_clear_bit_success_count:193

__test_and_clear_bit_miss_count has a very high miss rate.
In fact, I think set/clear/test_and_set(clear)_bit atomic version can also
Be investigated to see its miss ratio,
I have not tested the atomic version,
Because it reside in different architectures.

Thanks

Comments

Andrew Morton Feb. 3, 2015, 10:59 a.m. UTC | #1
On Tue, 3 Feb 2015 16:42:14 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> I make a change in kernel to test hit/miss ratio:

Neat, thanks.

>
> ...
>
> After use the phone some time:
> root@D5303:/ # cat /proc/meminfo
> VmallocUsed:       10348 kB
> VmallocChunk:      75632 kB
> __set_bit_miss_count:10002 __set_bit_success_count:1096661
> __clear_bit_miss_count:359484 __clear_bit_success_count:3674617
> __test_and_set_bit_miss_count:7 __test_and_set_bit_success_count:221
> __test_and_clear_bit_miss_count:924611 __test_and_clear_bit_success_count:193
> 
> __test_and_clear_bit_miss_count has a very high miss rate.
> In fact, I think set/clear/test_and_set(clear)_bit atomic version can also
> Be investigated to see its miss ratio,
> I have not tested the atomic version,
> Because it reside in different architectures.

Hopefully misses in test_and_X_bit are not a problem.  The CPU
implementation would be pretty stupid to go and dirty the cacheline
when it knows it didn't change anything.  But maybe I'm wrong about
that.  

That we're running clear_bit against a cleared bit 10% of the time is a
bit alarming.  I wonder where that's coming from.

The enormous miss count in test_and_clear_bit() might indicate an
inefficiency somewhere.
Wang, Yalin Feb. 9, 2015, 8:18 a.m. UTC | #2
> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Tuesday, February 03, 2015 6:59 PM
> To: Wang, Yalin
> Cc: 'Kirill A. Shutemov'; 'arnd@arndb.de'; 'linux-arch@vger.kernel.org';
> 'linux-kernel@vger.kernel.org'; 'linux@arm.linux.org.uk'; 'linux-arm-
> kernel@lists.infradead.org'
> Subject: Re: [RFC] change non-atomic bitops method
> 
> On Tue, 3 Feb 2015 16:42:14 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com>
> wrote:
> 
> > I make a change in kernel to test hit/miss ratio:
> 
> Neat, thanks.
> 
> >
> > ...
> >
> > After use the phone some time:
> > root@D5303:/ # cat /proc/meminfo
> > VmallocUsed:       10348 kB
> > VmallocChunk:      75632 kB
> > __set_bit_miss_count:10002 __set_bit_success_count:1096661
> > __clear_bit_miss_count:359484 __clear_bit_success_count:3674617
> > __test_and_set_bit_miss_count:7 __test_and_set_bit_success_count:221
> > __test_and_clear_bit_miss_count:924611
> __test_and_clear_bit_success_count:193
> >
> > __test_and_clear_bit_miss_count has a very high miss rate.
> > In fact, I think set/clear/test_and_set(clear)_bit atomic version can
> also
> > Be investigated to see its miss ratio,
> > I have not tested the atomic version,
> > Because it reside in different architectures.
> 
> Hopefully misses in test_and_X_bit are not a problem.  The CPU
> implementation would be pretty stupid to go and dirty the cacheline
> when it knows it didn't change anything.  But maybe I'm wrong about
> that.
> 
> That we're running clear_bit against a cleared bit 10% of the time is a
> bit alarming.  I wonder where that's coming from.
> 
> The enormous miss count in test_and_clear_bit() might indicate an
> inefficiency somewhere.
I te-test the patch on 3.10 kernel.
The result like this:

VmallocChunk:   251498164 kB
__set_bit_miss_count:11730 __set_bit_success_count:1036316
__clear_bit_miss_count:209640 __clear_bit_success_count:4806556
__test_and_set_bit_miss_count:0 __test_and_set_bit_success_count:121
__test_and_clear_bit_miss_count:0 __test_and_clear_bit_success_count:445

__clear_bit miss rate is a little high,
I check the log, and most miss coming from this code:

<6>[  442.701798] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
<6>[  442.701805] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
<6>[  442.701813] [<ffffffc0003126ac>] __alloc_fd+0xc8/0x124
<6>[  442.701821] [<ffffffc000312768>] get_unused_fd_flags+0x28/0x34
<6>[  442.701828] [<ffffffc0002f9370>] do_sys_open+0x10c/0x1c0
<6>[  442.701835] [<ffffffc0002f9458>] SyS_openat+0xc/0x18
In __clear_close_on_exec(fd, fdt);



<6>[  442.695354] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
<6>[  442.695359] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
<6>[  442.695367] [<ffffffc000312340>] dup_fd+0x1d4/0x280
<6>[  442.695375] [<ffffffc00021b07c>] copy_process.part.56+0x42c/0xe38
<6>[  442.695382] [<ffffffc00021bb9c>] do_fork+0xe0/0x360
<6>[  442.695389] [<ffffffc00021beb4>] SyS_clone+0x10/0x1c
In __clear_open_fd(open_files - i, new_fdt);

Do we need test_bit() before clear_bit()at these 2 place?

Thanks
Rasmus Villemoes Feb. 9, 2015, 9:42 p.m. UTC | #3
On Mon, Feb 09 2015, "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> I te-test the patch on 3.10 kernel.
> The result like this:
>
> VmallocChunk:   251498164 kB
> __set_bit_miss_count:11730 __set_bit_success_count:1036316
> __clear_bit_miss_count:209640 __clear_bit_success_count:4806556
> __test_and_set_bit_miss_count:0 __test_and_set_bit_success_count:121
> __test_and_clear_bit_miss_count:0 __test_and_clear_bit_success_count:445
>
> __clear_bit miss rate is a little high,
> I check the log, and most miss coming from this code:
>
> <6>[  442.701798] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> <6>[  442.701805] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> <6>[  442.701813] [<ffffffc0003126ac>] __alloc_fd+0xc8/0x124
> <6>[  442.701821] [<ffffffc000312768>] get_unused_fd_flags+0x28/0x34
> <6>[  442.701828] [<ffffffc0002f9370>] do_sys_open+0x10c/0x1c0
> <6>[  442.701835] [<ffffffc0002f9458>] SyS_openat+0xc/0x18
> In __clear_close_on_exec(fd, fdt);
>
>
>
> <6>[  442.695354] [<ffffffc00021d084>] warn_slowpath_fmt+0x4c/0x58
> <6>[  442.695359] [<ffffffc0002461a8>] __clear_bit+0x98/0xa4
> <6>[  442.695367] [<ffffffc000312340>] dup_fd+0x1d4/0x280
> <6>[  442.695375] [<ffffffc00021b07c>] copy_process.part.56+0x42c/0xe38
> <6>[  442.695382] [<ffffffc00021bb9c>] do_fork+0xe0/0x360
> <6>[  442.695389] [<ffffffc00021beb4>] SyS_clone+0x10/0x1c
> In __clear_open_fd(open_files - i, new_fdt);
>
> Do we need test_bit() before clear_bit()at these 2 place?
>

In the second case, new_fdt->open_fds has just been filled by a
memcpy, and no-one can possibly have written to that cache line in the
meantime. 

In the first case, testing is also likely wasteful if fdt->max_fds is
less than half the number of bits in a cacheline (fdt->close_on_exec and
fdt->open_fds are always contiguous, and the latter is unconditionally
written to).

Rasmus
diff mbox

Patch

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 80e4645..a82937b 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -2,6 +2,7 @@ 
 #include <linux/hugetlb.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
+#include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/mmzone.h>
@@ -15,6 +16,41 @@ 
 #include <asm/pgtable.h>
 #include "internal.h"
 
+atomic_t __set_bit_success_count = ATOMIC_INIT(0);
+atomic_t __set_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__set_bit_success_count);
+EXPORT_SYMBOL_GPL(__set_bit_miss_count);
+
+atomic_t __clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t __clear_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__clear_bit_success_count);
+EXPORT_SYMBOL_GPL(__clear_bit_miss_count);
+
+atomic_t __test_and_set_bit_success_count = ATOMIC_INIT(0);
+atomic_t __test_and_set_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__test_and_set_bit_success_count);
+EXPORT_SYMBOL_GPL(__test_and_set_bit_miss_count);
+
+atomic_t __test_and_clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t __test_and_clear_bit_miss_count = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(__test_and_clear_bit_success_count);
+EXPORT_SYMBOL_GPL(__test_and_clear_bit_miss_count);
+
+/*
+ * atomic bitops
+ */
+atomic_t set_bit_success_count = ATOMIC_INIT(0);
+atomic_t set_bit_miss_count = ATOMIC_INIT(0);
+
+atomic_t clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t clear_bit_miss_count = ATOMIC_INIT(0);
+
+atomic_t test_and_set_bit_success_count = ATOMIC_INIT(0);
+atomic_t test_and_set_bit_miss_count = ATOMIC_INIT(0);
+
+atomic_t test_and_clear_bit_success_count = ATOMIC_INIT(0);
+atomic_t test_and_clear_bit_miss_count = ATOMIC_INIT(0);
+
 void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
 {
 }
@@ -165,6 +201,18 @@  static int meminfo_proc_show(struct seq_file *m, void *v)
 		   HPAGE_PMD_NR)
 #endif
 		);
+	seq_printf(m,   "__set_bit_miss_count:%d __set_bit_success_count:%d\n"
+			"__clear_bit_miss_count:%d __clear_bit_success_count:%d\n"
+			"__test_and_set_bit_miss_count:%d __test_and_set_bit_success_count:%d\n"
+			"__test_and_clear_bit_miss_count:%d __test_and_clear_bit_success_count:%d\n",
+			atomic_read(&__set_bit_miss_count), atomic_read(&__set_bit_success_count),
+			atomic_read(&__clear_bit_miss_count), atomic_read(&__clear_bit_success_count),
+
+			atomic_read(&__test_and_set_bit_miss_count),
+			atomic_read(&__test_and_set_bit_success_count),
+
+			atomic_read(&__test_and_clear_bit_miss_count),
+			atomic_read(&__test_and_clear_bit_success_count));
 
 	hugetlb_report_meminfo(m);
 
diff --git a/include/asm-generic/bitops/non-atomic.h b/include/asm-generic/bitops/non-atomic.h
index 697cc2b..1895133 100644
--- a/include/asm-generic/bitops/non-atomic.h
+++ b/include/asm-generic/bitops/non-atomic.h
@@ -2,7 +2,18 @@ 
 #define _ASM_GENERIC_BITOPS_NON_ATOMIC_H_
 
 #include <asm/types.h>
+#include <asm/atomic.h>
+extern atomic_t __set_bit_success_count;
+extern atomic_t __set_bit_miss_count;
 
+extern atomic_t __clear_bit_success_count;
+extern atomic_t __clear_bit_miss_count;
+
+extern atomic_t __test_and_set_bit_success_count;
+extern atomic_t __test_and_set_bit_miss_count;
+
+extern atomic_t __test_and_clear_bit_success_count;
+extern atomic_t __test_and_clear_bit_miss_count;
 /**
  * __set_bit - Set a bit in memory
  * @nr: the bit to set
@@ -17,7 +28,13 @@  static inline void __set_bit(int nr, volatile unsigned long *addr)
 	unsigned long mask = BIT_MASK(nr);
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 
-	*p  |= mask;
+	if ((*p & mask) == 0) {
+		atomic_inc(&__set_bit_success_count);
+		*p  |= mask;
+	} else {
+		atomic_inc(&__set_bit_miss_count);
+	}
+
 }
 
 static inline void __clear_bit(int nr, volatile unsigned long *addr)
@@ -25,7 +42,12 @@  static inline void __clear_bit(int nr, volatile unsigned long *addr)
 	unsigned long mask = BIT_MASK(nr);
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 
-	*p &= ~mask;
+	if ((*p & mask) != 0) {
+		atomic_inc(&__clear_bit_success_count);
+		*p &= ~mask;
+	} else {
+		atomic_inc(&__clear_bit_miss_count);
+	}
 }
 
 /**
@@ -60,7 +82,12 @@  static inline int __test_and_set_bit(int nr, volatile unsigned long *addr)
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 	unsigned long old = *p;
 
-	*p = old | mask;
+	if ((old & mask) == 0) {
+		atomic_inc(&__test_and_set_bit_success_count);
+		*p = old | mask;
+	} else {
+		atomic_inc(&__test_and_set_bit_miss_count);
+	}
 	return (old & mask) != 0;
 }
 
@@ -79,7 +106,12 @@  static inline int __test_and_clear_bit(int nr, volatile unsigned long *addr)
 	unsigned long *p = ((unsigned long *)addr) + BIT_WORD(nr);
 	unsigned long old = *p;
 
-	*p = old & ~mask;
+	if ((old & mask) != 0) {
+		atomic_inc(&__test_and_clear_bit_success_count);
+		*p = old & ~mask;
+	} else {
+		atomic_inc(&__test_and_clear_bit_miss_count);
+	}
 	return (old & mask) != 0;
 }
---
After use the phone some time: