diff mbox

[RFC,V2] test_bit before clear files_struct bits

Message ID 35FD53F367049845BC99AC72306C23D1044A02027E19@CNBJMBX05.corpusers.net (mailing list archive)
State New, archived
Headers show

Commit Message

Wang, Yalin Feb. 10, 2015, 7:11 a.m. UTC
add test_bit() before clear close_on_exec and open_fds,
by trace __clear_bit(), these 2 place are false in most times,
we test it so that we don't need clear_bit, and we can win
in most time.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 fs/file.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Andrew Morton Feb. 10, 2015, 8:22 p.m. UTC | #1
(cc Linus for CPU-fu)

On Tue, 10 Feb 2015 15:11:37 +0800 "Wang, Yalin" <Yalin.Wang@sonymobile.com> wrote:

> add test_bit() before clear close_on_exec and open_fds,
> by trace __clear_bit(), these 2 place are false in most times,
> we test it so that we don't need clear_bit, and we can win
> in most time.
> 
> ...
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -209,7 +209,8 @@ static inline void __set_close_on_exec(int fd, struct fdtable *fdt)
>  
>  static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
>  {
> -	__clear_bit(fd, fdt->close_on_exec);
> +	if (test_bit(fd, fdt->close_on_exec))
> +		__clear_bit(fd, fdt->close_on_exec);
>  }
>  
>  static inline void __set_open_fd(int fd, struct fdtable *fdt)
> @@ -309,7 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
>  		struct file *f = *old_fds++;
>  		if (f) {
>  			get_file(f);
> -		} else {
> +		} else if (test_bit(open_files - i, new_fdt->open_fds)) {
>  			/*
>  			 * The fd may be claimed in the fd bitmap but not yet
>  			 * instantiated in the files array if a sibling thread

The patch is good but I'm still wondering if any CPUs can do this
speedup for us.  The CPU has to pull in the target word to modify the
bit and what it *could* do is to avoid dirtying the cacheline if it
sees that the bit is already in the desired state.

However somef elapsed-time testing I did on a couple of Intel
machines indicates that these CPUs don't perform that optimisation. 
Perhaps there's some reason why they don't, dunno.


Still, I think we should encapsulate the above (common) pattern into
helper functions in include/linux/bitops.h because

- it's cleaner

- it's self-documenting

- it permits us to eliminate the if(test_bit) on any CPU which does
  perform the optimisation internally, if such exists.


You actually have measurement results for these (and other)
set-bit-on-already-set-bit call sites.  Please include all of that info
in the changelog.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Feb. 10, 2015, 8:49 p.m. UTC | #2
On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> The patch is good but I'm still wondering if any CPUs can do this
> speedup for us.  The CPU has to pull in the target word to modify the
> bit and what it *could* do is to avoid dirtying the cacheline if it
> sees that the bit is already in the desired state.

Sadly, no CPU I know of actually does this.  Probably because it would
take a bit more core resources, and conditional writes to memory are
not normally part of an x86 core (it might be more natural for
something like old-style ARM that has conditional writes).

Also, even if the store were to be conditional, the cacheline would
have been acquired in exclusive state, and in many cache protocols the
state machine is from exclusive to dirty (since normally the only
reason to get a cacheline for exclusive use is in order to write to
it). So a "read, test, conditional write" ends up actually being more
complicated than you'd think - because you *want* that
exclusive->dirty state for the case where you really are going to
change the bit, and to avoid extra cache protocol stages you don't
generally want to read the cacheline into a shared read mode first
(only to then have to turn it into exclusive/dirty as a second state)

So at least on current x86 (and for reasons above, likely in the
future, including other architectures with read-modify-write memory
access models), the default assumption is that the bit operations will
actually change the bit, and unlikely bit setting/clearing for
cachelines that are very likely to otherwise stay clean should
probably be conditionalized in software. Like in this patch.

                         Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A . Shutemov Feb. 10, 2015, 10:46 p.m. UTC | #3
On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
> On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > The patch is good but I'm still wondering if any CPUs can do this
> > speedup for us.  The CPU has to pull in the target word to modify the
> > bit and what it *could* do is to avoid dirtying the cacheline if it
> > sees that the bit is already in the desired state.
> 
> Sadly, no CPU I know of actually does this.  Probably because it would
> take a bit more core resources, and conditional writes to memory are
> not normally part of an x86 core (it might be more natural for
> something like old-style ARM that has conditional writes).
> 
> Also, even if the store were to be conditional, the cacheline would
> have been acquired in exclusive state, and in many cache protocols the
> state machine is from exclusive to dirty (since normally the only
> reason to get a cacheline for exclusive use is in order to write to
> it). So a "read, test, conditional write" ends up actually being more
> complicated than you'd think - because you *want* that
> exclusive->dirty state for the case where you really are going to
> change the bit, and to avoid extra cache protocol stages you don't
> generally want to read the cacheline into a shared read mode first
> (only to then have to turn it into exclusive/dirty as a second state)

That all sounds resonable.

But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.

http://article.gmane.org/gmane.linux.kernel.cross-arch/26254

In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E->M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.

Any ideas?
Linus Torvalds Feb. 10, 2015, 11:29 p.m. UTC | #4
On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> But I still fail to understand why my micro-benchmark is faster with
> branch before store comparing to plain store.

Very tight artificial loops like that tend to be horrible for
performance analysis on modern cores, because you end up seeing mostly
random microarchitectural details rather than any real performance.

At a guess, since you write just one word per cacheline, what happens
is that the store buffer continually fills up faster than the stores
get drained to cache. So then the stores start staling. The extra load
- that you expect to slow things down - likely ends up efectively just
prefetching the hot L2 cacheline into L1 so that the store buffer then
drains more cleanly.

And both the load and the branch are effectively free, because the
branch predicts perfectly, and the load just prefetches a cacheline
that will have to be fetched for the subsequent store buffer drain
anyway.

And as you say, there is no cacheline bouncing issues, and the working
set presumably fits in the caches - even if it doesn't fit in the L1.

But that's all just a wild guess.

It could equally easily be some very specific microarchitectural store
buffer stall due to the simpler loop hitting just the right cycle
count between stores.  There are all kinds of random odd small
corner-cases that are generally very very rare and hidden in the
noise, but then a loop with just the right strides can happen to just
hit them. It used to be *trivial* to hit things like address
generation stalls, and even though modern Intel CPU's tend to be quite
robust performance-wise, it's not true that they always handle any
code sequence "perfectly".

                            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/file.c b/fs/file.c
index ee738ea..b0e059c 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -209,7 +209,8 @@  static inline void __set_close_on_exec(int fd, struct fdtable *fdt)
 
 static inline void __clear_close_on_exec(int fd, struct fdtable *fdt)
 {
-	__clear_bit(fd, fdt->close_on_exec);
+	if (test_bit(fd, fdt->close_on_exec))
+		__clear_bit(fd, fdt->close_on_exec);
 }
 
 static inline void __set_open_fd(int fd, struct fdtable *fdt)
@@ -309,7 +310,7 @@  struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
 		struct file *f = *old_fds++;
 		if (f) {
 			get_file(f);
-		} else {
+		} else if (test_bit(open_files - i, new_fdt->open_fds)) {
 			/*
 			 * The fd may be claimed in the fd bitmap but not yet
 			 * instantiated in the files array if a sibling thread