[v2,1/3] fs/file.c: add fast path in alloc_fd()

Message ID	20240622154904.3774273-2-yu.ma@intel.com (mailing list archive)
State	New
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1902C16F918; Sat, 22 Jun 2024 15:23:11 +0000 (UTC) From: Yu Ma <yu.ma@intel.com> To: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, mjguzik@gmail.com, edumazet@google.com Cc: yu.ma@intel.com, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, pan.deng@intel.com, tianyou.li@intel.com, tim.c.chen@intel.com, tim.c.chen@linux.intel.com Subject: [PATCH v2 1/3] fs/file.c: add fast path in alloc_fd() Date: Sat, 22 Jun 2024 11:49:02 -0400 Message-ID: <20240622154904.3774273-2-yu.ma@intel.com> In-Reply-To: <20240622154904.3774273-1-yu.ma@intel.com> References: <20240614163416.728752-1-yu.ma@intel.com> <20240622154904.3774273-1-yu.ma@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	fs/file.c: optimize the critical section of file_lock in \| expand [v2,0/3] fs/file.c: optimize the critical section of file_lock in [v2,1/3] fs/file.c: add fast path in alloc_fd() [v2,2/3] fs/file.c: conditionally clear full_fds [v2,3/3] fs/file.c: remove sanity_check from alloc_fd()

Ma, Yu June 22, 2024, 3:49 p.m. UTC

There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.

Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) find_next_zero_bit() itself has a fast path inside to speed up searching
when size<=64.

Besides, "!start" is added to fast path condition to ensure the allocated fd is
greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
(1) Allocating a new fd (the most common usage scenario) via
get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
(2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
old_fd's index in fdt, which is only called by syscall fcntl.

With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.

Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Yu Ma <yu.ma@intel.com>
---
 fs/file.c | 35 +++++++++++++++++++++--------------
 1 file changed, 21 insertions(+), 14 deletions(-)

Jan Kara June 25, 2024, 11:52 a.m. UTC | #1

On Sat 22-06-24 11:49:02, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
> 
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> when size<=64.
> 
> Besides, "!start" is added to fast path condition to ensure the allocated fd is
> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> (1) Allocating a new fd (the most common usage scenario) via
> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> old_fd's index in fdt, which is only called by syscall fcntl.
> 
> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> 
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Yu Ma <yu.ma@intel.com>
> ---
>  fs/file.c | 35 +++++++++++++++++++++--------------
>  1 file changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index a3b72aa64f11..50e900a47107 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>  	if (fd < files->next_fd)
>  		fd = files->next_fd;
>  
> -	if (fd < fdt->max_fds)
> +	error = -EMFILE;
> +	if (likely(fd < fdt->max_fds)) {
> +		if (~fdt->open_fds[0] && !start) {
> +			fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);

So I don't think this is quite correct. If files->next_fd is set, we could
end up calling find_next_zero_bit() starting from quite high offset causing
a regression? Also because we don't expand in this case, we could cause access
beyond end of fdtable?

Finally, AFAIU this speeds up the lookup for cases where fd < 64 is
available at the cost of cases where the first long is full (there we
unnecessarily load open_fds[0] into cache). Did you check if the cost is
visible (e.g. by making blogbench occupy first 64 fds before starting its
load)?

								Honza

> +			goto fastreturn;
> +		}
>  		fd = find_next_fd(fdt, fd);
> +	}
> +
> +	if (unlikely(fd >= fdt->max_fds)) {
> +		error = expand_files(files, fd);
> +		if (error < 0)
> +			goto out;
> +		/*
> +		 * If we needed to expand the fs array we
> +		 * might have blocked - try again.
> +		 */
> +		if (error)
> +			goto repeat;
> +	}
>  
> +fastreturn:
>  	/*
>  	 * N.B. For clone tasks sharing a files structure, this test
>  	 * will limit the total number of files that can be opened.
>  	 */
> -	error = -EMFILE;
> -	if (fd >= end)
> +	if (unlikely(fd >= end))
>  		goto out;
>  
> -	error = expand_files(files, fd);
> -	if (error < 0)
> -		goto out;
> -
> -	/*
> -	 * If we needed to expand the fs array we
> -	 * might have blocked - try again.
> -	 */
> -	if (error)
> -		goto repeat;
> -
>  	if (start <= files->next_fd)
>  		files->next_fd = fd + 1;
>  
> -- 
> 2.43.0
>

Jan Kara June 25, 2024, 12:53 p.m. UTC | #2

On Tue 25-06-24 13:52:57, Jan Kara wrote:
> On Sat 22-06-24 11:49:02, Yu Ma wrote:
> > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > when we look for an available fd slot. Skip 2-levels searching via
> > find_next_zero_bit() for this common fast path.
> > 
> > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > free slot is available there, as:
> > (1) The fd allocation algorithm would always allocate fd from small to large.
> > Lower bits in open_fds bitmap would be used much more frequently than higher
> > bits.
> > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > it would never be shrunk. The search size increases but there are few open fds
> > available here.
> > (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> > when size<=64.
> > 
> > Besides, "!start" is added to fast path condition to ensure the allocated fd is
> > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> > (1) Allocating a new fd (the most common usage scenario) via
> > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> > old_fd's index in fdt, which is only called by syscall fcntl.
> > 
> > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> > 
> > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > ---
> >  fs/file.c | 35 +++++++++++++++++++++--------------
> >  1 file changed, 21 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/file.c b/fs/file.c
> > index a3b72aa64f11..50e900a47107 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> >  	if (fd < files->next_fd)
> >  		fd = files->next_fd;
> >  
> > -	if (fd < fdt->max_fds)
> > +	error = -EMFILE;
> > +	if (likely(fd < fdt->max_fds)) {
> > +		if (~fdt->open_fds[0] && !start) {
> > +			fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> 
> So I don't think this is quite correct. If files->next_fd is set, we could
> end up calling find_next_zero_bit() starting from quite high offset causing
> a regression? Also because we don't expand in this case, we could cause access
> beyond end of fdtable?

OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
be false. As such the above condition could be rewritten as:

		if (!start && files->next_fd < BITS_PER_LONG)

to avoid loading the first bitmap long if we know it is full? Or we could
maybe go as far as:

		if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
			goto fastreturn;

because AFAIU this should work in exactly the same cases as your code?

								Honza

> > +			goto fastreturn;
> > +		}
> >  		fd = find_next_fd(fdt, fd);
> > +	}
> > +
> > +	if (unlikely(fd >= fdt->max_fds)) {
> > +		error = expand_files(files, fd);
> > +		if (error < 0)
> > +			goto out;
> > +		/*
> > +		 * If we needed to expand the fs array we
> > +		 * might have blocked - try again.
> > +		 */
> > +		if (error)
> > +			goto repeat;
> > +	}
> >  
> > +fastreturn:
> >  	/*
> >  	 * N.B. For clone tasks sharing a files structure, this test
> >  	 * will limit the total number of files that can be opened.
> >  	 */
> > -	error = -EMFILE;
> > -	if (fd >= end)
> > +	if (unlikely(fd >= end))
> >  		goto out;
> >  
> > -	error = expand_files(files, fd);
> > -	if (error < 0)
> > -		goto out;
> > -
> > -	/*
> > -	 * If we needed to expand the fs array we
> > -	 * might have blocked - try again.
> > -	 */
> > -	if (error)
> > -		goto repeat;
> > -
> >  	if (start <= files->next_fd)
> >  		files->next_fd = fd + 1;
> >  
> > -- 
> > 2.43.0
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
>

Ma, Yu June 25, 2024, 3:33 p.m. UTC | #3

On 6/25/2024 8:53 PM, Jan Kara wrote:
> On Tue 25-06-24 13:52:57, Jan Kara wrote:
>> On Sat 22-06-24 11:49:02, Yu Ma wrote:
>>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>>> when we look for an available fd slot. Skip 2-levels searching via
>>> find_next_zero_bit() for this common fast path.
>>>
>>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>>> free slot is available there, as:
>>> (1) The fd allocation algorithm would always allocate fd from small to large.
>>> Lower bits in open_fds bitmap would be used much more frequently than higher
>>> bits.
>>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>>> it would never be shrunk. The search size increases but there are few open fds
>>> available here.
>>> (3) find_next_zero_bit() itself has a fast path inside to speed up searching
>>> when size<=64.
>>>
>>> Besides, "!start" is added to fast path condition to ensure the allocated fd is
>>> greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
>>> (1) Allocating a new fd (the most common usage scenario) via
>>> get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
>>> (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
>>> old_fd's index in fdt, which is only called by syscall fcntl.
>>>
>>> With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
>>> by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
>>>
>>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>>> Signed-off-by: Yu Ma <yu.ma@intel.com>
>>> ---
>>>   fs/file.c | 35 +++++++++++++++++++++--------------
>>>   1 file changed, 21 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/fs/file.c b/fs/file.c
>>> index a3b72aa64f11..50e900a47107 100644
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>>>   	if (fd < files->next_fd)
>>>   		fd = files->next_fd;
>>>   
>>> -	if (fd < fdt->max_fds)
>>> +	error = -EMFILE;
>>> +	if (likely(fd < fdt->max_fds)) {
>>> +		if (~fdt->open_fds[0] && !start) {
>>> +			fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>> So I don't think this is quite correct. If files->next_fd is set, we could
>> end up calling find_next_zero_bit() starting from quite high offset causing
>> a regression? Also because we don't expand in this case, we could cause access
>> beyond end of fdtable?
> OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
> the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
> be false. As such the above condition could be rewritten as:
>
> 		if (!start && files->next_fd < BITS_PER_LONG)
>
> to avoid loading the first bitmap long if we know it is full? Or we could
> maybe go as far as:
>
> 		if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
> 			goto fastreturn;
>
> because AFAIU this should work in exactly the same cases as your code?
>
> 								Honza

Thanks Honza for the good concern and suggestions here, while both above 
conditions are not enough to ensure that there is available fd in the 
first 64 bits of open_fds. As next_fd just means there is no available 
fd before next_fd, just imagine that fd from 0 to 66 are already 
occupied, now fd=3 is returned back, then next_fd would be set as 3 per 
fd recycling logic (i.e. in __put_unused_fd()), next time when 
alloc_fd() being called, it would return fd=3 to the caller and set 
next_fd=4. Then next time when alloc_fd() being called again, 
next_fd==4, but actually it's already been occupied. So 
find_next_zero_bit() is needed to find the real 0 bit anyway. The 
conditions should either be like it is in patch or if (!start && 
!test_bit(0, fdt->full_fds_bits)), the latter should also have the 
bitmap loading cost, but another point is that a bit in full_fds_bits 
represents 64 bits in open_fds, no matter fd >64 or not, full_fds_bits 
should be loaded any way, maybe we can modify the condition to use 
full_fds_bits ?

>>> +			goto fastreturn;
>>> +		}
>>>   		fd = find_next_fd(fdt, fd);
>>> +	}
>>> +
>>> +	if (unlikely(fd >= fdt->max_fds)) {
>>> +		error = expand_files(files, fd);
>>> +		if (error < 0)
>>> +			goto out;
>>> +		/*
>>> +		 * If we needed to expand the fs array we
>>> +		 * might have blocked - try again.
>>> +		 */
>>> +		if (error)
>>> +			goto repeat;
>>> +	}
>>>   
>>> +fastreturn:
>>>   	/*
>>>   	 * N.B. For clone tasks sharing a files structure, this test
>>>   	 * will limit the total number of files that can be opened.
>>>   	 */
>>> -	error = -EMFILE;
>>> -	if (fd >= end)
>>> +	if (unlikely(fd >= end))
>>>   		goto out;
>>>   
>>> -	error = expand_files(files, fd);
>>> -	if (error < 0)
>>> -		goto out;
>>> -
>>> -	/*
>>> -	 * If we needed to expand the fs array we
>>> -	 * might have blocked - try again.
>>> -	 */
>>> -	if (error)
>>> -		goto repeat;
>>> -
>>>   	if (start <= files->next_fd)
>>>   		files->next_fd = fd + 1;
>>>   
>>> -- 
>>> 2.43.0
>>>
>> -- 
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
>>

Jan Kara June 26, 2024, 11:54 a.m. UTC | #4

On Tue 25-06-24 23:33:34, Ma, Yu wrote:
> On 6/25/2024 8:53 PM, Jan Kara wrote:
> > On Tue 25-06-24 13:52:57, Jan Kara wrote:
> > > On Sat 22-06-24 11:49:02, Yu Ma wrote:
> > > > There is available fd in the lower 64 bits of open_fds bitmap for most cases
> > > > when we look for an available fd slot. Skip 2-levels searching via
> > > > find_next_zero_bit() for this common fast path.
> > > > 
> > > > Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> > > > free slot is available there, as:
> > > > (1) The fd allocation algorithm would always allocate fd from small to large.
> > > > Lower bits in open_fds bitmap would be used much more frequently than higher
> > > > bits.
> > > > (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> > > > it would never be shrunk. The search size increases but there are few open fds
> > > > available here.
> > > > (3) find_next_zero_bit() itself has a fast path inside to speed up searching
> > > > when size<=64.
> > > > 
> > > > Besides, "!start" is added to fast path condition to ensure the allocated fd is
> > > > greater than start (i.e. >=0), given alloc_fd() is only called in two scenarios:
> > > > (1) Allocating a new fd (the most common usage scenario) via
> > > > get_unused_fd_flags() to find fd start from bit 0 in fdt (i.e. start==0).
> > > > (2) Duplicating a fd (less common usage) via dup_fd() to find a fd start from
> > > > old_fd's index in fdt, which is only called by syscall fcntl.
> > > > 
> > > > With the fast path added in alloc_fd(), pts/blogbench-1.1.0 read is improved
> > > > by 17% and write by 9% on Intel ICX 160 cores configuration with v6.10-rc4.
> > > > 
> > > > Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> > > > Signed-off-by: Yu Ma <yu.ma@intel.com>
> > > > ---
> > > >   fs/file.c | 35 +++++++++++++++++++++--------------
> > > >   1 file changed, 21 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/fs/file.c b/fs/file.c
> > > > index a3b72aa64f11..50e900a47107 100644
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -515,28 +515,35 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > > >   	if (fd < files->next_fd)
> > > >   		fd = files->next_fd;
> > > > -	if (fd < fdt->max_fds)
> > > > +	error = -EMFILE;
> > > > +	if (likely(fd < fdt->max_fds)) {
> > > > +		if (~fdt->open_fds[0] && !start) {
> > > > +			fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> > > So I don't think this is quite correct. If files->next_fd is set, we could
> > > end up calling find_next_zero_bit() starting from quite high offset causing
> > > a regression? Also because we don't expand in this case, we could cause access
> > > beyond end of fdtable?
> > OK, I've misunderstood the next_fd logic. next_fd is the lowest 0-bit in
> > the open_fds bitmap so if next_fd is big, the ~fdt->open_fds[0] should
> > be false. As such the above condition could be rewritten as:
> > 
> > 		if (!start && files->next_fd < BITS_PER_LONG)
> > 
> > to avoid loading the first bitmap long if we know it is full? Or we could
> > maybe go as far as:
> > 
> > 		if (!start && fd < BITS_PER_LONG && !test_bit(fd, fdt->open_fds))
> > 			goto fastreturn;
> > 
> > because AFAIU this should work in exactly the same cases as your code?
> > 
> > 								Honza
> 
> Thanks Honza for the good concern and suggestions here, while both above
> conditions are not enough to ensure that there is available fd in the first
> 64 bits of open_fds. As next_fd just means there is no available fd before
> next_fd, just imagine that fd from 0 to 66 are already occupied, now fd=3 is
> returned back, then next_fd would be set as 3 per fd recycling logic (i.e.
> in __put_unused_fd()), next time when alloc_fd() being called, it would
> return fd=3 to the caller and set next_fd=4. Then next time when alloc_fd()
> being called again, next_fd==4, but actually it's already been occupied. So
> find_next_zero_bit() is needed to find the real 0 bit anyway.

Indeed, thanks for correcting me! next_fd is just a lower bound for the
first free fd.

> The conditions
> should either be like it is in patch or if (!start && !test_bit(0,
> fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> but another point is that a bit in full_fds_bits represents 64 bits in
> open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> maybe we can modify the condition to use full_fds_bits ?

So maybe I'm wrong but I think the biggest benefit of your code compared to
plain find_next_fd() is exactly in that we don't have to load full_fds_bits
into cache. So I'm afraid that using full_fds_bits in the condition would
destroy your performance gains. Thinking about this with a fresh head how
about putting implementing your optimization like:

--- a/fs/file.c
+++ b/fs/file.c
@@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
        unsigned int maxbit = maxfd / BITS_PER_LONG;
        unsigned int bitbit = start / BITS_PER_LONG;
 
+       /*
+        * Optimistically search the first long of the open_fds bitmap. It
+        * saves us from loading full_fds_bits into cache in the common case
+        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
+        * a good chance there's a bit free in there.
+        */
+       if (start < BITS_PER_LONG) {
+               unsigned int bit;
+
+               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
+               if (bit < BITS_PER_LONG)
+                       return bit;
+       }
+
        bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
        if (bitbit >= maxfd)
                return maxfd;

Plus your optimizations with likely / unlikely. This way the code flow in
alloc_fd() stays more readable, we avoid loading the first open_fds long
into cache if it is full, and we should get all the performance benefits?

								Honza

 
> > > > +			goto fastreturn;
> > > > +		}
> > > >   		fd = find_next_fd(fdt, fd);
> > > > +	}
> > > > +
> > > > +	if (unlikely(fd >= fdt->max_fds)) {
> > > > +		error = expand_files(files, fd);
> > > > +		if (error < 0)
> > > > +			goto out;
> > > > +		/*
> > > > +		 * If we needed to expand the fs array we
> > > > +		 * might have blocked - try again.
> > > > +		 */
> > > > +		if (error)
> > > > +			goto repeat;
> > > > +	}
> > > > +fastreturn:
> > > >   	/*
> > > >   	 * N.B. For clone tasks sharing a files structure, this test
> > > >   	 * will limit the total number of files that can be opened.
> > > >   	 */
> > > > -	error = -EMFILE;
> > > > -	if (fd >= end)
> > > > +	if (unlikely(fd >= end))
> > > >   		goto out;
> > > > -	error = expand_files(files, fd);
> > > > -	if (error < 0)
> > > > -		goto out;
> > > > -
> > > > -	/*
> > > > -	 * If we needed to expand the fs array we
> > > > -	 * might have blocked - try again.
> > > > -	 */
> > > > -	if (error)
> > > > -		goto repeat;
> > > > -
> > > >   	if (start <= files->next_fd)
> > > >   		files->next_fd = fd + 1;
> > > > -- 
> > > > 2.43.0
> > > > 
> > > -- 
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> > > 
>

Tim Chen June 26, 2024, 4:43 p.m. UTC | #5

On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> 
> 
> Indeed, thanks for correcting me! next_fd is just a lower bound for the
> first free fd.
> 
> > The conditions
> > should either be like it is in patch or if (!start && !test_bit(0,
> > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > but another point is that a bit in full_fds_bits represents 64 bits in
> > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > maybe we can modify the condition to use full_fds_bits ?
> 
> So maybe I'm wrong but I think the biggest benefit of your code compared to
> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> into cache. So I'm afraid that using full_fds_bits in the condition would
> destroy your performance gains. Thinking about this with a fresh head how
> about putting implementing your optimization like:
> 
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
>  
> +       /*
> +        * Optimistically search the first long of the open_fds bitmap. It
> +        * saves us from loading full_fds_bits into cache in the common case
> +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> +        * a good chance there's a bit free in there.
> +        */
> +       if (start < BITS_PER_LONG) {
> +               unsigned int bit;
> +
> +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);

Say start is 31 (< BITS_PER_LONG)
bit found here could be 32 and greater than start.  Do we care if we return bit > start?

Tim

> +               if (bit < BITS_PER_LONG)
> +                       return bit;
> +       }
> +
>         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>         if (bitbit >= maxfd)
>                 return maxfd;
> 
> Plus your optimizations with likely / unlikely. This way the code flow in
> alloc_fd() stays more readable, we avoid loading the first open_fds long
> into cache if it is full, and we should get all the performance benefits?
> 
> 								Honza
> 
>  
> > > > > +			goto fastreturn;
> > > > > +		}
> > > > >   		fd = find_next_fd(fdt, fd);
> > > > > +	}
> > > > > +
> > > > > +	if (unlikely(fd >= fdt->max_fds)) {
> > > > > +		error = expand_files(files, fd);
> > > > > +		if (error < 0)
> > > > > +			goto out;
> > > > > +		/*
> > > > > +		 * If we needed to expand the fs array we
> > > > > +		 * might have blocked - try again.
> > > > > +		 */
> > > > > +		if (error)
> > > > > +			goto repeat;
> > > > > +	}
> > > > > +fastreturn:
> > > > >   	/*
> > > > >   	 * N.B. For clone tasks sharing a files structure, this test
> > > > >   	 * will limit the total number of files that can be opened.
> > > > >   	 */
> > > > > -	error = -EMFILE;
> > > > > -	if (fd >= end)
> > > > > +	if (unlikely(fd >= end))
> > > > >   		goto out;
> > > > > -	error = expand_files(files, fd);
> > > > > -	if (error < 0)
> > > > > -		goto out;
> > > > > -
> > > > > -	/*
> > > > > -	 * If we needed to expand the fs array we
> > > > > -	 * might have blocked - try again.
> > > > > -	 */
> > > > > -	if (error)
> > > > > -		goto repeat;
> > > > > -
> > > > >   	if (start <= files->next_fd)
> > > > >   		files->next_fd = fd + 1;
> > > > > -- 
> > > > > 2.43.0
> > > > > 
> > > > -- 
> > > > Jan Kara <jack@suse.com>
> > > > SUSE Labs, CR
> > > > 
> >

Tim Chen June 26, 2024, 4:52 p.m. UTC | #6

On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > 
> > 
> > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > first free fd.
> > 
> > > The conditions
> > > should either be like it is in patch or if (!start && !test_bit(0,
> > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > maybe we can modify the condition to use full_fds_bits ?
> > 
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> > 
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> >         unsigned int bitbit = start / BITS_PER_LONG;
> >  
> > +       /*
> > +        * Optimistically search the first long of the open_fds bitmap. It
> > +        * saves us from loading full_fds_bits into cache in the common case
> > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > +        * a good chance there's a bit free in there.
> > +        */
> > +       if (start < BITS_PER_LONG) {
> > +               unsigned int bit;
> > +
> > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> 
> Say start is 31 (< BITS_PER_LONG)
> bit found here could be 32 and greater than start.  Do we care if we return bit > start?

Sorry, I mean to say that we could find a bit like 30 that is less than start instead
of the other way round. 

Tim
>

Mateusz Guzik June 26, 2024, 7:13 p.m. UTC | #7

On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> So maybe I'm wrong but I think the biggest benefit of your code compared to
> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> into cache. So I'm afraid that using full_fds_bits in the condition would
> destroy your performance gains. Thinking about this with a fresh head how
> about putting implementing your optimization like:
>
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>         unsigned int maxbit = maxfd / BITS_PER_LONG;
>         unsigned int bitbit = start / BITS_PER_LONG;
>
> +       /*
> +        * Optimistically search the first long of the open_fds bitmap. It
> +        * saves us from loading full_fds_bits into cache in the common case
> +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> +        * a good chance there's a bit free in there.
> +        */
> +       if (start < BITS_PER_LONG) {
> +               unsigned int bit;
> +
> +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> +               if (bit < BITS_PER_LONG)
> +                       return bit;
> +       }
> +
>         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>         if (bitbit >= maxfd)
>                 return maxfd;
>
> Plus your optimizations with likely / unlikely. This way the code flow in
> alloc_fd() stays more readable, we avoid loading the first open_fds long
> into cache if it is full, and we should get all the performance benefits?
>

Huh.

So when I read the patch previously I assumed this is testing the bit
word for the map containing next_fd (whatever it is), avoiding looking
at the higher level bitmap and inlining the op (instead of calling the
fully fledged func for bit scans).

I did not mentally register this is in fact only checking for the
beginning of the range of the entire thing. So apologies from my end
as based on my feedback some work was done and I'm going to ask to
further redo it.

blogbench spawns 100 or so workers, say total fd count hovers just
above 100. say this lines up with about half of more cases in practice
for that benchmark.

Even so, that's a benchmark-specific optimization. A busy web server
can have literally tens of thousands of fds open (and this is a pretty
mundane case), making the 0-63 range not particularly interesting.

That aside I think the patchset is in the wrong order -- first patch
tries to not look at the higher level bitmap, while second reduces
stores made there. This makes it quite unclear how much is it worth to
reduce looking there if atomics are conditional.

So here is what I propose in terms of the patches:
1. NULL check removal, sprinkling of likely/unlikely and expand_files
call avoidance; no measurements done vs stock kernel for some effort
saving, just denote in the commit message there is less work under the
lock and treat it as baseline
2. conditional higher level bitmap clear as submitted; benchmarked against 1
3. open_fds check within the range containing fd, avoiding higher
level bitmap if a free slot is found. this should not result in any
func calls if successful; benchmarked against the above

Optionally the bitmap routines can grow variants which always inline
and are used here. If so that would probably land between 1 and 2 on
the list.

You noted you know about blogbench bugs and have them fixed. Would be
good to post a link to a pull request or some other spot for a
reference.

I'll be best if the vfs folk comment on what they want here.

Jan Kara June 27, 2024, 12:09 p.m. UTC | #8

On Wed 26-06-24 09:52:50, Tim Chen wrote:
> On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > 
> > > 
> > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > first free fd.
> > > 
> > > > The conditions
> > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > maybe we can modify the condition to use full_fds_bits ?
> > > 
> > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > destroy your performance gains. Thinking about this with a fresh head how
> > > about putting implementing your optimization like:
> > > 
> > > --- a/fs/file.c
> > > +++ b/fs/file.c
> > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> > >         unsigned int bitbit = start / BITS_PER_LONG;
> > >  
> > > +       /*
> > > +        * Optimistically search the first long of the open_fds bitmap. It
> > > +        * saves us from loading full_fds_bits into cache in the common case
> > > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > +        * a good chance there's a bit free in there.
> > > +        */
> > > +       if (start < BITS_PER_LONG) {
> > > +               unsigned int bit;
> > > +
> > > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > 
> > Say start is 31 (< BITS_PER_LONG)
> > bit found here could be 32 and greater than start.  Do we care if we return bit > start?
> 
> Sorry, I mean to say that we could find a bit like 30 that is less than
> start instead of the other way round. 

Well, I propose calling find_next_zero_bit() with offset set to 'start' so
it cannot possibly happen that the returned bit number is smaller than
start... But maybe I'm missing something?

									Honza

Mateusz Guzik June 27, 2024, 12:20 p.m. UTC | #9

On Thu, Jun 27, 2024 at 2:09 PM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 26-06-24 09:52:50, Tim Chen wrote:
> > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > >
> > > >
> > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > > first free fd.
> > > >
> > > > > The conditions
> > > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > > maybe we can modify the condition to use full_fds_bits ?
> > > >
> > > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > > destroy your performance gains. Thinking about this with a fresh head how
> > > > about putting implementing your optimization like:
> > > >
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > >         unsigned int bitbit = start / BITS_PER_LONG;
> > > >
> > > > +       /*
> > > > +        * Optimistically search the first long of the open_fds bitmap. It
> > > > +        * saves us from loading full_fds_bits into cache in the common case
> > > > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > > +        * a good chance there's a bit free in there.
> > > > +        */
> > > > +       if (start < BITS_PER_LONG) {
> > > > +               unsigned int bit;
> > > > +
> > > > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > >
> > > Say start is 31 (< BITS_PER_LONG)
> > > bit found here could be 32 and greater than start.  Do we care if we return bit > start?
> >
> > Sorry, I mean to say that we could find a bit like 30 that is less than
> > start instead of the other way round.
>
> Well, I propose calling find_next_zero_bit() with offset set to 'start' so
> it cannot possibly happen that the returned bit number is smaller than
> start... But maybe I'm missing something?
>

You gate it with " if (start < BITS_PER_LONG)" which only covers the
small initital range, while I'm arguing this should work for any fd.

Jan Kara June 27, 2024, 2:03 p.m. UTC | #10

On Wed 26-06-24 21:13:07, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> >         unsigned int bitbit = start / BITS_PER_LONG;
> >
> > +       /*
> > +        * Optimistically search the first long of the open_fds bitmap. It
> > +        * saves us from loading full_fds_bits into cache in the common case
> > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > +        * a good chance there's a bit free in there.
> > +        */
> > +       if (start < BITS_PER_LONG) {
> > +               unsigned int bit;
> > +
> > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > +               if (bit < BITS_PER_LONG)
> > +                       return bit;
> > +       }
> > +
> >         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> >         if (bitbit >= maxfd)
> >                 return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
> 
> Huh.
> 
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
> 
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.

Well, just confirming the fact that the way the code was written was
somewhat confusing ;)

> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
> 
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.

I agree this optimization helps only processes with low number of open fds.
On the other hand that is usually the  majority of the processes on the
system so the optimization makes sense to me. That being said your idea of
searching the word with next_fd makes sense as well...

> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
> 
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above

Yeah, I guess this ordering is the most obvious -> the least obvious so it
makes sense to me.

								Honza

Christian Brauner June 27, 2024, 3:33 p.m. UTC | #11

On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
> > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > into cache. So I'm afraid that using full_fds_bits in the condition would
> > destroy your performance gains. Thinking about this with a fresh head how
> > about putting implementing your optimization like:
> >
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> >         unsigned int bitbit = start / BITS_PER_LONG;
> >
> > +       /*
> > +        * Optimistically search the first long of the open_fds bitmap. It
> > +        * saves us from loading full_fds_bits into cache in the common case
> > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > +        * a good chance there's a bit free in there.
> > +        */
> > +       if (start < BITS_PER_LONG) {
> > +               unsigned int bit;
> > +
> > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > +               if (bit < BITS_PER_LONG)
> > +                       return bit;
> > +       }
> > +
> >         bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> >         if (bitbit >= maxfd)
> >                 return maxfd;
> >
> > Plus your optimizations with likely / unlikely. This way the code flow in
> > alloc_fd() stays more readable, we avoid loading the first open_fds long
> > into cache if it is full, and we should get all the performance benefits?
> >
> 
> Huh.
> 
> So when I read the patch previously I assumed this is testing the bit
> word for the map containing next_fd (whatever it is), avoiding looking
> at the higher level bitmap and inlining the op (instead of calling the
> fully fledged func for bit scans).
> 
> I did not mentally register this is in fact only checking for the
> beginning of the range of the entire thing. So apologies from my end
> as based on my feedback some work was done and I'm going to ask to
> further redo it.
> 
> blogbench spawns 100 or so workers, say total fd count hovers just
> above 100. say this lines up with about half of more cases in practice
> for that benchmark.
> 
> Even so, that's a benchmark-specific optimization. A busy web server
> can have literally tens of thousands of fds open (and this is a pretty
> mundane case), making the 0-63 range not particularly interesting.
> 
> That aside I think the patchset is in the wrong order -- first patch
> tries to not look at the higher level bitmap, while second reduces
> stores made there. This makes it quite unclear how much is it worth to
> reduce looking there if atomics are conditional.
> 
> So here is what I propose in terms of the patches:
> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
> call avoidance; no measurements done vs stock kernel for some effort
> saving, just denote in the commit message there is less work under the
> lock and treat it as baseline
> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
> 3. open_fds check within the range containing fd, avoiding higher
> level bitmap if a free slot is found. this should not result in any
> func calls if successful; benchmarked against the above
> 
> Optionally the bitmap routines can grow variants which always inline
> and are used here. If so that would probably land between 1 and 2 on
> the list.
> 
> You noted you know about blogbench bugs and have them fixed. Would be
> good to post a link to a pull request or some other spot for a
> reference.
> 
> I'll be best if the vfs folk comment on what they want here.

Optimizing only the < BIT_PER_LONG seems less desirable then making it
work for arbitrary next_fd. Imho, it'll also be easier to follow if
everything follows the same logic.

Tim Chen June 27, 2024, 4:21 p.m. UTC | #12

On Thu, 2024-06-27 at 14:09 +0200, Jan Kara wrote:
> On Wed 26-06-24 09:52:50, Tim Chen wrote:
> > On Wed, 2024-06-26 at 09:43 -0700, Tim Chen wrote:
> > > On Wed, 2024-06-26 at 13:54 +0200, Jan Kara wrote:
> > > > 
> > > > 
> > > > Indeed, thanks for correcting me! next_fd is just a lower bound for the
> > > > first free fd.
> > > > 
> > > > > The conditions
> > > > > should either be like it is in patch or if (!start && !test_bit(0,
> > > > > fdt->full_fds_bits)), the latter should also have the bitmap loading cost,
> > > > > but another point is that a bit in full_fds_bits represents 64 bits in
> > > > > open_fds, no matter fd >64 or not, full_fds_bits should be loaded any way,
> > > > > maybe we can modify the condition to use full_fds_bits ?
> > > > 
> > > > So maybe I'm wrong but I think the biggest benefit of your code compared to
> > > > plain find_next_fd() is exactly in that we don't have to load full_fds_bits
> > > > into cache. So I'm afraid that using full_fds_bits in the condition would
> > > > destroy your performance gains. Thinking about this with a fresh head how
> > > > about putting implementing your optimization like:
> > > > 
> > > > --- a/fs/file.c
> > > > +++ b/fs/file.c
> > > > @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
> > > >         unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > >         unsigned int bitbit = start / BITS_PER_LONG;
> > > >  
> > > > +       /*
> > > > +        * Optimistically search the first long of the open_fds bitmap. It
> > > > +        * saves us from loading full_fds_bits into cache in the common case
> > > > +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
> > > > +        * a good chance there's a bit free in there.
> > > > +        */
> > > > +       if (start < BITS_PER_LONG) {
> > > > +               unsigned int bit;
> > > > +
> > > > +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
> > > 
> > > Say start is 31 (< BITS_PER_LONG)
> > > bit found here could be 32 and greater than start.  Do we care if we return bit > start?
> > 
> > Sorry, I mean to say that we could find a bit like 30 that is less than
> > start instead of the other way round. 
> 
> Well, I propose calling find_next_zero_bit() with offset set to 'start' so
> it cannot possibly happen that the returned bit number is smaller than
> start... But maybe I'm missing something?

Yes, you're right. the search begin at start bit so it is okay.

Tim

> 
> 									Honza

Ma, Yu June 27, 2024, 6:27 p.m. UTC | #13

On 6/27/2024 11:33 PM, Christian Brauner wrote:
> On Wed, Jun 26, 2024 at 09:13:07PM GMT, Mateusz Guzik wrote:
>> On Wed, Jun 26, 2024 at 1:54 PM Jan Kara <jack@suse.cz> wrote:
>>> So maybe I'm wrong but I think the biggest benefit of your code compared to
>>> plain find_next_fd() is exactly in that we don't have to load full_fds_bits
>>> into cache. So I'm afraid that using full_fds_bits in the condition would
>>> destroy your performance gains. Thinking about this with a fresh head how
>>> about putting implementing your optimization like:
>>>
>>> --- a/fs/file.c
>>> +++ b/fs/file.c
>>> @@ -490,6 +490,20 @@ static unsigned int find_next_fd(struct fdtable *fdt, unsigned int start)
>>>          unsigned int maxbit = maxfd / BITS_PER_LONG;
>>>          unsigned int bitbit = start / BITS_PER_LONG;
>>>
>>> +       /*
>>> +        * Optimistically search the first long of the open_fds bitmap. It
>>> +        * saves us from loading full_fds_bits into cache in the common case
>>> +        * and because BITS_PER_LONG > start >= files->next_fd, we have quite
>>> +        * a good chance there's a bit free in there.
>>> +        */
>>> +       if (start < BITS_PER_LONG) {
>>> +               unsigned int bit;
>>> +
>>> +               bit = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, start);
>>> +               if (bit < BITS_PER_LONG)
>>> +                       return bit;
>>> +       }
>>> +
>>>          bitbit = find_next_zero_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
>>>          if (bitbit >= maxfd)
>>>                  return maxfd;
>>>
>>> Plus your optimizations with likely / unlikely. This way the code flow in
>>> alloc_fd() stays more readable, we avoid loading the first open_fds long
>>> into cache if it is full, and we should get all the performance benefits?
>>>
>> Huh.
>>
>> So when I read the patch previously I assumed this is testing the bit
>> word for the map containing next_fd (whatever it is), avoiding looking
>> at the higher level bitmap and inlining the op (instead of calling the
>> fully fledged func for bit scans).
>>
>> I did not mentally register this is in fact only checking for the
>> beginning of the range of the entire thing. So apologies from my end
>> as based on my feedback some work was done and I'm going to ask to
>> further redo it.
>>
>> blogbench spawns 100 or so workers, say total fd count hovers just
>> above 100. say this lines up with about half of more cases in practice
>> for that benchmark.
>>
>> Even so, that's a benchmark-specific optimization. A busy web server
>> can have literally tens of thousands of fds open (and this is a pretty
>> mundane case), making the 0-63 range not particularly interesting.
>>
>> That aside I think the patchset is in the wrong order -- first patch
>> tries to not look at the higher level bitmap, while second reduces
>> stores made there. This makes it quite unclear how much is it worth to
>> reduce looking there if atomics are conditional.
>>
>> So here is what I propose in terms of the patches:
>> 1. NULL check removal, sprinkling of likely/unlikely and expand_files
>> call avoidance; no measurements done vs stock kernel for some effort
>> saving, just denote in the commit message there is less work under the
>> lock and treat it as baseline
>> 2. conditional higher level bitmap clear as submitted; benchmarked against 1
>> 3. open_fds check within the range containing fd, avoiding higher
>> level bitmap if a free slot is found. this should not result in any
>> func calls if successful; benchmarked against the above
>>
>> Optionally the bitmap routines can grow variants which always inline
>> and are used here. If so that would probably land between 1 and 2 on
>> the list.
>>
>> You noted you know about blogbench bugs and have them fixed. Would be
>> good to post a link to a pull request or some other spot for a
>> reference.
>>
>> I'll be best if the vfs folk comment on what they want here.
> Optimizing only the < BIT_PER_LONG seems less desirable then making it
> work for arbitrary next_fd. Imho, it'll also be easier to follow if
> everything follows the same logic.

Sorry that this message is a bit long. Thanks for your time to review.

Firstly sincerely thanks all for the hot discussion and kind suggestions 
with your expertise to make the patch set better. At least, we already 
reached some agreements on removing sanity_check and adding conditional 
clear (p.s. I'll revise the bug_on to warn_on in fd_install() as 
aligned). I fully agree with Guzik's suggestion to resort the patches. 
As the remaining focus of discussion is around fast path, I suggest that 
we submit patch 1 & 2 (after reorder) for up-streaming firstly (with 
data remeasured on latest kernel version accordingly), then we focus on 
discussion for fast path.

For this fast path idea, here I summarized some info for further 
discussion, why I still think it is valuable:

1. The original intention for fast path is to reduce func calls and 
avoid unnecessary load/store on the members sharing the same cacheline 
(such as file_lock, next_fd and the 3 bitmaps. BTW, we've tried to add 
__cacheline_aligned_in_smp for next_fd and fd array, no improvement 
observed), specially, yes, specially, all these operations are inside of 
critical section of file_lock.

2. For fast path implementation, the essential and simple point is to 
directly return an available bit if there is free bit in [0-63]. I'd 
emphasize that it does not only improve low number of open fds (even it 
is the majority case on system as Honza agreed), but also improve the 
cases that lots of fds open/close frequently with short task (as per the 
algorithm, lower bits will be prioritized to allocate after being 
recycled). Not only blogbench, a synthetic benchmark, but also the 
realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological 
performance case for __alloc_fd()"), which literally introduced this 
2-levels bitmap searching algorithm to vfs as we see now. We may ask 
Eric for help to see whether it's possible to let us have some data on 
it. Besides, for those lots of fds are allocated and occupied for 
not-short time, the lock contention would be much less than the 
scenarios we're talking about here, then the impact of change would be 
much less.

3. Now we talk about the extra cost of fast path based on the patch we 
submitted. To be honest, before fast path, we firstly found that 
alloc_fd() is only called in two scenarios, as I mentioned in commit: 
(1) called by __get_unused_fd_flags() to find available fd start from 
bit 0, which is the most common usage. It means start==0 for alloc_fd(), 
and with this premise, alloc_fd() logic can be simpler, two branches for 
comparing to next_fd can be reduced; (2) dup a fd via dup_fd() to find a 
fd start from old_fd, which means "start" is not zero, but it is only 
called by fcntl. Then the first impression came into our mind is that 
why we sacrifice the performance of absolutely majority cases for this 
specific dup_fd. So we revised __get_unused_fd_flags() to not call 
alloc_fd() directly, but with the same logic as alloc_fd() by omitted 
the branches related to "start". Based on this version, we then found 
fast path would be possibly more efficient than the stock 2-levels 
bitmap searching based on the considerations stated in item 2 and commit 
message of this thread. Leaving aside the benefit, the extra cost is an 
conditional branch, but with 2 branches related to "start" has been 
reduced, it is still profitable, not even to say the cost can be 
alleviated by branch predictor. However, with this draft version, the 
code of __get_unused_fd_flags() duplicates a lot with alloc_fd(), then 
we change to current version for concise code. What I want to say is 
that there is space to make it faster with cost less than stock. For 
whether to use open_fds[0] as conditions for fast path, we think it's OK 
as all bitmaps are almost on the same cacheline, and we finaly need to 
write open_fds to allocate bit anyway.

4. Based on patches order as suggested by Guzik, we've re-measured the 
data on latest kernel 6.10-rc5, removing sanity_check and add 
likely/unlikely would have 6% gain for read, and 2% for write. Combined 
with conditional clear, it would have 14% gain for read, and 8% for 
write. If with fast path, it might have another ~15% gain to read (we do 
not re-measure this one yet due to time, will make up soon).

Mateusz Guzik June 27, 2024, 7:59 p.m. UTC | #14

On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> 2. For fast path implementation, the essential and simple point is to
> directly return an available bit if there is free bit in [0-63]. I'd
> emphasize that it does not only improve low number of open fds (even it
> is the majority case on system as Honza agreed), but also improve the
> cases that lots of fds open/close frequently with short task (as per the
> algorithm, lower bits will be prioritized to allocate after being
> recycled). Not only blogbench, a synthetic benchmark, but also the
> realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> performance case for __alloc_fd()"), which literally introduced this
> 2-levels bitmap searching algorithm to vfs as we see now.

I don't understand how using next_fd instead is supposed to be inferior.

Maybe I should clarify that by API contract the kernel must return the
lowest free fd it can find. To that end it maintains the next_fd field
as a hint to hopefully avoid some of the search work.

In the stock kernel the first thing done in alloc_fd is setting it as
a starting point:
        fdt = files_fdtable(files);
        fd = start;
        if (fd < files->next_fd)
                fd = files->next_fd;

that is all the calls which come here with 0 start their search from
next_fd position.

Suppose you implemented the patch as suggested by me and next_fd fits
the range of 0-63. Then you get the benefit of lower level bitmap
check just like in the patch you submitted, but without having to
first branch on whether you happen to be in that range.

Suppose next_fd is somewhere higher up, say 80. With your general
approach the optimization wont be done whatsoever or it will be
attempted at the 0-63 range when it is an invariant it finds no free
fds.

With what I'm suggesting the general idea of taking a peek at the
lower level bitmap can be applied across the entire fd space. Some
manual mucking will be needed to make sure this never pulls more than
one cacheline, easiest way out I see would be to align next_fd to
BITS_PER_LONG for the bitmap search purposes.

Outside of the scope of this patchset, but definitely worth
considering, is an observation that this still pulls an entire
cacheline worth of a bitmap (assuming it grew). If one could assume
that the size is always a multiply of 64 bytes (which it is after
first expansion) the 64 byte scan could be entirely inlined -- there
is quite a bit of fd fields in this range we may as well scan in hopes
of avoiding looking at the higher level bitmap, after all we already
paid for fetching it. This would take the optimization to its logical
conclusion.

Perhaps it would be ok to special-case the lower bitmap to start with
64 bytes so that there would be no need to branch on it.

Jan Kara June 28, 2024, 9:12 a.m. UTC | #15

On Thu 27-06-24 21:59:12, Mateusz Guzik wrote:
> On Thu, Jun 27, 2024 at 8:27 PM Ma, Yu <yu.ma@intel.com> wrote:
> > 2. For fast path implementation, the essential and simple point is to
> > directly return an available bit if there is free bit in [0-63]. I'd
> > emphasize that it does not only improve low number of open fds (even it
> > is the majority case on system as Honza agreed), but also improve the
> > cases that lots of fds open/close frequently with short task (as per the
> > algorithm, lower bits will be prioritized to allocate after being
> > recycled). Not only blogbench, a synthetic benchmark, but also the
> > realistic scenario as claimed in f3f86e33dc3d("vfs: Fix pathological
> > performance case for __alloc_fd()"), which literally introduced this
> > 2-levels bitmap searching algorithm to vfs as we see now.
> 
> I don't understand how using next_fd instead is supposed to be inferior.
> 
> Maybe I should clarify that by API contract the kernel must return the
> lowest free fd it can find. To that end it maintains the next_fd field
> as a hint to hopefully avoid some of the search work.
> 
> In the stock kernel the first thing done in alloc_fd is setting it as
> a starting point:
>         fdt = files_fdtable(files);
>         fd = start;
>         if (fd < files->next_fd)
>                 fd = files->next_fd;
> 
> that is all the calls which come here with 0 start their search from
> next_fd position.

Yup.

> Suppose you implemented the patch as suggested by me and next_fd fits
> the range of 0-63. Then you get the benefit of lower level bitmap
> check just like in the patch you submitted, but without having to
> first branch on whether you happen to be in that range.
> 
> Suppose next_fd is somewhere higher up, say 80. With your general
> approach the optimization wont be done whatsoever or it will be
> attempted at the 0-63 range when it is an invariant it finds no free
> fds.
> 
> With what I'm suggesting the general idea of taking a peek at the
> lower level bitmap can be applied across the entire fd space. Some
> manual mucking will be needed to make sure this never pulls more than
> one cacheline, easiest way out I see would be to align next_fd to
> BITS_PER_LONG for the bitmap search purposes.

Well, all you need to do is to call:

	bit = find_next_zero_bit(fdt->open_fds[start / BITS_PER_LONG],
				 BITS_PER_LONG, start & (BITS_PER_LONG-1));
	if (bit < BITS_PER_LONG)
		return bit + (start & ~(BITS_PER_LONG - 1));


in find_next_fd(). Not sure if this is what you meant by aligning next_fd
to BITS_PER_LONG...

								Honza

[v2,1/3] fs/file.c: add fast path in alloc_fd()

Commit Message

Comments

Patch