[0/2] btrfs: reduce div64 calls for __btrfs_map_block() and its variants

Message ID	cover.1675586554.git.wqu@suse.com (mailing list archive)
Headers	show Return-Path: <linux-btrfs-owner@vger.kernel.org> From: Qu Wenruo <wqu@suse.com> To: linux-btrfs@vger.kernel.org Subject: [PATCH 0/2] btrfs: reduce div64 calls for __btrfs_map_block() and its variants Date: Sun, 5 Feb 2023 16:53:40 +0800 Message-Id: <cover.1675586554.git.wqu@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: reduce div64 calls for __btrfs_map_block() and its variants \| expand [0/2] btrfs: reduce div64 calls for __btrfs_map_block() and its variants [1/2] btrfs: remove map_lookup->stripe_len [2/2] btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32

Message ID

cover.1675586554.git.wqu@suse.com (mailing list archive)

Headers

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs@vger.kernel.org
Subject: [PATCH 0/2] btrfs: reduce div64 calls for __btrfs_map_block() and its
 variants
Date: Sun,  5 Feb 2023 16:53:40 +0800
Message-Id: <cover.1675586554.git.wqu@suse.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

btrfs: reduce div64 calls for __btrfs_map_block() and its variants | expand

Message

Qu Wenruo Feb. 5, 2023, 8:53 a.m. UTC

Div64 is much slower than 32 bit division, and only get improved in
the most recent CPUs, that's why we have dedicated div64* helpers.

One usage of div64 is in __btrfs_map_block() and its variants, where we
got @stripe_nr as u64, and all later division has to go the div64
helpers.

But the truth is, with our current chunk size limit (10G) and fixed
stripe length (64K), we can have at most 160K stripes in a chunk, which
is small enough for u32 already.

So this patchset would reduce div64 calls by:

- Remove map_lookup::stripe_len first
  So now we don't need to call div64 to calculate @stripe_nr, just a
  simple right shift, then truncate to u32.

  This is a prerequisite for the 2nd patch, without the fixed stripe
  length, we have to rely on div64.

- Reduce the width of various @stripe_nr to 32

- Use regular divitsion and module to do the calculation
  Now we can get rid of the fear that we missed some div64 helpers.
  

Qu Wenruo (2):
  btrfs: remove map_lookup->stripe_len
  btrfs: reduce div64 calls by limiting the number of stripes of a chunk
    to u32

 fs/btrfs/block-group.c            |  18 ++---
 fs/btrfs/scrub.c                  |  43 ++++++------
 fs/btrfs/tests/extent-map-tests.c |   1 -
 fs/btrfs/tree-checker.c           |  14 ++++
 fs/btrfs/volumes.c                | 110 +++++++++++++++---------------
 fs/btrfs/volumes.h                |   7 +-
 6 files changed, 104 insertions(+), 89 deletions(-)

Comments

Christoph Hellwig Feb. 6, 2023, 6:38 a.m. UTC | #1

On Sun, Feb 05, 2023 at 04:53:40PM +0800, Qu Wenruo wrote:
> Div64 is much slower than 32 bit division, and only get improved in
> the most recent CPUs, that's why we have dedicated div64* helpers.

I think that's a little too simple.  All 64-bit CPUs can do 64-bit
divisions of course.  32-bit CPUs usually can't do it inside a single
instruction, and the helpers exist to avoid the libgcc calls.

That should not stand against an cleanups and optimizations of course.

Qu Wenruo Feb. 6, 2023, 6:58 a.m. UTC | #2

On 2023/2/6 14:38, Christoph Hellwig wrote:
> On Sun, Feb 05, 2023 at 04:53:40PM +0800, Qu Wenruo wrote:
>> Div64 is much slower than 32 bit division, and only get improved in
>> the most recent CPUs, that's why we have dedicated div64* helpers.
> 
> I think that's a little too simple.  All 64-bit CPUs can do 64-bit
> divisions of course.  32-bit CPUs usually can't do it inside a single
> instruction, and the helpers exist to avoid the libgcc calls.

Right, I focused too much on the perf part, but sometimes the perf part 
may sound more scary:

   For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and
   reciprocal throughput of 24-90), for 64-bits inputs.

I'm not sure what's the best way to benchmark such thing.
Maybe just do such division with random numbers and run it at module 
load time to verify the perf?

Thanks,
Qu

> 
> That should not stand against an cleanups and optimizations of course.
>

Christoph Hellwig Feb. 6, 2023, 7:02 a.m. UTC | #3

On Mon, Feb 06, 2023 at 02:58:55PM +0800, Qu Wenruo wrote:
> Right, I focused too much on the perf part, but sometimes the perf part may
> sound more scary:
> 
>   For example, DIVQ on Skylake has latency of 42-95 cycles [1] (and
>   reciprocal throughput of 24-90), for 64-bits inputs.
> 
> I'm not sure what's the best way to benchmark such thing.
> Maybe just do such division with random numbers and run it at module load
> time to verify the perf?

Honestly, getting rid of the ugly divisions calls is probably
reason enough as the series looks like a nice cleanup.  I just had
to nipick on the sentence.