Message ID | 150fdf65c8e4cc4dba71e020ce0859bcf636a5ff.1685449706.git.ojaswin@linux.ibm.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | multiblock allocator improvements | expand |
On Tue 30-05-23 18:03:49, Ojaswin Mujoo wrote: > CR1_5 aims to optimize allocations which can't be satisfied in CR1. The > fact that we couldn't find a group in CR1 suggests that it would be > difficult to find a continuous extent to compleltely satisfy our > allocations. So before falling to the slower CR2, in CR1.5 we > proactively trim the the preallocations so we can find a group with > (free / fragments) big enough. This speeds up our allocation at the > cost of slightly reduced preallocation. > > The patch also adds a new sysfs tunable: > > * /sys/fs/ext4/<partition>/mb_cr1_5_max_trim_order > > This controls how much CR1.5 can trim a request before falling to CR2. > For example, for a request of order 7 and max trim order 2, CR1.5 can > trim this upto order 5. > > Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> > Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> > > ext4 squash Why is this here? > +/* > + * We couldn't find a group in CR1 so try to find the highest free fragment > + * order we have and proactively trim the goal request length to that order to > + * find a suitable group faster. > + * > + * This optimizes allocation speed at the cost of slightly reduced > + * preallocations. However, we make sure that we don't trim the request too > + * much and fall to CR2 in that case. > + */ > +static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac, > + enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) > +{ > + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); > + struct ext4_group_info *grp = NULL; > + int i, order, min_order; > + unsigned long num_stripe_clusters = 0; > + > + if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) { > + if (sbi->s_mb_stats) > + atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions); > + } > + > + /* > + * mb_avg_fragment_size_order() returns order in a way that makes > + * retrieving back the length using (1 << order) inaccurate. Hence, use > + * fls() instead since we need to know the actual length while modifying > + * goal length. > + */ > + order = fls(ac->ac_g_ex.fe_len); > + min_order = order - sbi->s_mb_cr1_5_max_trim_order; > + if (min_order < 0) > + min_order = 0; > + > + if (1 << min_order < ac->ac_o_ex.fe_len) > + min_order = fls(ac->ac_o_ex.fe_len) + 1; > + > + if (sbi->s_stripe > 0) { > + /* > + * We are assuming that stripe size is always a multiple of > + * cluster ratio otherwise __ext4_fill_super exists early. > + */ > + num_stripe_clusters = EXT4_NUM_B2C(sbi, sbi->s_stripe); > + if (1 << min_order < num_stripe_clusters) > + min_order = fls(num_stripe_clusters); > + } > + > + for (i = order; i >= min_order; i--) { > + int frag_order; > + /* > + * Scale down goal len to make sure we find something > + * in the free fragments list. Basically, reduce > + * preallocations. > + */ > + ac->ac_g_ex.fe_len = 1 << i; I smell some off-by-one issues here. Look fls(1) == 1 so (1 << fls(n)) > n. Hence this loop will actually *grow* the goal allocation length. Also I'm not sure why you have +1 in min_order = fls(ac->ac_o_ex.fe_len) + 1. > + > + if (num_stripe_clusters > 0) { > + /* > + * Try to round up the adjusted goal to stripe size ^^^ goal length? > + * (in cluster units) multiple for efficiency. > + * > + * XXX: Is s->stripe always a power of 2? In that case > + * we can use the faster round_up() variant. > + */ I don't think s->stripe has to be a power of 2. E.g. when you have three data disks in a RAID config. Otherwise the patch looks good to me. Honza
Jan, thanks for the comments to Ojaswin's patch series. Since I had already landed his patch series in my tree and have been testing it, I've fixed the obvious issues you've raised in a fixup patch (attached). There is one issue which I have not fixed: On Wed, Jun 07, 2023 at 12:21:03PM +0200, Jan Kara wrote: > > + for (i = order; i >= min_order; i--) { > > + int frag_order; > > + /* > > + * Scale down goal len to make sure we find something > > + * in the free fragments list. Basically, reduce > > + * preallocations. > > + */ > > + ac->ac_g_ex.fe_len = 1 << i; > > I smell some off-by-one issues here. Look fls(1) == 1 so (1 << fls(n)) > n. > Hence this loop will actually *grow* the goal allocation length. Also I'm > not sure why you have +1 in min_order = fls(ac->ac_o_ex.fe_len) + 1. Ojaswin, could you take a look this? Thanks!! - Ted commit 182d2d90a180838789ed5a19e08c333043d1617a Author: Theodore Ts'o <tytso@mit.edu> Date: Thu Jun 8 10:39:35 2023 -0400 ext4: clean up mballoc criteria comments Line wrap and slightly clarify the comments describing mballoc's cirtiera. Define EXT4_MB_NUM_CRS as part of the enum, so that it will automatically get updated when criteria is added or removed. Also fix a potential unitialized use of 'cr' variable if CONFIG_EXT4_DEBUG is enabled. Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 6a1f013d23f7..45a531446ea2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -128,47 +128,52 @@ enum SHIFT_DIRECTION { }; /* - * Number of criterias defined. For each criteria, mballoc has slightly - * different way of finding the required blocks nad usually, higher the - * criteria the slower the allocation. We start at lower criterias and keep - * falling back to higher ones if we are not able to find any blocks. - */ -#define EXT4_MB_NUM_CRS 5 -/* - * All possible allocation criterias for mballoc. Lower are faster. + * For each criteria, mballoc has slightly different way of finding + * the required blocks nad usually, higher the criteria the slower the + * allocation. We start at lower criterias and keep falling back to + * higher ones if we are not able to find any blocks. Lower (earlier) + * criteria are faster. */ enum criteria { /* - * Used when number of blocks needed is a power of 2. This doesn't - * trigger any disk IO except prefetch and is the fastest criteria. + * Used when number of blocks needed is a power of 2. This + * doesn't trigger any disk IO except prefetch and is the + * fastest criteria. */ CR_POWER2_ALIGNED, /* - * Tries to lookup in-memory data structures to find the most suitable - * group that satisfies goal request. No disk IO except block prefetch. + * Tries to lookup in-memory data structures to find the most + * suitable group that satisfies goal request. No disk IO + * except block prefetch. */ CR_GOAL_LEN_FAST, /* - * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal length to - * the best available length for faster allocation. + * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal + * length to the best available length for faster allocation. */ CR_BEST_AVAIL_LEN, /* - * Reads each block group sequentially, performing disk IO if necessary, to - * find find_suitable block group. Tries to allocate goal length but might trim - * the request if nothing is found after enough tries. + * Reads each block group sequentially, performing disk IO if + * necessary, to find find_suitable block group. Tries to + * allocate goal length but might trim the request if nothing + * is found after enough tries. */ CR_GOAL_LEN_SLOW, /* - * Finds the first free set of blocks and allocates those. This is only - * used in rare cases when CR_GOAL_LEN_SLOW also fails to allocate - * anything. + * Finds the first free set of blocks and allocates + * those. This is only used in rare cases when + * CR_GOAL_LEN_SLOW also fails to allocate anything. */ CR_ANY_FREE, + + /* + * Number of criterias defined. + */ + EXT4_MB_NUM_CRS }; /* criteria below which we use fast block scanning and avoid unnecessary IO */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 8a6896d4e9b0..2f9f5dc720cc 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2759,7 +2759,7 @@ static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t prefetch_grp = 0, ngroups, group, i; - enum criteria cr, new_cr; + enum criteria new_cr, cr = CR_GOAL_LEN_FAST; int err = 0, first_err = 0; unsigned int nr = 0, prefetch_ios = 0; struct ext4_sb_info *sbi; @@ -2816,12 +2816,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) spin_unlock(&sbi->s_md_lock); } - /* Let's just scan groups to find more-less suitable blocks */ - cr = ac->ac_2order ? CR_POWER2_ALIGNED : CR_GOAL_LEN_FAST; /* - * cr == CR_POWER2_ALIGNED try to get exact allocation, - * cr == CR_ANY_FREE try to get anything + * Let's just scan groups to find more-less suitable blocks We + * start with CR_GOAL_LEN_FAST, unless it is power of 2 + * aligned, in which case let's do that faster approach first. */ + if (ac->ac_2order) + cr = CR_POWER2_ALIGNED; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { ac->ac_criteria = cr;
On Thu, Jun 08, 2023 at 10:45:05AM -0400, Theodore Ts'o wrote: > Jan, thanks for the comments to Ojaswin's patch series. Since I had > already landed his patch series in my tree and have been testing it, > I've fixed the obvious issues you've raised in a fixup patch > (attached). > > There is one issue which I have not fixed: > > On Wed, Jun 07, 2023 at 12:21:03PM +0200, Jan Kara wrote: > > > + for (i = order; i >= min_order; i--) { > > > + int frag_order; > > > + /* > > > + * Scale down goal len to make sure we find something > > > + * in the free fragments list. Basically, reduce > > > + * preallocations. > > > + */ > > > + ac->ac_g_ex.fe_len = 1 << i; > > > > I smell some off-by-one issues here. Look fls(1) == 1 so (1 << fls(n)) > n. > > Hence this loop will actually *grow* the goal allocation length. Also I'm > > not sure why you have +1 in min_order = fls(ac->ac_o_ex.fe_len) + 1. > > Ojaswin, could you take a look this? Thanks!! > > - Ted > > commit 182d2d90a180838789ed5a19e08c333043d1617a > Author: Theodore Ts'o <tytso@mit.edu> > Date: Thu Jun 8 10:39:35 2023 -0400 > > ext4: clean up mballoc criteria comments > > Line wrap and slightly clarify the comments describing mballoc's > cirtiera. > > Define EXT4_MB_NUM_CRS as part of the enum, so that it will > automatically get updated when criteria is added or removed. > > Also fix a potential unitialized use of 'cr' variable if > CONFIG_EXT4_DEBUG is enabled. > > Signed-off-by: Theodore Ts'o <tytso@mit.edu> Hi Ted, Patch looks good, thanks for doing this. I've sent the fix for the off by one issue here: https://lore.kernel.org/linux-ext4/20230609103403.112807-1-ojaswin@linux.ibm.com/T/#u Jan, thanks for the review. I've addressed the bug for now. Since I'm on vacation for the next one and a half week I might not be able to address the other cleanups. I'll get them done once I'm back. Regards, ojaswin > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 6a1f013d23f7..45a531446ea2 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -128,47 +128,52 @@ enum SHIFT_DIRECTION { > }; > > /* > - * Number of criterias defined. For each criteria, mballoc has slightly > - * different way of finding the required blocks nad usually, higher the > - * criteria the slower the allocation. We start at lower criterias and keep > - * falling back to higher ones if we are not able to find any blocks. > - */ > -#define EXT4_MB_NUM_CRS 5 > -/* > - * All possible allocation criterias for mballoc. Lower are faster. > + * For each criteria, mballoc has slightly different way of finding > + * the required blocks nad usually, higher the criteria the slower the > + * allocation. We start at lower criterias and keep falling back to > + * higher ones if we are not able to find any blocks. Lower (earlier) > + * criteria are faster. > */ > enum criteria { > /* > - * Used when number of blocks needed is a power of 2. This doesn't > - * trigger any disk IO except prefetch and is the fastest criteria. > + * Used when number of blocks needed is a power of 2. This > + * doesn't trigger any disk IO except prefetch and is the > + * fastest criteria. > */ > CR_POWER2_ALIGNED, > > /* > - * Tries to lookup in-memory data structures to find the most suitable > - * group that satisfies goal request. No disk IO except block prefetch. > + * Tries to lookup in-memory data structures to find the most > + * suitable group that satisfies goal request. No disk IO > + * except block prefetch. > */ > CR_GOAL_LEN_FAST, > > /* > - * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal length to > - * the best available length for faster allocation. > + * Same as CR_GOAL_LEN_FAST but is allowed to reduce the goal > + * length to the best available length for faster allocation. > */ > CR_BEST_AVAIL_LEN, > > /* > - * Reads each block group sequentially, performing disk IO if necessary, to > - * find find_suitable block group. Tries to allocate goal length but might trim > - * the request if nothing is found after enough tries. > + * Reads each block group sequentially, performing disk IO if > + * necessary, to find find_suitable block group. Tries to > + * allocate goal length but might trim the request if nothing > + * is found after enough tries. > */ > CR_GOAL_LEN_SLOW, > > /* > - * Finds the first free set of blocks and allocates those. This is only > - * used in rare cases when CR_GOAL_LEN_SLOW also fails to allocate > - * anything. > + * Finds the first free set of blocks and allocates > + * those. This is only used in rare cases when > + * CR_GOAL_LEN_SLOW also fails to allocate anything. > */ > CR_ANY_FREE, > + > + /* > + * Number of criterias defined. > + */ > + EXT4_MB_NUM_CRS > }; > > /* criteria below which we use fast block scanning and avoid unnecessary IO */ > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 8a6896d4e9b0..2f9f5dc720cc 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -2759,7 +2759,7 @@ static noinline_for_stack int > ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > { > ext4_group_t prefetch_grp = 0, ngroups, group, i; > - enum criteria cr, new_cr; > + enum criteria new_cr, cr = CR_GOAL_LEN_FAST; > int err = 0, first_err = 0; > unsigned int nr = 0, prefetch_ios = 0; > struct ext4_sb_info *sbi; > @@ -2816,12 +2816,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > spin_unlock(&sbi->s_md_lock); > } > > - /* Let's just scan groups to find more-less suitable blocks */ > - cr = ac->ac_2order ? CR_POWER2_ALIGNED : CR_GOAL_LEN_FAST; > /* > - * cr == CR_POWER2_ALIGNED try to get exact allocation, > - * cr == CR_ANY_FREE try to get anything > + * Let's just scan groups to find more-less suitable blocks We > + * start with CR_GOAL_LEN_FAST, unless it is power of 2 > + * aligned, in which case let's do that faster approach first. > */ > + if (ac->ac_2order) > + cr = CR_POWER2_ALIGNED; > repeat: > for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) { > ac->ac_criteria = cr;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index eae981ab2539..942e97026a60 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -133,13 +133,14 @@ enum SHIFT_DIRECTION { * criteria the slower the allocation. We start at lower criterias and keep * falling back to higher ones if we are not able to find any blocks. */ -#define EXT4_MB_NUM_CRS 4 +#define EXT4_MB_NUM_CRS 5 /* * All possible allocation criterias for mballoc */ enum criteria { CR0, CR1, + CR1_5, CR2, CR3, }; @@ -185,6 +186,9 @@ enum criteria { #define EXT4_MB_CR0_OPTIMIZED 0x8000 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */ #define EXT4_MB_CR1_OPTIMIZED 0x00010000 +/* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */ +#define EXT4_MB_CR1_5_OPTIMIZED 0x00020000 + struct ext4_allocation_request { /* target inode for block we're allocating */ struct inode *inode; @@ -1547,6 +1551,7 @@ struct ext4_sb_info { unsigned long s_mb_last_start; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; + unsigned int s_mb_cr1_5_max_trim_order; /* stats for buddy allocator */ atomic_t s_bal_reqs; /* number of reqs with len > 1 */ @@ -1561,6 +1566,7 @@ struct ext4_sb_info { atomic_t s_bal_2orders; /* 2^order hits */ atomic_t s_bal_cr0_bad_suggestions; atomic_t s_bal_cr1_bad_suggestions; + atomic_t s_bal_cr1_5_bad_suggestions; atomic64_t s_bal_cX_groups_considered[EXT4_MB_NUM_CRS]; atomic64_t s_bal_cX_hits[EXT4_MB_NUM_CRS]; atomic64_t s_bal_cX_failed[EXT4_MB_NUM_CRS]; /* cX loop didn't find blocks */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index f59e1e0e01b1..0cf037489e97 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -166,6 +166,14 @@ * equal to request size using our average fragment size group lists (data * structure 2) in O(1) time. * + * At CR1.5 (aka CR1_5), we aim to optimize allocations which can't be satisfied + * in CR1. The fact that we couldn't find a group in CR1 suggests that there is + * no BG that has average fragment size > goal length. So before falling to the + * slower CR2, in CR1.5 we proactively trim goal length and then use the same + * fragment lists as CR1 to find a BG with a big enough average fragment size. + * This increases the chances of finding a suitable block group in O(1) time and + * results * in faster allocation at the cost of reduced size of allocation. + * * If "mb_optimize_scan" mount option is not set, mballoc traverses groups in * linear order which requires O(N) search time for each CR0 and CR1 phase. * @@ -963,6 +971,91 @@ static void ext4_mb_choose_next_group_cr1(struct ext4_allocation_context *ac, *group = grp->bb_group; ac->ac_flags |= EXT4_MB_CR1_OPTIMIZED; } else { + *new_cr = CR1_5; + } +} + +/* + * We couldn't find a group in CR1 so try to find the highest free fragment + * order we have and proactively trim the goal request length to that order to + * find a suitable group faster. + * + * This optimizes allocation speed at the cost of slightly reduced + * preallocations. However, we make sure that we don't trim the request too + * much and fall to CR2 in that case. + */ +static void ext4_mb_choose_next_group_cr1_5(struct ext4_allocation_context *ac, + enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) +{ + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb); + struct ext4_group_info *grp = NULL; + int i, order, min_order; + unsigned long num_stripe_clusters = 0; + + if (unlikely(ac->ac_flags & EXT4_MB_CR1_5_OPTIMIZED)) { + if (sbi->s_mb_stats) + atomic_inc(&sbi->s_bal_cr1_5_bad_suggestions); + } + + /* + * mb_avg_fragment_size_order() returns order in a way that makes + * retrieving back the length using (1 << order) inaccurate. Hence, use + * fls() instead since we need to know the actual length while modifying + * goal length. + */ + order = fls(ac->ac_g_ex.fe_len); + min_order = order - sbi->s_mb_cr1_5_max_trim_order; + if (min_order < 0) + min_order = 0; + + if (1 << min_order < ac->ac_o_ex.fe_len) + min_order = fls(ac->ac_o_ex.fe_len) + 1; + + if (sbi->s_stripe > 0) { + /* + * We are assuming that stripe size is always a multiple of + * cluster ratio otherwise __ext4_fill_super exists early. + */ + num_stripe_clusters = EXT4_NUM_B2C(sbi, sbi->s_stripe); + if (1 << min_order < num_stripe_clusters) + min_order = fls(num_stripe_clusters); + } + + for (i = order; i >= min_order; i--) { + int frag_order; + /* + * Scale down goal len to make sure we find something + * in the free fragments list. Basically, reduce + * preallocations. + */ + ac->ac_g_ex.fe_len = 1 << i; + + if (num_stripe_clusters > 0) { + /* + * Try to round up the adjusted goal to stripe size + * (in cluster units) multiple for efficiency. + * + * XXX: Is s->stripe always a power of 2? In that case + * we can use the faster round_up() variant. + */ + ac->ac_g_ex.fe_len = roundup(ac->ac_g_ex.fe_len, + num_stripe_clusters); + } + + frag_order = mb_avg_fragment_size_order(ac->ac_sb, + ac->ac_g_ex.fe_len); + + grp = ext4_mb_find_good_group_avg_frag_lists(ac, frag_order); + if (grp) + break; + } + + if (grp) { + *group = grp->bb_group; + ac->ac_flags |= EXT4_MB_CR1_5_OPTIMIZED; + } else { + /* Reset goal length to original goal length before falling into CR2 */ + ac->ac_g_ex.fe_len = ac->ac_orig_goal_len; *new_cr = CR2; } } @@ -1029,6 +1122,8 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); } else if (*new_cr == CR1) { ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups); + } else if (*new_cr == CR1_5) { + ext4_mb_choose_next_group_cr1_5(ac, new_cr, group, ngroups); } else { /* * TODO: For CR=2, we can arrange groups in an rb tree sorted by @@ -2352,7 +2447,7 @@ void ext4_mb_complex_scan_group(struct ext4_allocation_context *ac, if (ac->ac_criteria < CR2) { /* - * In CR1, we are sure that this group will + * In CR1 and CR1_5, we are sure that this group will * have a large enough continuous free extent, so skip * over the smaller free extents */ @@ -2484,6 +2579,7 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac, return true; case CR1: + case CR1_5: if ((free / fragments) >= ac->ac_g_ex.fe_len) return true; break; @@ -2748,7 +2844,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) * spend a lot of time loading imperfect groups */ if ((prefetch_grp == group) && - (cr > CR1 || + (cr > CR1_5 || prefetch_ios < sbi->s_mb_prefetch_limit)) { nr = sbi->s_mb_prefetch; if (ext4_has_feature_flex_bg(sb)) { @@ -2788,7 +2884,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_scanned++; if (cr == CR0) ext4_mb_simple_scan_group(ac, &e4b); - else if (cr == CR1 && sbi->s_stripe && + else if ((cr == CR1 || cr == CR1_5) && sbi->s_stripe && !(ac->ac_g_ex.fe_len % EXT4_B2C(sbi, sbi->s_stripe))) ext4_mb_scan_aligned(ac, &e4b); @@ -2804,6 +2900,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) /* Processed all groups and haven't found blocks */ if (sbi->s_mb_stats && i == ngroups) atomic64_inc(&sbi->s_bal_cX_failed[cr]); + + if (i == ngroups && ac->ac_criteria == CR1_5) + /* Reset goal length to original goal length before + * falling into CR2 */ + ac->ac_g_ex.fe_len = ac->ac_orig_goal_len; } if (ac->ac_b_ex.fe_len > 0 && ac->ac_status != AC_STATUS_FOUND && @@ -2973,6 +3074,16 @@ int ext4_seq_mb_stats_show(struct seq_file *seq, void *offset) seq_printf(seq, "\t\tbad_suggestions: %u\n", atomic_read(&sbi->s_bal_cr1_bad_suggestions)); + seq_puts(seq, "\tcr1.5_stats:\n"); + seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR1_5])); + seq_printf(seq, "\t\tgroups_considered: %llu\n", + atomic64_read(&sbi->s_bal_cX_groups_considered[CR1_5])); + seq_printf(seq, "\t\textents_scanned: %u\n", atomic_read(&sbi->s_bal_cX_ex_scanned[CR1_5])); + seq_printf(seq, "\t\tuseless_loops: %llu\n", + atomic64_read(&sbi->s_bal_cX_failed[CR1_5])); + seq_printf(seq, "\t\tbad_suggestions: %u\n", + atomic_read(&sbi->s_bal_cr1_5_bad_suggestions)); + seq_puts(seq, "\tcr2_stats:\n"); seq_printf(seq, "\t\thits: %llu\n", atomic64_read(&sbi->s_bal_cX_hits[CR2])); seq_printf(seq, "\t\tgroups_considered: %llu\n", @@ -3490,6 +3601,8 @@ int ext4_mb_init(struct super_block *sb) sbi->s_mb_stats = MB_DEFAULT_STATS; sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; + sbi->s_mb_cr1_5_max_trim_order = MB_DEFAULT_CR1_5_TRIM_ORDER; + /* * The default group preallocation is 512, which for 4k block * sizes translates to 2 megabytes. However for bigalloc file @@ -4402,6 +4515,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac, * placement or satisfy big request as is */ ac->ac_g_ex.fe_logical = start; ac->ac_g_ex.fe_len = EXT4_NUM_B2C(sbi, size); + ac->ac_orig_goal_len = ac->ac_g_ex.fe_len; /* define goal start in order to merge */ if (ar->pright && (ar->lright == (start + size)) && @@ -4445,8 +4559,10 @@ static void ext4_mb_collect_stats(struct ext4_allocation_context *ac) if (ac->ac_g_ex.fe_start == ac->ac_b_ex.fe_start && ac->ac_g_ex.fe_group == ac->ac_b_ex.fe_group) atomic_inc(&sbi->s_bal_goals); - if (ac->ac_f_ex.fe_len == ac->ac_g_ex.fe_len) + /* did we allocate as much as normalizer originally wanted? */ + if (ac->ac_f_ex.fe_len == ac->ac_orig_goal_len) atomic_inc(&sbi->s_bal_len_goals); + if (ac->ac_found > sbi->s_mb_max_to_scan) atomic_inc(&sbi->s_bal_breaks); } @@ -4931,7 +5047,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) pa = ac->ac_pa; - if (ac->ac_b_ex.fe_len < ac->ac_g_ex.fe_len) { + if (ac->ac_b_ex.fe_len < ac->ac_orig_goal_len) { int new_bex_start; int new_bex_end; @@ -4946,14 +5062,14 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) * fragmentation in check while ensuring logical range of best * extent doesn't overflow out of goal extent: * - * 1. Check if best ex can be kept at end of goal and still - * cover original start + * 1. Check if best ex can be kept at end of goal (before + * cr_best_avail trimmed it) and still cover original start * 2. Else, check if best ex can be kept at start of goal and * still cover original start * 3. Else, keep the best ex at start of original request. */ new_bex_end = ac->ac_g_ex.fe_logical + - EXT4_C2B(sbi, ac->ac_g_ex.fe_len); + EXT4_C2B(sbi, ac->ac_orig_goal_len); new_bex_start = new_bex_end - EXT4_C2B(sbi, ac->ac_b_ex.fe_len); if (ac->ac_o_ex.fe_logical >= new_bex_start) goto adjust_bex; @@ -4974,7 +5090,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac) BUG_ON(ac->ac_o_ex.fe_logical < ac->ac_b_ex.fe_logical); BUG_ON(ac->ac_o_ex.fe_len > ac->ac_b_ex.fe_len); BUG_ON(new_bex_end > (ac->ac_g_ex.fe_logical + - EXT4_C2B(sbi, ac->ac_g_ex.fe_len))); + EXT4_C2B(sbi, ac->ac_orig_goal_len))); } pa->pa_lstart = ac->ac_b_ex.fe_logical; @@ -5594,6 +5710,7 @@ ext4_mb_initialize_context(struct ext4_allocation_context *ac, ac->ac_o_ex.fe_start = block; ac->ac_o_ex.fe_len = len; ac->ac_g_ex = ac->ac_o_ex; + ac->ac_orig_goal_len = ac->ac_g_ex.fe_len; ac->ac_flags = ar->flags; /* we have to define context: we'll work with a file or diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index acfdc204e15d..bddc0335c261 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -85,6 +85,13 @@ */ #define MB_DEFAULT_LINEAR_SCAN_THRESHOLD 16 +/* + * The maximum order upto which CR1.5 can trim a particular allocation request. + * Example, if we have an order 7 request and max trim order of 3, CR1.5 can + * trim this upto order 4. + */ +#define MB_DEFAULT_CR1_5_TRIM_ORDER 3 + /* * Number of valid buddy orders */ @@ -179,6 +186,12 @@ struct ext4_allocation_context { /* copy of the best found extent taken before preallocation efforts */ struct ext4_free_extent ac_f_ex; + /* + * goal len can change in CR1.5, so save the original len. This is + * used while adjusting the PA window and for accounting. + */ + ext4_grpblk_t ac_orig_goal_len; + __u32 ac_groups_considered; __u32 ac_flags; /* allocation hints */ __u16 ac_groups_scanned; diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index 3042bc605bbf..4a5c08c8dddb 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -223,6 +223,7 @@ EXT4_RW_ATTR_SBI_UI(warning_ratelimit_interval_ms, s_warning_ratelimit_state.int EXT4_RW_ATTR_SBI_UI(warning_ratelimit_burst, s_warning_ratelimit_state.burst); EXT4_RW_ATTR_SBI_UI(msg_ratelimit_interval_ms, s_msg_ratelimit_state.interval); EXT4_RW_ATTR_SBI_UI(msg_ratelimit_burst, s_msg_ratelimit_state.burst); +EXT4_RW_ATTR_SBI_UI(mb_cr1_5_max_trim_order, s_mb_cr1_5_max_trim_order); #ifdef CONFIG_EXT4_DEBUG EXT4_RW_ATTR_SBI_UL(simulate_fail, s_simulate_fail); #endif @@ -273,6 +274,7 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(warning_ratelimit_burst), ATTR_LIST(msg_ratelimit_interval_ms), ATTR_LIST(msg_ratelimit_burst), + ATTR_LIST(mb_cr1_5_max_trim_order), ATTR_LIST(errors_count), ATTR_LIST(warning_count), ATTR_LIST(msg_count), diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index f062147ca32b..7ea9b4fcb21f 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -122,6 +122,7 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX); TRACE_DEFINE_ENUM(CR0); TRACE_DEFINE_ENUM(CR1); +TRACE_DEFINE_ENUM(CR1_5); TRACE_DEFINE_ENUM(CR2); TRACE_DEFINE_ENUM(CR3); @@ -129,6 +130,7 @@ TRACE_DEFINE_ENUM(CR3); __print_symbolic(cr, \ { CR0, "CR0" }, \ { CR1, "CR1" }, \ + { CR1_5, "CR1.5" } \ { CR2, "CR2" }, \ { CR3, "CR3" })