diff mbox series

[RFC] badblocks: Improvement badblocks_set() for handling multiple ranges

Message ID 20201203171535.67715-1-colyli@suse.de (mailing list archive)
State New, archived
Headers show
Series [RFC] badblocks: Improvement badblocks_set() for handling multiple ranges | expand

Commit Message

Coly Li Dec. 3, 2020, 5:15 p.m. UTC
Recently I received a bug report that current badblocks code does not
properly handle multiple ranges. For example,
	badblocks_set(bb, 32, 1, true);
	badblocks_set(bb, 34, 1, true);
	badblocks_set(bb, 36, 1, true);
	badblocks_set(bb, 32, 12, true);
Then indeed badblocks_show() reports,
	32 3
	36 1
But the expected bad blocks table should be,
	32 12
Obviously only the first 2 ranges are merged and badblocks_set() returns
and ignores the rest setting range.

This behavior is improper, if the caller of badblocks_set() wants to set
a range of blocks into bad blocks table, all of the blocks in the range
should be handled even the previous part encountering failure.

The desired way to set bad blocks range by badblocks_set() is,
- Set as many as blocks in the setting range into bad blocks table.
- Merge the bad blocks ranges and occupy as less as slots in the bad
  blocks table.
- Fast.

Indeed the above proposal is complicated, especially with the following
restrictions,
- The setting bad blocks range can be ackknowledged or not acknowledged.
- The bad blocks table size is limited.
- Memory allocation should be avoided.

This patch is an initial effort to improve badblocks_set() for setting
bad blocks range when it covers multiple already set bad ranges in the
bad blocks table, and to do it as fast as possible.

The basic idea of the patch is to categorize all possible bad blocks
range setting combinationsinto to much less simplified and more less
special conditions. Inside badblocks_set() there is an implicit loop
composed by jumping between labels 're_insert' and 'update_sectors'. No
matter how large the setting bad blocks range is, in every loop just a
minimized range from the head is handled by a pre-defined behavior from
one of the categorized conditions. The logic is simple and code flow is
manageable.

This patch is unfinished yet, it only improves badblocks_set() and not
touch badblocks_clear() and badblocks_show() yet. I post it earlier
because this patch will be large (more then 1000 lines of change), I
want more people to give me comments earlier before I go too far away.

The code logic is tested as user space programmer, this patch passes
compiling but not tested in kernel mode yet. Right now it is only for
RFC purpose. I will post tested patch in further versions.

Thank you in advance for any review or comments on this patch.

Signed-off-by: Coly Li <colyli@suse.de>
---
 block/badblocks.c         | 1041 ++++++++++++++++++++++++++++++-------
 include/linux/badblocks.h |   33 ++
 2 files changed, 881 insertions(+), 193 deletions(-)

Comments

Dan Williams Dec. 18, 2020, 3:25 a.m. UTC | #1
[ add Neil, original gooodguy who wrote badblocks ]


On Thu, Dec 3, 2020 at 9:16 AM Coly Li <colyli@suse.de> wrote:
>
> Recently I received a bug report that current badblocks code does not
> properly handle multiple ranges. For example,
>         badblocks_set(bb, 32, 1, true);
>         badblocks_set(bb, 34, 1, true);
>         badblocks_set(bb, 36, 1, true);
>         badblocks_set(bb, 32, 12, true);
> Then indeed badblocks_show() reports,
>         32 3
>         36 1
> But the expected bad blocks table should be,
>         32 12
> Obviously only the first 2 ranges are merged and badblocks_set() returns
> and ignores the rest setting range.
>
> This behavior is improper, if the caller of badblocks_set() wants to set
> a range of blocks into bad blocks table, all of the blocks in the range
> should be handled even the previous part encountering failure.
>
> The desired way to set bad blocks range by badblocks_set() is,
> - Set as many as blocks in the setting range into bad blocks table.
> - Merge the bad blocks ranges and occupy as less as slots in the bad
>   blocks table.
> - Fast.
>
> Indeed the above proposal is complicated, especially with the following
> restrictions,
> - The setting bad blocks range can be ackknowledged or not acknowledged.

s/ackknowledged/acknowledged/

I'd run checkpatch --codespell for future versions...

> - The bad blocks table size is limited.
> - Memory allocation should be avoided.
>
> This patch is an initial effort to improve badblocks_set() for setting
> bad blocks range when it covers multiple already set bad ranges in the
> bad blocks table, and to do it as fast as possible.
>
> The basic idea of the patch is to categorize all possible bad blocks
> range setting combinationsinto to much less simplified and more less
> special conditions. Inside badblocks_set() there is an implicit loop
> composed by jumping between labels 're_insert' and 'update_sectors'. No
> matter how large the setting bad blocks range is, in every loop just a
> minimized range from the head is handled by a pre-defined behavior from
> one of the categorized conditions. The logic is simple and code flow is
> manageable.
>
> This patch is unfinished yet, it only improves badblocks_set() and not
> touch badblocks_clear() and badblocks_show() yet. I post it earlier
> because this patch will be large (more then 1000 lines of change), I
> want more people to give me comments earlier before I go too far away.
>

I wonder if this isn't indication that the base data structure should
be replaced... but I have not had a chance to devote deeper thought to
this.


> The code logic is tested as user space programmer, this patch passes
> compiling but not tested in kernel mode yet. Right now it is only for
> RFC purpose. I will post tested patch in further versions.
>
> Thank you in advance for any review or comments on this patch.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  block/badblocks.c         | 1041 ++++++++++++++++++++++++++++++-------
>  include/linux/badblocks.h |   33 ++
>  2 files changed, 881 insertions(+), 193 deletions(-)
>
> diff --git a/block/badblocks.c b/block/badblocks.c
> index d39056630d9c..04ccae95777d 100644
> --- a/block/badblocks.c
> +++ b/block/badblocks.c
> @@ -5,6 +5,8 @@
>   * - Heavily based on MD badblocks code from Neil Brown
>   *
>   * Copyright (c) 2015, Intel Corporation.
> + *
> + * Improvement for handling multiple ranges by Coly Li <colyli@suse.de>
>   */
>
>  #include <linux/badblocks.h>
> @@ -16,114 +18,612 @@
>  #include <linux/types.h>
>  #include <linux/slab.h>
>
> -/**
> - * badblocks_check() - check a given range for bad sectors
> - * @bb:                the badblocks structure that holds all badblock information
> - * @s:         sector (start) at which to check for badblocks
> - * @sectors:   number of sectors to check for badblocks
> - * @first_bad: pointer to store location of the first badblock
> - * @bad_sectors: pointer to store number of badblocks after @first_bad
> +/*
> + * The purpose of badblocks set/clear is to manage bad blocks ranges which are
> + * identified by LBA addresses.
>   *
> - * We can record which blocks on each device are 'bad' and so just
> - * fail those blocks, or that stripe, rather than the whole device.
> - * Entries in the bad-block table are 64bits wide.  This comprises:
> - * Length of bad-range, in sectors: 0-511 for lengths 1-512
> - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> - *  A 'shift' can be set so that larger blocks are tracked and
> - *  consequently larger devices can be covered.
> - * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + * When the caller of badblocks_set() wants to set a range of bad blocks, the
> + * setting range can be acked or unacked. And the setting range may merge,
> + * overwrite, skip the overlaypped already set range, depends on who they are
> + * overlapped or adjacent, and the acknowledgment type of the ranges. It can be
> + * more complicated when the setting range covers multiple already set bad block
> + * ranges, with restritctions of maximum length of each bad range and the bad
> + * table space limitation.
>   *
> - * Locking of the bad-block table uses a seqlock so badblocks_check
> - * might need to retry if it is very unlucky.
> - * We will sometimes want to check for bad blocks in a bi_end_io function,
> - * so we use the write_seqlock_irq variant.
> + * It is difficut and unnecessary to take care of all the possible situations,
> + * for setting a large range of bad blocks, we can handle it by dividing the
> + * large range into smaller ones when encounter overlap, max range length or
> + * bad table full conditions. Every time only a smaller piece of the bad range
> + * is handled with a limited number of conditions how it is interacted with
> + * possible overlapped or adjacent already set bad block ranges. Then the hard
> + * complicated problem can be much simpler to habndle in proper way.
>   *
> - * When looking for a bad block we specify a range and want to
> - * know if any block in the range is bad.  So we binary-search
> - * to the last range that starts at-or-before the given endpoint,
> - * (or "before the sector after the target range")
> - * then see if it ends after the given start.
> + * When setting a range of bad blocks to the bad table, the simplified situations
> + * to be considered are, (The already set bad blocks ranges are naming with
> + *  prefix E, and the setting bad blocks range is naming with prefix S)
> + *
> + * 1) A setting range is not overlapped or adjacent to any other already set bad
> + *    block range.
> + *                         +--------+
> + *                         |    S   |
> + *                         +--------+
> + *        +-------------+               +-------------+
> + *        |      E1     |               |      E2     |
> + *        +-------------+               +-------------+
> + *    For this situation if the bad blocks table is not full, just allocate a
> + *    free slot from the bad blocks table to mark the setting range S. The
> + *    result is,
> + *        +-------------+  +--------+   +-------------+
> + *        |      E1     |  |    S   |   |      E2     |
> + *        +-------------+  +--------+   +-------------+
> + * 2) A setting range starts exactly at a start LBA of an already set bad blocks
> + *    range.
> + * 2.1) The setting range size < already set range size
> + *        +--------+
> + *        |    S   |
> + *        +--------+
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + * 2.1.1) If S and E are both acked or unacked range, the setting range S can
> + *    be merged into existing bad range E. The result is,
> + *        +-------------+
> + *        |      S      |
> + *        +-------------+
> + * 2.1.2) If S is uncked setting and E is acked, the setting will be dinied, and
> + *    the result is,
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + * 2.1.3) If S is acked setting and E is unacked, range S can overwirte on E.
> + *    An extra slot from the bad blocks table will be allocated for S, and head
> + *    of E will move to end of the inserted range E. The result is,
> + *        +--------+----+
> + *        |    S   | E  |
> + *        +--------+----+
> + * 2.2) The setting range size == already set range size
> + * 2.2.1) If S and E are both acked or unacked range, the setting range S can
> + *    be merged into existing bad range E. The result is,
> + *        +-------------+
> + *        |      S      |
> + *        +-------------+
> + * 2.2.2) If S is uncked setting and E is acked, the setting will be dinied, and
> + *    the result is,
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + * 2.2.3) If S is acked setting and E is unacked, range S can overwirte all of
> +      bad blocks range E. The result is,
> + *        +-------------+
> + *        |      S      |
> + *        +-------------+
> + * 2.3) The setting range size > already set range size
> + *        +-------------------+
> + *        |          S        |
> + *        +-------------------+
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + *    For such situation, the setting range S can be treated as two parts, the
> + *    first part (S1) is as same size as the already set range E, the second
> + *    part (S2) is thre rest of setting range.
> + *        +-------------+-----+        +-------------+       +-----+
> + *        |    S1       | S2  |        |     S1      |       | S2  |
> + *        +-------------+-----+  ===>  +-------------+       +-----+
> + *        +-------------+              +-------------+
> + *        |      E      |              |      E      |
> + *        +-------------+              +-------------+
> + *    Now we only focus on how to handle the setting range S1 and already set
> + *    range E, which are already explained in 1.2), for the rest S2 it will be
> + *    handled later in next loop.
> + * 3) A setting range starts before the start LBA of an already set bad blocks
> + *    range.
> + *        +-------------+
> + *        |      S      |
> + *        +-------------+
> + *             +-------------+
> + *             |      E      |
> + *             +-------------+
> + *    For this situation, the setting range S can be divided into two parts, the
> + *    first (S1) ends at the start LBA of already set range E, the second part
> + *    (S2) starts exactly at a start LBA of the already set range E.
> + *        +----+---------+             +----+      +---------+
> + *        | S1 |    S2   |             | S1 |      |    S2   |
> + *        +----+---------+      ===>   +----+      +---------+
> + *             +-------------+                     +-------------+
> + *             |      E      |                     |      E      |
> + *             +-------------+                     +-------------+
> + *    Now only the first part S1 should be handled in this loop, which is in
> + *    similar condition as 1). The rest part S2 has exact same start LBA address
> + *    of the already set range E, they will be handled in next loop in one of
> + *    situations in 2).
> + * 4) A setting range starts after the start LBA of an already set bad blocks
> + *    range.
> + * 4.1) If the setting range S exactly matches the tail part of already set bad
> + *    blocks range E, like the folowing chart shows,
> + *            +---------+
> + *            |   S     |
> + *            +---------+
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + * 4.1.1) If range S and E have same ackknowledg value (both acked or unacked),
> + *    they will be merged into one, the result is,
> + *        +-------------+
> + *        |      S      |
> + *        +-------------+
> + * 4.1.2) If range E is acked and the setting range S is unacked, the setting
> + *    request of S will be rejected, the result is,
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + * 4.1.3) If range E is unacked, and the setting range S is acked, then S may
> + *    overwrite the overlapped range of E, the result is,
> + *        +---+---------+
> + *        | E |    S    |
> + *        +---+---------+
> + * 4.2) If the setting range S stays in middle of an already set range E, like
> + *    the following chart shows,
> + *             +----+
> + *             | S  |
> + *             +----+
> + *        +--------------+
> + *        |       E      |
> + *        +--------------+
> + * 4.2.1) If range S and E have same ackknowledg value (both acked or unacked),
> + *    they will be merged into one, the result is,
> + *        +--------------+
> + *        |       S      |
> + *        +--------------+
> + * 4.2.2) If range E is acked and the setting range S is unacked, the setting
> + *    request of S will be rejected, the result is also,
> + *        +--------------+
> + *        |       E      |
> + *        +--------------+
> + * 4.2.3) If range E is unacked, and the setting range S is acked, then S will
> + *    inserted into middle of E and split previous range E into twp parts (E1
> + *    and E2), the result is,
> + *        +----+----+----+
> + *        | E1 |  S | E2 |
> + *        +----+----+----+
> + * 4.3) If the setting bad blocks range S is overlapped with an already set bad
> + *    blocks range E. The range S starts after the start LBA of range E, and
> + *    ends after the end LBA of range E, as the following chart shows,
> + *            +-------------------+
> + *            |          S        |
> + *            +-------------------+
> + *        +-------------+
> + *        |      E      |
> + *        +-------------+
> + *    For this situation the range S can be divided into two parts, the first
> + *    part (S1) ends at end range E, and the second part (S2) has rest range of
> + *    origin S.
> + *            +---------+---------+            +---------+      +---------+
> + *            |    S1   |    S2   |            |    S1   |      |    S2   |
> + *            +---------+---------+  ===>      +---------+      +---------+
> + *        +-------------+                  +-------------+
> + *        |      E      |                  |      E      |
> + *        +-------------+                  +-------------+
> + *     Now in this loop the setting range S1 and already set range E can be
> + *     handled as the situations 4), the rest range S2 will be handled in next
> + *     loop and ignored in this loop.
> + * 5) A setting bad blocks range S is adjacent to one or more already set bad
> + *    blocks range(s), and they are all acked or unacked range.
> + * 5.1) Front merge: If the already set bad blocks range E is before setting
> + *    range S and they are adjacent,
> + *                +------+
> + *                |  S   |
> + *                +------+
> + *        +-------+
> + *        |   E   |
> + *        +-------+
> + * 5.1.1) When total size of range S and E <= BB_MAX_LEN, and their acknowledge
> + *    values are same, the setting range S can front merges into range E. The
> + *    result is,
> + *        +--------------+
> + *        |       S      |
> + *        +--------------+
> + * 5.1.2) Otherwise these two ranges cannot merge, just insert the setting
> + *    range S right after already set range E into the bad blocks table. The
> + *    result is,
> + *        +--------+------+
> + *        |   E    |   S  |
> + *        +--------+------+
> + * 6) Special cases which above conditions cannot handle
> + * 6.1) Multiple already set ranges may merge into less ones in a full bad table
> + *        +-------------------------------------------------------+
> + *        |                           S                           |
> + *        +-------------------------------------------------------+
> + *        |<----- BB_MAX_LEN ----->|
> + *                                 +-----+     +-----+   +-----+
> + *                                 | E1  |     | E2  |   | E3  |
> + *                                 +-----+     +-----+   +-----+
> + *     In the above example, when the bad blocks table is full, inserting the
> + *     first part of setting range S will fail because no more available slot
> + *     can be allocated from bad blocks table. In this situation a proper
> + *     setting method should be go though all the setting bad blocks range and
> + *     look for chance to merge already set ranges into less ones. When there
> + *     is available slot from bad blocks table, re-try again to handle more
> + *     setting bad blocks ranges as many as possible.
> + *        +------------------------+
> + *        |          S3            |
> + *        +------------------------+
> + *        |<----- BB_MAX_LEN ----->|
> + *                                 +-----+-----+-----+---+-----+--+
> + *                                 |       S1        |     S2     |
> + *                                 +-----+-----+-----+---+-----+--+
> + *     The above chart shows although the first part (S3) cannot be inserted due
> + *     to no-space in bad blocks table, but the following E1, E2 and E3 ranges
> + *     can be merged with rest part of S into less range S1 and S2. Now there is
> + *     1 free slot in bad blocks table.
> + *        +------------------------+-----+-----+-----+---+-----+--+
> + *        |           S3           |       S1        |     S2     |
> + *        +------------------------+-----+-----+-----+---+-----+--+
> + *     Since the bad blocks table is not full anymore, re-try again for the
> + *     origin setting range S. Now the setting range S3 can be inserted into the
> + *     bad blocks table with previous freed slot from multiple ranges merge.
> + * 6.2) Front merge after overwrite
> + *    In the following example, in bad blocks table, E1 is an acked bad blocks
> + *    range and E2 is an unacked bad blocks range, therefore they are not able
> + *    to merge into a larger range. The setting bad blocks range S is acked,
> + *    therefore part of E2 can be overwritten by S.
> + *                      +--------+
> + *                      |    S   |                             acknowledged
> + *                      +--------+                         S:       1
> + *              +-------+-------------+                   E1:       1
> + *              |   E1  |    E2       |                   E2:       0
> + *              +-------+-------------+
> + *     With previosu simplified routines, after overwiting part of E2 with S,
> + *     the bad blocks table should be (E3 is remaining part of E2 which is not
> + *     overwritten by S),
> + *                                                             acknowledged
> + *              +-------+--------+----+                    S:       1
> + *              |   E1  |    S   | E3 |                   E1:       1
> + *              +-------+--------+----+                   E3:       0
> + *     The above result is correct but not perfect. Range E1 and S in the bad
> + *     blocks table are all acked, merging them into a larger one range may
> + *     occupy less bad blocks table space and make badblocks_check() faster.
> + *     Therefore in such situation, after overwiting range S, the previous range
> + *     E1 should be checked for possible front combination. Then the ideal
> + *     result can be,
> + *              +----------------+----+                        acknowledged
> + *              |       E1       | E3 |                   E1:       1
> + *              +----------------+----+                   E3:       0
> + * 6.3) Behind merge: If the already set bad blocks range E is behind the setting
> + *    range S and they are adjacent. Normally we don't need to care about this
> + *    because front merge handles this while going though range S from head to
> + *    tail, except for the tail part of range S. When the setting range S are
> + *    fully handled, all the above simplified routine doesn't check whether the
> + *    tail LBA of range S is adjacent to the next already set range and not able
> + *    to them if they are mergeable.
> + *        +------+
> + *        |  S   |
> + *        +------+
> + *               +-------+
> + *               |   E   |
> + *               +-------+
> + *    For the above special stiuation, when the setting range S are all handled
> + *    and the loop ends, an extra check is necessary for whether next already
> + *    set range E is right after S and mergeable.
> + * 6.2.1) When total size of range E and S <= BB_MAX_LEN, and their acknowledge
> + *    values are same, the setting range S can behind merges into range E. The
> + *    result is,
> + *        +--------------+
> + *        |       S      |
> + *        +--------------+
> + * 6.2.2) Otherwise these two ranges cannot merge, just insert the setting range
> + *     S infront of the already set range E in the bad blocks table. The result
> + *     is,
> + *        +------+-------+
> + *        |  S   |   E   |
> + *        +------+-------+
> + *
> + * All the above 5 simplified situations and 3 special cases may cover 99%+ of
> + * the bad block range setting conditions. Maybe there is some rare corner case
> + * is not considered and optimized, it won't hurt if badblocks_set() fails due
> + * to no space, or some ranges are not merged to save bad blocks table space.
> + *
> + * Inside badblocks_set() each loop starts by jumping to re_insert label, every
> + * time for the new loop prev_badblocks() is called to find an already set range
> + * which starts before or at current setting range. Since the setting bad blocks
> + * range is handled from head to tail, most of the cases it is unnecessary to do
> + * the binary search inside prev_badblocks(), it is possible to provide a hint
> + * to prev_badblocks() for a fast path, then the expensive binary search can be
> + * avoided. In my test with the hint to prev_badblocks(), except for the first
> + * loop, all rested calls to prev_badblocks() can go into the fast path and
> + * return correct bad blocks table index immediately.
>   *
> - * Return:
> - *  0: there are no known bad blocks in the range
> - *  1: there are known bad block which are all acknowledged
> - * -1: there are bad blocks which have not yet been acknowledged in metadata.
> - * plus the start/length of the first bad section we overlap.
>   */
> -int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> -                       sector_t *first_bad, int *bad_sectors)
> +
> +static int prev_by_hint(struct badblocks *bb, sector_t s, int hint)
>  {
> -       int hi;
> -       int lo;
>         u64 *p = bb->page;
> -       int rv;
> -       sector_t target = s + sectors;
> -       unsigned seq;
> +       int ret = -1;
> +       int hint_end = hint + 2;
>
> -       if (bb->shift > 0) {
> -               /* round the start down, and the end up */
> -               s >>= bb->shift;
> -               target += (1<<bb->shift) - 1;
> -               target >>= bb->shift;
> -               sectors = target - s;
> +       while ((hint < hint_end) && ((hint + 1) <= bb->count) &&
> +              (BB_OFFSET(p[hint]) <= s)) {
> +               if ((hint + 1) == bb->count || BB_OFFSET(p[hint + 1]) > s) {
> +                       ret = hint;
> +                       break;
> +               }
> +               hint++;
> +       }
> +
> +       return ret;
> +}
> +
> +/* find the range starts at-or-before bad->start */
> +static int prev_badblocks(struct badblocks *bb, struct bad_context *bad,
> +                         int hint)
> +{
> +       u64 *p;
> +       int lo, hi;
> +       sector_t s = bad->start;
> +       int ret = -1;
> +
> +       if (!bb->count)
> +               goto out;
> +
> +       if (hint >= 0) {
> +               ret = prev_by_hint(bb, s, hint);
> +               if (ret >= 0)
> +                       goto out;
>         }
> -       /* 'target' is now the first block after the bad range */
>
> -retry:
> -       seq = read_seqbegin(&bb->lock);
>         lo = 0;
> -       rv = 0;
>         hi = bb->count;
> +       p = bb->page;
>
> -       /* Binary search between lo and hi for 'target'
> -        * i.e. for the last range that starts before 'target'
> -        */
> -       /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
> -        * are known not to be the last range before target.
> -        * VARIANT: hi-lo is the number of possible
> -        * ranges, and decreases until it reaches 1
> -        */
>         while (hi - lo > 1) {
> -               int mid = (lo + hi) / 2;
> +               int mid = (lo + hi)/2;
>                 sector_t a = BB_OFFSET(p[mid]);
>
> -               if (a < target)
> -                       /* This could still be the one, earlier ranges
> -                        * could not.
> -                        */
> +               if (a <= s)
>                         lo = mid;
>                 else
> -                       /* This and later ranges are definitely out. */
>                         hi = mid;
>         }
> -       /* 'lo' might be the last that started before target, but 'hi' isn't */
> -       if (hi > lo) {
> -               /* need to check all range that end after 's' to see if
> -                * any are unacknowledged.
> +
> +       if (BB_OFFSET(p[lo]) <= s)
> +               ret = lo;
> +out:
> +       return ret;
> +}
> +
> +static int can_merge_behind(struct badblocks *bb, struct bad_context *bad,
> +                           int behind)
> +{
> +       u64 *p = bb->page;
> +       sector_t s = bad->start;
> +       sector_t sectors = bad->len;
> +       int ack = bad->ack;
> +
> +       if ((s <= BB_OFFSET(p[behind])) &&
> +           ((s + sectors) >= BB_OFFSET(p[behind])) &&
> +           ((BB_END(p[behind]) - s) <= BB_MAX_LEN) &&
> +           BB_ACK(p[behind]) == ack)
> +               return true;
> +       return false;
> +}
> +
> +static int behind_merge(struct badblocks *bb, struct bad_context *bad,
> +                       int behind)
> +{
> +       u64 *p = bb->page;
> +       sector_t s = bad->start;
> +       sector_t sectors = bad->len;
> +       int ack = bad->ack;
> +       int merged = 0;
> +
> +       WARN_ON(s > BB_OFFSET(p[behind]));
> +       WARN_ON((s + sectors) < BB_OFFSET(p[behind]));
> +
> +       if (s < BB_OFFSET(p[behind])) {
> +               WARN_ON((BB_LEN(p[behind]) + merged) >= BB_MAX_LEN);
> +
> +               merged = min_t(sector_t, sectors, BB_OFFSET(p[behind]) - s);
> +               p[behind] =  BB_MAKE(s, BB_LEN(p[behind]) + merged, ack);
> +       } else {
> +               merged = min_t(sector_t, sectors, BB_LEN(p[behind]));
> +       }
> +
> +       WARN_ON(merged == 0);
> +
> +       return merged;
> +}
> +
> +static int can_merge_front(struct badblocks *bb, int prev,
> +                          struct bad_context *bad)
> +{
> +       u64 *p = bb->page;
> +       sector_t s = bad->start;
> +       int ack = bad->ack;
> +
> +       if (BB_ACK(p[prev]) == ack &&
> +           (s < BB_END(p[prev]) ||
> +            (s == BB_END(p[prev]) && (BB_LEN(p[prev]) < BB_MAX_LEN))))
> +               return true;
> +       return false;
> +}
> +
> +static int front_merge(struct badblocks *bb, int prev, struct bad_context *bad)
> +{
> +       int sectors = bad->len;
> +       int s = bad->start;
> +       int ack = bad->ack;
> +       u64 *p = bb->page;
> +       int merged = 0;
> +
> +       WARN_ON(s > BB_END(p[prev]));
> +
> +       if (s < BB_END(p[prev])) {
> +               merged = min_t(sector_t, sectors, BB_END(p[prev]) - s);
> +       } else {
> +               merged = min_t(sector_t, sectors, BB_MAX_LEN - BB_LEN(p[prev]));
> +               if ((prev + 1) < bb->count &&
> +                   merged > (BB_OFFSET(p[prev + 1]) - BB_END(p[prev]))) {
> +                       merged = BB_OFFSET(p[prev + 1]) - BB_END(p[prev]);
> +               }
> +
> +               p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]) + merged, ack);
> +       }
> +
> +       return merged;
> +}
> +
> +static int can_combine_front(struct badblocks *bb, int prev,
> +                            struct bad_context *bad)
> +{
> +       u64 *p = bb->page;
> +
> +       if ((BB_OFFSET(p[prev]) == bad->start) && (prev > 0) &&
> +           (BB_LEN(p[prev - 1]) + BB_LEN(p[prev]) <= BB_MAX_LEN) &&
> +           (BB_ACK(p[prev - 1]) == BB_ACK(p[prev])))
> +               return true;
> +       return false;
> +}
> +
> +static void front_combine(struct badblocks *bb, int prev)
> +{
> +       u64 *p = bb->page;
> +
> +       p[prev - 1] = BB_MAKE(BB_OFFSET(p[prev - 1]),
> +                             BB_LEN(p[prev - 1]) + BB_LEN(p[prev]),
> +                             BB_ACK(p[prev]));
> +       if ((prev + 1) < bb->count)
> +               memmove(p + prev, p + prev + 1, (bb->count - prev - 1) * 8);
> +}
> +
> +static int overlap_front(struct badblocks *bb, int front,
> +                         struct bad_context *bad)
> +{
> +       u64 *p = bb->page;
> +
> +       if (bad->start >= BB_OFFSET(p[front]) &&
> +           bad->start < BB_END(p[front]))
> +               return true;
> +       return false;
> +}
> +
> +static int can_front_overwrite(struct badblocks *bb, int prev,
> +                              struct bad_context *bad, int *extra)
> +{
> +       u64 *p = bb->page;
> +       int len;
> +
> +       WARN_ON(!overlap_front(bb, prev, bad));
> +
> +       if (BB_ACK(p[prev]) >= bad->ack)
> +               return false;
> +
> +       if (BB_END(p[prev]) <= (bad->start + bad->len)) {
> +               len = BB_END(p[prev]) - bad->start;
> +               if (BB_OFFSET(p[prev]) == bad->start)
> +                       *extra = 0;
> +               else
> +                       *extra = 1;
> +
> +               bad->len = len;
> +       } else {
> +               if (BB_OFFSET(p[prev]) == bad->start)
> +                       *extra = 1;
> +               else
> +               /*
> +                * prev range will be split into two, beside the overwritten
> +                * one, an extra slot needed from bad table.
>                  */
> -               while (lo >= 0 &&
> -                      BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> -                       if (BB_OFFSET(p[lo]) < target) {
> -                               /* starts before the end, and finishes after
> -                                * the start, so they must overlap
> -                                */
> -                               if (rv != -1 && BB_ACK(p[lo]))
> -                                       rv = 1;
> -                               else
> -                                       rv = -1;
> -                               *first_bad = BB_OFFSET(p[lo]);
> -                               *bad_sectors = BB_LEN(p[lo]);
> -                       }
> -                       lo--;
> +                       *extra = 2;
> +       }
> +
> +       if ((bb->count + (*extra)) >= MAX_BADBLOCKS)
> +               return false;
> +
> +       return true;
> +}
> +
> +static int front_overwrite(struct badblocks *bb, int prev,
> +                          struct bad_context *bad, int extra)
> +{
> +       u64 *p = bb->page;
> +       int n = extra;
> +       sector_t orig_end = BB_END(p[prev]);
> +       int orig_ack = BB_ACK(p[prev]);
> +
> +       switch (extra) {
> +       case 0:
> +               p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]),
> +                                 bad->ack);
> +               break;
> +       case 1:
> +               if (BB_OFFSET(p[prev]) == bad->start) {
> +                       p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> +                                         bad->len, bad->ack);
> +                       memmove(p + prev + 2, p + prev + 1,
> +                               (bb->count - prev - 1) * 8);
> +                       p[prev + 1] = BB_MAKE(bad->start + bad->len,
> +                                             orig_end - BB_END(p[prev]),
> +                                             orig_ack);
> +               } else {
> +                       p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> +                                         BB_END(p[prev]) - bad->start,
> +                                         BB_ACK(p[prev]));
> +                       memmove(p + prev + 1 + n, p + prev + 1,
> +                               (bb->count - prev - 1) * 8);
> +                       p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
>                 }
> +               break;
> +       case 2:
> +               p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> +                                 BB_END(p[prev]) - bad->start,
> +                                 BB_ACK(p[prev]));
> +               memmove(p + prev + 1 + n, p + prev + 1,
> +                       (bb->count - prev - 1) * 8);
> +               p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
> +               p[prev + 2] = BB_MAKE(BB_END(p[prev + 1]),
> +                                     orig_end - BB_END(p[prev + 1]),
> +                                     BB_ACK(p[prev]));
> +               break;
> +       default:
> +               break;
>         }
>
> -       if (read_seqretry(&bb->lock, seq))
> -               goto retry;
> +       return bad->len;
> +}
>
> -       return rv;
> +static int overlap_behind(struct badblocks *bb, struct bad_context *bad,
> +                         int behind)
> +{
> +       u64 *p = bb->page;
> +
> +       if (bad->start < BB_OFFSET(p[behind]) &&
> +           (bad->start + bad->len) > BB_OFFSET(p[behind]))
> +               return true;
> +
> +       if (bad->start >= BB_OFFSET(p[behind]) &&
> +           bad->start < BB_END(p[behind]))
> +               return true;
> +
> +       return false;
> +}
> +
> +static int insert_at(struct badblocks *bb, int at, struct bad_context *bad)
> +{
> +       u64 *p = bb->page;
> +       int sectors = bad->len;
> +       int s = bad->start;
> +       int ack = bad->ack;
> +       int len;
> +
> +       WARN_ON(badblocks_full(bb));
> +
> +       len = min_t(sector_t, sectors, BB_MAX_LEN);
> +       if (at < bb->count)
> +               memmove(p + at + 1, p + at, (bb->count - at) * 8);
> +       p[at] = BB_MAKE(s, len, ack);
> +
> +       return len;
>  }
> -EXPORT_SYMBOL_GPL(badblocks_check);
>
>  static void badblocks_update_acked(struct badblocks *bb)
>  {
> @@ -164,7 +664,10 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
>                         int acknowledged)
>  {
>         u64 *p;
> -       int lo, hi;
> +       struct bad_context bad;
> +       int prev = -1, hint = -1;
> +       int len = 0, added = 0;
> +       int retried = 0, space_desired = 0;
>         int rv = 0;
>         unsigned long flags;
>
> @@ -172,144 +675,187 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
>                 /* badblocks are disabled */
>                 return 1;
>
> +       if (sectors <= 0)
> +               /* Invalid sectors number */
> +               return 1;
> +
>         if (bb->shift) {
>                 /* round the start down, and the end up */
>                 sector_t next = s + sectors;
>
> -               s >>= bb->shift;
> -               next += (1<<bb->shift) - 1;
> -               next >>= bb->shift;
> +               rounddown(s, bb->shift);
> +               roundup(next, bb->shift);
>                 sectors = next - s;
>         }
>
>         write_seqlock_irqsave(&bb->lock, flags);
>
> +       bad.orig_start = s;
> +       bad.orig_len = sectors;
> +       bad.ack = acknowledged;
>         p = bb->page;
> -       lo = 0;
> -       hi = bb->count;
> -       /* Find the last range that starts at-or-before 's' */
> -       while (hi - lo > 1) {
> -               int mid = (lo + hi) / 2;
> -               sector_t a = BB_OFFSET(p[mid]);
>
> -               if (a <= s)
> -                       lo = mid;
> -               else
> -                       hi = mid;
> +re_insert:
> +       bad.start = s;
> +       bad.len = sectors;
> +       len = 0;
> +
> +       if (badblocks_empty(bb)) {
> +               len = insert_at(bb, 0, &bad);
> +               bb->count++;
> +               added++;
> +               goto update_sectors;
>         }
> -       if (hi > lo && BB_OFFSET(p[lo]) > s)
> -               hi = lo;
>
> -       if (hi > lo) {
> -               /* we found a range that might merge with the start
> -                * of our new range
> -                */
> -               sector_t a = BB_OFFSET(p[lo]);
> -               sector_t e = a + BB_LEN(p[lo]);
> -               int ack = BB_ACK(p[lo]);
> -
> -               if (e >= s) {
> -                       /* Yes, we can merge with a previous range */
> -                       if (s == a && s + sectors >= e)
> -                               /* new range covers old */
> -                               ack = acknowledged;
> -                       else
> -                               ack = ack && acknowledged;
> -
> -                       if (e < s + sectors)
> -                               e = s + sectors;
> -                       if (e - a <= BB_MAX_LEN) {
> -                               p[lo] = BB_MAKE(a, e-a, ack);
> -                               s = e;
> +       prev = prev_badblocks(bb, &bad, hint);
> +
> +       /* start before all badblocks */
> +       if (prev < 0) {
> +               if (!badblocks_full(bb)) {
> +                       /* insert on the first */
> +                       if (bad.len > (BB_OFFSET(p[0]) - bad.start))
> +                               bad.len = BB_OFFSET(p[0]) - bad.start;
> +                       len = insert_at(bb, 0, &bad);
> +                       bb->count++;
> +                       added++;
> +                       hint = 0;
> +                       goto update_sectors;
> +               }
> +
> +               /* No sapce, try to merge */
> +               if (overlap_behind(bb, &bad, 0)) {
> +                       if (can_merge_behind(bb, &bad, 0)) {
> +                               len = behind_merge(bb, &bad, 0);
> +                               added++;
>                         } else {
> -                               /* does not all fit in one range,
> -                                * make p[lo] maximal
> -                                */
> -                               if (BB_LEN(p[lo]) != BB_MAX_LEN)
> -                                       p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
> -                               s = a + BB_MAX_LEN;
> +                               len = min_t(sector_t, BB_OFFSET(p[0]) - s, sectors);
> +                               space_desired = 1;
>                         }
> -                       sectors = e - s;
> +                       hint = 0;
> +                       goto update_sectors;
>                 }
> +
> +               /* no table space and give up */
> +               goto out;
>         }
> -       if (sectors && hi < bb->count) {
> -               /* 'hi' points to the first range that starts after 's'.
> -                * Maybe we can merge with the start of that range
> -                */
> -               sector_t a = BB_OFFSET(p[hi]);
> -               sector_t e = a + BB_LEN(p[hi]);
> -               int ack = BB_ACK(p[hi]);
> -
> -               if (a <= s + sectors) {
> -                       /* merging is possible */
> -                       if (e <= s + sectors) {
> -                               /* full overlap */
> -                               e = s + sectors;
> -                               ack = acknowledged;
> -                       } else
> -                               ack = ack && acknowledged;
> -
> -                       a = s;
> -                       if (e - a <= BB_MAX_LEN) {
> -                               p[hi] = BB_MAKE(a, e-a, ack);
> -                               s = e;
> -                       } else {
> -                               p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
> -                               s = a + BB_MAX_LEN;
> +
> +       /* in case p[prev-1] can be merged with p[prev] */
> +       if (can_combine_front(bb, prev, &bad)) {
> +               front_combine(bb, prev);
> +               bb->count--;
> +               added++;
> +               hint = prev - 1;
> +               goto update_sectors;
> +       }
> +
> +       if (overlap_front(bb, prev, &bad)) {
> +               if (can_merge_front(bb, prev, &bad)) {
> +                       len = front_merge(bb, prev, &bad);
> +                       added++;
> +                       hint = prev - 1;
> +               } else {
> +                       int extra = 0;
> +
> +                       if (!can_front_overwrite(bb, prev, &bad, &extra)) {
> +                               len = min_t(sector_t, BB_END(p[prev]) - s, sectors);
> +                               hint = prev;
> +                               goto update_sectors;
> +                       }
> +
> +                       len = front_overwrite(bb, prev, &bad, extra);
> +                       added++;
> +                       bb->count += extra;
> +                       hint = prev;
> +
> +                       if (prev > 0 && can_combine_front(bb, prev, &bad)) {
> +                               front_combine(bb, prev);
> +                               bb->count--;
> +                               hint = prev - 1;
>                         }
> -                       sectors = e - s;
> -                       lo = hi;
> -                       hi++;
>                 }
> +               goto update_sectors;
> +       }
> +
> +       if (can_merge_front(bb, prev, &bad)) {
> +               len = front_merge(bb, prev, &bad);
> +               added++;
> +               hint = prev;
> +               goto update_sectors;
>         }
> -       if (sectors == 0 && hi < bb->count) {
> -               /* we might be able to combine lo and hi */
> -               /* Note: 's' is at the end of 'lo' */
> -               sector_t a = BB_OFFSET(p[hi]);
> -               int lolen = BB_LEN(p[lo]);
> -               int hilen = BB_LEN(p[hi]);
> -               int newlen = lolen + hilen - (s - a);
> -
> -               if (s >= a && newlen < BB_MAX_LEN) {
> -                       /* yes, we can combine them */
> -                       int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
> -
> -                       p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
> -                       memmove(p + hi, p + hi + 1,
> -                               (bb->count - hi - 1) * 8);
> -                       bb->count--;
> +
> +       /* if no space in table, still try to merge in the covered range */
> +       if (badblocks_full(bb)) {
> +               /* skip the cannot-merge range */
> +               if (((prev + 1) < bb->count) &&
> +                   overlap_behind(bb, &bad, prev + 1) &&
> +                   ((s + sectors) >= BB_END(p[prev + 1]))) {
> +                       len = BB_END(p[prev + 1]) - s;
> +                       hint = prev + 1;
> +                       goto update_sectors;
>                 }
> +
> +               /* no retry any more */
> +               len = sectors;
> +               space_desired = 1;
> +               hint = -1;
> +               goto update_sectors;
>         }
> -       while (sectors) {
> -               /* didn't merge (it all).
> -                * Need to add a range just before 'hi'
> -                */
> -               if (bb->count >= MAX_BADBLOCKS) {
> -                       /* No room for more */
> -                       rv = 1;
> -                       break;
> -               } else {
> -                       int this_sectors = sectors;
>
> -                       memmove(p + hi + 1, p + hi,
> -                               (bb->count - hi) * 8);
> -                       bb->count++;
> +       /* cannot merge and there is space in bad table */
> +       if (overlap_behind(bb, &bad, prev + 1))
> +               bad.len = min_t(sector_t, bad.len, BB_OFFSET(p[prev + 1]) - bad.start);
>
> -                       if (this_sectors > BB_MAX_LEN)
> -                               this_sectors = BB_MAX_LEN;
> -                       p[hi] = BB_MAKE(s, this_sectors, acknowledged);
> -                       sectors -= this_sectors;
> -                       s += this_sectors;
> -               }
> +       len = insert_at(bb, prev + 1, &bad);
> +       bb->count++;
> +       added++;
> +       hint = prev + 1;
> +
> +update_sectors:
> +       s += len;
> +       sectors -= len;
> +
> +       if (sectors > 0)
> +               goto re_insert;
> +
> +       WARN_ON(sectors < 0);
> +
> +       /* Check whether the following already set range can be merged */
> +       if ((prev + 1) < bb->count &&
> +           BB_END(p[prev]) == BB_OFFSET(p[prev + 1]) &&
> +           (BB_LEN(p[prev]) + BB_LEN(p[prev + 1])) <= BB_MAX_LEN &&
> +           BB_ACK(p[prev]) == BB_ACK(p[prev + 1])) {
> +               p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> +                                 BB_LEN(p[prev]) + BB_LEN(p[prev + 1]),
> +                                 BB_ACK(p[prev]));
> +
> +               if ((prev + 2) < bb->count)
> +                       memmove(p + prev + 1, p + prev + 2,
> +                               (bb->count -  (prev + 2)) * 8);
> +               bb->count--;
> +       }
> +
> +       if (space_desired && !badblocks_full(bb)) {
> +               s = bad.orig_start;
> +               sectors = bad.orig_len;
> +               if (retried++ < 3)
> +                       goto re_insert;
> +       }
> +
> +out:
> +       if (added) {
> +               set_changed(bb);
> +
> +               if (!acknowledged)
> +                       bb->unacked_exist = 1;
> +               else
> +                       badblocks_update_acked(bb);
>         }
>
> -       bb->changed = 1;
> -       if (!acknowledged)
> -               bb->unacked_exist = 1;
> -       else
> -               badblocks_update_acked(bb);
>         write_sequnlock_irqrestore(&bb->lock, flags);
>
> +       if (!added)
> +               rv = 1;
> +
>         return rv;
>  }
>  EXPORT_SYMBOL_GPL(badblocks_set);
> @@ -423,6 +969,115 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
>  }
>  EXPORT_SYMBOL_GPL(badblocks_clear);
>
> +/**
> + * badblocks_check() - check a given range for bad sectors
> + * @bb:                the badblocks structure that holds all badblock information
> + * @s:         sector (start) at which to check for badblocks
> + * @sectors:   number of sectors to check for badblocks
> + * @first_bad: pointer to store location of the first badblock
> + * @bad_sectors: pointer to store number of badblocks after @first_bad
> + *
> + * We can record which blocks on each device are 'bad' and so just
> + * fail those blocks, or that stripe, rather than the whole device.
> + * Entries in the bad-block table are 64bits wide.  This comprises:
> + * Length of bad-range, in sectors: 0-511 for lengths 1-512
> + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> + *  A 'shift' can be set so that larger blocks are tracked and
> + *  consequently larger devices can be covered.
> + * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + *
> + * Locking of the bad-block table uses a seqlock so badblocks_check
> + * might need to retry if it is very unlucky.
> + * We will sometimes want to check for bad blocks in a bi_end_io function,
> + * so we use the write_seqlock_irq variant.
> + *
> + * When looking for a bad block we specify a range and want to
> + * know if any block in the range is bad.  So we binary-search
> + * to the last range that starts at-or-before the given endpoint,
> + * (or "before the sector after the target range")
> + * then see if it ends after the given start.
> + *
> + * Return:
> + *  0: there are no known bad blocks in the range
> + *  1: there are known bad block which are all acknowledged
> + * -1: there are bad blocks which have not yet been acknowledged in metadata.
> + * plus the start/length of the first bad section we overlap.
> + */
> +int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> +                       sector_t *first_bad, int *bad_sectors)
> +{
> +       int hi;
> +       int lo;
> +       u64 *p = bb->page;
> +       int rv;
> +       sector_t target = s + sectors;
> +       unsigned seq;
> +
> +       if (bb->shift > 0) {
> +               /* round the start down, and the end up */
> +               s >>= bb->shift;
> +               target += (1<<bb->shift) - 1;
> +               target >>= bb->shift;
> +               sectors = target - s;
> +       }
> +       /* 'target' is now the first block after the bad range */
> +
> +retry:
> +       seq = read_seqbegin(&bb->lock);
> +       lo = 0;
> +       rv = 0;
> +       hi = bb->count;
> +
> +       /* Binary search between lo and hi for 'target'
> +        * i.e. for the last range that starts before 'target'
> +        */
> +       /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
> +        * are known not to be the last range before target.
> +        * VARIANT: hi-lo is the number of possible
> +        * ranges, and decreases until it reaches 1
> +        */
> +       while (hi - lo > 1) {
> +               int mid = (lo + hi) / 2;
> +               sector_t a = BB_OFFSET(p[mid]);
> +
> +               if (a < target)
> +                       /* This could still be the one, earlier ranges
> +                        * could not.
> +                        */
> +                       lo = mid;
> +               else
> +                       /* This and later ranges are definitely out. */
> +                       hi = mid;
> +       }
> +       /* 'lo' might be the last that started before target, but 'hi' isn't */
> +       if (hi > lo) {
> +               /* need to check all range that end after 's' to see if
> +                * any are unacknowledged.
> +                */
> +               while (lo >= 0 &&
> +                      BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> +                       if (BB_OFFSET(p[lo]) < target) {
> +                               /* starts before the end, and finishes after
> +                                * the start, so they must overlap
> +                                */
> +                               if (rv != -1 && BB_ACK(p[lo]))
> +                                       rv = 1;
> +                               else
> +                                       rv = -1;
> +                               *first_bad = BB_OFFSET(p[lo]);
> +                               *bad_sectors = BB_LEN(p[lo]);
> +                       }
> +                       lo--;
> +               }
> +       }
> +
> +       if (read_seqretry(&bb->lock, seq))
> +               goto retry;
> +
> +       return rv;
> +}
> +EXPORT_SYMBOL_GPL(badblocks_check);
> +
>  /**
>   * ack_all_badblocks() - Acknowledge all bad blocks in a list.
>   * @bb:                the badblocks structure that holds all badblock information
> diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
> index 2426276b9bd3..b4bd997a53a4 100644
> --- a/include/linux/badblocks.h
> +++ b/include/linux/badblocks.h
> @@ -15,6 +15,7 @@
>  #define BB_OFFSET(x)   (((x) & BB_OFFSET_MASK) >> 9)
>  #define BB_LEN(x)      (((x) & BB_LEN_MASK) + 1)
>  #define BB_ACK(x)      (!!((x) & BB_ACK_MASK))
> +#define BB_END(x)      (BB_OFFSET(x) + BB_LEN(x))
>  #define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
>
>  /* Bad block numbers are stored sorted in a single page.
> @@ -41,6 +42,14 @@ struct badblocks {
>         sector_t size;          /* in sectors */
>  };
>
> +struct bad_context {
> +       sector_t        start;
> +       sector_t        len;
> +       int             ack;
> +       sector_t        orig_start;
> +       sector_t        orig_len;
> +};
> +
>  int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
>                    sector_t *first_bad, int *bad_sectors);
>  int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> @@ -54,6 +63,7 @@ int badblocks_init(struct badblocks *bb, int enable);
>  void badblocks_exit(struct badblocks *bb);
>  struct device;
>  int devm_init_badblocks(struct device *dev, struct badblocks *bb);
> +
>  static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
>  {
>         if (bb->dev != dev) {
> @@ -63,4 +73,27 @@ static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
>         }
>         badblocks_exit(bb);
>  }
> +
> +static inline int badblocks_full(struct badblocks *bb)
> +{
> +       return (bb->count >= MAX_BADBLOCKS);
> +}
> +
> +static inline int badblocks_empty(struct badblocks *bb)
> +{
> +       return (bb->count == 0);
> +}
> +
> +static inline void set_changed(struct badblocks *bb)
> +{
> +       if (bb->changed != 1)
> +               bb->changed = 1;
> +}
> +
> +static inline void clear_changed(struct badblocks *bb)
> +{
> +       if (bb->changed != 0)
> +               bb->changed = 0;
> +}
> +
>  #endif
> --
> 2.26.2
>
Wols Lists Dec. 19, 2020, 8:02 p.m. UTC | #2
On 03/12/2020 17:15, Coly Li wrote:
> This patch is an initial effort to improve badblocks_set() for setting
> bad blocks range when it covers multiple already set bad ranges in the
> bad blocks table, and to do it as fast as possible.

Is this your patch, or submitted as part of the bug report?

"Heavily based on MD badblocks code from Neil Brown"

How much has this code got to do with the mdraid subsystem? Because 
badblocks in mdraid has an appalling reputation, with many people 
wanting to just rip it out.

If this code is separate from the mdraid implementation, any chance you 
can work with it, and fix that at the same time? Or make it redundant! I 
don't quite see why mdraid should need a badblocks list given modern 
disk drives.

And it's on my to-do list (if I can find the time!!!) to integrate 
dm-integrity into mdraid, at which point md badblocks should be irrelevant.

Hope I'm not being a shower of cold water, and if you want to fix all 
this, good on you, but to the extent that this is relevant to 
linux-raid, I think a lot of people will be asking "What's the point?"

Cheers,
Wol
Coly Li Dec. 20, 2020, 9:46 a.m. UTC | #3
On 12/20/20 4:02 AM, antlists wrote:
> On 03/12/2020 17:15, Coly Li wrote:
>> This patch is an initial effort to improve badblocks_set() for setting
>> bad blocks range when it covers multiple already set bad ranges in the
>> bad blocks table, and to do it as fast as possible.
> 
> Is this your patch, or submitted as part of the bug report?

This is not finished yet. The final version should go into upstream as a
fix for current badblocks routines.

> 
> "Heavily based on MD badblocks code from Neil Brown"
> 
> How much has this code got to do with the mdraid subsystem? Because
> badblocks in mdraid has an appalling reputation, with many people
> wanting to just rip it out.

This is in-memory data structure management which is almost irrelevant
to md raid or other on-disk layout.


> 
> If this code is separate from the mdraid implementation, any chance you
> can work with it, and fix that at the same time? Or make it redundant! I

This is 100% separated from md raid, as well as current badblocks code,
it is just about combine or split some [start, length] extent in a
table. The purpose of this patch is to fixing some reported issue from
users and our customers.

> don't quite see why mdraid should need a badblocks list given modern
> disk drives.
> 

For me the motivation is just people report bugs and I fix it. If there
is new code to replace it in upstream, then I just continue to maintain
the new code for our users and customers.

> And it's on my to-do list (if I can find the time!!!) to integrate
> dm-integrity into mdraid, at which point md badblocks should be irrelevant.
> 
> Hope I'm not being a shower of cold water, and if you want to fix all
> this, good on you, but to the extent that this is relevant to
> linux-raid, I think a lot of people will be asking "What's the point?"

Currently blocks/badblocks.c is used by md raid and nvdimm code, and the
badblocks table is irrelevant to any of these two subsystems.

If there will be better code for similar or better functionality, it
should be cool. For me, if the reporting bug is fixed, no difference in
my view :-)

Thanks.

Coly Li
Coly Li Dec. 20, 2020, 9:50 a.m. UTC | #4
On 12/18/20 11:25 AM, Dan Williams wrote:
> [ add Neil, original gooodguy who wrote badblocks ]
> 
> 
> On Thu, Dec 3, 2020 at 9:16 AM Coly Li <colyli@suse.de> wrote:
>>
>> Recently I received a bug report that current badblocks code does not
>> properly handle multiple ranges. For example,
>>         badblocks_set(bb, 32, 1, true);
>>         badblocks_set(bb, 34, 1, true);
>>         badblocks_set(bb, 36, 1, true);
>>         badblocks_set(bb, 32, 12, true);
>> Then indeed badblocks_show() reports,
>>         32 3
>>         36 1
>> But the expected bad blocks table should be,
>>         32 12
>> Obviously only the first 2 ranges are merged and badblocks_set() returns
>> and ignores the rest setting range.
>>
>> This behavior is improper, if the caller of badblocks_set() wants to set
>> a range of blocks into bad blocks table, all of the blocks in the range
>> should be handled even the previous part encountering failure.
>>
>> The desired way to set bad blocks range by badblocks_set() is,
>> - Set as many as blocks in the setting range into bad blocks table.
>> - Merge the bad blocks ranges and occupy as less as slots in the bad
>>   blocks table.
>> - Fast.
>>
>> Indeed the above proposal is complicated, especially with the following
>> restrictions,
>> - The setting bad blocks range can be ackknowledged or not acknowledged.


Hi Dan,

> 
> s/ackknowledged/acknowledged/
> 
> I'd run checkpatch --codespell for future versions...

Thanks for the hint. I will do it next time.


> 
>> - The bad blocks table size is limited.
>> - Memory allocation should be avoided.
>>
>> This patch is an initial effort to improve badblocks_set() for setting
>> bad blocks range when it covers multiple already set bad ranges in the
>> bad blocks table, and to do it as fast as possible.
>>
>> The basic idea of the patch is to categorize all possible bad blocks
>> range setting combinationsinto to much less simplified and more less
>> special conditions. Inside badblocks_set() there is an implicit loop
>> composed by jumping between labels 're_insert' and 'update_sectors'. No
>> matter how large the setting bad blocks range is, in every loop just a
>> minimized range from the head is handled by a pre-defined behavior from
>> one of the categorized conditions. The logic is simple and code flow is
>> manageable.
>>
>> This patch is unfinished yet, it only improves badblocks_set() and not
>> touch badblocks_clear() and badblocks_show() yet. I post it earlier
>> because this patch will be large (more then 1000 lines of change), I
>> want more people to give me comments earlier before I go too far away.
>>
> 
> I wonder if this isn't indication that the base data structure should
> be replaced... but I have not had a chance to devote deeper thought to
> this.
> 

No existing data structure changed. Even the in-memory badblocks table I
don't change it at all. I just fix the report issue by handle more
corner cases, on-disk and in-memory stuffs are untouched and consistent.


Coly Li

> 
>> The code logic is tested as user space programmer, this patch passes
>> compiling but not tested in kernel mode yet. Right now it is only for
>> RFC purpose. I will post tested patch in further versions.
>>
>> Thank you in advance for any review or comments on this patch.
>>

[snipped]
Wols Lists Dec. 22, 2020, 11:12 a.m. UTC | #5
On 20/12/2020 09:46, Coly Li wrote:
> Currently blocks/badblocks.c is used by md raid and nvdimm code, and the
> badblocks table is irrelevant to any of these two subsystems.

Good to know.
> 
> If there will be better code for similar or better functionality, it
> should be cool. For me, if the reporting bug is fixed, no difference in
> my view:-)
> 
Hopefully that will improve the badblocks handling in md. Sounds like 
that could in part be the problems we've been seeing.

If I integrate dm-integrity into md, badblocks should be mutually 
exclusive with it, but because dm-integrity is both a performance and 
disk-space hit, people might well not want to enable it.

Cheers,
Wol
diff mbox series

Patch

diff --git a/block/badblocks.c b/block/badblocks.c
index d39056630d9c..04ccae95777d 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -5,6 +5,8 @@ 
  * - Heavily based on MD badblocks code from Neil Brown
  *
  * Copyright (c) 2015, Intel Corporation.
+ *
+ * Improvement for handling multiple ranges by Coly Li <colyli@suse.de>
  */
 
 #include <linux/badblocks.h>
@@ -16,114 +18,612 @@ 
 #include <linux/types.h>
 #include <linux/slab.h>
 
-/**
- * badblocks_check() - check a given range for bad sectors
- * @bb:		the badblocks structure that holds all badblock information
- * @s:		sector (start) at which to check for badblocks
- * @sectors:	number of sectors to check for badblocks
- * @first_bad:	pointer to store location of the first badblock
- * @bad_sectors: pointer to store number of badblocks after @first_bad
+/*
+ * The purpose of badblocks set/clear is to manage bad blocks ranges which are
+ * identified by LBA addresses.
  *
- * We can record which blocks on each device are 'bad' and so just
- * fail those blocks, or that stripe, rather than the whole device.
- * Entries in the bad-block table are 64bits wide.  This comprises:
- * Length of bad-range, in sectors: 0-511 for lengths 1-512
- * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
- *  A 'shift' can be set so that larger blocks are tracked and
- *  consequently larger devices can be covered.
- * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ * When the caller of badblocks_set() wants to set a range of bad blocks, the
+ * setting range can be acked or unacked. And the setting range may merge,
+ * overwrite, skip the overlaypped already set range, depends on who they are
+ * overlapped or adjacent, and the acknowledgment type of the ranges. It can be
+ * more complicated when the setting range covers multiple already set bad block
+ * ranges, with restritctions of maximum length of each bad range and the bad
+ * table space limitation.
  *
- * Locking of the bad-block table uses a seqlock so badblocks_check
- * might need to retry if it is very unlucky.
- * We will sometimes want to check for bad blocks in a bi_end_io function,
- * so we use the write_seqlock_irq variant.
+ * It is difficut and unnecessary to take care of all the possible situations,
+ * for setting a large range of bad blocks, we can handle it by dividing the
+ * large range into smaller ones when encounter overlap, max range length or
+ * bad table full conditions. Every time only a smaller piece of the bad range
+ * is handled with a limited number of conditions how it is interacted with
+ * possible overlapped or adjacent already set bad block ranges. Then the hard
+ * complicated problem can be much simpler to habndle in proper way.
  *
- * When looking for a bad block we specify a range and want to
- * know if any block in the range is bad.  So we binary-search
- * to the last range that starts at-or-before the given endpoint,
- * (or "before the sector after the target range")
- * then see if it ends after the given start.
+ * When setting a range of bad blocks to the bad table, the simplified situations
+ * to be considered are, (The already set bad blocks ranges are naming with
+ *  prefix E, and the setting bad blocks range is naming with prefix S)
+ *
+ * 1) A setting range is not overlapped or adjacent to any other already set bad
+ *    block range.
+ *                         +--------+
+ *                         |    S   |
+ *                         +--------+
+ *        +-------------+               +-------------+
+ *        |      E1     |               |      E2     |
+ *        +-------------+               +-------------+
+ *    For this situation if the bad blocks table is not full, just allocate a
+ *    free slot from the bad blocks table to mark the setting range S. The
+ *    result is,
+ *        +-------------+  +--------+   +-------------+
+ *        |      E1     |  |    S   |   |      E2     |
+ *        +-------------+  +--------+   +-------------+
+ * 2) A setting range starts exactly at a start LBA of an already set bad blocks
+ *    range.
+ * 2.1) The setting range size < already set range size
+ *        +--------+
+ *        |    S   |
+ *        +--------+
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ * 2.1.1) If S and E are both acked or unacked range, the setting range S can
+ *    be merged into existing bad range E. The result is,
+ *        +-------------+
+ *        |      S      |
+ *        +-------------+
+ * 2.1.2) If S is uncked setting and E is acked, the setting will be dinied, and
+ *    the result is,
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ * 2.1.3) If S is acked setting and E is unacked, range S can overwirte on E.
+ *    An extra slot from the bad blocks table will be allocated for S, and head
+ *    of E will move to end of the inserted range E. The result is,
+ *        +--------+----+
+ *        |    S   | E  |
+ *        +--------+----+
+ * 2.2) The setting range size == already set range size
+ * 2.2.1) If S and E are both acked or unacked range, the setting range S can
+ *    be merged into existing bad range E. The result is,
+ *        +-------------+
+ *        |      S      |
+ *        +-------------+
+ * 2.2.2) If S is uncked setting and E is acked, the setting will be dinied, and
+ *    the result is,
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ * 2.2.3) If S is acked setting and E is unacked, range S can overwirte all of
+      bad blocks range E. The result is,
+ *        +-------------+
+ *        |      S      |
+ *        +-------------+
+ * 2.3) The setting range size > already set range size
+ *        +-------------------+
+ *        |          S        |
+ *        +-------------------+
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ *    For such situation, the setting range S can be treated as two parts, the
+ *    first part (S1) is as same size as the already set range E, the second
+ *    part (S2) is thre rest of setting range.
+ *        +-------------+-----+        +-------------+       +-----+
+ *        |    S1       | S2  |        |     S1      |       | S2  |
+ *        +-------------+-----+  ===>  +-------------+       +-----+
+ *        +-------------+              +-------------+
+ *        |      E      |              |      E      |
+ *        +-------------+              +-------------+
+ *    Now we only focus on how to handle the setting range S1 and already set
+ *    range E, which are already explained in 1.2), for the rest S2 it will be
+ *    handled later in next loop.
+ * 3) A setting range starts before the start LBA of an already set bad blocks
+ *    range.
+ *        +-------------+
+ *        |      S      |
+ *        +-------------+
+ *             +-------------+
+ *             |      E      |
+ *             +-------------+
+ *    For this situation, the setting range S can be divided into two parts, the
+ *    first (S1) ends at the start LBA of already set range E, the second part
+ *    (S2) starts exactly at a start LBA of the already set range E.
+ *        +----+---------+             +----+      +---------+
+ *        | S1 |    S2   |             | S1 |      |    S2   |
+ *        +----+---------+      ===>   +----+      +---------+
+ *             +-------------+                     +-------------+
+ *             |      E      |                     |      E      |
+ *             +-------------+                     +-------------+
+ *    Now only the first part S1 should be handled in this loop, which is in
+ *    similar condition as 1). The rest part S2 has exact same start LBA address
+ *    of the already set range E, they will be handled in next loop in one of
+ *    situations in 2).
+ * 4) A setting range starts after the start LBA of an already set bad blocks
+ *    range.
+ * 4.1) If the setting range S exactly matches the tail part of already set bad
+ *    blocks range E, like the folowing chart shows,
+ *            +---------+
+ *            |   S     |
+ *            +---------+
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ * 4.1.1) If range S and E have same ackknowledg value (both acked or unacked),
+ *    they will be merged into one, the result is,
+ *        +-------------+
+ *        |      S      |
+ *        +-------------+
+ * 4.1.2) If range E is acked and the setting range S is unacked, the setting
+ *    request of S will be rejected, the result is,
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ * 4.1.3) If range E is unacked, and the setting range S is acked, then S may
+ *    overwrite the overlapped range of E, the result is,
+ *        +---+---------+
+ *        | E |    S    |
+ *        +---+---------+
+ * 4.2) If the setting range S stays in middle of an already set range E, like
+ *    the following chart shows,
+ *             +----+
+ *             | S  |
+ *             +----+
+ *        +--------------+
+ *        |       E      |
+ *        +--------------+
+ * 4.2.1) If range S and E have same ackknowledg value (both acked or unacked),
+ *    they will be merged into one, the result is,
+ *        +--------------+
+ *        |       S      |
+ *        +--------------+
+ * 4.2.2) If range E is acked and the setting range S is unacked, the setting
+ *    request of S will be rejected, the result is also,
+ *        +--------------+
+ *        |       E      |
+ *        +--------------+
+ * 4.2.3) If range E is unacked, and the setting range S is acked, then S will
+ *    inserted into middle of E and split previous range E into twp parts (E1
+ *    and E2), the result is,
+ *        +----+----+----+
+ *        | E1 |  S | E2 |
+ *        +----+----+----+
+ * 4.3) If the setting bad blocks range S is overlapped with an already set bad
+ *    blocks range E. The range S starts after the start LBA of range E, and
+ *    ends after the end LBA of range E, as the following chart shows,
+ *            +-------------------+
+ *            |          S        |
+ *            +-------------------+
+ *        +-------------+
+ *        |      E      |
+ *        +-------------+
+ *    For this situation the range S can be divided into two parts, the first
+ *    part (S1) ends at end range E, and the second part (S2) has rest range of
+ *    origin S.
+ *            +---------+---------+            +---------+      +---------+
+ *            |    S1   |    S2   |            |    S1   |      |    S2   |
+ *            +---------+---------+  ===>      +---------+      +---------+
+ *        +-------------+                  +-------------+
+ *        |      E      |                  |      E      |
+ *        +-------------+                  +-------------+
+ *     Now in this loop the setting range S1 and already set range E can be
+ *     handled as the situations 4), the rest range S2 will be handled in next
+ *     loop and ignored in this loop.
+ * 5) A setting bad blocks range S is adjacent to one or more already set bad
+ *    blocks range(s), and they are all acked or unacked range.
+ * 5.1) Front merge: If the already set bad blocks range E is before setting
+ *    range S and they are adjacent,
+ *                +------+
+ *                |  S   |
+ *                +------+
+ *        +-------+
+ *        |   E   |
+ *        +-------+
+ * 5.1.1) When total size of range S and E <= BB_MAX_LEN, and their acknowledge
+ *    values are same, the setting range S can front merges into range E. The
+ *    result is,
+ *        +--------------+
+ *        |       S      |
+ *        +--------------+
+ * 5.1.2) Otherwise these two ranges cannot merge, just insert the setting
+ *    range S right after already set range E into the bad blocks table. The
+ *    result is,
+ *        +--------+------+
+ *        |   E    |   S  |
+ *        +--------+------+
+ * 6) Special cases which above conditions cannot handle
+ * 6.1) Multiple already set ranges may merge into less ones in a full bad table
+ *        +-------------------------------------------------------+
+ *        |                           S                           |
+ *        +-------------------------------------------------------+
+ *        |<----- BB_MAX_LEN ----->|
+ *                                 +-----+     +-----+   +-----+
+ *                                 | E1  |     | E2  |   | E3  |
+ *                                 +-----+     +-----+   +-----+
+ *     In the above example, when the bad blocks table is full, inserting the
+ *     first part of setting range S will fail because no more available slot
+ *     can be allocated from bad blocks table. In this situation a proper
+ *     setting method should be go though all the setting bad blocks range and
+ *     look for chance to merge already set ranges into less ones. When there
+ *     is available slot from bad blocks table, re-try again to handle more
+ *     setting bad blocks ranges as many as possible.
+ *        +------------------------+
+ *        |          S3            |
+ *        +------------------------+
+ *        |<----- BB_MAX_LEN ----->|
+ *                                 +-----+-----+-----+---+-----+--+
+ *                                 |       S1        |     S2     |
+ *                                 +-----+-----+-----+---+-----+--+
+ *     The above chart shows although the first part (S3) cannot be inserted due
+ *     to no-space in bad blocks table, but the following E1, E2 and E3 ranges
+ *     can be merged with rest part of S into less range S1 and S2. Now there is
+ *     1 free slot in bad blocks table.
+ *        +------------------------+-----+-----+-----+---+-----+--+
+ *        |           S3           |       S1        |     S2     |
+ *        +------------------------+-----+-----+-----+---+-----+--+
+ *     Since the bad blocks table is not full anymore, re-try again for the
+ *     origin setting range S. Now the setting range S3 can be inserted into the
+ *     bad blocks table with previous freed slot from multiple ranges merge.
+ * 6.2) Front merge after overwrite
+ *    In the following example, in bad blocks table, E1 is an acked bad blocks
+ *    range and E2 is an unacked bad blocks range, therefore they are not able
+ *    to merge into a larger range. The setting bad blocks range S is acked,
+ *    therefore part of E2 can be overwritten by S.
+ *                      +--------+
+ *                      |    S   |                             acknowledged
+ *                      +--------+                         S:       1
+ *              +-------+-------------+                   E1:       1
+ *              |   E1  |    E2       |                   E2:       0
+ *              +-------+-------------+
+ *     With previosu simplified routines, after overwiting part of E2 with S,
+ *     the bad blocks table should be (E3 is remaining part of E2 which is not
+ *     overwritten by S),
+ *                                                             acknowledged
+ *              +-------+--------+----+                    S:       1
+ *              |   E1  |    S   | E3 |                   E1:       1
+ *              +-------+--------+----+                   E3:       0
+ *     The above result is correct but not perfect. Range E1 and S in the bad
+ *     blocks table are all acked, merging them into a larger one range may
+ *     occupy less bad blocks table space and make badblocks_check() faster.
+ *     Therefore in such situation, after overwiting range S, the previous range
+ *     E1 should be checked for possible front combination. Then the ideal
+ *     result can be,
+ *              +----------------+----+                        acknowledged
+ *              |       E1       | E3 |                   E1:       1
+ *              +----------------+----+                   E3:       0
+ * 6.3) Behind merge: If the already set bad blocks range E is behind the setting
+ *    range S and they are adjacent. Normally we don't need to care about this
+ *    because front merge handles this while going though range S from head to
+ *    tail, except for the tail part of range S. When the setting range S are
+ *    fully handled, all the above simplified routine doesn't check whether the
+ *    tail LBA of range S is adjacent to the next already set range and not able
+ *    to them if they are mergeable.
+ *        +------+
+ *        |  S   |
+ *        +------+
+ *               +-------+
+ *               |   E   |
+ *               +-------+
+ *    For the above special stiuation, when the setting range S are all handled
+ *    and the loop ends, an extra check is necessary for whether next already
+ *    set range E is right after S and mergeable.
+ * 6.2.1) When total size of range E and S <= BB_MAX_LEN, and their acknowledge
+ *    values are same, the setting range S can behind merges into range E. The
+ *    result is,
+ *        +--------------+
+ *        |       S      |
+ *        +--------------+
+ * 6.2.2) Otherwise these two ranges cannot merge, just insert the setting range
+ *     S infront of the already set range E in the bad blocks table. The result
+ *     is,
+ *        +------+-------+
+ *        |  S   |   E   |
+ *        +------+-------+
+ *
+ * All the above 5 simplified situations and 3 special cases may cover 99%+ of
+ * the bad block range setting conditions. Maybe there is some rare corner case
+ * is not considered and optimized, it won't hurt if badblocks_set() fails due
+ * to no space, or some ranges are not merged to save bad blocks table space.
+ *
+ * Inside badblocks_set() each loop starts by jumping to re_insert label, every
+ * time for the new loop prev_badblocks() is called to find an already set range
+ * which starts before or at current setting range. Since the setting bad blocks
+ * range is handled from head to tail, most of the cases it is unnecessary to do
+ * the binary search inside prev_badblocks(), it is possible to provide a hint
+ * to prev_badblocks() for a fast path, then the expensive binary search can be
+ * avoided. In my test with the hint to prev_badblocks(), except for the first
+ * loop, all rested calls to prev_badblocks() can go into the fast path and
+ * return correct bad blocks table index immediately.
  *
- * Return:
- *  0: there are no known bad blocks in the range
- *  1: there are known bad block which are all acknowledged
- * -1: there are bad blocks which have not yet been acknowledged in metadata.
- * plus the start/length of the first bad section we overlap.
  */
-int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
-			sector_t *first_bad, int *bad_sectors)
+
+static int prev_by_hint(struct badblocks *bb, sector_t s, int hint)
 {
-	int hi;
-	int lo;
 	u64 *p = bb->page;
-	int rv;
-	sector_t target = s + sectors;
-	unsigned seq;
+	int ret = -1;
+	int hint_end = hint + 2;
 
-	if (bb->shift > 0) {
-		/* round the start down, and the end up */
-		s >>= bb->shift;
-		target += (1<<bb->shift) - 1;
-		target >>= bb->shift;
-		sectors = target - s;
+	while ((hint < hint_end) && ((hint + 1) <= bb->count) &&
+	       (BB_OFFSET(p[hint]) <= s)) {
+		if ((hint + 1) == bb->count || BB_OFFSET(p[hint + 1]) > s) {
+			ret = hint;
+			break;
+		}
+		hint++;
+	}
+
+	return ret;
+}
+
+/* find the range starts at-or-before bad->start */
+static int prev_badblocks(struct badblocks *bb, struct bad_context *bad,
+			  int hint)
+{
+	u64 *p;
+	int lo, hi;
+	sector_t s = bad->start;
+	int ret = -1;
+
+	if (!bb->count)
+		goto out;
+
+	if (hint >= 0) {
+		ret = prev_by_hint(bb, s, hint);
+		if (ret >= 0)
+			goto out;
 	}
-	/* 'target' is now the first block after the bad range */
 
-retry:
-	seq = read_seqbegin(&bb->lock);
 	lo = 0;
-	rv = 0;
 	hi = bb->count;
+	p = bb->page;
 
-	/* Binary search between lo and hi for 'target'
-	 * i.e. for the last range that starts before 'target'
-	 */
-	/* INVARIANT: ranges before 'lo' and at-or-after 'hi'
-	 * are known not to be the last range before target.
-	 * VARIANT: hi-lo is the number of possible
-	 * ranges, and decreases until it reaches 1
-	 */
 	while (hi - lo > 1) {
-		int mid = (lo + hi) / 2;
+		int mid = (lo + hi)/2;
 		sector_t a = BB_OFFSET(p[mid]);
 
-		if (a < target)
-			/* This could still be the one, earlier ranges
-			 * could not.
-			 */
+		if (a <= s)
 			lo = mid;
 		else
-			/* This and later ranges are definitely out. */
 			hi = mid;
 	}
-	/* 'lo' might be the last that started before target, but 'hi' isn't */
-	if (hi > lo) {
-		/* need to check all range that end after 's' to see if
-		 * any are unacknowledged.
+
+	if (BB_OFFSET(p[lo]) <= s)
+		ret = lo;
+out:
+	return ret;
+}
+
+static int can_merge_behind(struct badblocks *bb, struct bad_context *bad,
+			    int behind)
+{
+	u64 *p = bb->page;
+	sector_t s = bad->start;
+	sector_t sectors = bad->len;
+	int ack = bad->ack;
+
+	if ((s <= BB_OFFSET(p[behind])) &&
+	    ((s + sectors) >= BB_OFFSET(p[behind])) &&
+	    ((BB_END(p[behind]) - s) <= BB_MAX_LEN) &&
+	    BB_ACK(p[behind]) == ack)
+		return true;
+	return false;
+}
+
+static int behind_merge(struct badblocks *bb, struct bad_context *bad,
+			int behind)
+{
+	u64 *p = bb->page;
+	sector_t s = bad->start;
+	sector_t sectors = bad->len;
+	int ack = bad->ack;
+	int merged = 0;
+
+	WARN_ON(s > BB_OFFSET(p[behind]));
+	WARN_ON((s + sectors) < BB_OFFSET(p[behind]));
+
+	if (s < BB_OFFSET(p[behind])) {
+		WARN_ON((BB_LEN(p[behind]) + merged) >= BB_MAX_LEN);
+
+		merged = min_t(sector_t, sectors, BB_OFFSET(p[behind]) - s);
+		p[behind] =  BB_MAKE(s, BB_LEN(p[behind]) + merged, ack);
+	} else {
+		merged = min_t(sector_t, sectors, BB_LEN(p[behind]));
+	}
+
+	WARN_ON(merged == 0);
+
+	return merged;
+}
+
+static int can_merge_front(struct badblocks *bb, int prev,
+			   struct bad_context *bad)
+{
+	u64 *p = bb->page;
+	sector_t s = bad->start;
+	int ack = bad->ack;
+
+	if (BB_ACK(p[prev]) == ack &&
+	    (s < BB_END(p[prev]) ||
+	     (s == BB_END(p[prev]) && (BB_LEN(p[prev]) < BB_MAX_LEN))))
+		return true;
+	return false;
+}
+
+static int front_merge(struct badblocks *bb, int prev, struct bad_context *bad)
+{
+	int sectors = bad->len;
+	int s = bad->start;
+	int ack = bad->ack;
+	u64 *p = bb->page;
+	int merged = 0;
+
+	WARN_ON(s > BB_END(p[prev]));
+
+	if (s < BB_END(p[prev])) {
+		merged = min_t(sector_t, sectors, BB_END(p[prev]) - s);
+	} else {
+		merged = min_t(sector_t, sectors, BB_MAX_LEN - BB_LEN(p[prev]));
+		if ((prev + 1) < bb->count &&
+		    merged > (BB_OFFSET(p[prev + 1]) - BB_END(p[prev]))) {
+			merged = BB_OFFSET(p[prev + 1]) - BB_END(p[prev]);
+		}
+
+		p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]) + merged, ack);
+	}
+
+	return merged;
+}
+
+static int can_combine_front(struct badblocks *bb, int prev,
+			     struct bad_context *bad)
+{
+	u64 *p = bb->page;
+
+	if ((BB_OFFSET(p[prev]) == bad->start) && (prev > 0) &&
+	    (BB_LEN(p[prev - 1]) + BB_LEN(p[prev]) <= BB_MAX_LEN) &&
+	    (BB_ACK(p[prev - 1]) == BB_ACK(p[prev])))
+		return true;
+	return false;
+}
+
+static void front_combine(struct badblocks *bb, int prev)
+{
+	u64 *p = bb->page;
+
+	p[prev - 1] = BB_MAKE(BB_OFFSET(p[prev - 1]),
+			      BB_LEN(p[prev - 1]) + BB_LEN(p[prev]),
+			      BB_ACK(p[prev]));
+	if ((prev + 1) < bb->count)
+		memmove(p + prev, p + prev + 1, (bb->count - prev - 1) * 8);
+}
+
+static int overlap_front(struct badblocks *bb, int front,
+			  struct bad_context *bad)
+{
+	u64 *p = bb->page;
+
+	if (bad->start >= BB_OFFSET(p[front]) &&
+	    bad->start < BB_END(p[front]))
+		return true;
+	return false;
+}
+
+static int can_front_overwrite(struct badblocks *bb, int prev,
+			       struct bad_context *bad, int *extra)
+{
+	u64 *p = bb->page;
+	int len;
+
+	WARN_ON(!overlap_front(bb, prev, bad));
+
+	if (BB_ACK(p[prev]) >= bad->ack)
+		return false;
+
+	if (BB_END(p[prev]) <= (bad->start + bad->len)) {
+		len = BB_END(p[prev]) - bad->start;
+		if (BB_OFFSET(p[prev]) == bad->start)
+			*extra = 0;
+		else
+			*extra = 1;
+
+		bad->len = len;
+	} else {
+		if (BB_OFFSET(p[prev]) == bad->start)
+			*extra = 1;
+		else
+		/*
+		 * prev range will be split into two, beside the overwritten
+		 * one, an extra slot needed from bad table.
 		 */
-		while (lo >= 0 &&
-		       BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
-			if (BB_OFFSET(p[lo]) < target) {
-				/* starts before the end, and finishes after
-				 * the start, so they must overlap
-				 */
-				if (rv != -1 && BB_ACK(p[lo]))
-					rv = 1;
-				else
-					rv = -1;
-				*first_bad = BB_OFFSET(p[lo]);
-				*bad_sectors = BB_LEN(p[lo]);
-			}
-			lo--;
+			*extra = 2;
+	}
+
+	if ((bb->count + (*extra)) >= MAX_BADBLOCKS)
+		return false;
+
+	return true;
+}
+
+static int front_overwrite(struct badblocks *bb, int prev,
+			   struct bad_context *bad, int extra)
+{
+	u64 *p = bb->page;
+	int n = extra;
+	sector_t orig_end = BB_END(p[prev]);
+	int orig_ack = BB_ACK(p[prev]);
+
+	switch (extra) {
+	case 0:
+		p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]),
+				  bad->ack);
+		break;
+	case 1:
+		if (BB_OFFSET(p[prev]) == bad->start) {
+			p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+					  bad->len, bad->ack);
+			memmove(p + prev + 2, p + prev + 1,
+				(bb->count - prev - 1) * 8);
+			p[prev + 1] = BB_MAKE(bad->start + bad->len,
+					      orig_end - BB_END(p[prev]),
+					      orig_ack);
+		} else {
+			p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+					  BB_END(p[prev]) - bad->start,
+					  BB_ACK(p[prev]));
+			memmove(p + prev + 1 + n, p + prev + 1,
+				(bb->count - prev - 1) * 8);
+			p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
 		}
+		break;
+	case 2:
+		p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+				  BB_END(p[prev]) - bad->start,
+				  BB_ACK(p[prev]));
+		memmove(p + prev + 1 + n, p + prev + 1,
+			(bb->count - prev - 1) * 8);
+		p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
+		p[prev + 2] = BB_MAKE(BB_END(p[prev + 1]),
+				      orig_end - BB_END(p[prev + 1]),
+				      BB_ACK(p[prev]));
+		break;
+	default:
+		break;
 	}
 
-	if (read_seqretry(&bb->lock, seq))
-		goto retry;
+	return bad->len;
+}
 
-	return rv;
+static int overlap_behind(struct badblocks *bb, struct bad_context *bad,
+			  int behind)
+{
+	u64 *p = bb->page;
+
+	if (bad->start < BB_OFFSET(p[behind]) &&
+	    (bad->start + bad->len) > BB_OFFSET(p[behind]))
+		return true;
+
+	if (bad->start >= BB_OFFSET(p[behind]) &&
+	    bad->start < BB_END(p[behind]))
+		return true;
+
+	return false;
+}
+
+static int insert_at(struct badblocks *bb, int at, struct bad_context *bad)
+{
+	u64 *p = bb->page;
+	int sectors = bad->len;
+	int s = bad->start;
+	int ack = bad->ack;
+	int len;
+
+	WARN_ON(badblocks_full(bb));
+
+	len = min_t(sector_t, sectors, BB_MAX_LEN);
+	if (at < bb->count)
+		memmove(p + at + 1, p + at, (bb->count - at) * 8);
+	p[at] = BB_MAKE(s, len, ack);
+
+	return len;
 }
-EXPORT_SYMBOL_GPL(badblocks_check);
 
 static void badblocks_update_acked(struct badblocks *bb)
 {
@@ -164,7 +664,10 @@  int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
 			int acknowledged)
 {
 	u64 *p;
-	int lo, hi;
+	struct bad_context bad;
+	int prev = -1, hint = -1;
+	int len = 0, added = 0;
+	int retried = 0, space_desired = 0;
 	int rv = 0;
 	unsigned long flags;
 
@@ -172,144 +675,187 @@  int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
 		/* badblocks are disabled */
 		return 1;
 
+	if (sectors <= 0)
+		/* Invalid sectors number */
+		return 1;
+
 	if (bb->shift) {
 		/* round the start down, and the end up */
 		sector_t next = s + sectors;
 
-		s >>= bb->shift;
-		next += (1<<bb->shift) - 1;
-		next >>= bb->shift;
+		rounddown(s, bb->shift);
+		roundup(next, bb->shift);
 		sectors = next - s;
 	}
 
 	write_seqlock_irqsave(&bb->lock, flags);
 
+	bad.orig_start = s;
+	bad.orig_len = sectors;
+	bad.ack = acknowledged;
 	p = bb->page;
-	lo = 0;
-	hi = bb->count;
-	/* Find the last range that starts at-or-before 's' */
-	while (hi - lo > 1) {
-		int mid = (lo + hi) / 2;
-		sector_t a = BB_OFFSET(p[mid]);
 
-		if (a <= s)
-			lo = mid;
-		else
-			hi = mid;
+re_insert:
+	bad.start = s;
+	bad.len = sectors;
+	len = 0;
+
+	if (badblocks_empty(bb)) {
+		len = insert_at(bb, 0, &bad);
+		bb->count++;
+		added++;
+		goto update_sectors;
 	}
-	if (hi > lo && BB_OFFSET(p[lo]) > s)
-		hi = lo;
 
-	if (hi > lo) {
-		/* we found a range that might merge with the start
-		 * of our new range
-		 */
-		sector_t a = BB_OFFSET(p[lo]);
-		sector_t e = a + BB_LEN(p[lo]);
-		int ack = BB_ACK(p[lo]);
-
-		if (e >= s) {
-			/* Yes, we can merge with a previous range */
-			if (s == a && s + sectors >= e)
-				/* new range covers old */
-				ack = acknowledged;
-			else
-				ack = ack && acknowledged;
-
-			if (e < s + sectors)
-				e = s + sectors;
-			if (e - a <= BB_MAX_LEN) {
-				p[lo] = BB_MAKE(a, e-a, ack);
-				s = e;
+	prev = prev_badblocks(bb, &bad, hint);
+
+	/* start before all badblocks */
+	if (prev < 0) {
+		if (!badblocks_full(bb)) {
+			/* insert on the first */
+			if (bad.len > (BB_OFFSET(p[0]) - bad.start))
+				bad.len = BB_OFFSET(p[0]) - bad.start;
+			len = insert_at(bb, 0, &bad);
+			bb->count++;
+			added++;
+			hint = 0;
+			goto update_sectors;
+		}
+
+		/* No sapce, try to merge */
+		if (overlap_behind(bb, &bad, 0)) {
+			if (can_merge_behind(bb, &bad, 0)) {
+				len = behind_merge(bb, &bad, 0);
+				added++;
 			} else {
-				/* does not all fit in one range,
-				 * make p[lo] maximal
-				 */
-				if (BB_LEN(p[lo]) != BB_MAX_LEN)
-					p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
-				s = a + BB_MAX_LEN;
+				len = min_t(sector_t, BB_OFFSET(p[0]) - s, sectors);
+				space_desired = 1;
 			}
-			sectors = e - s;
+			hint = 0;
+			goto update_sectors;
 		}
+
+		/* no table space and give up */
+		goto out;
 	}
-	if (sectors && hi < bb->count) {
-		/* 'hi' points to the first range that starts after 's'.
-		 * Maybe we can merge with the start of that range
-		 */
-		sector_t a = BB_OFFSET(p[hi]);
-		sector_t e = a + BB_LEN(p[hi]);
-		int ack = BB_ACK(p[hi]);
-
-		if (a <= s + sectors) {
-			/* merging is possible */
-			if (e <= s + sectors) {
-				/* full overlap */
-				e = s + sectors;
-				ack = acknowledged;
-			} else
-				ack = ack && acknowledged;
-
-			a = s;
-			if (e - a <= BB_MAX_LEN) {
-				p[hi] = BB_MAKE(a, e-a, ack);
-				s = e;
-			} else {
-				p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
-				s = a + BB_MAX_LEN;
+
+	/* in case p[prev-1] can be merged with p[prev] */
+	if (can_combine_front(bb, prev, &bad)) {
+		front_combine(bb, prev);
+		bb->count--;
+		added++;
+		hint = prev - 1;
+		goto update_sectors;
+	}
+
+	if (overlap_front(bb, prev, &bad)) {
+		if (can_merge_front(bb, prev, &bad)) {
+			len = front_merge(bb, prev, &bad);
+			added++;
+			hint = prev - 1;
+		} else {
+			int extra = 0;
+
+			if (!can_front_overwrite(bb, prev, &bad, &extra)) {
+				len = min_t(sector_t, BB_END(p[prev]) - s, sectors);
+				hint = prev;
+				goto update_sectors;
+			}
+
+			len = front_overwrite(bb, prev, &bad, extra);
+			added++;
+			bb->count += extra;
+			hint = prev;
+
+			if (prev > 0 && can_combine_front(bb, prev, &bad)) {
+				front_combine(bb, prev);
+				bb->count--;
+				hint = prev - 1;
 			}
-			sectors = e - s;
-			lo = hi;
-			hi++;
 		}
+		goto update_sectors;
+	}
+
+	if (can_merge_front(bb, prev, &bad)) {
+		len = front_merge(bb, prev, &bad);
+		added++;
+		hint = prev;
+		goto update_sectors;
 	}
-	if (sectors == 0 && hi < bb->count) {
-		/* we might be able to combine lo and hi */
-		/* Note: 's' is at the end of 'lo' */
-		sector_t a = BB_OFFSET(p[hi]);
-		int lolen = BB_LEN(p[lo]);
-		int hilen = BB_LEN(p[hi]);
-		int newlen = lolen + hilen - (s - a);
-
-		if (s >= a && newlen < BB_MAX_LEN) {
-			/* yes, we can combine them */
-			int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
-
-			p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
-			memmove(p + hi, p + hi + 1,
-				(bb->count - hi - 1) * 8);
-			bb->count--;
+
+	/* if no space in table, still try to merge in the covered range */
+	if (badblocks_full(bb)) {
+		/* skip the cannot-merge range */
+		if (((prev + 1) < bb->count) &&
+		    overlap_behind(bb, &bad, prev + 1) &&
+		    ((s + sectors) >= BB_END(p[prev + 1]))) {
+			len = BB_END(p[prev + 1]) - s;
+			hint = prev + 1;
+			goto update_sectors;
 		}
+
+		/* no retry any more */
+		len = sectors;
+		space_desired = 1;
+		hint = -1;
+		goto update_sectors;
 	}
-	while (sectors) {
-		/* didn't merge (it all).
-		 * Need to add a range just before 'hi'
-		 */
-		if (bb->count >= MAX_BADBLOCKS) {
-			/* No room for more */
-			rv = 1;
-			break;
-		} else {
-			int this_sectors = sectors;
 
-			memmove(p + hi + 1, p + hi,
-				(bb->count - hi) * 8);
-			bb->count++;
+	/* cannot merge and there is space in bad table */
+	if (overlap_behind(bb, &bad, prev + 1))
+		bad.len = min_t(sector_t, bad.len, BB_OFFSET(p[prev + 1]) - bad.start);
 
-			if (this_sectors > BB_MAX_LEN)
-				this_sectors = BB_MAX_LEN;
-			p[hi] = BB_MAKE(s, this_sectors, acknowledged);
-			sectors -= this_sectors;
-			s += this_sectors;
-		}
+	len = insert_at(bb, prev + 1, &bad);
+	bb->count++;
+	added++;
+	hint = prev + 1;
+
+update_sectors:
+	s += len;
+	sectors -= len;
+
+	if (sectors > 0)
+		goto re_insert;
+
+	WARN_ON(sectors < 0);
+
+	/* Check whether the following already set range can be merged */
+	if ((prev + 1) < bb->count &&
+	    BB_END(p[prev]) == BB_OFFSET(p[prev + 1]) &&
+	    (BB_LEN(p[prev]) + BB_LEN(p[prev + 1])) <= BB_MAX_LEN &&
+	    BB_ACK(p[prev]) == BB_ACK(p[prev + 1])) {
+		p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
+				  BB_LEN(p[prev]) + BB_LEN(p[prev + 1]),
+				  BB_ACK(p[prev]));
+
+		if ((prev + 2) < bb->count)
+			memmove(p + prev + 1, p + prev + 2,
+				(bb->count -  (prev + 2)) * 8);
+		bb->count--;
+	}
+
+	if (space_desired && !badblocks_full(bb)) {
+		s = bad.orig_start;
+		sectors = bad.orig_len;
+		if (retried++ < 3)
+			goto re_insert;
+	}
+
+out:
+	if (added) {
+		set_changed(bb);
+
+		if (!acknowledged)
+			bb->unacked_exist = 1;
+		else
+			badblocks_update_acked(bb);
 	}
 
-	bb->changed = 1;
-	if (!acknowledged)
-		bb->unacked_exist = 1;
-	else
-		badblocks_update_acked(bb);
 	write_sequnlock_irqrestore(&bb->lock, flags);
 
+	if (!added)
+		rv = 1;
+
 	return rv;
 }
 EXPORT_SYMBOL_GPL(badblocks_set);
@@ -423,6 +969,115 @@  int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
 }
 EXPORT_SYMBOL_GPL(badblocks_clear);
 
+/**
+ * badblocks_check() - check a given range for bad sectors
+ * @bb:		the badblocks structure that holds all badblock information
+ * @s:		sector (start) at which to check for badblocks
+ * @sectors:	number of sectors to check for badblocks
+ * @first_bad:	pointer to store location of the first badblock
+ * @bad_sectors: pointer to store number of badblocks after @first_bad
+ *
+ * We can record which blocks on each device are 'bad' and so just
+ * fail those blocks, or that stripe, rather than the whole device.
+ * Entries in the bad-block table are 64bits wide.  This comprises:
+ * Length of bad-range, in sectors: 0-511 for lengths 1-512
+ * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
+ *  A 'shift' can be set so that larger blocks are tracked and
+ *  consequently larger devices can be covered.
+ * 'Acknowledged' flag - 1 bit. - the most significant bit.
+ *
+ * Locking of the bad-block table uses a seqlock so badblocks_check
+ * might need to retry if it is very unlucky.
+ * We will sometimes want to check for bad blocks in a bi_end_io function,
+ * so we use the write_seqlock_irq variant.
+ *
+ * When looking for a bad block we specify a range and want to
+ * know if any block in the range is bad.  So we binary-search
+ * to the last range that starts at-or-before the given endpoint,
+ * (or "before the sector after the target range")
+ * then see if it ends after the given start.
+ *
+ * Return:
+ *  0: there are no known bad blocks in the range
+ *  1: there are known bad block which are all acknowledged
+ * -1: there are bad blocks which have not yet been acknowledged in metadata.
+ * plus the start/length of the first bad section we overlap.
+ */
+int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+			sector_t *first_bad, int *bad_sectors)
+{
+	int hi;
+	int lo;
+	u64 *p = bb->page;
+	int rv;
+	sector_t target = s + sectors;
+	unsigned seq;
+
+	if (bb->shift > 0) {
+		/* round the start down, and the end up */
+		s >>= bb->shift;
+		target += (1<<bb->shift) - 1;
+		target >>= bb->shift;
+		sectors = target - s;
+	}
+	/* 'target' is now the first block after the bad range */
+
+retry:
+	seq = read_seqbegin(&bb->lock);
+	lo = 0;
+	rv = 0;
+	hi = bb->count;
+
+	/* Binary search between lo and hi for 'target'
+	 * i.e. for the last range that starts before 'target'
+	 */
+	/* INVARIANT: ranges before 'lo' and at-or-after 'hi'
+	 * are known not to be the last range before target.
+	 * VARIANT: hi-lo is the number of possible
+	 * ranges, and decreases until it reaches 1
+	 */
+	while (hi - lo > 1) {
+		int mid = (lo + hi) / 2;
+		sector_t a = BB_OFFSET(p[mid]);
+
+		if (a < target)
+			/* This could still be the one, earlier ranges
+			 * could not.
+			 */
+			lo = mid;
+		else
+			/* This and later ranges are definitely out. */
+			hi = mid;
+	}
+	/* 'lo' might be the last that started before target, but 'hi' isn't */
+	if (hi > lo) {
+		/* need to check all range that end after 's' to see if
+		 * any are unacknowledged.
+		 */
+		while (lo >= 0 &&
+		       BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
+			if (BB_OFFSET(p[lo]) < target) {
+				/* starts before the end, and finishes after
+				 * the start, so they must overlap
+				 */
+				if (rv != -1 && BB_ACK(p[lo]))
+					rv = 1;
+				else
+					rv = -1;
+				*first_bad = BB_OFFSET(p[lo]);
+				*bad_sectors = BB_LEN(p[lo]);
+			}
+			lo--;
+		}
+	}
+
+	if (read_seqretry(&bb->lock, seq))
+		goto retry;
+
+	return rv;
+}
+EXPORT_SYMBOL_GPL(badblocks_check);
+
 /**
  * ack_all_badblocks() - Acknowledge all bad blocks in a list.
  * @bb:		the badblocks structure that holds all badblock information
diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
index 2426276b9bd3..b4bd997a53a4 100644
--- a/include/linux/badblocks.h
+++ b/include/linux/badblocks.h
@@ -15,6 +15,7 @@ 
 #define BB_OFFSET(x)	(((x) & BB_OFFSET_MASK) >> 9)
 #define BB_LEN(x)	(((x) & BB_LEN_MASK) + 1)
 #define BB_ACK(x)	(!!((x) & BB_ACK_MASK))
+#define BB_END(x)	(BB_OFFSET(x) + BB_LEN(x))
 #define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
 
 /* Bad block numbers are stored sorted in a single page.
@@ -41,6 +42,14 @@  struct badblocks {
 	sector_t size;		/* in sectors */
 };
 
+struct bad_context {
+	sector_t	start;
+	sector_t	len;
+	int		ack;
+	sector_t	orig_start;
+	sector_t	orig_len;
+};
+
 int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
 		   sector_t *first_bad, int *bad_sectors);
 int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
@@ -54,6 +63,7 @@  int badblocks_init(struct badblocks *bb, int enable);
 void badblocks_exit(struct badblocks *bb);
 struct device;
 int devm_init_badblocks(struct device *dev, struct badblocks *bb);
+
 static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
 {
 	if (bb->dev != dev) {
@@ -63,4 +73,27 @@  static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
 	}
 	badblocks_exit(bb);
 }
+
+static inline int badblocks_full(struct badblocks *bb)
+{
+	return (bb->count >= MAX_BADBLOCKS);
+}
+
+static inline int badblocks_empty(struct badblocks *bb)
+{
+	return (bb->count == 0);
+}
+
+static inline void set_changed(struct badblocks *bb)
+{
+	if (bb->changed != 1)
+		bb->changed = 1;
+}
+
+static inline void clear_changed(struct badblocks *bb)
+{
+	if (bb->changed != 0)
+		bb->changed = 0;
+}
+
 #endif