mbox series

[v3,0/7] badblocks improvement for multiple bad block ranges

Message ID 20210913163643.10233-1-colyli@suse.de (mailing list archive)
Headers show
Series badblocks improvement for multiple bad block ranges | expand

Message

Coly Li Sept. 13, 2021, 4:36 p.m. UTC
This is the second effort to improve badblocks code APIs to handle
multiple ranges in bad block table.

There are 2 changes from previous version,
- Fixes 2 bugs in front_overwrite() which are detected by the user
  space testing code.
- Provide the user space testing code in last patch.

There is NO in-memory or on-disk format change in the whole series, all
existing API and data structures are consistent. This series just only
improve the code algorithm to handle more corner cases, the interfaces
are same and consistency to all existing callers (md raid and nvdimm
drivers).

The original motivation of the change is from the requirement from our
customer, that current badblocks routines don't handle multiple ranges.
For example if the bad block setting range covers multiple ranges from
bad block table, only the first two bad block ranges merged and rested
ranges are intact. The expected behavior should be all the covered
ranges to be handled.

All the patches are tested by modified user space code and the code
logic works as expected. The modified user space testing code is
provided in last patch. The testing code detects 2 defects in helper
front_overwrite() and fixed in this version.

The whole change is divided into 6 patches to make the code review more
clear and easier. If people prefer, I'd like to post a single large
patch finally after the code review accomplished.

This version is seriously tested, and so far no more defect observed.


Coly Li

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: NeilBrown <neilb@suse.de>
Cc: Richard Fan <richard.fan@suse.com>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
---
Changelog:
v3: add tester Richard Fan <richard.fan@suse.com>
v2: the improved version, and with testing code.
v1: the first completed version.


Coly Li (6):
  badblocks: add more helper structure and routines in badblocks.h
  badblocks: add helper routines for badblock ranges handling
  badblocks: improvement badblocks_set() for multiple ranges handling
  badblocks: improve badblocks_clear() for multiple ranges handling
  badblocks: improve badblocks_check() for multiple ranges handling
  badblocks: switch to the improved badblock handling code
Coly Li (1):
  test: user space code to test badblocks APIs

 block/badblocks.c         | 1599 ++++++++++++++++++++++++++++++-------
 include/linux/badblocks.h |   32 +
 2 files changed, 1340 insertions(+), 291 deletions(-)

Comments

Coly Li Sept. 23, 2021, 5:59 a.m. UTC | #1
Hi all the kernel gurus, and folks in mailing lists,

This is a question about exporting 4KB+ text information via sysfs 
interface. I need advice on how to handle the problem.

Recently I work on the bad blocks API (block/badblocks.c) improvement, 
there is a sysfs file to export the bad block ranges for me raid. E.g 
for a md raid1 device, file
     /sys/block/md0/md/rd0/bad_blocks
may contain the following text content,
     64 32
    128 8
The above lines mean there are two bad block ranges, one starts at LBA 
64, length 32 sectors, another one starts at LBA 128 and length 8 
sectors. All the content is generated from the internal bad block 
records with 512 elements. In my testing the worst case only 185 from 
512 records can be displayed via the sysfs file if the LBA string is 
very long, e.g.the following content,
   17668164135030776 512
   17668164135029776 512
   17668164135028776 512
   17668164135027776 512
   ... ...
The bad block ranges stored in internal bad blocks array are correct, 
but the output message is truncated. This is the problem I encountered.

I don't see sysfs has seq_file support (correct me if I am wrong), and I 
know it is improper to transfer 4KB+ text via sysfs interface, but the 
code is here already for long time.

There are 2 ideas to fix showing up in my brain,
1) Do not fix the problem
     Normally it is rare that a storage media has 100+ bad block ranges, 
maybe in real world all the existing bad blocks information won't exceed 
the page size limitation of sysfs file.
2) Add seq_file support to sysfs interface if there is no

It is probably there is other better solution to fix. So I do want to 
get hint/advice from you.

Thanks in advance for any comment :-)

Coly Li

On 9/14/21 12:36 AM, Coly Li wrote:
> This is the second effort to improve badblocks code APIs to handle
> multiple ranges in bad block table.
>
> There are 2 changes from previous version,
> - Fixes 2 bugs in front_overwrite() which are detected by the user
>    space testing code.
> - Provide the user space testing code in last patch.
>
> There is NO in-memory or on-disk format change in the whole series, all
> existing API and data structures are consistent. This series just only
> improve the code algorithm to handle more corner cases, the interfaces
> are same and consistency to all existing callers (md raid and nvdimm
> drivers).
>
> The original motivation of the change is from the requirement from our
> customer, that current badblocks routines don't handle multiple ranges.
> For example if the bad block setting range covers multiple ranges from
> bad block table, only the first two bad block ranges merged and rested
> ranges are intact. The expected behavior should be all the covered
> ranges to be handled.
>
> All the patches are tested by modified user space code and the code
> logic works as expected. The modified user space testing code is
> provided in last patch. The testing code detects 2 defects in helper
> front_overwrite() and fixed in this version.
>
> The whole change is divided into 6 patches to make the code review more
> clear and easier. If people prefer, I'd like to post a single large
> patch finally after the code review accomplished.
>
> This version is seriously tested, and so far no more defect observed.
>
>
> Coly Li
>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Richard Fan <richard.fan@suse.com>
> Cc: Vishal L Verma <vishal.l.verma@intel.com>
> ---
> Changelog:
> v3: add tester Richard Fan <richard.fan@suse.com>
> v2: the improved version, and with testing code.
> v1: the first completed version.
>
>
> Coly Li (6):
>    badblocks: add more helper structure and routines in badblocks.h
>    badblocks: add helper routines for badblock ranges handling
>    badblocks: improvement badblocks_set() for multiple ranges handling
>    badblocks: improve badblocks_clear() for multiple ranges handling
>    badblocks: improve badblocks_check() for multiple ranges handling
>    badblocks: switch to the improved badblock handling code
> Coly Li (1):
>    test: user space code to test badblocks APIs
>
>   block/badblocks.c         | 1599 ++++++++++++++++++++++++++++++-------
>   include/linux/badblocks.h |   32 +
>   2 files changed, 1340 insertions(+), 291 deletions(-)
>
Greg Kroah-Hartman Sept. 23, 2021, 6:08 a.m. UTC | #2
On Thu, Sep 23, 2021 at 01:59:28PM +0800, Coly Li wrote:
> Hi all the kernel gurus, and folks in mailing lists,
> 
> This is a question about exporting 4KB+ text information via sysfs
> interface. I need advice on how to handle the problem.

Please do not do that.  Seriously, that is not what sysfs is for, and is
an abuse of it.

sysfs is for "one value per file" and should never even get close to a
4kb limit.  If it does, you are doing something really really wrong and
should just remove that sysfs file from the system and redesign your
api.

> Recently I work on the bad blocks API (block/badblocks.c) improvement, there
> is a sysfs file to export the bad block ranges for me raid. E.g for a md
> raid1 device, file
>     /sys/block/md0/md/rd0/bad_blocks
> may contain the following text content,
>     64 32
>    128 8

Ick, again, that's not ok at all.  sysfs files should never have to be
parsed like this.

thanks,

greg k-h
Coly Li Sept. 23, 2021, 6:14 a.m. UTC | #3
On 9/23/21 2:08 PM, Greg Kroah-Hartman wrote:
> On Thu, Sep 23, 2021 at 01:59:28PM +0800, Coly Li wrote:
>> Hi all the kernel gurus, and folks in mailing lists,
>>
>> This is a question about exporting 4KB+ text information via sysfs
>> interface. I need advice on how to handle the problem.

Hi Greg,

This is the code in mainline kernel for quite long time.

> Please do not do that.  Seriously, that is not what sysfs is for, and is
> an abuse of it.
>
> sysfs is for "one value per file" and should never even get close to a
> 4kb limit.  If it does, you are doing something really really wrong and
> should just remove that sysfs file from the system and redesign your
> api.

I understand this. And what I addressed is the problem I need to fix.

The code is there for almost 10 years, I just find it during my work on 
bad blocks API fixing.


>
>> Recently I work on the bad blocks API (block/badblocks.c) improvement, there
>> is a sysfs file to export the bad block ranges for me raid. E.g for a md
>> raid1 device, file
>>      /sys/block/md0/md/rd0/bad_blocks
>> may contain the following text content,
>>      64 32
>>     128 8
> Ick, again, that's not ok at all.  sysfs files should never have to be
> parsed like this.

I cannot agree more with you. What I am asking for was ---- how to fix it ?

Thanks.

Coly Li
Greg Kroah-Hartman Sept. 23, 2021, 6:47 a.m. UTC | #4
On Thu, Sep 23, 2021 at 02:14:12PM +0800, Coly Li wrote:
> On 9/23/21 2:08 PM, Greg Kroah-Hartman wrote:
> > On Thu, Sep 23, 2021 at 01:59:28PM +0800, Coly Li wrote:
> > > Hi all the kernel gurus, and folks in mailing lists,
> > > 
> > > This is a question about exporting 4KB+ text information via sysfs
> > > interface. I need advice on how to handle the problem.
> 
> Hi Greg,
> 
> This is the code in mainline kernel for quite long time.

{sigh}

What tools rely on this?  If none, just don't add new stuff to the file
and work to create a new api instead.

> > Please do not do that.  Seriously, that is not what sysfs is for, and is
> > an abuse of it.
> > 
> > sysfs is for "one value per file" and should never even get close to a
> > 4kb limit.  If it does, you are doing something really really wrong and
> > should just remove that sysfs file from the system and redesign your
> > api.
> 
> I understand this. And what I addressed is the problem I need to fix.
> 
> The code is there for almost 10 years, I just find it during my work on bad
> blocks API fixing.
> 
> 
> > > Recently I work on the bad blocks API (block/badblocks.c) improvement, there
> > > is a sysfs file to export the bad block ranges for me raid. E.g for a md
> > > raid1 device, file
> > >      /sys/block/md0/md/rd0/bad_blocks
> > > may contain the following text content,
> > >      64 32
> > >     128 8
> > Ick, again, that's not ok at all.  sysfs files should never have to be
> > parsed like this.
> 
> I cannot agree more with you. What I am asking for was ---- how to fix it ?

Best solution, come up with a new api.

Worst solution, you are stuck with the existing file and I can show you
the "way out" of dealing with files larger than 4kb in sysfs that a
number of other apis are being forced to do as they grow over time.

But ideally, just drop ths api and make a new one please.

thanks,

greg k-h
Coly Li Sept. 23, 2021, 7:13 a.m. UTC | #5
On 9/23/21 2:47 PM, Greg Kroah-Hartman wrote:
> On Thu, Sep 23, 2021 at 02:14:12PM +0800, Coly Li wrote:
>> On 9/23/21 2:08 PM, Greg Kroah-Hartman wrote:
>>> On Thu, Sep 23, 2021 at 01:59:28PM +0800, Coly Li wrote:
>>>> Hi all the kernel gurus, and folks in mailing lists,
>>>>
>>>> This is a question about exporting 4KB+ text information via sysfs
>>>> interface. I need advice on how to handle the problem.
>> Hi Greg,
>>
>> This is the code in mainline kernel for quite long time.
> {sigh}
>
> What tools rely on this?  If none, just don't add new stuff to the file
> and work to create a new api instead.

At least I know mdadm uses this sysfs interface for md raid component 
disks monitoring. It has been in mdadm for around 5 years.

Yes you are right, let it be for existing sysfs interface to avoid 
breaking things.

>>> Please do not do that.  Seriously, that is not what sysfs is for, and is
>>> an abuse of it.
>>>
>>> sysfs is for "one value per file" and should never even get close to a
>>> 4kb limit.  If it does, you are doing something really really wrong and
>>> should just remove that sysfs file from the system and redesign your
>>> api.
>> I understand this. And what I addressed is the problem I need to fix.
>>
>> The code is there for almost 10 years, I just find it during my work on bad
>> blocks API fixing.
>>
>>
>>>> Recently I work on the bad blocks API (block/badblocks.c) improvement, there
>>>> is a sysfs file to export the bad block ranges for me raid. E.g for a md
>>>> raid1 device, file
>>>>       /sys/block/md0/md/rd0/bad_blocks
>>>> may contain the following text content,
>>>>       64 32
>>>>      128 8
>>> Ick, again, that's not ok at all.  sysfs files should never have to be
>>> parsed like this.
>> I cannot agree more with you. What I am asking for was ---- how to fix it ?
> Best solution, come up with a new api.
>
> Worst solution, you are stuck with the existing file and I can show you
> the "way out" of dealing with files larger than 4kb in sysfs that a
> number of other apis are being forced to do as they grow over time.

Now I am sure you are very probably not willing to accept the patches, 
even I know how to do that :-)

>
> But ideally, just drop ths api and make a new one please.

OK, then I leave the existing things as what they are, avoid to make 
them worse.

Thanks for your response.

Coly Li
Hannes Reinecke Sept. 23, 2021, 9:40 a.m. UTC | #6
On 9/23/21 7:59 AM, Coly Li wrote:
> Hi all the kernel gurus, and folks in mailing lists,
> 
> This is a question about exporting 4KB+ text information via sysfs
> interface. I need advice on how to handle the problem.
> 
> Recently I work on the bad blocks API (block/badblocks.c) improvement,
> there is a sysfs file to export the bad block ranges for me raid. E.g
> for a md raid1 device, file
>     /sys/block/md0/md/rd0/bad_blocks
> may contain the following text content,
>     64 32
>    128 8
> The above lines mean there are two bad block ranges, one starts at LBA
> 64, length 32 sectors, another one starts at LBA 128 and length 8
> sectors. All the content is generated from the internal bad block
> records with 512 elements. In my testing the worst case only 185 from
> 512 records can be displayed via the sysfs file if the LBA string is
> very long, e.g.the following content,
>   17668164135030776 512
>   17668164135029776 512
>   17668164135028776 512
>   17668164135027776 512
>   ... ...
> The bad block ranges stored in internal bad blocks array are correct,
> but the output message is truncated. This is the problem I encountered.
> 
> I don't see sysfs has seq_file support (correct me if I am wrong), and I
> know it is improper to transfer 4KB+ text via sysfs interface, but the
> code is here already for long time.
> 
> There are 2 ideas to fix showing up in my brain,
> 1) Do not fix the problem
>     Normally it is rare that a storage media has 100+ bad block ranges,
> maybe in real world all the existing bad blocks information won't exceed
> the page size limitation of sysfs file.
> 2) Add seq_file support to sysfs interface if there is no
> 
> It is probably there is other better solution to fix. So I do want to
> get hint/advice from you.
> 
> Thanks in advance for any comment :-)
> 
> Coly Li
> 
> On 9/14/21 12:36 AM, Coly Li wrote:
>> This is the second effort to improve badblocks code APIs to handle
>> multiple ranges in bad block table.
>>
>> There are 2 changes from previous version,
>> - Fixes 2 bugs in front_overwrite() which are detected by the user
>>    space testing code.
>> - Provide the user space testing code in last patch.
>>
>> There is NO in-memory or on-disk format change in the whole series, all
>> existing API and data structures are consistent. This series just only
>> improve the code algorithm to handle more corner cases, the interfaces
>> are same and consistency to all existing callers (md raid and nvdimm
>> drivers).
>>
>> The original motivation of the change is from the requirement from our
>> customer, that current badblocks routines don't handle multiple ranges.
>> For example if the bad block setting range covers multiple ranges from
>> bad block table, only the first two bad block ranges merged and rested
>> ranges are intact. The expected behavior should be all the covered
>> ranges to be handled.
>>
>> All the patches are tested by modified user space code and the code
>> logic works as expected. The modified user space testing code is
>> provided in last patch. The testing code detects 2 defects in helper
>> front_overwrite() and fixed in this version.
>>
>> The whole change is divided into 6 patches to make the code review more
>> clear and easier. If people prefer, I'd like to post a single large
>> patch finally after the code review accomplished.
>>
>> This version is seriously tested, and so far no more defect observed.
>>
>>
>> Coly Li
>>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Hannes Reinecke <hare@suse.de>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: NeilBrown <neilb@suse.de>
>> Cc: Richard Fan <richard.fan@suse.com>
>> Cc: Vishal L Verma <vishal.l.verma@intel.com>
>> ---
>> Changelog:
>> v3: add tester Richard Fan <richard.fan@suse.com>
>> v2: the improved version, and with testing code.
>> v1: the first completed version.
>>
>>
>> Coly Li (6):
>>    badblocks: add more helper structure and routines in badblocks.h
>>    badblocks: add helper routines for badblock ranges handling
>>    badblocks: improvement badblocks_set() for multiple ranges handling
>>    badblocks: improve badblocks_clear() for multiple ranges handling
>>    badblocks: improve badblocks_check() for multiple ranges handling
>>    badblocks: switch to the improved badblock handling code
>> Coly Li (1):
>>    test: user space code to test badblocks APIs
>>
>>   block/badblocks.c         | 1599 ++++++++++++++++++++++++++++++-------
>>   include/linux/badblocks.h |   32 +
>>   2 files changed, 1340 insertions(+), 291 deletions(-)
>>
> 

Please have a look at the patchset 'start switching sysfs attributes to
expose the seq_file' from Christoph Hellwig on linux-block; that seems
to be the approach you are looking for.


Cheers,

Hannes
Greg Kroah-Hartman Sept. 23, 2021, 9:57 a.m. UTC | #7
On Thu, Sep 23, 2021 at 11:40:30AM +0200, Hannes Reinecke wrote:
> On 9/23/21 7:59 AM, Coly Li wrote:
> > Hi all the kernel gurus, and folks in mailing lists,
> > 
> > This is a question about exporting 4KB+ text information via sysfs
> > interface. I need advice on how to handle the problem.
> > 
> > Recently I work on the bad blocks API (block/badblocks.c) improvement,
> > there is a sysfs file to export the bad block ranges for me raid. E.g
> > for a md raid1 device, file
> >     /sys/block/md0/md/rd0/bad_blocks
> > may contain the following text content,
> >     64 32
> >    128 8
> > The above lines mean there are two bad block ranges, one starts at LBA
> > 64, length 32 sectors, another one starts at LBA 128 and length 8
> > sectors. All the content is generated from the internal bad block
> > records with 512 elements. In my testing the worst case only 185 from
> > 512 records can be displayed via the sysfs file if the LBA string is
> > very long, e.g.the following content,
> >   17668164135030776 512
> >   17668164135029776 512
> >   17668164135028776 512
> >   17668164135027776 512
> >   ... ...
> > The bad block ranges stored in internal bad blocks array are correct,
> > but the output message is truncated. This is the problem I encountered.
> > 
> > I don't see sysfs has seq_file support (correct me if I am wrong), and I
> > know it is improper to transfer 4KB+ text via sysfs interface, but the
> > code is here already for long time.
> > 
> > There are 2 ideas to fix showing up in my brain,
> > 1) Do not fix the problem
> >     Normally it is rare that a storage media has 100+ bad block ranges,
> > maybe in real world all the existing bad blocks information won't exceed
> > the page size limitation of sysfs file.
> > 2) Add seq_file support to sysfs interface if there is no
> > 
> > It is probably there is other better solution to fix. So I do want to
> > get hint/advice from you.
> > 
> > Thanks in advance for any comment :-)
> > 
> > Coly Li
> > 
> > On 9/14/21 12:36 AM, Coly Li wrote:
> >> This is the second effort to improve badblocks code APIs to handle
> >> multiple ranges in bad block table.
> >>
> >> There are 2 changes from previous version,
> >> - Fixes 2 bugs in front_overwrite() which are detected by the user
> >>    space testing code.
> >> - Provide the user space testing code in last patch.
> >>
> >> There is NO in-memory or on-disk format change in the whole series, all
> >> existing API and data structures are consistent. This series just only
> >> improve the code algorithm to handle more corner cases, the interfaces
> >> are same and consistency to all existing callers (md raid and nvdimm
> >> drivers).
> >>
> >> The original motivation of the change is from the requirement from our
> >> customer, that current badblocks routines don't handle multiple ranges.
> >> For example if the bad block setting range covers multiple ranges from
> >> bad block table, only the first two bad block ranges merged and rested
> >> ranges are intact. The expected behavior should be all the covered
> >> ranges to be handled.
> >>
> >> All the patches are tested by modified user space code and the code
> >> logic works as expected. The modified user space testing code is
> >> provided in last patch. The testing code detects 2 defects in helper
> >> front_overwrite() and fixed in this version.
> >>
> >> The whole change is divided into 6 patches to make the code review more
> >> clear and easier. If people prefer, I'd like to post a single large
> >> patch finally after the code review accomplished.
> >>
> >> This version is seriously tested, and so far no more defect observed.
> >>
> >>
> >> Coly Li
> >>
> >> Cc: Dan Williams <dan.j.williams@intel.com>
> >> Cc: Hannes Reinecke <hare@suse.de>
> >> Cc: Jens Axboe <axboe@kernel.dk>
> >> Cc: NeilBrown <neilb@suse.de>
> >> Cc: Richard Fan <richard.fan@suse.com>
> >> Cc: Vishal L Verma <vishal.l.verma@intel.com>
> >> ---
> >> Changelog:
> >> v3: add tester Richard Fan <richard.fan@suse.com>
> >> v2: the improved version, and with testing code.
> >> v1: the first completed version.
> >>
> >>
> >> Coly Li (6):
> >>    badblocks: add more helper structure and routines in badblocks.h
> >>    badblocks: add helper routines for badblock ranges handling
> >>    badblocks: improvement badblocks_set() for multiple ranges handling
> >>    badblocks: improve badblocks_clear() for multiple ranges handling
> >>    badblocks: improve badblocks_check() for multiple ranges handling
> >>    badblocks: switch to the improved badblock handling code
> >> Coly Li (1):
> >>    test: user space code to test badblocks APIs
> >>
> >>   block/badblocks.c         | 1599 ++++++++++++++++++++++++++++++-------
> >>   include/linux/badblocks.h |   32 +
> >>   2 files changed, 1340 insertions(+), 291 deletions(-)
> >>
> > 
> 
> Please have a look at the patchset 'start switching sysfs attributes to
> expose the seq_file' from Christoph Hellwig on linux-block; that seems
> to be the approach you are looking for.

No, I rejected the seq_file api for sysfs files, as it encourages abuse
like this, sorry.

greg k-h
NeilBrown Sept. 23, 2021, 10:09 a.m. UTC | #8
On Thu, 23 Sep 2021, Coly Li wrote:
> Hi all the kernel gurus, and folks in mailing lists,
> 
> This is a question about exporting 4KB+ text information via sysfs 
> interface. I need advice on how to handle the problem.

Why do you think there is a problem?
As documented in Documentation/admin-guide/md.rst, the truncation at 1
page is expected and by design.

The "unacknowledge-bad-blocks" file is the important one that is needed
for correct behaviour.  Being able to read a single block is sufficient,
though being able to read more than one could provide better performance
in some cases.

The "bad-blocks" file primarily exist to provide visibility into the
state of the system - useful during development.  It can be written to
to add bad blocks.  I never *needs* to be read from.

The authoritative source of information about the set of bad blocks is
the on-disk data the can be and should be read directly...

Except that mdadm does.  That was a mistake.  check_for_cleared_bb() is
wrong.  I wonder why it was added.  The commit message doesn't give any
justification.

NeilBrown


> 
> Recently I work on the bad blocks API (block/badblocks.c) improvement, 
> there is a sysfs file to export the bad block ranges for me raid. E.g 
> for a md raid1 device, file
>      /sys/block/md0/md/rd0/bad_blocks
> may contain the following text content,
>      64 32
>     128 8
> The above lines mean there are two bad block ranges, one starts at LBA 
> 64, length 32 sectors, another one starts at LBA 128 and length 8 
> sectors. All the content is generated from the internal bad block 
> records with 512 elements. In my testing the worst case only 185 from 
> 512 records can be displayed via the sysfs file if the LBA string is 
> very long, e.g.the following content,
>    17668164135030776 512
>    17668164135029776 512
>    17668164135028776 512
>    17668164135027776 512
>    ... ...
> The bad block ranges stored in internal bad blocks array are correct, 
> but the output message is truncated. This is the problem I encountered.
> 
> I don't see sysfs has seq_file support (correct me if I am wrong), and I 
> know it is improper to transfer 4KB+ text via sysfs interface, but the 
> code is here already for long time.
> 
> There are 2 ideas to fix showing up in my brain,
> 1) Do not fix the problem
>      Normally it is rare that a storage media has 100+ bad block ranges, 
> maybe in real world all the existing bad blocks information won't exceed 
> the page size limitation of sysfs file.
> 2) Add seq_file support to sysfs interface if there is no
> 
> It is probably there is other better solution to fix. So I do want to 
> get hint/advice from you.
> 
> Thanks in advance for any comment :-)
> 
> Coly Li
> 
> On 9/14/21 12:36 AM, Coly Li wrote:
> > This is the second effort to improve badblocks code APIs to handle
> > multiple ranges in bad block table.
> >
> > There are 2 changes from previous version,
> > - Fixes 2 bugs in front_overwrite() which are detected by the user
> >    space testing code.
> > - Provide the user space testing code in last patch.
> >
> > There is NO in-memory or on-disk format change in the whole series, all
> > existing API and data structures are consistent. This series just only
> > improve the code algorithm to handle more corner cases, the interfaces
> > are same and consistency to all existing callers (md raid and nvdimm
> > drivers).
> >
> > The original motivation of the change is from the requirement from our
> > customer, that current badblocks routines don't handle multiple ranges.
> > For example if the bad block setting range covers multiple ranges from
> > bad block table, only the first two bad block ranges merged and rested
> > ranges are intact. The expected behavior should be all the covered
> > ranges to be handled.
> >
> > All the patches are tested by modified user space code and the code
> > logic works as expected. The modified user space testing code is
> > provided in last patch. The testing code detects 2 defects in helper
> > front_overwrite() and fixed in this version.
> >
> > The whole change is divided into 6 patches to make the code review more
> > clear and easier. If people prefer, I'd like to post a single large
> > patch finally after the code review accomplished.
> >
> > This version is seriously tested, and so far no more defect observed.
> >
> >
> > Coly Li
> >
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Hannes Reinecke <hare@suse.de>
> > Cc: Jens Axboe <axboe@kernel.dk>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Richard Fan <richard.fan@suse.com>
> > Cc: Vishal L Verma <vishal.l.verma@intel.com>
> > ---
> > Changelog:
> > v3: add tester Richard Fan <richard.fan@suse.com>
> > v2: the improved version, and with testing code.
> > v1: the first completed version.
> >
> >
> > Coly Li (6):
> >    badblocks: add more helper structure and routines in badblocks.h
> >    badblocks: add helper routines for badblock ranges handling
> >    badblocks: improvement badblocks_set() for multiple ranges handling
> >    badblocks: improve badblocks_clear() for multiple ranges handling
> >    badblocks: improve badblocks_check() for multiple ranges handling
> >    badblocks: switch to the improved badblock handling code
> > Coly Li (1):
> >    test: user space code to test badblocks APIs
> >
> >   block/badblocks.c         | 1599 ++++++++++++++++++++++++++++++-------
> >   include/linux/badblocks.h |   32 +
> >   2 files changed, 1340 insertions(+), 291 deletions(-)
> >
> 
>
Greg Kroah-Hartman Sept. 23, 2021, 10:39 a.m. UTC | #9
On Thu, Sep 23, 2021 at 08:09:21PM +1000, NeilBrown wrote:
> On Thu, 23 Sep 2021, Coly Li wrote:
> > Hi all the kernel gurus, and folks in mailing lists,
> > 
> > This is a question about exporting 4KB+ text information via sysfs 
> > interface. I need advice on how to handle the problem.
> 
> Why do you think there is a problem?
> As documented in Documentation/admin-guide/md.rst, the truncation at 1
> page is expected and by design.

Ah, shouldn't that info also be in Documentation/ABI/ so that others can
easily find it?

thanks,

greg k-h
Coly Li Sept. 23, 2021, 12:55 p.m. UTC | #10
On 9/23/21 6:09 PM, NeilBrown wrote:
> On Thu, 23 Sep 2021, Coly Li wrote:
>> Hi all the kernel gurus, and folks in mailing lists,
>>
>> This is a question about exporting 4KB+ text information via sysfs
>> interface. I need advice on how to handle the problem.

Hi Neil,

> Why do you think there is a problem?
> As documented in Documentation/admin-guide/md.rst, the truncation at 1
> page is expected and by design.

Oh, thanks for letting me know this. Yes this is as-designed, so I will 
not worry more about this.

>
> The "unacknowledge-bad-blocks" file is the important one that is needed
> for correct behaviour.  Being able to read a single block is sufficient,
> though being able to read more than one could provide better performance
> in some cases.
>
> The "bad-blocks" file primarily exist to provide visibility into the
> state of the system - useful during development.  It can be written to
> to add bad blocks.  I never *needs* to be read from.

Thanks for the hint.

>
> The authoritative source of information about the set of bad blocks is
> the on-disk data the can be and should be read directly...

The reply is informative. It is more clear for me.

Coly Li

>
> Except that mdadm does.  That was a mistake.  check_for_cleared_bb() is
> wrong.  I wonder why it was added.  The commit message doesn't give any
> justification.
>
> NeilBrown
>
>
>> Recently I work on the bad blocks API (block/badblocks.c) improvement,
>> there is a sysfs file to export the bad block ranges for me raid. E.g
>> for a md raid1 device, file
>>       /sys/block/md0/md/rd0/bad_blocks
>> may contain the following text content,
>>       64 32
>>      128 8
>> The above lines mean there are two bad block ranges, one starts at LBA
>> 64, length 32 sectors, another one starts at LBA 128 and length 8
>> sectors. All the content is generated from the internal bad block
>> records with 512 elements. In my testing the worst case only 185 from
>> 512 records can be displayed via the sysfs file if the LBA string is
>> very long, e.g.the following content,
>>     17668164135030776 512
>>     17668164135029776 512
>>     17668164135028776 512
>>     17668164135027776 512
>>     ... ...
>> The bad block ranges stored in internal bad blocks array are correct,
>> but the output message is truncated. This is the problem I encountered.
>>
>> I don't see sysfs has seq_file support (correct me if I am wrong), and I
>> know it is improper to transfer 4KB+ text via sysfs interface, but the
>> code is here already for long time.
>>
>> There are 2 ideas to fix showing up in my brain,
>> 1) Do not fix the problem
>>       Normally it is rare that a storage media has 100+ bad block ranges,
>> maybe in real world all the existing bad blocks information won't exceed
>> the page size limitation of sysfs file.
>> 2) Add seq_file support to sysfs interface if there is no
>>
>> It is probably there is other better solution to fix. So I do want to
>> get hint/advice from you.
>>
>> Thanks in advance for any comment :-)
>>
>> Coly Li
>>
>> On 9/14/21 12:36 AM, Coly Li wrote:
>>> This is the second effort to improve badblocks code APIs to handle
>>> multiple ranges in bad block table.
>>>
>>> There are 2 changes from previous version,
>>> - Fixes 2 bugs in front_overwrite() which are detected by the user
>>>     space testing code.
>>> - Provide the user space testing code in last patch.
>>>
>>> There is NO in-memory or on-disk format change in the whole series, all
>>> existing API and data structures are consistent. This series just only
>>> improve the code algorithm to handle more corner cases, the interfaces
>>> are same and consistency to all existing callers (md raid and nvdimm
>>> drivers).
>>>
>>> The original motivation of the change is from the requirement from our
>>> customer, that current badblocks routines don't handle multiple ranges.
>>> For example if the bad block setting range covers multiple ranges from
>>> bad block table, only the first two bad block ranges merged and rested
>>> ranges are intact. The expected behavior should be all the covered
>>> ranges to be handled.
>>>
>>> All the patches are tested by modified user space code and the code
>>> logic works as expected. The modified user space testing code is
>>> provided in last patch. The testing code detects 2 defects in helper
>>> front_overwrite() and fixed in this version.
>>>
>>> The whole change is divided into 6 patches to make the code review more
>>> clear and easier. If people prefer, I'd like to post a single large
>>> patch finally after the code review accomplished.
>>>
>>> This version is seriously tested, and so far no more defect observed.
>>>
>>>
>>> Coly Li
>>>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Hannes Reinecke <hare@suse.de>
>>> Cc: Jens Axboe <axboe@kernel.dk>
>>> Cc: NeilBrown <neilb@suse.de>
>>> Cc: Richard Fan <richard.fan@suse.com>
>>> Cc: Vishal L Verma <vishal.l.verma@intel.com>
>>> ---
>>> Changelog:
>>> v3: add tester Richard Fan <richard.fan@suse.com>
>>> v2: the improved version, and with testing code.
>>> v1: the first completed version.
>>>
>>>
>>> Coly Li (6):
>>>     badblocks: add more helper structure and routines in badblocks.h
>>>     badblocks: add helper routines for badblock ranges handling
>>>     badblocks: improvement badblocks_set() for multiple ranges handling
>>>     badblocks: improve badblocks_clear() for multiple ranges handling
>>>     badblocks: improve badblocks_check() for multiple ranges handling
>>>     badblocks: switch to the improved badblock handling code
>>> Coly Li (1):
>>>     test: user space code to test badblocks APIs
>>>
>>>    block/badblocks.c         | 1599 ++++++++++++++++++++++++++++++-------
>>>    include/linux/badblocks.h |   32 +
>>>    2 files changed, 1340 insertions(+), 291 deletions(-)
>>>
>>