diff mbox

[RESEND] gpmi-nand: Handle ECC Errors in erased pages

Message ID 1456059126-25469-1-git-send-email-mpa@pengutronix.de (mailing list archive)
State New, archived
Headers show

Commit Message

Markus Pargmann Feb. 21, 2016, 12:52 p.m. UTC
ECC is only calculated for written pages. As erased pages are not
actively written the ECC is always invalid. For this purpose the
Hardware BCH unit is able to check for erased pages and does not raise
an ECC error in this case. This behaviour can be influenced using the
BCH_MODE register which sets the number of allowed bitflips in an erased
page. Unfortunately the unit is not capable of fixing the bitflips in
memory.

To avoid complete software checks for erased pages, we can simply check
buffers with uncorrectable ECC errors because we know that any erased
page with errors is uncorrectable by the BCH unit.

This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand
to correct erased pages. To have the valid data in the buffer before
using them, this patch moves the read_page_swap_end() call before the
ECC status checking for-loop.

Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
---
 drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 5 deletions(-)

Comments

Boris BREZILLON Feb. 24, 2016, 1:57 p.m. UTC | #1
Hi Markus,

On Sun, 21 Feb 2016 13:52:06 +0100
Markus Pargmann <mpa@pengutronix.de> wrote:

> ECC is only calculated for written pages. As erased pages are not
> actively written the ECC is always invalid. For this purpose the
> Hardware BCH unit is able to check for erased pages and does not raise
> an ECC error in this case. This behaviour can be influenced using the
> BCH_MODE register which sets the number of allowed bitflips in an erased
> page. Unfortunately the unit is not capable of fixing the bitflips in
> memory.
> 
> To avoid complete software checks for erased pages, we can simply check
> buffers with uncorrectable ECC errors because we know that any erased
> page with errors is uncorrectable by the BCH unit.
> 
> This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand
> to correct erased pages. To have the valid data in the buffer before
> using them, this patch moves the read_page_swap_end() call before the
> ECC status checking for-loop.
> 
> Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
> ---
>  drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> index 235ddcb58f39..ce5a21252102 100644
> --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
>  	/* Loop over status bytes, accumulating ECC status. */
>  	status = auxiliary_virt + nfc_geo->auxiliary_status_offset;
>  
> +	read_page_swap_end(this, buf, nfc_geo->payload_size,
> +			   this->payload_virt, this->payload_phys,
> +			   nfc_geo->payload_size,
> +			   payload_virt, payload_phys);
> +
>  	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
>  		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
>  			continue;
>  
>  		if (*status == STATUS_UNCORRECTABLE) {
> +			int flips;
> +
> +			/*
> +			 * The ECC hardware has an uncorrectable ECC status
> +			 * code in case we have bitflips in an erased page. As
> +			 * nothing was written into this subpage the ECC is
> +			 * obviously wrong and we can not trust it. We assume
> +			 * at this point that we are reading an erased page and
> +			 * try to correct the bitflips in buffer up to
> +			 * ecc_strength bitflips. If this is a page with random
> +			 * data, we exceed this number of bitflips and have a
> +			 * ECC failure. Otherwise we use the corrected buffer.
> +			 */
> +			if (i == 0) {
> +				/* The first block includes metadata */
> +				flips = nand_check_erased_ecc_chunk(
> +						buf + i * nfc_geo->ecc_chunk_size,
> +						nfc_geo->ecc_chunk_size,
> +						NULL, 0,
> +						auxiliary_virt,
> +						nfc_geo->metadata_size,
> +						nfc_geo->ecc_strength);
> +			} else {
> +				flips = nand_check_erased_ecc_chunk(
> +						buf + i * nfc_geo->ecc_chunk_size,
> +						nfc_geo->ecc_chunk_size,
> +						NULL, 0,
> +						NULL, 0,
> +						nfc_geo->ecc_strength);
> +			}

You're not checking ECC bytes. I know it's a bit more complicated to
implement, but as Brian stated several times, you should always check
ECC bytes along with data bytes when testing for an erased chunk.

Indeed, it might appear that the user really programmed ff on the page,
and in this case you don't want to consider the page as erased.
In this case, the ECC bytes will be different from ff, and you'll
be able to identify it by checking those ECC bytes.

Best Regards,

Boris
Markus Pargmann April 11, 2016, 6:34 a.m. UTC | #2
Hi Boris,

On Wednesday 24 February 2016 14:57:50 Boris Brezillon wrote:
> Hi Markus,
> 
> On Sun, 21 Feb 2016 13:52:06 +0100
> Markus Pargmann <mpa@pengutronix.de> wrote:
> 
> > ECC is only calculated for written pages. As erased pages are not
> > actively written the ECC is always invalid. For this purpose the
> > Hardware BCH unit is able to check for erased pages and does not raise
> > an ECC error in this case. This behaviour can be influenced using the
> > BCH_MODE register which sets the number of allowed bitflips in an erased
> > page. Unfortunately the unit is not capable of fixing the bitflips in
> > memory.
> > 
> > To avoid complete software checks for erased pages, we can simply check
> > buffers with uncorrectable ECC errors because we know that any erased
> > page with errors is uncorrectable by the BCH unit.
> > 
> > This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand
> > to correct erased pages. To have the valid data in the buffer before
> > using them, this patch moves the read_page_swap_end() call before the
> > ECC status checking for-loop.
> > 
> > Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
> > ---
> >  drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++----
> >  1 file changed, 44 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> > index 235ddcb58f39..ce5a21252102 100644
> > --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> > +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> > @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
> >  	/* Loop over status bytes, accumulating ECC status. */
> >  	status = auxiliary_virt + nfc_geo->auxiliary_status_offset;
> >  
> > +	read_page_swap_end(this, buf, nfc_geo->payload_size,
> > +			   this->payload_virt, this->payload_phys,
> > +			   nfc_geo->payload_size,
> > +			   payload_virt, payload_phys);
> > +
> >  	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
> >  		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
> >  			continue;
> >  
> >  		if (*status == STATUS_UNCORRECTABLE) {
> > +			int flips;
> > +
> > +			/*
> > +			 * The ECC hardware has an uncorrectable ECC status
> > +			 * code in case we have bitflips in an erased page. As
> > +			 * nothing was written into this subpage the ECC is
> > +			 * obviously wrong and we can not trust it. We assume
> > +			 * at this point that we are reading an erased page and
> > +			 * try to correct the bitflips in buffer up to
> > +			 * ecc_strength bitflips. If this is a page with random
> > +			 * data, we exceed this number of bitflips and have a
> > +			 * ECC failure. Otherwise we use the corrected buffer.
> > +			 */
> > +			if (i == 0) {
> > +				/* The first block includes metadata */
> > +				flips = nand_check_erased_ecc_chunk(
> > +						buf + i * nfc_geo->ecc_chunk_size,
> > +						nfc_geo->ecc_chunk_size,
> > +						NULL, 0,
> > +						auxiliary_virt,
> > +						nfc_geo->metadata_size,
> > +						nfc_geo->ecc_strength);
> > +			} else {
> > +				flips = nand_check_erased_ecc_chunk(
> > +						buf + i * nfc_geo->ecc_chunk_size,
> > +						nfc_geo->ecc_chunk_size,
> > +						NULL, 0,
> > +						NULL, 0,
> > +						nfc_geo->ecc_strength);
> > +			}
> 
> You're not checking ECC bytes. I know it's a bit more complicated to
> implement, but as Brian stated several times, you should always check
> ECC bytes along with data bytes when testing for an erased chunk.
> 
> Indeed, it might appear that the user really programmed ff on the page,
> and in this case you don't want to consider the page as erased.
> In this case, the ECC bytes will be different from ff, and you'll
> be able to identify it by checking those ECC bytes.

Thanks for the feedback. Talking with a coworker about this we may have found a
better approach to this that is less complicated to implement. The hardware
unit allows us to set a bitflip threshold for erased pages. The ECC unit
creates an ECC error only if the number of bitflips exceeds this threshold, but
it does not correct these. So the idea is to change the patch so that we set
pages, that are signaled by the ECC as erased, to 0xff completely without
checking. So the ECC will do all the work and we completely trust in its
abilities to do it correctly.

I will send a modified patch as soon as I have some time for this.

Best Regards,

Markus
Boris BREZILLON April 11, 2016, 7:29 a.m. UTC | #3
On Mon, 11 Apr 2016 08:34:35 +0200
Markus Pargmann <mpa@pengutronix.de> wrote:

> Hi Boris,
> 
> On Wednesday 24 February 2016 14:57:50 Boris Brezillon wrote:
> > Hi Markus,
> > 
> > On Sun, 21 Feb 2016 13:52:06 +0100
> > Markus Pargmann <mpa@pengutronix.de> wrote:
> > 
> > > ECC is only calculated for written pages. As erased pages are not
> > > actively written the ECC is always invalid. For this purpose the
> > > Hardware BCH unit is able to check for erased pages and does not raise
> > > an ECC error in this case. This behaviour can be influenced using the
> > > BCH_MODE register which sets the number of allowed bitflips in an erased
> > > page. Unfortunately the unit is not capable of fixing the bitflips in
> > > memory.
> > > 
> > > To avoid complete software checks for erased pages, we can simply check
> > > buffers with uncorrectable ECC errors because we know that any erased
> > > page with errors is uncorrectable by the BCH unit.
> > > 
> > > This patch adds the generic nand_check_erased_ecc_chunk() to gpmi-nand
> > > to correct erased pages. To have the valid data in the buffer before
> > > using them, this patch moves the read_page_swap_end() call before the
> > > ECC status checking for-loop.
> > > 
> > > Signed-off-by: Markus Pargmann <mpa@pengutronix.de>
> > > ---
> > >  drivers/mtd/nand/gpmi-nand/gpmi-nand.c | 49 ++++++++++++++++++++++++++++++----
> > >  1 file changed, 44 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> > > index 235ddcb58f39..ce5a21252102 100644
> > > --- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> > > +++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
> > > @@ -1035,14 +1035,58 @@ static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
> > >  	/* Loop over status bytes, accumulating ECC status. */
> > >  	status = auxiliary_virt + nfc_geo->auxiliary_status_offset;
> > >  
> > > +	read_page_swap_end(this, buf, nfc_geo->payload_size,
> > > +			   this->payload_virt, this->payload_phys,
> > > +			   nfc_geo->payload_size,
> > > +			   payload_virt, payload_phys);
> > > +
> > >  	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
> > >  		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
> > >  			continue;
> > >  
> > >  		if (*status == STATUS_UNCORRECTABLE) {
> > > +			int flips;
> > > +
> > > +			/*
> > > +			 * The ECC hardware has an uncorrectable ECC status
> > > +			 * code in case we have bitflips in an erased page. As
> > > +			 * nothing was written into this subpage the ECC is
> > > +			 * obviously wrong and we can not trust it. We assume
> > > +			 * at this point that we are reading an erased page and
> > > +			 * try to correct the bitflips in buffer up to
> > > +			 * ecc_strength bitflips. If this is a page with random
> > > +			 * data, we exceed this number of bitflips and have a
> > > +			 * ECC failure. Otherwise we use the corrected buffer.
> > > +			 */
> > > +			if (i == 0) {
> > > +				/* The first block includes metadata */
> > > +				flips = nand_check_erased_ecc_chunk(
> > > +						buf + i * nfc_geo->ecc_chunk_size,
> > > +						nfc_geo->ecc_chunk_size,
> > > +						NULL, 0,
> > > +						auxiliary_virt,
> > > +						nfc_geo->metadata_size,
> > > +						nfc_geo->ecc_strength);
> > > +			} else {
> > > +				flips = nand_check_erased_ecc_chunk(
> > > +						buf + i * nfc_geo->ecc_chunk_size,
> > > +						nfc_geo->ecc_chunk_size,
> > > +						NULL, 0,
> > > +						NULL, 0,
> > > +						nfc_geo->ecc_strength);
> > > +			}
> > 
> > You're not checking ECC bytes. I know it's a bit more complicated to
> > implement, but as Brian stated several times, you should always check
> > ECC bytes along with data bytes when testing for an erased chunk.
> > 
> > Indeed, it might appear that the user really programmed ff on the page,
> > and in this case you don't want to consider the page as erased.
> > In this case, the ECC bytes will be different from ff, and you'll
> > be able to identify it by checking those ECC bytes.
> 
> Thanks for the feedback. Talking with a coworker about this we may have found a
> better approach to this that is less complicated to implement. The hardware
> unit allows us to set a bitflip threshold for erased pages. The ECC unit
> creates an ECC error only if the number of bitflips exceeds this threshold, but
> it does not correct these. So the idea is to change the patch so that we set
> pages, that are signaled by the ECC as erased, to 0xff completely without
> checking. So the ECC will do all the work and we completely trust in its
> abilities to do it correctly.

Sounds good.
Han Xu April 12, 2016, 10:39 p.m. UTC | #4

Boris BREZILLON April 12, 2016, 10:51 p.m. UTC | #5
On Tue, 12 Apr 2016 22:39:08 +0000
Han Xu <han.xu@nxp.com> wrote:

> > Thanks for the feedback. Talking with a coworker about this we may have found a
> > better approach to this that is less complicated to implement. The hardware
> > unit allows us to set a bitflip threshold for erased pages. The ECC unit
> > creates an ECC error only if the number of bitflips exceeds this threshold, but
> > it does not correct these. So the idea is to change the patch so that we set
> > pages, that are signaled by the ECC as erased, to 0xff completely without
> > checking. So the ECC will do all the work and we completely trust in its
> > abilities to do it correctly.
> 
> Sounds good.
> 
> some new platforms with new gpmi controller could check the count of 0 bits in page,
> refer to my patch https://patchwork.ozlabs.org/patch/587124/
> 
> But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and
> only check the uncorrectable branch and then correct data sounds better. Setting threshold
> and correcting all erased page may highly impact the performance.

Indeed, bitflips in erased pages is not so common, and penalizing the
likely case (erased pages without any bitflips) doesn't look like a good
idea in the end.

You can still implement this check in software. You can have a look at
nand_check_erased_ecc_chunk() [1] if you need an example, but you'll
have to adapt it because your controller does not guarantees that ECC
bits for a given chunk are byte aligned :-/

[1]http://lxr.free-electrons.com/source/drivers/mtd/nand/nand_base.c#L1200
Markus Pargmann April 15, 2016, 7:55 a.m. UTC | #6
On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote:
> On Tue, 12 Apr 2016 22:39:08 +0000
> Han Xu <han.xu@nxp.com> wrote:
> 
> > > Thanks for the feedback. Talking with a coworker about this we may have found a
> > > better approach to this that is less complicated to implement. The hardware
> > > unit allows us to set a bitflip threshold for erased pages. The ECC unit
> > > creates an ECC error only if the number of bitflips exceeds this threshold, but
> > > it does not correct these. So the idea is to change the patch so that we set
> > > pages, that are signaled by the ECC as erased, to 0xff completely without
> > > checking. So the ECC will do all the work and we completely trust in its
> > > abilities to do it correctly.
> > 
> > Sounds good.
> > 
> > some new platforms with new gpmi controller could check the count of 0 bits in page,
> > refer to my patch https://patchwork.ozlabs.org/patch/587124/
> > 
> > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and
> > only check the uncorrectable branch and then correct data sounds better. Setting threshold
> > and correcting all erased page may highly impact the performance.
> 
> Indeed, bitflips in erased pages is not so common, and penalizing the
> likely case (erased pages without any bitflips) doesn't look like a good
> idea in the end.

Are erased pages really read that often? I am not sure how UBI handles
this, does it read every page before writing?

> 
> You can still implement this check in software. You can have a look at
> nand_check_erased_ecc_chunk() [1] if you need an example, but you'll
> have to adapt it because your controller does not guarantees that ECC
> bits for a given chunk are byte aligned :-/

Yes I used this function in the patch. The issue is that I am not quite
sure yet where to find the raw ECC data (without rereading the page).
The reference manual is not extremely clear about that, ecc data may be
in the 'auxilliary data' but I am not sure that it really is available
somewhere.

Best Regards,

Markus
Boris BREZILLON April 15, 2016, 8:35 a.m. UTC | #7
Hi Markus,

On Fri, 15 Apr 2016 09:55:45 +0200
Markus Pargmann <mpa@pengutronix.de> wrote:

> On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote:
> > On Tue, 12 Apr 2016 22:39:08 +0000
> > Han Xu <han.xu@nxp.com> wrote:
> > 
> > > > Thanks for the feedback. Talking with a coworker about this we may have found a
> > > > better approach to this that is less complicated to implement. The hardware
> > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit
> > > > creates an ECC error only if the number of bitflips exceeds this threshold, but
> > > > it does not correct these. So the idea is to change the patch so that we set
> > > > pages, that are signaled by the ECC as erased, to 0xff completely without
> > > > checking. So the ECC will do all the work and we completely trust in its
> > > > abilities to do it correctly.
> > > 
> > > Sounds good.
> > > 
> > > some new platforms with new gpmi controller could check the count of 0 bits in page,
> > > refer to my patch https://patchwork.ozlabs.org/patch/587124/
> > > 
> > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and
> > > only check the uncorrectable branch and then correct data sounds better. Setting threshold
> > > and correcting all erased page may highly impact the performance.
> > 
> > Indeed, bitflips in erased pages is not so common, and penalizing the
> > likely case (erased pages without any bitflips) doesn't look like a good
> > idea in the end.
> 
> Are erased pages really read that often?

Yes, it's not unusual to have those "empty pages?" checks (added Artem
and Richard to get a confirmation). AFAIR, UBIFS check for empty pages
in its journal heads after an unclean unmount (which happens quite
often) to make sure there's no corruption.

> I am not sure how UBI handles
> this, does it read every page before writing?

Nope, or maybe it does when you activate some extra checks.

> 
> > 
> > You can still implement this check in software. You can have a look at
> > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll
> > have to adapt it because your controller does not guarantees that ECC
> > bits for a given chunk are byte aligned :-/
> 
> Yes I used this function in the patch. The issue is that I am not quite
> sure yet where to find the raw ECC data (without rereading the page).
> The reference manual is not extremely clear about that, ecc data may be
> in the 'auxilliary data' but I am not sure that it really is available
> somewhere.

AFAIR (and I'm not sure since it was a long time ago), you don't have
direct access to ECC bytes with the GPMI engine. If that's the case,
you'll have to read the ECC bytes manually (moving the page pointer
using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with
this engine, because ECC bytes are not guaranteed to be byte aligned
(see gpmi ->read_page_raw() implementation).
Once you've retrieved ECC bytes (or bits in this case), for each ECC
chunk, you can use the nand_check_erased_ecc_chunk() function (just make
sure you're padding the last ECC byte of each chunk with ones so that
bitflips cannot be reported on this section).

Best Regards,

Boris
Markus Pargmann April 15, 2016, 9:35 a.m. UTC | #8
Hi Boris,

On Friday 15 April 2016 10:35:08 Boris Brezillon wrote:
> Hi Markus,
> 
> On Fri, 15 Apr 2016 09:55:45 +0200
> Markus Pargmann <mpa@pengutronix.de> wrote:
> 
> > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote:
> > > On Tue, 12 Apr 2016 22:39:08 +0000
> > > Han Xu <han.xu@nxp.com> wrote:
> > > 
> > > > > Thanks for the feedback. Talking with a coworker about this we may have found a
> > > > > better approach to this that is less complicated to implement. The hardware
> > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit
> > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but
> > > > > it does not correct these. So the idea is to change the patch so that we set
> > > > > pages, that are signaled by the ECC as erased, to 0xff completely without
> > > > > checking. So the ECC will do all the work and we completely trust in its
> > > > > abilities to do it correctly.
> > > > 
> > > > Sounds good.
> > > > 
> > > > some new platforms with new gpmi controller could check the count of 0 bits in page,
> > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/
> > > > 
> > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and
> > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold
> > > > and correcting all erased page may highly impact the performance.
> > > 
> > > Indeed, bitflips in erased pages is not so common, and penalizing the
> > > likely case (erased pages without any bitflips) doesn't look like a good
> > > idea in the end.
> > 
> > Are erased pages really read that often?
> 
> Yes, it's not unusual to have those "empty pages?" checks (added Artem
> and Richard to get a confirmation). AFAIR, UBIFS check for empty pages
> in its journal heads after an unclean unmount (which happens quite
> often) to make sure there's no corruption.
> 
> > I am not sure how UBI handles
> > this, does it read every page before writing?
> 
> Nope, or maybe it does when you activate some extra checks.
> 
> > 
> > > 
> > > You can still implement this check in software. You can have a look at
> > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll
> > > have to adapt it because your controller does not guarantees that ECC
> > > bits for a given chunk are byte aligned :-/
> > 
> > Yes I used this function in the patch. The issue is that I am not quite
> > sure yet where to find the raw ECC data (without rereading the page).
> > The reference manual is not extremely clear about that, ecc data may be
> > in the 'auxilliary data' but I am not sure that it really is available
> > somewhere.
> 
> AFAIR (and I'm not sure since it was a long time ago), you don't have
> direct access to ECC bytes with the GPMI engine. If that's the case,
> you'll have to read the ECC bytes manually (moving the page pointer
> using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with
> this engine, because ECC bytes are not guaranteed to be byte aligned
> (see gpmi ->read_page_raw() implementation).
> Once you've retrieved ECC bytes (or bits in this case), for each ECC
> chunk, you can use the nand_check_erased_ecc_chunk() function (just make
> sure you're padding the last ECC byte of each chunk with ones so that
> bitflips cannot be reported on this section).

Thanks for the information. So I understand that this approach is the
preferred one to avoid any performance issues for normal operation.

I actually won't be able to fix this patch accordingly for some time. If
anyone else needs this earlier, feel free to implement it.

Best Regards,

Markus
Boris BREZILLON April 15, 2016, 9:39 a.m. UTC | #9
Hi Markus,

On Fri, 15 Apr 2016 11:35:07 +0200
Markus Pargmann <mpa@pengutronix.de> wrote:

> Hi Boris,
> 
> On Friday 15 April 2016 10:35:08 Boris Brezillon wrote:
> > Hi Markus,
> > 
> > On Fri, 15 Apr 2016 09:55:45 +0200
> > Markus Pargmann <mpa@pengutronix.de> wrote:
> > 
> > > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote:
> > > > On Tue, 12 Apr 2016 22:39:08 +0000
> > > > Han Xu <han.xu@nxp.com> wrote:
> > > > 
> > > > > > Thanks for the feedback. Talking with a coworker about this we may have found a
> > > > > > better approach to this that is less complicated to implement. The hardware
> > > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit
> > > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but
> > > > > > it does not correct these. So the idea is to change the patch so that we set
> > > > > > pages, that are signaled by the ECC as erased, to 0xff completely without
> > > > > > checking. So the ECC will do all the work and we completely trust in its
> > > > > > abilities to do it correctly.
> > > > > 
> > > > > Sounds good.
> > > > > 
> > > > > some new platforms with new gpmi controller could check the count of 0 bits in page,
> > > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/
> > > > > 
> > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and
> > > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold
> > > > > and correcting all erased page may highly impact the performance.
> > > > 
> > > > Indeed, bitflips in erased pages is not so common, and penalizing the
> > > > likely case (erased pages without any bitflips) doesn't look like a good
> > > > idea in the end.
> > > 
> > > Are erased pages really read that often?
> > 
> > Yes, it's not unusual to have those "empty pages?" checks (added Artem
> > and Richard to get a confirmation). AFAIR, UBIFS check for empty pages
> > in its journal heads after an unclean unmount (which happens quite
> > often) to make sure there's no corruption.
> > 
> > > I am not sure how UBI handles
> > > this, does it read every page before writing?
> > 
> > Nope, or maybe it does when you activate some extra checks.
> > 
> > > 
> > > > 
> > > > You can still implement this check in software. You can have a look at
> > > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll
> > > > have to adapt it because your controller does not guarantees that ECC
> > > > bits for a given chunk are byte aligned :-/
> > > 
> > > Yes I used this function in the patch. The issue is that I am not quite
> > > sure yet where to find the raw ECC data (without rereading the page).
> > > The reference manual is not extremely clear about that, ecc data may be
> > > in the 'auxilliary data' but I am not sure that it really is available
> > > somewhere.
> > 
> > AFAIR (and I'm not sure since it was a long time ago), you don't have
> > direct access to ECC bytes with the GPMI engine. If that's the case,
> > you'll have to read the ECC bytes manually (moving the page pointer
> > using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with
> > this engine, because ECC bytes are not guaranteed to be byte aligned
> > (see gpmi ->read_page_raw() implementation).
> > Once you've retrieved ECC bytes (or bits in this case), for each ECC
> > chunk, you can use the nand_check_erased_ecc_chunk() function (just make
> > sure you're padding the last ECC byte of each chunk with ones so that
> > bitflips cannot be reported on this section).
> 
> Thanks for the information. So I understand that this approach is the
> preferred one to avoid any performance issues for normal operation.
> 
> I actually won't be able to fix this patch accordingly for some time. If
> anyone else needs this earlier, feel free to implement it.

I just did [1] (it applies on top of your patch), but maybe you
can test it (I don't have any imx platforms right now) ;).

If these changes work, feel free to squash them into your previous
patch.

Thanks,

Boris

[1]http://code.bulix.org/bq6yyh-96549
Markus Pargmann April 15, 2016, 12:03 p.m. UTC | #10
Hi Boris,

On Friday 15 April 2016 11:39:06 Boris Brezillon wrote:
> Hi Markus,
> 
> On Fri, 15 Apr 2016 11:35:07 +0200
> Markus Pargmann <mpa@pengutronix.de> wrote:
> 
> > Hi Boris,
> > 
> > On Friday 15 April 2016 10:35:08 Boris Brezillon wrote:
> > > Hi Markus,
> > > 
> > > On Fri, 15 Apr 2016 09:55:45 +0200
> > > Markus Pargmann <mpa@pengutronix.de> wrote:
> > > 
> > > > On Wednesday 13 April 2016 00:51:55 Boris Brezillon wrote:
> > > > > On Tue, 12 Apr 2016 22:39:08 +0000
> > > > > Han Xu <han.xu@nxp.com> wrote:
> > > > > 
> > > > > > > Thanks for the feedback. Talking with a coworker about this we may have found a
> > > > > > > better approach to this that is less complicated to implement. The hardware
> > > > > > > unit allows us to set a bitflip threshold for erased pages. The ECC unit
> > > > > > > creates an ECC error only if the number of bitflips exceeds this threshold, but
> > > > > > > it does not correct these. So the idea is to change the patch so that we set
> > > > > > > pages, that are signaled by the ECC as erased, to 0xff completely without
> > > > > > > checking. So the ECC will do all the work and we completely trust in its
> > > > > > > abilities to do it correctly.
> > > > > > 
> > > > > > Sounds good.
> > > > > > 
> > > > > > some new platforms with new gpmi controller could check the count of 0 bits in page,
> > > > > > refer to my patch https://patchwork.ozlabs.org/patch/587124/
> > > > > > 
> > > > > > But for all legacy platforms, IMO, considering bitflip is rare case, set threshold to 0 and
> > > > > > only check the uncorrectable branch and then correct data sounds better. Setting threshold
> > > > > > and correcting all erased page may highly impact the performance.
> > > > > 
> > > > > Indeed, bitflips in erased pages is not so common, and penalizing the
> > > > > likely case (erased pages without any bitflips) doesn't look like a good
> > > > > idea in the end.
> > > > 
> > > > Are erased pages really read that often?
> > > 
> > > Yes, it's not unusual to have those "empty pages?" checks (added Artem
> > > and Richard to get a confirmation). AFAIR, UBIFS check for empty pages
> > > in its journal heads after an unclean unmount (which happens quite
> > > often) to make sure there's no corruption.
> > > 
> > > > I am not sure how UBI handles
> > > > this, does it read every page before writing?
> > > 
> > > Nope, or maybe it does when you activate some extra checks.
> > > 
> > > > 
> > > > > 
> > > > > You can still implement this check in software. You can have a look at
> > > > > nand_check_erased_ecc_chunk() [1] if you need an example, but you'll
> > > > > have to adapt it because your controller does not guarantees that ECC
> > > > > bits for a given chunk are byte aligned :-/
> > > > 
> > > > Yes I used this function in the patch. The issue is that I am not quite
> > > > sure yet where to find the raw ECC data (without rereading the page).
> > > > The reference manual is not extremely clear about that, ecc data may be
> > > > in the 'auxilliary data' but I am not sure that it really is available
> > > > somewhere.
> > > 
> > > AFAIR (and I'm not sure since it was a long time ago), you don't have
> > > direct access to ECC bytes with the GPMI engine. If that's the case,
> > > you'll have to read the ECC bytes manually (moving the page pointer
> > > using ->cmdfunc(NAND_CMD_RNDOUT, column, -1)), which is a pain with
> > > this engine, because ECC bytes are not guaranteed to be byte aligned
> > > (see gpmi ->read_page_raw() implementation).
> > > Once you've retrieved ECC bytes (or bits in this case), for each ECC
> > > chunk, you can use the nand_check_erased_ecc_chunk() function (just make
> > > sure you're padding the last ECC byte of each chunk with ones so that
> > > bitflips cannot be reported on this section).
> > 
> > Thanks for the information. So I understand that this approach is the
> > preferred one to avoid any performance issues for normal operation.
> > 
> > I actually won't be able to fix this patch accordingly for some time. If
> > anyone else needs this earlier, feel free to implement it.
> 
> I just did [1] (it applies on top of your patch), but maybe you
> can test it (I don't have any imx platforms right now) ;).

Great, thank you :). I just tested the patch and it works for me. The
erased page bitflips are still detected and fixed. I will send a new
version then.

Best Regards,

Markus

> 
> If these changes work, feel free to squash them into your previous
> patch.
> 
> Thanks,
> 
> Boris
> 
> [1]http://code.bulix.org/bq6yyh-96549
> 
>
Han Xu April 15, 2016, 3:33 p.m. UTC | #11

Boris BREZILLON April 15, 2016, 3:40 p.m. UTC | #12
On Fri, 15 Apr 2016 15:33:07 +0000
Han Xu <han.xu@nxp.com> wrote:

> >
> > I just did [1] (it applies on top of your patch), but maybe you
> > can test it (I don't have any imx platforms right now) ;).
> 
> Great, thank you :). I just tested the patch and it works for me. The
> erased page bitflips are still detected and fixed. I will send a new
> version then.
> 
> Hi Markus,
> 
> Could you please share how to verify the patch, in other words, how to reproduce the
> UBIFS corruption issue consistently. Thanks.

I should really post a new version of the nandflipbits tool [1].
That's clearly the easiest solution to verify that your bitflip
correction is reliable.


[1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html
Markus Pargmann April 18, 2016, 10:07 a.m. UTC | #13
On Friday 15 April 2016 17:40:31 Boris Brezillon wrote:
> On Fri, 15 Apr 2016 15:33:07 +0000
> Han Xu <han.xu@nxp.com> wrote:
> 
> > >
> > > I just did [1] (it applies on top of your patch), but maybe you
> > > can test it (I don't have any imx platforms right now) ;).
> > 
> > Great, thank you :). I just tested the patch and it works for me. The
> > erased page bitflips are still detected and fixed. I will send a new
> > version then.
> > 
> > Hi Markus,
> > 
> > Could you please share how to verify the patch, in other words, how to reproduce the
> > UBIFS corruption issue consistently. Thanks.

I used a simple bashscript with 'nandwrite --noecc' that writes a number
of zeroes in every sub-page. For written pages the ECC will handle the
flips. For erased pages we can test the algorithm that handles the
erased page bitflips.

> 
> I should really post a new version of the nandflipbits tool [1].
> That's clearly the easiest solution to verify that your bitflip
> correction is reliable.

Seems really useful.

Best Regards,

Markus

> 
> 
> [1]http://lists.infradead.org/pipermail/linux-mtd/2014-November/056634.html
> 
>
diff mbox

Patch

diff --git a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
index 235ddcb58f39..ce5a21252102 100644
--- a/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
+++ b/drivers/mtd/nand/gpmi-nand/gpmi-nand.c
@@ -1035,14 +1035,58 @@  static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 	/* Loop over status bytes, accumulating ECC status. */
 	status = auxiliary_virt + nfc_geo->auxiliary_status_offset;
 
+	read_page_swap_end(this, buf, nfc_geo->payload_size,
+			   this->payload_virt, this->payload_phys,
+			   nfc_geo->payload_size,
+			   payload_virt, payload_phys);
+
 	for (i = 0; i < nfc_geo->ecc_chunk_count; i++, status++) {
 		if ((*status == STATUS_GOOD) || (*status == STATUS_ERASED))
 			continue;
 
 		if (*status == STATUS_UNCORRECTABLE) {
+			int flips;
+
+			/*
+			 * The ECC hardware has an uncorrectable ECC status
+			 * code in case we have bitflips in an erased page. As
+			 * nothing was written into this subpage the ECC is
+			 * obviously wrong and we can not trust it. We assume
+			 * at this point that we are reading an erased page and
+			 * try to correct the bitflips in buffer up to
+			 * ecc_strength bitflips. If this is a page with random
+			 * data, we exceed this number of bitflips and have a
+			 * ECC failure. Otherwise we use the corrected buffer.
+			 */
+			if (i == 0) {
+				/* The first block includes metadata */
+				flips = nand_check_erased_ecc_chunk(
+						buf + i * nfc_geo->ecc_chunk_size,
+						nfc_geo->ecc_chunk_size,
+						NULL, 0,
+						auxiliary_virt,
+						nfc_geo->metadata_size,
+						nfc_geo->ecc_strength);
+			} else {
+				flips = nand_check_erased_ecc_chunk(
+						buf + i * nfc_geo->ecc_chunk_size,
+						nfc_geo->ecc_chunk_size,
+						NULL, 0,
+						NULL, 0,
+						nfc_geo->ecc_strength);
+			}
+
+			if (flips > 0) {
+				max_bitflips = max_t(unsigned int, max_bitflips,
+						     flips);
+				mtd->ecc_stats.corrected += flips;
+				continue;
+			}
+
 			mtd->ecc_stats.failed++;
 			continue;
 		}
+
 		mtd->ecc_stats.corrected += *status;
 		max_bitflips = max_t(unsigned int, max_bitflips, *status);
 	}
@@ -1062,11 +1106,6 @@  static int gpmi_ecc_read_page(struct mtd_info *mtd, struct nand_chip *chip,
 		chip->oob_poi[0] = ((uint8_t *) auxiliary_virt)[0];
 	}
 
-	read_page_swap_end(this, buf, nfc_geo->payload_size,
-			this->payload_virt, this->payload_phys,
-			nfc_geo->payload_size,
-			payload_virt, payload_phys);
-
 	return max_bitflips;
 }