diff mbox series

[RFC] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache

Message ID 20230302144330.274947-1-ulf.hansson@linaro.org (mailing list archive)
State New, archived
Headers show
Series [RFC] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache | expand

Commit Message

Ulf Hansson March 2, 2023, 2:43 p.m. UTC
REQ_FUA is in general supported for eMMC cards, which translates into so
called "reliable writes". To support these write operations, the CMD23
(MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common
but not always the case.

For some eMMC devices, it has been reported that reliable writes are quite
costly, leading to performance degradations.

In a way to improve the situation, let's avoid announcing REQ_FUA support
if the eMMC supports an internal cache, as that allows us to rely solely on
flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
Note that, those mmc hosts that lacks CMD23 support are already using this
type of configuration, whatever that could mean.

Reported-by: Wenchao Chen <wenchao.chen666@gmail.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---

Note that, I haven't been able to test this patch myself, but are relying on
Wenchao and others to help out. Sharing some performance number before and
after the patch, would be nice.

Moreover, what is not clear to me (hence the RFC), is whether relying solely on
flush requests are sufficient and as such if it's a good idea after all.
Comments are highly appreciated in this regards.

Kind regards
Ulf Hansson

---
 drivers/mmc/core/block.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

Comments

Bean Huo March 2, 2023, 3:02 p.m. UTC | #1
On 02.03.23 3:43 PM, Ulf Hansson wrote:
> REQ_FUA is in general supported for eMMC cards, which translates into so
> called "reliable writes". To support these write operations, the CMD23
> (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common
> but not always the case.
>
> For some eMMC devices, it has been reported that reliable writes are quite
> costly, leading to performance degradations.
>
> In a way to improve the situation, let's avoid announcing REQ_FUA support
> if the eMMC supports an internal cache, as that allows us to rely solely on
> flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
> Note that, those mmc hosts that lacks CMD23 support are already using this
> type of configuration, whatever that could mean.
>
> Reported-by: Wenchao Chen<wenchao.chen666@gmail.com>
> Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org>
Acked-by: Bean Huo <beanhuo@micron.com>
Christian Loehle March 2, 2023, 3:07 p.m. UTC | #2
> Subject: [RFC PATCH] mmc: core: Disable REQ_FUA if the eMMC supports an internal cache

Hey Uffe,
I may have the setup to play with this.
(Powerfailing eMMC devices)
> 
> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case.
> 
> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations.
> 
> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean.

Just note that reliable write is strictly weaker than turning cache off/flushing,
if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec.
(And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?)
Maybe some FS people can also chime in?

Regards,
Christian

Hyperstone GmbH | Reichenaustr. 39a  | 78467 Konstanz
Managing Director: Dr. Jan Peter Berns.
Commercial register of local courts: Freiburg HRB381782
Avri Altman March 3, 2023, 9:39 a.m. UTC | #3
> On 02.03.23 3:43 PM, Ulf Hansson wrote:
> > REQ_FUA is in general supported for eMMC cards, which translates into
> > so called "reliable writes". To support these write operations, the
> > CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too,
> > which is common but not always the case.
> >
> > For some eMMC devices, it has been reported that reliable writes are
> > quite costly, leading to performance degradations.
> >
> > In a way to improve the situation, let's avoid announcing REQ_FUA
> > support if the eMMC supports an internal cache, as that allows us to
> > rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a
> lot cheaper.
> > Note that, those mmc hosts that lacks CMD23 support are already using
> > this type of configuration, whatever that could mean.
> >
> > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com>
> > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org>
> Acked-by: Bean Huo <beanhuo@micron.com>
Acked-by: Avri Altman <avri.altman@wdc.com>

Another option might be, allowing to report "broken_fua",
should the platform owner chooses to, much like scsi does per sdev.

Thanks,
Avri
Ulf Hansson March 3, 2023, 11:03 a.m. UTC | #4
On Fri, 3 Mar 2023 at 10:39, Avri Altman <Avri.Altman@wdc.com> wrote:
>
> > On 02.03.23 3:43 PM, Ulf Hansson wrote:
> > > REQ_FUA is in general supported for eMMC cards, which translates into
> > > so called "reliable writes". To support these write operations, the
> > > CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too,
> > > which is common but not always the case.
> > >
> > > For some eMMC devices, it has been reported that reliable writes are
> > > quite costly, leading to performance degradations.
> > >
> > > In a way to improve the situation, let's avoid announcing REQ_FUA
> > > support if the eMMC supports an internal cache, as that allows us to
> > > rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a
> > lot cheaper.
> > > Note that, those mmc hosts that lacks CMD23 support are already using
> > > this type of configuration, whatever that could mean.
> > >
> > > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com>
> > > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org>
> > Acked-by: Bean Huo <beanhuo@micron.com>
> Acked-by: Avri Altman <avri.altman@wdc.com>

Thanks!

>
> Another option might be, allowing to report "broken_fua",
> should the platform owner chooses to, much like scsi does per sdev.

Are you suggesting a static or dynamic configuration option?

For mmc, we also have the card quirks that may be used to configure
the support for FUA, based upon what would work best for the card. Is
that what you were thinking of?

>
> Thanks,
> Avri

Kind regards
Uffe
Avri Altman March 3, 2023, 11:37 a.m. UTC | #5
> On Fri, 3 Mar 2023 at 10:39, Avri Altman <Avri.Altman@wdc.com> wrote:
> >
> > > On 02.03.23 3:43 PM, Ulf Hansson wrote:
> > > > REQ_FUA is in general supported for eMMC cards, which translates
> > > > into so called "reliable writes". To support these write
> > > > operations, the
> > > > CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host
> too,
> > > > which is common but not always the case.
> > > >
> > > > For some eMMC devices, it has been reported that reliable writes
> > > > are quite costly, leading to performance degradations.
> > > >
> > > > In a way to improve the situation, let's avoid announcing REQ_FUA
> > > > support if the eMMC supports an internal cache, as that allows us
> > > > to rely solely on flush-requests (REQ_OP_FLUSH) instead, which
> > > > seems to be a
> > > lot cheaper.
> > > > Note that, those mmc hosts that lacks CMD23 support are already
> > > > using this type of configuration, whatever that could mean.
> > > >
> > > > Reported-by: Wenchao Chen<wenchao.chen666@gmail.com>
> > > > Signed-off-by: Ulf Hansson<ulf.hansson@linaro.org>
> > > Acked-by: Bean Huo <beanhuo@micron.com>
> > Acked-by: Avri Altman <avri.altman@wdc.com>
> 
> Thanks!
> 
> >
> > Another option might be, allowing to report "broken_fua", should the
> > platform owner chooses to, much like scsi does per sdev.
> 
> Are you suggesting a static or dynamic configuration option?
Static

> 
> For mmc, we also have the card quirks that may be used to configure the
> support for FUA, based upon what would work best for the card. Is that
> what you were thinking of?
I was thinking to allow the platform owner the flexibility to decide if to use it or not,
More like SDHCI_QUIRK_xxx

But I am totally fine with your current suggestions.

Thanks,
Avri

> 
> >
> > Thanks,
> > Avri
> 
> Kind regards
> Uffe
Christian Loehle March 3, 2023, 11:40 a.m. UTC | #6
>> 
>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case.
>> 
>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations.
>> 
>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean.
> 
> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec.
> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in?

Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case.
If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if
a) < sector chunks are committed to flash
b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write.

I guess the cards which increase performance do b)? Or something else?
Anyway regarding FUA i don't have any concerns regarding reliability with cache flush.
I can add some performance comparisons with some eMMCs I have around though.

Regards,
Christian

Hyperstone GmbH | Reichenaustr. 39a  | 78467 Konstanz
Managing Director: Dr. Jan Peter Berns.
Commercial register of local courts: Freiburg HRB381782
Ulf Hansson March 3, 2023, 12:01 p.m. UTC | #7
On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote:
>
>
> >>
> >> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case.
> >>
> >> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations.
> >>
> >> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
> >> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean.
> >
> > Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec.
> > (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in?
>
> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case.
> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if
> a) < sector chunks are committed to flash
> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write.

Right, I agree!

Note 1) Reliable write was introduced way before cache management in
the eMMC spec. So, if the support for reliable write would have a
stricter implementation than needed, I would not be surprised.

Note 2) In the eMMC v5.1 spec, the cache flushing support has been
extended to allow an explicit barrier operation. Perhaps, we should
let that option take precedence over a regular flush+barrier, for
REQ_OP_FLUSH!?

>
> I guess the cards which increase performance do b)? Or something else?

That's the tricky part to know, as it's the internals of the eMMC.

Although, it seems like both Avri (WDC) and Bean (Micron) would be
happy to proceed with $subject patch, which makes me more comfortable
to move forward.

> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush.
> I can add some performance comparisons with some eMMCs I have around though.

That would be great, thanks a lot for helping out with testing!

Kind regards
Uffe
Adrian Hunter March 3, 2023, 2:41 p.m. UTC | #8
On 3/03/23 14:01, Ulf Hansson wrote:
> On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote:
>>
>>
>>>>
>>>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case.
>>>>
>>>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations.
>>>>
>>>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
>>>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean.
>>>
>>> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec.
>>> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in?
>>
>> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case.
>> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if
>> a) < sector chunks are committed to flash
>> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write.
> 
> Right, I agree!
> 
> Note 1) Reliable write was introduced way before cache management in
> the eMMC spec. So, if the support for reliable write would have a
> stricter implementation than needed, I would not be surprised.

I am not sure when you say stricter than needed.  Historically
file systems assumed that sectors are updated atomically i.e.
there is never a sector with a mixture of old and new data.
The eMMC spec does not guarantee that, except for reliable
write.

File systems may use REQ_FUA for important information, like the
superblock or a journal commit record, so using reliable write
for REQ_FUA would seem to give better protection against file system
corruption than a cache flush which could leave a sector
half-written.

On the other hand, sudden power loss is probably rare in battery
powered systems because they are designed to monitor the battery
power and shutdown when it gets too low.

And file systems can use checksums to detect half-written updates.

And there is anyway no protection for other (non REQ_FUA) writes a
file system might do and expect not to tear sectors.

And you are more likely to smash the screen than bounce the battery
out and cause an unrecoverable file system error.

Nevertheless, the commit message of this patch reads like the change
is an optimization, whereas it seems more like a policy change.
The commit message should perhaps say something like:
"The consensus is that the benefit of improved performance by not
using reliable-write for REQ_FUA is much greater than any potential
benefit that reliable-write might provide to avoid file system
corruption in the event of sudden power loss."

As for allowing for the policy to be overridden, perhaps an mmc_core
module option?

> 
> Note 2) In the eMMC v5.1 spec, the cache flushing support has been
> extended to allow an explicit barrier operation. Perhaps, we should
> let that option take precedence over a regular flush+barrier, for
> REQ_OP_FLUSH!?
> 
>>
>> I guess the cards which increase performance do b)? Or something else?
> 
> That's the tricky part to know, as it's the internals of the eMMC.

It is the natural conclusion though.  The eMMC probably does not
update mapping information with every write, instead if power is
lost, it scans the updated areas at the next initialization. (The
power-off notify feature would commit the mapping information to
media to avoid that).  So a reliable write might have to:
1. write information to record that the old mapping
should be used, not what might be discovered by scanning
2. do the actual write
3. write mapping information to record the new mapping

> 
> Although, it seems like both Avri (WDC) and Bean (Micron) would be
> happy to proceed with $subject patch, which makes me more comfortable
> to move forward.
> 
>> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush.
>> I can add some performance comparisons with some eMMCs I have around though.
> 
> That would be great, thanks a lot for helping out with testing!
> 
> Kind regards
> Uffe
Ulf Hansson March 6, 2023, 4:09 p.m. UTC | #9
On Fri, 3 Mar 2023 at 15:41, Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 3/03/23 14:01, Ulf Hansson wrote:
> > On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote:
> >>
> >>
> >>>>
> >>>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case.
> >>>>
> >>>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations.
> >>>>
> >>>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
> >>>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean.
> >>>
> >>> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec.
> >>> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in?
> >>
> >> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case.
> >> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if
> >> a) < sector chunks are committed to flash
> >> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write.
> >
> > Right, I agree!
> >
> > Note 1) Reliable write was introduced way before cache management in
> > the eMMC spec. So, if the support for reliable write would have a
> > stricter implementation than needed, I would not be surprised.
>
> I am not sure when you say stricter than needed.  Historically
> file systems assumed that sectors are updated atomically i.e.
> there is never a sector with a mixture of old and new data.
> The eMMC spec does not guarantee that, except for reliable
> write.

Yes, I agree. With stricker, I was merely thinking of whether the eMMC
makes the entire write request (multiple sectors) being atomic, or the
guarantee is only per sector basis.

According to the eMMC spec, that seems to be implementation specific.
One option could be heavier than the other, I guess.

>
> File systems may use REQ_FUA for important information, like the
> superblock or a journal commit record, so using reliable write
> for REQ_FUA would seem to give better protection against file system
> corruption than a cache flush which could leave a sector
> half-written.

Yes, I agree. If we should fully conform to what is stated in the eMMC
spec, we should probably keep the current path to support REQ_FUA.

>
> On the other hand, sudden power loss is probably rare in battery
> powered systems because they are designed to monitor the battery
> power and shutdown when it gets too low.
>
> And file systems can use checksums to detect half-written updates.
>
> And there is anyway no protection for other (non REQ_FUA) writes a
> file system might do and expect not to tear sectors.
>
> And you are more likely to smash the screen than bounce the battery
> out and cause an unrecoverable file system error.

Right, these are good arguments to why $subject patch perhaps makes
sense to move forward with anyway.

Moreover, it seems like some eMMC vendors don't really have any
concerns with us moving away from reliable writes, to instead use only
"cache flushing". I guess it can indicate that the regular writes may
already be treated in an atomic kind of way, but who knows.

>
> Nevertheless, the commit message of this patch reads like the change
> is an optimization, whereas it seems more like a policy change.
> The commit message should perhaps say something like:
> "The consensus is that the benefit of improved performance by not
> using reliable-write for REQ_FUA is much greater than any potential
> benefit that reliable-write might provide to avoid file system
> corruption in the event of sudden power loss."

I agree. I will improve it along the lines of what you suggest.

>
> As for allowing for the policy to be overridden, perhaps an mmc_core
> module option?

Even if I am not very fond of module parameters, this seems like a
reasonable thing to use for this case.

I was also looking at using a card quirk.

>
> >
> > Note 2) In the eMMC v5.1 spec, the cache flushing support has been
> > extended to allow an explicit barrier operation. Perhaps, we should
> > let that option take precedence over a regular flush+barrier, for
> > REQ_OP_FLUSH!?
> >
> >>
> >> I guess the cards which increase performance do b)? Or something else?
> >
> > That's the tricky part to know, as it's the internals of the eMMC.
>
> It is the natural conclusion though.  The eMMC probably does not
> update mapping information with every write, instead if power is
> lost, it scans the updated areas at the next initialization. (The
> power-off notify feature would commit the mapping information to
> media to avoid that).  So a reliable write might have to:
> 1. write information to record that the old mapping
> should be used, not what might be discovered by scanning
> 2. do the actual write
> 3. write mapping information to record the new mapping

Yes. And depending on the eMMC device, some may be more clever than
others for how to actually deal with this.

>
> >
> > Although, it seems like both Avri (WDC) and Bean (Micron) would be
> > happy to proceed with $subject patch, which makes me more comfortable
> > to move forward.
> >
> >> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush.
> >> I can add some performance comparisons with some eMMCs I have around though.
> >
> > That would be great, thanks a lot for helping out with testing!
> >

Kind regards
Uffe
Adrian Hunter March 7, 2023, 1:15 p.m. UTC | #10
On 6/03/23 18:09, Ulf Hansson wrote:
> On Fri, 3 Mar 2023 at 15:41, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 3/03/23 14:01, Ulf Hansson wrote:
>>> On Fri, 3 Mar 2023 at 12:40, Christian Löhle <CLoehle@hyperstone.com> wrote:
>>>>
>>>>
>>>>>>
>>>>>> REQ_FUA is in general supported for eMMC cards, which translates into so called "reliable writes". To support these write operations, the CMD23 (MMC_CAP_CMD23), needs to be supported by the mmc host too, which is common but not always the case.
>>>>>>
>>>>>> For some eMMC devices, it has been reported that reliable writes are quite costly, leading to performance degradations.
>>>>>>
>>>>>> In a way to improve the situation, let's avoid announcing REQ_FUA support if the eMMC supports an internal cache, as that allows us to rely solely on flush-requests (REQ_OP_FLUSH) instead, which seems to be a lot cheaper.
>>>>>> Note that, those mmc hosts that lacks CMD23 support are already using this type of configuration, whatever that could mean.
>>>>>
>>>>> Just note that reliable write is strictly weaker than turning cache off/flushing, if card loses power during cache off/flush programming / busy, sector-wise atomicity is not mandated by the spec.
>>>>> (And that is assuming cache off/flush is actually respected by the card as intended by the spec, should some cards be checked?) Maybe some FS people can also chime in?
>>>>
>>>> Nevermind, the sector-wise atomicity should not matter on 5.1 cards or if the block length isn't being played with, which it isn't in our case.
>>>> If reliable write is implemented only according to spec, I don't see why the cache flushing should be less expensive, which would only make sense if
>>>> a) < sector chunks are committed to flash
>>>> b) reliable write is implemented much stricter than the spec, ensuring atomicity for the entire write.
>>>
>>> Right, I agree!
>>>
>>> Note 1) Reliable write was introduced way before cache management in
>>> the eMMC spec. So, if the support for reliable write would have a
>>> stricter implementation than needed, I would not be surprised.
>>
>> I am not sure when you say stricter than needed.  Historically
>> file systems assumed that sectors are updated atomically i.e.
>> there is never a sector with a mixture of old and new data.
>> The eMMC spec does not guarantee that, except for reliable
>> write.
> 
> Yes, I agree. With stricker, I was merely thinking of whether the eMMC
> makes the entire write request (multiple sectors) being atomic, or the
> guarantee is only per sector basis.
> 
> According to the eMMC spec, that seems to be implementation specific.
> One option could be heavier than the other, I guess.
> 
>>
>> File systems may use REQ_FUA for important information, like the
>> superblock or a journal commit record, so using reliable write
>> for REQ_FUA would seem to give better protection against file system
>> corruption than a cache flush which could leave a sector
>> half-written.
> 
> Yes, I agree. If we should fully conform to what is stated in the eMMC
> spec, we should probably keep the current path to support REQ_FUA.
> 
>>
>> On the other hand, sudden power loss is probably rare in battery
>> powered systems because they are designed to monitor the battery
>> power and shutdown when it gets too low.
>>
>> And file systems can use checksums to detect half-written updates.
>>
>> And there is anyway no protection for other (non REQ_FUA) writes a
>> file system might do and expect not to tear sectors.
>>
>> And you are more likely to smash the screen than bounce the battery
>> out and cause an unrecoverable file system error.
> 
> Right, these are good arguments to why $subject patch perhaps makes
> sense to move forward with anyway.

Yes

> 
> Moreover, it seems like some eMMC vendors don't really have any
> concerns with us moving away from reliable writes, to instead use only
> "cache flushing". I guess it can indicate that the regular writes may
> already be treated in an atomic kind of way, but who knows.

Indeed

> 
>>
>> Nevertheless, the commit message of this patch reads like the change
>> is an optimization, whereas it seems more like a policy change.
>> The commit message should perhaps say something like:
>> "The consensus is that the benefit of improved performance by not
>> using reliable-write for REQ_FUA is much greater than any potential
>> benefit that reliable-write might provide to avoid file system
>> corruption in the event of sudden power loss."
> 
> I agree. I will improve it along the lines of what you suggest.
> 
>>
>> As for allowing for the policy to be overridden, perhaps an mmc_core
>> module option?
> 
> Even if I am not very fond of module parameters, this seems like a
> reasonable thing to use for this case.
> 
> I was also looking at using a card quirk.

Yes that makes sense

> 
>>
>>>
>>> Note 2) In the eMMC v5.1 spec, the cache flushing support has been
>>> extended to allow an explicit barrier operation. Perhaps, we should
>>> let that option take precedence over a regular flush+barrier, for
>>> REQ_OP_FLUSH!?
>>>
>>>>
>>>> I guess the cards which increase performance do b)? Or something else?
>>>
>>> That's the tricky part to know, as it's the internals of the eMMC.
>>
>> It is the natural conclusion though.  The eMMC probably does not
>> update mapping information with every write, instead if power is
>> lost, it scans the updated areas at the next initialization. (The
>> power-off notify feature would commit the mapping information to
>> media to avoid that).  So a reliable write might have to:
>> 1. write information to record that the old mapping
>> should be used, not what might be discovered by scanning
>> 2. do the actual write
>> 3. write mapping information to record the new mapping
> 
> Yes. And depending on the eMMC device, some may be more clever than
> others for how to actually deal with this.
> 
>>
>>>
>>> Although, it seems like both Avri (WDC) and Bean (Micron) would be
>>> happy to proceed with $subject patch, which makes me more comfortable
>>> to move forward.
>>>
>>>> Anyway regarding FUA i don't have any concerns regarding reliability with cache flush.
>>>> I can add some performance comparisons with some eMMCs I have around though.
>>>
>>> That would be great, thanks a lot for helping out with testing!
>>>
> 
> Kind regards
> Uffe
Christian Loehle March 10, 2023, 1:43 p.m. UTC | #11
I have benchmarked the FUA/Cache behavior a bit.
I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:

# call with
# for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M of=/dev/mmcblk2; done; for loop in {1..5}; do time ./filesystembenchmark.sh; umount /mnt; done
mkfs.ext4 -F /dev/mmcblk2
mount /dev/mmcblk2 /mnt
for i in {1..3}
do
cp -r linux-6.2.2 /mnt/$i
done
for i in {1..3}
do
rm -r /mnt/$i
done
for i in {1..3}
do
cp -r linux-6.2.2 /mnt/$i
done


I found a couple of DUTs that I can link, I also tested one industrial card.

DUT1: blue PCB Foresee eMMC
https://pine64.com/product/32gb-emmc-module/
DUT2: green PCB SiliconGo eMMC
Couldn't find that one online anymore unfortunately
DUT3: orange hardkernel PCB 8GB
https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
DUT4: orange hardkernel PCB white dot
https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odroid-xu3.html
DUT5: Industrial card


The test issued 461 DO_REL_WR during one of the iterations for DUT5

DUT1:
Cache, no FUA:
13:04.49
13:13.82
13:30.59
13:28:13
13:20:64
FUA:
13:30.32
13:36.26
13:10.86
13:32.52
13:48.59

DUT2:
FUA:
8:11.24
7:47.73
7:48.00
7:48.18
7:47.38
Cache, no FUA:
8:10.30
7:48.97
7:48.47
7:47.93
7:44.18

DUT3:
Cache, no FUA:
7:02.82
6:58.94
7:03.20
7:00.27
7:00.88
FUA:
7:05.43
7:03.44
7:04.82
7:03.26
7:04.74

DUT4:
FUA:
7:23.92
7:20.15
7:20.52
7:19.10
7:20.71
Cache, no FUA:
7:20.23
7:20.48
7:19.94
7:18.90
7:19.88

Cache, no FUA:
7:19.36
7:02.11
7:01.53
7:01.35
7:00.37
Cache, no FUA CQE:
7:17.55
7:00.73
6:59.25
6:58.44
6:58.60
FUA:
7:15.10
6:58.99
6:58.94
6:59.17
6:60.00
FUA CQE:
7:11.03
6:58.04
6:56.89
6:56.43
6:56:28

If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.

Regards,
Christian

Hyperstone GmbH | Reichenaustr. 39a  | 78467 Konstanz
Managing Director: Dr. Jan Peter Berns.
Commercial register of local courts: Freiburg HRB381782
Ulf Hansson March 10, 2023, 2:53 p.m. UTC | #12
On Fri, 10 Mar 2023 at 14:43, Christian Löhle <CLoehle@hyperstone.com> wrote:
>
> I have benchmarked the FUA/Cache behavior a bit.
> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>
> # call with
> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M of=/dev/mmcblk2; done; for loop in {1..5}; do time ./filesystembenchmark.sh; umount /mnt; done
> mkfs.ext4 -F /dev/mmcblk2
> mount /dev/mmcblk2 /mnt
> for i in {1..3}
> do
> cp -r linux-6.2.2 /mnt/$i
> done
> for i in {1..3}
> do
> rm -r /mnt/$i
> done
> for i in {1..3}
> do
> cp -r linux-6.2.2 /mnt/$i
> done
>
>
> I found a couple of DUTs that I can link, I also tested one industrial card.
>
> DUT1: blue PCB Foresee eMMC
> https://pine64.com/product/32gb-emmc-module/
> DUT2: green PCB SiliconGo eMMC
> Couldn't find that one online anymore unfortunately
> DUT3: orange hardkernel PCB 8GB
> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
> DUT4: orange hardkernel PCB white dot
> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odroid-xu3.html
> DUT5: Industrial card

Thanks a lot for helping out with testing! Much appreciated!

>
>
> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>
> DUT1:
> Cache, no FUA:
> 13:04.49
> 13:13.82
> 13:30.59
> 13:28:13
> 13:20:64
> FUA:
> 13:30.32
> 13:36.26
> 13:10.86
> 13:32.52
> 13:48.59
>
> DUT2:
> FUA:
> 8:11.24
> 7:47.73
> 7:48.00
> 7:48.18
> 7:47.38
> Cache, no FUA:
> 8:10.30
> 7:48.97
> 7:48.47
> 7:47.93
> 7:44.18
>
> DUT3:
> Cache, no FUA:
> 7:02.82
> 6:58.94
> 7:03.20
> 7:00.27
> 7:00.88
> FUA:
> 7:05.43
> 7:03.44
> 7:04.82
> 7:03.26
> 7:04.74
>
> DUT4:
> FUA:
> 7:23.92
> 7:20.15
> 7:20.52
> 7:19.10
> 7:20.71
> Cache, no FUA:
> 7:20.23
> 7:20.48
> 7:19.94
> 7:18.90
> 7:19.88

Without going into the details of the above, it seems like for DUT1,
DUT2, DUT3 and DUT4 there a good reasons to why we should move forward
with $subject patch.

Do you agree?

>
> Cache, no FUA:
> 7:19.36
> 7:02.11
> 7:01.53
> 7:01.35
> 7:00.37
> Cache, no FUA CQE:
> 7:17.55
> 7:00.73
> 6:59.25
> 6:58.44
> 6:58.60
> FUA:
> 7:15.10
> 6:58.99
> 6:58.94
> 6:59.17
> 6:60.00
> FUA CQE:
> 7:11.03
> 6:58.04
> 6:56.89
> 6:56.43
> 6:56:28
>
> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.

If I understand correctly, for DUT5, it seems like using FUA may be
slightly better than just cache-flushing, right?

For CQE, it seems like FUA could be slightly even better, at least for
DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or
MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().

When it comes to CQE, maybe Adrian have some additional thoughts
around this? Perhaps we should keep using REQ_FUA, if we have CQE?

Kind regards
Uffe
Christian Loehle March 10, 2023, 5:06 p.m. UTC | #13
>>
>> I have benchmarked the FUA/Cache behavior a bit.
>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>>
>> # call with
>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M 
>> of=/dev/mmcblk2; done; for loop in {1..5}; do time 
>> ./filesystembenchmark.sh; umount /mnt; done
>> mkfs.ext4 -F /dev/mmcblk2
>> mount /dev/mmcblk2 /mnt
>> for i in {1..3}
>> do
>> cp -r linux-6.2.2 /mnt/$i
>> done
>> for i in {1..3}
>> do
>> rm -r /mnt/$i
>> done
>> for i in {1..3}
>> do
>> cp -r linux-6.2.2 /mnt/$i
>> done
>>
>>
>> I found a couple of DUTs that I can link, I also tested one industrial card.
>>
>> DUT1: blue PCB Foresee eMMC
>> https://pine64.com/product/32gb-emmc-module/
>> DUT2: green PCB SiliconGo eMMC
>> Couldn't find that one online anymore unfortunately
>> DUT3: orange hardkernel PCB 8GB
>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
>> DUT4: orange hardkernel PCB white dot
>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
>> id-xu3.html
>> DUT5: Industrial card
>
> Thanks a lot for helping out with testing! Much appreciated!

No problem, glad to be of help.

>
>>
>>
>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>>
>> DUT1:
>> Cache, no FUA:
>> 13:04.49
>> 13:13.82
>> 13:30.59
>> 13:28:13
>> 13:20:64
>> FUA:
>> 13:30.32
>> 13:36.26
>> 13:10.86
>> 13:32.52
>> 13:48.59
>>
>> DUT2:
>> FUA:
>> 8:11.24
>> 7:47.73
>> 7:48.00
>> 7:48.18
>> 7:47.38
>> Cache, no FUA:
>> 8:10.30
>> 7:48.97
>> 7:48.47
>> 7:47.93
>> 7:44.18
>>
>> DUT3:
>> Cache, no FUA:
>> 7:02.82
>> 6:58.94
>> 7:03.20
>> 7:00.27
>> 7:00.88
>> FUA:
>> 7:05.43
>> 7:03.44
>> 7:04.82
>> 7:03.26
>> 7:04.74
>>
>> DUT4:
>> FUA:
>> 7:23.92
>> 7:20.15
>> 7:20.52
>> 7:19.10
>> 7:20.71
>> Cache, no FUA:
>> 7:20.23
>> 7:20.48
>> 7:19.94
>> 7:18.90
>> 7:19.88
>
> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
>
> Do you agree?

That is a good question, that's why I just posted the data without further comment from my side.
I was honestly expecting the difference to be much higher, given the original patch.
If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
If there are cards where the difference is much more significant then of course a quirk would be nicer.
On the other side I don't see why not and any improvement is a good one?

>
>>
>> Cache, no FUA:
>> 7:19.36
>> 7:02.11
>> 7:01.53
>> 7:01.35
>> 7:00.37
>> Cache, no FUA CQE:
>> 7:17.55
>> 7:00.73
>> 6:59.25
>> 6:58.44
>> 6:58.60
>> FUA:
>> 7:15.10
>> 6:58.99
>> 6:58.94
>> 6:59.17
>> 6:60.00
>> FUA CQE:
>> 7:11.03
>> 6:58.04
>> 6:56.89
>> 6:56.43
>> 6:56:28
>>
>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
>
> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?

That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.

>
> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.

>
> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
Sure, I'm also interested in Adrian's take on this.

Regards,
Christian

Hyperstone GmbH | Reichenaustr. 39a  | 78467 Konstanz
Managing Director: Dr. Jan Peter Berns.
Commercial register of local courts: Freiburg HRB381782
Avri Altman March 11, 2023, 8:30 a.m. UTC | #14
Hi,
> I have benchmarked the FUA/Cache behavior a bit.
> I don't have an actual filesystem benchmark that does what I wanted and is
> easy to port to the target so I used:
I guess I'm a bit late here - sorry about that.
If you happened to have traces of the below runs - I'll be happy to provide further analysis.

You might also want to look into fio which allows you do write to a raw block device,
And has various sync options - so you are much more in control of the payload size
And how many reliable-writes / sync-caches were issued - if any.
  
Thanks,
Avri

> # call with
> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M of=/dev/mmcblk2;
> done; for loop in {1..5}; do time ./filesystembenchmark.sh; umount /mnt;
> done
> mkfs.ext4 -F /dev/mmcblk2
> mount /dev/mmcblk2 /mnt
> for i in {1..3}
> do
> cp -r linux-6.2.2 /mnt/$i
> done
> for i in {1..3}
> do
> rm -r /mnt/$i
> done
> for i in {1..3}
> do
> cp -r linux-6.2.2 /mnt/$i
> done
> 
> 
> I found a couple of DUTs that I can link, I also tested one industrial card.
> 
> DUT1: blue PCB Foresee eMMC
> https://pine64.com/product/32gb-emmc-module/
> DUT2: green PCB SiliconGo eMMC
> Couldn't find that one online anymore unfortunately
> DUT3: orange hardkernel PCB 8GB
> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
> DUT4: orange hardkernel PCB white dot
> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-
> odroid-xu3.html
> DUT5: Industrial card
> 
> 
> The test issued 461 DO_REL_WR during one of the iterations for DUT5
> 
> DUT1:
> Cache, no FUA:
> 13:04.49
> 13:13.82
> 13:30.59
> 13:28:13
> 13:20:64
> FUA:
> 13:30.32
> 13:36.26
> 13:10.86
> 13:32.52
> 13:48.59
> 
> DUT2:
> FUA:
> 8:11.24
> 7:47.73
> 7:48.00
> 7:48.18
> 7:47.38
> Cache, no FUA:
> 8:10.30
> 7:48.97
> 7:48.47
> 7:47.93
> 7:44.18
> 
> DUT3:
> Cache, no FUA:
> 7:02.82
> 6:58.94
> 7:03.20
> 7:00.27
> 7:00.88
> FUA:
> 7:05.43
> 7:03.44
> 7:04.82
> 7:03.26
> 7:04.74
> 
> DUT4:
> FUA:
> 7:23.92
> 7:20.15
> 7:20.52
> 7:19.10
> 7:20.71
> Cache, no FUA:
> 7:20.23
> 7:20.48
> 7:19.94
> 7:18.90
> 7:19.88
> 
> Cache, no FUA:
> 7:19.36
> 7:02.11
> 7:01.53
> 7:01.35
> 7:00.37
> Cache, no FUA CQE:
> 7:17.55
> 7:00.73
> 6:59.25
> 6:58.44
> 6:58.60
> FUA:
> 7:15.10
> 6:58.99
> 6:58.94
> 6:59.17
> 6:60.00
> FUA CQE:
> 7:11.03
> 6:58.04
> 6:56.89
> 6:56.43
> 6:56:28
> 
> If anyone has any comments or disagrees with the benchmark, or has a
> specific eMMC to test, let me know.
> 
> Regards,
> Christian
> 
> Hyperstone GmbH | Reichenaustr. 39a  | 78467 Konstanz Managing Director:
> Dr. Jan Peter Berns.
> Commercial register of local courts: Freiburg HRB381782
Adrian Hunter March 13, 2023, 4:56 p.m. UTC | #15
On 10/03/23 19:06, Christian Löhle wrote:
> 
>>>
>>> I have benchmarked the FUA/Cache behavior a bit.
>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>>>
>>> # call with
>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M 
>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time 
>>> ./filesystembenchmark.sh; umount /mnt; done
>>> mkfs.ext4 -F /dev/mmcblk2
>>> mount /dev/mmcblk2 /mnt
>>> for i in {1..3}
>>> do
>>> cp -r linux-6.2.2 /mnt/$i
>>> done
>>> for i in {1..3}
>>> do
>>> rm -r /mnt/$i
>>> done
>>> for i in {1..3}
>>> do
>>> cp -r linux-6.2.2 /mnt/$i
>>> done
>>>
>>>
>>> I found a couple of DUTs that I can link, I also tested one industrial card.
>>>
>>> DUT1: blue PCB Foresee eMMC
>>> https://pine64.com/product/32gb-emmc-module/
>>> DUT2: green PCB SiliconGo eMMC
>>> Couldn't find that one online anymore unfortunately
>>> DUT3: orange hardkernel PCB 8GB
>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
>>> DUT4: orange hardkernel PCB white dot
>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
>>> id-xu3.html
>>> DUT5: Industrial card
>>
>> Thanks a lot for helping out with testing! Much appreciated!
> 
> No problem, glad to be of help.
> 
>>
>>>
>>>
>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>>>
>>> DUT1:
>>> Cache, no FUA:
>>> 13:04.49
>>> 13:13.82
>>> 13:30.59
>>> 13:28:13
>>> 13:20:64
>>> FUA:
>>> 13:30.32
>>> 13:36.26
>>> 13:10.86
>>> 13:32.52
>>> 13:48.59
>>>
>>> DUT2:
>>> FUA:
>>> 8:11.24
>>> 7:47.73
>>> 7:48.00
>>> 7:48.18
>>> 7:47.38
>>> Cache, no FUA:
>>> 8:10.30
>>> 7:48.97
>>> 7:48.47
>>> 7:47.93
>>> 7:44.18
>>>
>>> DUT3:
>>> Cache, no FUA:
>>> 7:02.82
>>> 6:58.94
>>> 7:03.20
>>> 7:00.27
>>> 7:00.88
>>> FUA:
>>> 7:05.43
>>> 7:03.44
>>> 7:04.82
>>> 7:03.26
>>> 7:04.74
>>>
>>> DUT4:
>>> FUA:
>>> 7:23.92
>>> 7:20.15
>>> 7:20.52
>>> 7:19.10
>>> 7:20.71
>>> Cache, no FUA:
>>> 7:20.23
>>> 7:20.48
>>> 7:19.94
>>> 7:18.90
>>> 7:19.88
>>
>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
>>
>> Do you agree?
> 
> That is a good question, that's why I just posted the data without further comment from my side.
> I was honestly expecting the difference to be much higher, given the original patch.
> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
> If there are cards where the difference is much more significant then of course a quirk would be nicer.
> On the other side I don't see why not and any improvement is a good one?
> 
>>
>>>
>>> Cache, no FUA:
>>> 7:19.36
>>> 7:02.11
>>> 7:01.53
>>> 7:01.35
>>> 7:00.37
>>> Cache, no FUA CQE:
>>> 7:17.55
>>> 7:00.73
>>> 6:59.25
>>> 6:58.44
>>> 6:58.60
>>> FUA:
>>> 7:15.10
>>> 6:58.99
>>> 6:58.94
>>> 6:59.17
>>> 6:60.00
>>> FUA CQE:
>>> 7:11.03
>>> 6:58.04
>>> 6:56.89
>>> 6:56.43
>>> 6:56:28
>>>
>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
>>
>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
> 
> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
> 
>>
>> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
> 
>>
>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
> Sure, I'm also interested in Adrian's take on this.

Testing an arbitrary system and looking only at individual I/Os,
which may not be representative of any use-case, resulted in
FUA always winning, see below.

All values are approximate and in microseconds.

		With FUA		Without FUA

With CQE	Reliable Write	350	Write	125
					Flush	300
		Total		350		425

Without CQE	Reliable Write	350	Write	125
		CMD13		100	CMD13	100
					Flush	300
					CMD13	100
		Total		450		625

FYI the test I was doing was:

  # cat test.sh
	#!/bin/sh

	echo "hi" > /mnt/mmc/hi.txt

	sync


  # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
  # perf script --ns --deltatime


The conclusion in this case would seem to be that CQE
makes the case for removing FUA less bad.

Perhaps CQE is more common in newer eMMCs which in turn
have better FUA implementations.
Ulf Hansson March 14, 2023, 7:56 a.m. UTC | #16
On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 10/03/23 19:06, Christian Löhle wrote:
> >
> >>>
> >>> I have benchmarked the FUA/Cache behavior a bit.
> >>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
> >>>
> >>> # call with
> >>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
> >>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
> >>> ./filesystembenchmark.sh; umount /mnt; done
> >>> mkfs.ext4 -F /dev/mmcblk2
> >>> mount /dev/mmcblk2 /mnt
> >>> for i in {1..3}
> >>> do
> >>> cp -r linux-6.2.2 /mnt/$i
> >>> done
> >>> for i in {1..3}
> >>> do
> >>> rm -r /mnt/$i
> >>> done
> >>> for i in {1..3}
> >>> do
> >>> cp -r linux-6.2.2 /mnt/$i
> >>> done
> >>>
> >>>
> >>> I found a couple of DUTs that I can link, I also tested one industrial card.
> >>>
> >>> DUT1: blue PCB Foresee eMMC
> >>> https://pine64.com/product/32gb-emmc-module/
> >>> DUT2: green PCB SiliconGo eMMC
> >>> Couldn't find that one online anymore unfortunately
> >>> DUT3: orange hardkernel PCB 8GB
> >>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
> >>> DUT4: orange hardkernel PCB white dot
> >>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
> >>> id-xu3.html
> >>> DUT5: Industrial card
> >>
> >> Thanks a lot for helping out with testing! Much appreciated!
> >
> > No problem, glad to be of help.
> >
> >>
> >>>
> >>>
> >>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
> >>>
> >>> DUT1:
> >>> Cache, no FUA:
> >>> 13:04.49
> >>> 13:13.82
> >>> 13:30.59
> >>> 13:28:13
> >>> 13:20:64
> >>> FUA:
> >>> 13:30.32
> >>> 13:36.26
> >>> 13:10.86
> >>> 13:32.52
> >>> 13:48.59
> >>>
> >>> DUT2:
> >>> FUA:
> >>> 8:11.24
> >>> 7:47.73
> >>> 7:48.00
> >>> 7:48.18
> >>> 7:47.38
> >>> Cache, no FUA:
> >>> 8:10.30
> >>> 7:48.97
> >>> 7:48.47
> >>> 7:47.93
> >>> 7:44.18
> >>>
> >>> DUT3:
> >>> Cache, no FUA:
> >>> 7:02.82
> >>> 6:58.94
> >>> 7:03.20
> >>> 7:00.27
> >>> 7:00.88
> >>> FUA:
> >>> 7:05.43
> >>> 7:03.44
> >>> 7:04.82
> >>> 7:03.26
> >>> 7:04.74
> >>>
> >>> DUT4:
> >>> FUA:
> >>> 7:23.92
> >>> 7:20.15
> >>> 7:20.52
> >>> 7:19.10
> >>> 7:20.71
> >>> Cache, no FUA:
> >>> 7:20.23
> >>> 7:20.48
> >>> 7:19.94
> >>> 7:18.90
> >>> 7:19.88
> >>
> >> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
> >>
> >> Do you agree?
> >
> > That is a good question, that's why I just posted the data without further comment from my side.
> > I was honestly expecting the difference to be much higher, given the original patch.
> > If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
> > If there are cards where the difference is much more significant then of course a quirk would be nicer.
> > On the other side I don't see why not and any improvement is a good one?
> >
> >>
> >>>
> >>> Cache, no FUA:
> >>> 7:19.36
> >>> 7:02.11
> >>> 7:01.53
> >>> 7:01.35
> >>> 7:00.37
> >>> Cache, no FUA CQE:
> >>> 7:17.55
> >>> 7:00.73
> >>> 6:59.25
> >>> 6:58.44
> >>> 6:58.60
> >>> FUA:
> >>> 7:15.10
> >>> 6:58.99
> >>> 6:58.94
> >>> 6:59.17
> >>> 6:60.00
> >>> FUA CQE:
> >>> 7:11.03
> >>> 6:58.04
> >>> 6:56.89
> >>> 6:56.43
> >>> 6:56:28
> >>>
> >>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
> >>
> >> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
> >
> > That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
> >
> >>
> >> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
> > It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
> >
> >>
> >> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
> > Sure, I'm also interested in Adrian's take on this.
>
> Testing an arbitrary system and looking only at individual I/Os,
> which may not be representative of any use-case, resulted in
> FUA always winning, see below.
>
> All values are approximate and in microseconds.
>
>                 With FUA                Without FUA
>
> With CQE        Reliable Write  350     Write   125
>                                         Flush   300
>                 Total           350             425
>
> Without CQE     Reliable Write  350     Write   125
>                 CMD13           100     CMD13   100
>                                         Flush   300
>                                         CMD13   100
>                 Total           450             625
>
> FYI the test I was doing was:
>
>   # cat test.sh
>         #!/bin/sh
>
>         echo "hi" > /mnt/mmc/hi.txt
>
>         sync
>
>
>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
>   # perf script --ns --deltatime
>
>
> The conclusion in this case would seem to be that CQE
> makes the case for removing FUA less bad.
>
> Perhaps CQE is more common in newer eMMCs which in turn
> have better FUA implementations.

Very interesting data - and thanks for helping out with tests!

A few observations and thoughts from the above.

1)
A "normal" use case would probably include additional writes (regular
writes) and I guess that could impact the flushing behavior. Maybe the
flushing becomes less heavy, if the device internally/occasionally
needs to flush its cache anyway? Or - maybe it doesn't matter at all,
because the reliable writes are triggering the cache to be flushed
too.

2)
Assuming that a reliable write is triggering the internal cache to be
flushed too, then we need less number of commands to be sent/acked to
the eMMC - compared to not using FUA. This means less protocol
overhead when using FUA - and perhaps that's what your tests is
actually telling us?

Kind regards
Uffe
Adrian Hunter March 14, 2023, 8:57 a.m. UTC | #17
On 14/03/23 09:56, Ulf Hansson wrote:
> On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 10/03/23 19:06, Christian Löhle wrote:
>>>
>>>>>
>>>>> I have benchmarked the FUA/Cache behavior a bit.
>>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>>>>>
>>>>> # call with
>>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
>>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
>>>>> ./filesystembenchmark.sh; umount /mnt; done
>>>>> mkfs.ext4 -F /dev/mmcblk2
>>>>> mount /dev/mmcblk2 /mnt
>>>>> for i in {1..3}
>>>>> do
>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>> done
>>>>> for i in {1..3}
>>>>> do
>>>>> rm -r /mnt/$i
>>>>> done
>>>>> for i in {1..3}
>>>>> do
>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>> done
>>>>>
>>>>>
>>>>> I found a couple of DUTs that I can link, I also tested one industrial card.
>>>>>
>>>>> DUT1: blue PCB Foresee eMMC
>>>>> https://pine64.com/product/32gb-emmc-module/
>>>>> DUT2: green PCB SiliconGo eMMC
>>>>> Couldn't find that one online anymore unfortunately
>>>>> DUT3: orange hardkernel PCB 8GB
>>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
>>>>> DUT4: orange hardkernel PCB white dot
>>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
>>>>> id-xu3.html
>>>>> DUT5: Industrial card
>>>>
>>>> Thanks a lot for helping out with testing! Much appreciated!
>>>
>>> No problem, glad to be of help.
>>>
>>>>
>>>>>
>>>>>
>>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>>>>>
>>>>> DUT1:
>>>>> Cache, no FUA:
>>>>> 13:04.49
>>>>> 13:13.82
>>>>> 13:30.59
>>>>> 13:28:13
>>>>> 13:20:64
>>>>> FUA:
>>>>> 13:30.32
>>>>> 13:36.26
>>>>> 13:10.86
>>>>> 13:32.52
>>>>> 13:48.59
>>>>>
>>>>> DUT2:
>>>>> FUA:
>>>>> 8:11.24
>>>>> 7:47.73
>>>>> 7:48.00
>>>>> 7:48.18
>>>>> 7:47.38
>>>>> Cache, no FUA:
>>>>> 8:10.30
>>>>> 7:48.97
>>>>> 7:48.47
>>>>> 7:47.93
>>>>> 7:44.18
>>>>>
>>>>> DUT3:
>>>>> Cache, no FUA:
>>>>> 7:02.82
>>>>> 6:58.94
>>>>> 7:03.20
>>>>> 7:00.27
>>>>> 7:00.88
>>>>> FUA:
>>>>> 7:05.43
>>>>> 7:03.44
>>>>> 7:04.82
>>>>> 7:03.26
>>>>> 7:04.74
>>>>>
>>>>> DUT4:
>>>>> FUA:
>>>>> 7:23.92
>>>>> 7:20.15
>>>>> 7:20.52
>>>>> 7:19.10
>>>>> 7:20.71
>>>>> Cache, no FUA:
>>>>> 7:20.23
>>>>> 7:20.48
>>>>> 7:19.94
>>>>> 7:18.90
>>>>> 7:19.88
>>>>
>>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
>>>>
>>>> Do you agree?
>>>
>>> That is a good question, that's why I just posted the data without further comment from my side.
>>> I was honestly expecting the difference to be much higher, given the original patch.
>>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
>>> If there are cards where the difference is much more significant then of course a quirk would be nicer.
>>> On the other side I don't see why not and any improvement is a good one?
>>>
>>>>
>>>>>
>>>>> Cache, no FUA:
>>>>> 7:19.36
>>>>> 7:02.11
>>>>> 7:01.53
>>>>> 7:01.35
>>>>> 7:00.37
>>>>> Cache, no FUA CQE:
>>>>> 7:17.55
>>>>> 7:00.73
>>>>> 6:59.25
>>>>> 6:58.44
>>>>> 6:58.60
>>>>> FUA:
>>>>> 7:15.10
>>>>> 6:58.99
>>>>> 6:58.94
>>>>> 6:59.17
>>>>> 6:60.00
>>>>> FUA CQE:
>>>>> 7:11.03
>>>>> 6:58.04
>>>>> 6:56.89
>>>>> 6:56.43
>>>>> 6:56:28
>>>>>
>>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
>>>>
>>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
>>>
>>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
>>>
>>>>
>>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
>>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
>>>
>>>>
>>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
>>> Sure, I'm also interested in Adrian's take on this.
>>
>> Testing an arbitrary system and looking only at individual I/Os,
>> which may not be representative of any use-case, resulted in
>> FUA always winning, see below.
>>
>> All values are approximate and in microseconds.
>>
>>                 With FUA                Without FUA
>>
>> With CQE        Reliable Write  350     Write   125
>>                                         Flush   300
>>                 Total           350             425
>>
>> Without CQE     Reliable Write  350     Write   125
>>                 CMD13           100     CMD13   100
>>                                         Flush   300
>>                                         CMD13   100
>>                 Total           450             625
>>
>> FYI the test I was doing was:
>>
>>   # cat test.sh
>>         #!/bin/sh
>>
>>         echo "hi" > /mnt/mmc/hi.txt
>>
>>         sync
>>
>>
>>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
>>   # perf script --ns --deltatime
>>
>>
>> The conclusion in this case would seem to be that CQE
>> makes the case for removing FUA less bad.
>>
>> Perhaps CQE is more common in newer eMMCs which in turn
>> have better FUA implementations.
> 
> Very interesting data - and thanks for helping out with tests!
> 
> A few observations and thoughts from the above.
> 
> 1)
> A "normal" use case would probably include additional writes (regular
> writes) and I guess that could impact the flushing behavior. Maybe the
> flushing becomes less heavy, if the device internally/occasionally
> needs to flush its cache anyway? Or - maybe it doesn't matter at all,
> because the reliable writes are triggering the cache to be flushed
> too.

The sync is presumably causing an EXT4 journal commit which
seems to use REQ_PREFLUSH and REQ_FUA. That is:
	Flush (the journal to media)
	Write (the commit record) (FUA)
So it does a flush anyway.  The no-FUA case is:
	Flush (the journal to media)
	Write (the commit record)
	Flush (the commit record)

> 
> 2)
> Assuming that a reliable write is triggering the internal cache to be
> flushed too, then we need less number of commands to be sent/acked to
> the eMMC - compared to not using FUA. This means less protocol
> overhead when using FUA - and perhaps that's what your tests is
> actually telling us?

There is definitely less protocol overhead because the no-FUA
case has to do an extra CMD6 (flush) and CMD13.

Note also, in this case auto-CMD23 is being used, which is why
is is not listed.

Using an older system (no CQE but also auto-CMD23), resulted
in a win for no-FUA:

		With FUA		Without FUA

		Reliable Write	1200	Write	 850
		CMD13		 100	CMD13	 100
					Flush	 120
					CMD13	  65
		Total		1300		1135
Ulf Hansson March 16, 2023, 12:12 p.m. UTC | #18
On Tue, 14 Mar 2023 at 09:58, Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 14/03/23 09:56, Ulf Hansson wrote:
> > On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote:
> >>
> >> On 10/03/23 19:06, Christian Löhle wrote:
> >>>
> >>>>>
> >>>>> I have benchmarked the FUA/Cache behavior a bit.
> >>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
> >>>>>
> >>>>> # call with
> >>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
> >>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
> >>>>> ./filesystembenchmark.sh; umount /mnt; done
> >>>>> mkfs.ext4 -F /dev/mmcblk2
> >>>>> mount /dev/mmcblk2 /mnt
> >>>>> for i in {1..3}
> >>>>> do
> >>>>> cp -r linux-6.2.2 /mnt/$i
> >>>>> done
> >>>>> for i in {1..3}
> >>>>> do
> >>>>> rm -r /mnt/$i
> >>>>> done
> >>>>> for i in {1..3}
> >>>>> do
> >>>>> cp -r linux-6.2.2 /mnt/$i
> >>>>> done
> >>>>>
> >>>>>
> >>>>> I found a couple of DUTs that I can link, I also tested one industrial card.
> >>>>>
> >>>>> DUT1: blue PCB Foresee eMMC
> >>>>> https://pine64.com/product/32gb-emmc-module/
> >>>>> DUT2: green PCB SiliconGo eMMC
> >>>>> Couldn't find that one online anymore unfortunately
> >>>>> DUT3: orange hardkernel PCB 8GB
> >>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
> >>>>> DUT4: orange hardkernel PCB white dot
> >>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
> >>>>> id-xu3.html
> >>>>> DUT5: Industrial card
> >>>>
> >>>> Thanks a lot for helping out with testing! Much appreciated!
> >>>
> >>> No problem, glad to be of help.
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
> >>>>>
> >>>>> DUT1:
> >>>>> Cache, no FUA:
> >>>>> 13:04.49
> >>>>> 13:13.82
> >>>>> 13:30.59
> >>>>> 13:28:13
> >>>>> 13:20:64
> >>>>> FUA:
> >>>>> 13:30.32
> >>>>> 13:36.26
> >>>>> 13:10.86
> >>>>> 13:32.52
> >>>>> 13:48.59
> >>>>>
> >>>>> DUT2:
> >>>>> FUA:
> >>>>> 8:11.24
> >>>>> 7:47.73
> >>>>> 7:48.00
> >>>>> 7:48.18
> >>>>> 7:47.38
> >>>>> Cache, no FUA:
> >>>>> 8:10.30
> >>>>> 7:48.97
> >>>>> 7:48.47
> >>>>> 7:47.93
> >>>>> 7:44.18
> >>>>>
> >>>>> DUT3:
> >>>>> Cache, no FUA:
> >>>>> 7:02.82
> >>>>> 6:58.94
> >>>>> 7:03.20
> >>>>> 7:00.27
> >>>>> 7:00.88
> >>>>> FUA:
> >>>>> 7:05.43
> >>>>> 7:03.44
> >>>>> 7:04.82
> >>>>> 7:03.26
> >>>>> 7:04.74
> >>>>>
> >>>>> DUT4:
> >>>>> FUA:
> >>>>> 7:23.92
> >>>>> 7:20.15
> >>>>> 7:20.52
> >>>>> 7:19.10
> >>>>> 7:20.71
> >>>>> Cache, no FUA:
> >>>>> 7:20.23
> >>>>> 7:20.48
> >>>>> 7:19.94
> >>>>> 7:18.90
> >>>>> 7:19.88
> >>>>
> >>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
> >>>>
> >>>> Do you agree?
> >>>
> >>> That is a good question, that's why I just posted the data without further comment from my side.
> >>> I was honestly expecting the difference to be much higher, given the original patch.
> >>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
> >>> If there are cards where the difference is much more significant then of course a quirk would be nicer.
> >>> On the other side I don't see why not and any improvement is a good one?
> >>>
> >>>>
> >>>>>
> >>>>> Cache, no FUA:
> >>>>> 7:19.36
> >>>>> 7:02.11
> >>>>> 7:01.53
> >>>>> 7:01.35
> >>>>> 7:00.37
> >>>>> Cache, no FUA CQE:
> >>>>> 7:17.55
> >>>>> 7:00.73
> >>>>> 6:59.25
> >>>>> 6:58.44
> >>>>> 6:58.60
> >>>>> FUA:
> >>>>> 7:15.10
> >>>>> 6:58.99
> >>>>> 6:58.94
> >>>>> 6:59.17
> >>>>> 6:60.00
> >>>>> FUA CQE:
> >>>>> 7:11.03
> >>>>> 6:58.04
> >>>>> 6:56.89
> >>>>> 6:56.43
> >>>>> 6:56:28
> >>>>>
> >>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
> >>>>
> >>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
> >>>
> >>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
> >>>
> >>>>
> >>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
> >>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
> >>>
> >>>>
> >>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
> >>> Sure, I'm also interested in Adrian's take on this.
> >>
> >> Testing an arbitrary system and looking only at individual I/Os,
> >> which may not be representative of any use-case, resulted in
> >> FUA always winning, see below.
> >>
> >> All values are approximate and in microseconds.
> >>
> >>                 With FUA                Without FUA
> >>
> >> With CQE        Reliable Write  350     Write   125
> >>                                         Flush   300
> >>                 Total           350             425
> >>
> >> Without CQE     Reliable Write  350     Write   125
> >>                 CMD13           100     CMD13   100
> >>                                         Flush   300
> >>                                         CMD13   100
> >>                 Total           450             625
> >>
> >> FYI the test I was doing was:
> >>
> >>   # cat test.sh
> >>         #!/bin/sh
> >>
> >>         echo "hi" > /mnt/mmc/hi.txt
> >>
> >>         sync
> >>
> >>
> >>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
> >>   # perf script --ns --deltatime
> >>
> >>
> >> The conclusion in this case would seem to be that CQE
> >> makes the case for removing FUA less bad.
> >>
> >> Perhaps CQE is more common in newer eMMCs which in turn
> >> have better FUA implementations.
> >
> > Very interesting data - and thanks for helping out with tests!
> >
> > A few observations and thoughts from the above.
> >
> > 1)
> > A "normal" use case would probably include additional writes (regular
> > writes) and I guess that could impact the flushing behavior. Maybe the
> > flushing becomes less heavy, if the device internally/occasionally
> > needs to flush its cache anyway? Or - maybe it doesn't matter at all,
> > because the reliable writes are triggering the cache to be flushed
> > too.
>
> The sync is presumably causing an EXT4 journal commit which
> seems to use REQ_PREFLUSH and REQ_FUA. That is:
>         Flush (the journal to media)
>         Write (the commit record) (FUA)
> So it does a flush anyway.  The no-FUA case is:
>         Flush (the journal to media)
>         Write (the commit record)
>         Flush (the commit record)
>
> >
> > 2)
> > Assuming that a reliable write is triggering the internal cache to be
> > flushed too, then we need less number of commands to be sent/acked to
> > the eMMC - compared to not using FUA. This means less protocol
> > overhead when using FUA - and perhaps that's what your tests is
> > actually telling us?
>
> There is definitely less protocol overhead because the no-FUA
> case has to do an extra CMD6 (flush) and CMD13.
>
> Note also, in this case auto-CMD23 is being used, which is why
> is is not listed.
>
> Using an older system (no CQE but also auto-CMD23), resulted
> in a win for no-FUA:
>
>                 With FUA                Without FUA
>
>                 Reliable Write  1200    Write    850
>                 CMD13            100    CMD13    100
>                                         Flush    120
>                                         CMD13     65
>                 Total           1300            1135
>
>

Alright, so it seems like just checking whether the cache control
feature is available, isn't sufficient when deciding to avoid FUA.

That said, in the next version I am going to add a card quirk, which
needs to be set too, to avoid FUA. Then we can see for what cards it
should actually be set for.

What eMMC cards did you use in your tests above?

Kind regards
Uffe
Adrian Hunter March 16, 2023, 12:44 p.m. UTC | #19
On 16/03/23 14:12, Ulf Hansson wrote:
> On Tue, 14 Mar 2023 at 09:58, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>
>> On 14/03/23 09:56, Ulf Hansson wrote:
>>> On Mon, 13 Mar 2023 at 17:56, Adrian Hunter <adrian.hunter@intel.com> wrote:
>>>>
>>>> On 10/03/23 19:06, Christian Löhle wrote:
>>>>>
>>>>>>>
>>>>>>> I have benchmarked the FUA/Cache behavior a bit.
>>>>>>> I don't have an actual filesystem benchmark that does what I wanted and is easy to port to the target so I used:
>>>>>>>
>>>>>>> # call with
>>>>>>> # for loop in {1..3}; do sudo dd if=/dev/urandom bs=1M
>>>>>>> of=/dev/mmcblk2; done; for loop in {1..5}; do time
>>>>>>> ./filesystembenchmark.sh; umount /mnt; done
>>>>>>> mkfs.ext4 -F /dev/mmcblk2
>>>>>>> mount /dev/mmcblk2 /mnt
>>>>>>> for i in {1..3}
>>>>>>> do
>>>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>>>> done
>>>>>>> for i in {1..3}
>>>>>>> do
>>>>>>> rm -r /mnt/$i
>>>>>>> done
>>>>>>> for i in {1..3}
>>>>>>> do
>>>>>>> cp -r linux-6.2.2 /mnt/$i
>>>>>>> done
>>>>>>>
>>>>>>>
>>>>>>> I found a couple of DUTs that I can link, I also tested one industrial card.
>>>>>>>
>>>>>>> DUT1: blue PCB Foresee eMMC
>>>>>>> https://pine64.com/product/32gb-emmc-module/
>>>>>>> DUT2: green PCB SiliconGo eMMC
>>>>>>> Couldn't find that one online anymore unfortunately
>>>>>>> DUT3: orange hardkernel PCB 8GB
>>>>>>> https://www.hardkernel.com/shop/8gb-emmc-module-c2-android/
>>>>>>> DUT4: orange hardkernel PCB white dot
>>>>>>> https://rlx.sk/en/odroid/3198-16gb-emmc-50-module-xu3-android-for-odro
>>>>>>> id-xu3.html
>>>>>>> DUT5: Industrial card
>>>>>>
>>>>>> Thanks a lot for helping out with testing! Much appreciated!
>>>>>
>>>>> No problem, glad to be of help.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The test issued 461 DO_REL_WR during one of the iterations for DUT5
>>>>>>>
>>>>>>> DUT1:
>>>>>>> Cache, no FUA:
>>>>>>> 13:04.49
>>>>>>> 13:13.82
>>>>>>> 13:30.59
>>>>>>> 13:28:13
>>>>>>> 13:20:64
>>>>>>> FUA:
>>>>>>> 13:30.32
>>>>>>> 13:36.26
>>>>>>> 13:10.86
>>>>>>> 13:32.52
>>>>>>> 13:48.59
>>>>>>>
>>>>>>> DUT2:
>>>>>>> FUA:
>>>>>>> 8:11.24
>>>>>>> 7:47.73
>>>>>>> 7:48.00
>>>>>>> 7:48.18
>>>>>>> 7:47.38
>>>>>>> Cache, no FUA:
>>>>>>> 8:10.30
>>>>>>> 7:48.97
>>>>>>> 7:48.47
>>>>>>> 7:47.93
>>>>>>> 7:44.18
>>>>>>>
>>>>>>> DUT3:
>>>>>>> Cache, no FUA:
>>>>>>> 7:02.82
>>>>>>> 6:58.94
>>>>>>> 7:03.20
>>>>>>> 7:00.27
>>>>>>> 7:00.88
>>>>>>> FUA:
>>>>>>> 7:05.43
>>>>>>> 7:03.44
>>>>>>> 7:04.82
>>>>>>> 7:03.26
>>>>>>> 7:04.74
>>>>>>>
>>>>>>> DUT4:
>>>>>>> FUA:
>>>>>>> 7:23.92
>>>>>>> 7:20.15
>>>>>>> 7:20.52
>>>>>>> 7:19.10
>>>>>>> 7:20.71
>>>>>>> Cache, no FUA:
>>>>>>> 7:20.23
>>>>>>> 7:20.48
>>>>>>> 7:19.94
>>>>>>> 7:18.90
>>>>>>> 7:19.88
>>>>>>
>>>>>> Without going into the details of the above, it seems like for DUT1, DUT2, DUT3 and DUT4 there a good reasons to why we should move forward with $subject patch.
>>>>>>
>>>>>> Do you agree?
>>>>>
>>>>> That is a good question, that's why I just posted the data without further comment from my side.
>>>>> I was honestly expecting the difference to be much higher, given the original patch.
>>>>> If this is representative for most cards, you would require quite an unusual workload to actually notice the difference IMO.
>>>>> If there are cards where the difference is much more significant then of course a quirk would be nicer.
>>>>> On the other side I don't see why not and any improvement is a good one?
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Cache, no FUA:
>>>>>>> 7:19.36
>>>>>>> 7:02.11
>>>>>>> 7:01.53
>>>>>>> 7:01.35
>>>>>>> 7:00.37
>>>>>>> Cache, no FUA CQE:
>>>>>>> 7:17.55
>>>>>>> 7:00.73
>>>>>>> 6:59.25
>>>>>>> 6:58.44
>>>>>>> 6:58.60
>>>>>>> FUA:
>>>>>>> 7:15.10
>>>>>>> 6:58.99
>>>>>>> 6:58.94
>>>>>>> 6:59.17
>>>>>>> 6:60.00
>>>>>>> FUA CQE:
>>>>>>> 7:11.03
>>>>>>> 6:58.04
>>>>>>> 6:56.89
>>>>>>> 6:56.43
>>>>>>> 6:56:28
>>>>>>>
>>>>>>> If anyone has any comments or disagrees with the benchmark, or has a specific eMMC to test, let me know.
>>>>>>
>>>>>> If I understand correctly, for DUT5, it seems like using FUA may be slightly better than just cache-flushing, right?
>>>>>
>>>>> That is correct, I specifically tested with this card as under the assumption that reliable write is without much additional cost, the DCMD would be slightly worse for performance and SYNC a bit worse.
>>>>>
>>>>>>
>>>>>> For CQE, it seems like FUA could be slightly even better, at least for DUT5.  Do you know if REQ_OP_FLUSH translates into MMC_ISSUE_DCMD or MMC_ISSUE_SYNC for your case? See mmc_cqe_issue_type().
>>>>> It is SYNC (this is sdhci-of-arasan on rk3399, no DCMD), but even SYNC is not too bad here it seems, could of course be worse if the workload was less sequential.
>>>>>
>>>>>>
>>>>>> When it comes to CQE, maybe Adrian have some additional thoughts around this? Perhaps we should keep using REQ_FUA, if we have CQE?
>>>>> Sure, I'm also interested in Adrian's take on this.
>>>>
>>>> Testing an arbitrary system and looking only at individual I/Os,
>>>> which may not be representative of any use-case, resulted in
>>>> FUA always winning, see below.
>>>>
>>>> All values are approximate and in microseconds.
>>>>
>>>>                 With FUA                Without FUA
>>>>
>>>> With CQE        Reliable Write  350     Write   125
>>>>                                         Flush   300
>>>>                 Total           350             425
>>>>
>>>> Without CQE     Reliable Write  350     Write   125
>>>>                 CMD13           100     CMD13   100
>>>>                                         Flush   300
>>>>                                         CMD13   100
>>>>                 Total           450             625
>>>>
>>>> FYI the test I was doing was:
>>>>
>>>>   # cat test.sh
>>>>         #!/bin/sh
>>>>
>>>>         echo "hi" > /mnt/mmc/hi.txt
>>>>
>>>>         sync
>>>>
>>>>
>>>>   # perf record --no-bpf-event -e mmc:* -a -- ./test.sh
>>>>   # perf script --ns --deltatime
>>>>
>>>>
>>>> The conclusion in this case would seem to be that CQE
>>>> makes the case for removing FUA less bad.
>>>>
>>>> Perhaps CQE is more common in newer eMMCs which in turn
>>>> have better FUA implementations.
>>>
>>> Very interesting data - and thanks for helping out with tests!
>>>
>>> A few observations and thoughts from the above.
>>>
>>> 1)
>>> A "normal" use case would probably include additional writes (regular
>>> writes) and I guess that could impact the flushing behavior. Maybe the
>>> flushing becomes less heavy, if the device internally/occasionally
>>> needs to flush its cache anyway? Or - maybe it doesn't matter at all,
>>> because the reliable writes are triggering the cache to be flushed
>>> too.
>>
>> The sync is presumably causing an EXT4 journal commit which
>> seems to use REQ_PREFLUSH and REQ_FUA. That is:
>>         Flush (the journal to media)
>>         Write (the commit record) (FUA)
>> So it does a flush anyway.  The no-FUA case is:
>>         Flush (the journal to media)
>>         Write (the commit record)
>>         Flush (the commit record)
>>
>>>
>>> 2)
>>> Assuming that a reliable write is triggering the internal cache to be
>>> flushed too, then we need less number of commands to be sent/acked to
>>> the eMMC - compared to not using FUA. This means less protocol
>>> overhead when using FUA - and perhaps that's what your tests is
>>> actually telling us?
>>
>> There is definitely less protocol overhead because the no-FUA
>> case has to do an extra CMD6 (flush) and CMD13.
>>
>> Note also, in this case auto-CMD23 is being used, which is why
>> is is not listed.
>>
>> Using an older system (no CQE but also auto-CMD23), resulted
>> in a win for no-FUA:
>>
>>                 With FUA                Without FUA
>>
>>                 Reliable Write  1200    Write    850
>>                 CMD13            100    CMD13    100
>>                                         Flush    120
>>                                         CMD13     65
>>                 Total           1300            1135
>>
>>
> 
> Alright, so it seems like just checking whether the cache control
> feature is available, isn't sufficient when deciding to avoid FUA.
> 
> That said, in the next version I am going to add a card quirk, which
> needs to be set too, to avoid FUA. Then we can see for what cards it
> should actually be set for.
> 
> What eMMC cards did you use in your tests above?

I don't know the part numbers

	Manu ID 	Name	Date		Machine
Newer:	0x11 (Toshiba)	064G30	11/2019		Jasper Lake
Older:	0x45		SEM128	07/2014		Lenovo Thinkpad 10
diff mbox series

Patch

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 672ab90c4b2d..2a49531bf023 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -2490,15 +2490,20 @@  static struct mmc_blk_data *mmc_blk_alloc_req(struct mmc_card *card,
 			md->flags |= MMC_BLK_CMD23;
 	}
 
-	if (md->flags & MMC_BLK_CMD23 &&
-	    ((card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN) ||
-	     card->ext_csd.rel_sectors)) {
+	/*
+	 * REQ_FUA is supported through eMMC reliable writes, which has been
+	 * reported to be quite costly for some eMMCs. Therefore, let's rely
+	 * on flush requests (REQ_OP_FLUSH), if an internal cache is supported.
+	 */
+	if (mmc_cache_enabled(card->host)) {
+		cache_enabled  = true;
+	} else if (md->flags & MMC_BLK_CMD23 &&
+		  (card->ext_csd.rel_param & EXT_CSD_WR_REL_PARAM_EN ||
+		   card->ext_csd.rel_sectors)) {
 		md->flags |= MMC_BLK_REL_WR;
 		fua_enabled = true;
 		cache_enabled = true;
 	}
-	if (mmc_cache_enabled(card->host))
-		cache_enabled  = true;
 
 	blk_queue_write_cache(md->queue.queue, cache_enabled, fua_enabled);