[RFC,0/6] mmc: Add clock scaling support for mmc driver

Message ID	1571668177-3766-1-git-send-email-rampraka@codeaurora.org (mailing list archive)
Headers	show Return-Path: <SRS0=psOg=YO=vger.kernel.org=linux-mmc-owner@kernel.org> IronPort-SDR: vT5ilpa8nvRDneSXUcHtGXGZXqG/FfTAThMd4qBEG39zlVw4x7bYT9vjYSvmaC1H6w+sztqiCP uJv3N4yfZ8W1Vh+79fOOcqX214B5QrVNtUt/1s/F4uIRlNT4bHiiNUMkfPEufen95ayoI41ZgQ OYmw+QfW0QflvLSpFhKDjoEuUhBE5rWc6FGVBShd1Dqt6fMkbwxmZXl8Za6m9n4gVh6HcHKo21 29O0Hc37EupITRjWwekG3Ra39GBahQz5YTf1kFUpT7ddoqrH1f2e+q288berFOUO3AH38FzPND aygV6/ah1QN/XVoPp+M5sfpV From: Ram Prakash Gupta <rampraka@codeaurora.org> To: asutoshd@codeaurora.org, stummala@codeaurora.org, sayalil@codeaurora.org, rampraka@codeaurora.org, vbadigan@codeaurora.org, cang@codeaurora.org, ppvk@codeaurora.org, adrian.hunter@intel.com, ulf.hansson@linaro.org, robh+dt@kernel.org, linux-mmc@vger.kernel.org, linux-kernel@vger.kernel.org, devicetree@vger.kernel.org Subject: [RFC 0/6] mmc: Add clock scaling support for mmc driver Date: Mon, 21 Oct 2019 19:59:31 +0530 Message-Id: <1571668177-3766-1-git-send-email-rampraka@codeaurora.org> Sender: linux-mmc-owner@vger.kernel.org Precedence: bulk
Series	mmc: Add clock scaling support for mmc driver \| expand [RFC,0/6] mmc: Add clock scaling support for mmc driver [RFC,1/6] mmc: core: Parse clk scaling dt entries [RFC,2/6] mmc: core: Add core scaling support in driver [RFC,3/6] mmc: core: Initialize clk scaling for mmc and SDCard [RFC,4/6] mmc: core: Add debugfs entries for scaling support [RFC,5/6] mmc: sdhci-msm: Add capability in platform host [RFC,6/6] dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters

Ram Prakash Gupta Oct. 21, 2019, 2:29 p.m. UTC

This change adds the use of devfreq based clock scaling to MMC.
This applicable for eMMC and SDCard.
For some workloads, such as video playback, it isn't necessary
for these cards to run at high speed. Running at lower
frequency, in such cases can still meet the deadlines for data
transfers.

Scaling down the clock frequency dynamically has power savings
not only because the bus is running at lower frequency but also
has an advantage of scaling down the system core voltage, if
supported. Provide an ondemand clock scaling support similar
to the cpufreq ondemand governor having two thresholds,
up_threshold and down_threshold to decide whether to increase
the frequency or scale it down respectively as per load.


Ram Prakash Gupta (6):
  mmc: core: Parse clk scaling dt entries
  mmc: core: Add core scaling support in driver
  mmc: core: Initialize clk scaling for mmc and SDCard
  mmc: core: Add debugfs entries for scaling support
  mmc: sdhci-msm: Add capability in platfrom host
  dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters

 .../devicetree/bindings/mmc/sdhci-msm.txt          |  19 +
 drivers/mmc/core/block.c                           |  19 +-
 drivers/mmc/core/core.c                            | 777 +++++++++++++++++++++
 drivers/mmc/core/core.h                            |  17 +
 drivers/mmc/core/debugfs.c                         |  90 +++
 drivers/mmc/core/host.c                            | 226 ++++++
 drivers/mmc/core/mmc.c                             | 246 ++++++-
 drivers/mmc/core/queue.c                           |   2 +
 drivers/mmc/core/sd.c                              |  84 ++-
 drivers/mmc/host/sdhci-msm.c                       |   2 +
 include/linux/mmc/card.h                           |   7 +
 include/linux/mmc/host.h                           |  66 ++
 12 files changed, 1550 insertions(+), 5 deletions(-)

Ulf Hansson Oct. 22, 2019, 8:40 a.m. UTC | #1

On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta <rampraka@codeaurora.org> wrote:
>
> This change adds the use of devfreq based clock scaling to MMC.
> This applicable for eMMC and SDCard.
> For some workloads, such as video playback, it isn't necessary
> for these cards to run at high speed. Running at lower
> frequency, in such cases can still meet the deadlines for data
> transfers.
>
> Scaling down the clock frequency dynamically has power savings
> not only because the bus is running at lower frequency but also
> has an advantage of scaling down the system core voltage, if
> supported. Provide an ondemand clock scaling support similar
> to the cpufreq ondemand governor having two thresholds,
> up_threshold and down_threshold to decide whether to increase
> the frequency or scale it down respectively as per load.

This sounds simple, but what the series is doing is far more
complicated but scaling the bus clock, as it also re-negotiates the
bus speed mode.

Each time the triggering point for scaling up/down is hit, then a
series of commands needs to be sent to the card, including running the
tuning procedure. The point is, for sure, this doesn't come for free,
both from a latency point of view, but also from an energy cost point
of view. So, whether this really improves the behaviour, seems like
very use case sensitive, right!?

Overall, when it comes to use cases, we have very limited knowledge
about them at the mmc block layer level. I think it should remain like
that. If at any place at all, this information is better maintained by
the generic block layer and potentially the configured I/O scheduler.

This brings me to a question about the tests you have you run. Can you
share some information and data about that?

>
>
> Ram Prakash Gupta (6):
>   mmc: core: Parse clk scaling dt entries
>   mmc: core: Add core scaling support in driver
>   mmc: core: Initialize clk scaling for mmc and SDCard
>   mmc: core: Add debugfs entries for scaling support
>   mmc: sdhci-msm: Add capability in platfrom host
>   dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters
>
>  .../devicetree/bindings/mmc/sdhci-msm.txt          |  19 +

I noticed that the DT patch was put last in the series, but the
parsing is implemented in the first patch. Please flip this around. If
you want to implement DT parsing of new bindings, please make sure to
discuss the new bindings first.

>  drivers/mmc/core/block.c                           |  19 +-
>  drivers/mmc/core/core.c                            | 777 +++++++++++++++++++++
>  drivers/mmc/core/core.h                            |  17 +
>  drivers/mmc/core/debugfs.c                         |  90 +++
>  drivers/mmc/core/host.c                            | 226 ++++++
>  drivers/mmc/core/mmc.c                             | 246 ++++++-
>  drivers/mmc/core/queue.c                           |   2 +
>  drivers/mmc/core/sd.c                              |  84 ++-
>  drivers/mmc/host/sdhci-msm.c                       |   2 +
>  include/linux/mmc/card.h                           |   7 +
>  include/linux/mmc/host.h                           |  66 ++
>  12 files changed, 1550 insertions(+), 5 deletions(-)

This is a lot of new code in the mmc core, which I would then need to
maintain, of course. I have to admit, I am a bit concerned about that,
so you have to convince me that there are good reasons for me to apply
this.

As I stated above, I think the approach looks quite questionable in
general. And even if you can share measurement, that it improves the
behaviour, I suspect (without a deeper code review) that some of the
code better belongs in common block device layer.

Kind regards
Uffe

Doug Anderson Nov. 15, 2019, 5:39 p.m. UTC | #2

Hi,

On Fri, Nov 15, 2019 at 3:13 AM Ram Prakash Gupta
<rampraka@codeaurora.org> wrote:
>
> Each time the triggering point for scaling up/down is hit, then a
> series of commands needs to be sent to the card, including running the
> tuning procedure. The point is, for sure, this doesn't come for free,
> both from a latency point of view, but also from an energy cost point
> of view. So, whether this really improves the behaviour, seems like
> very use case sensitive, right!?
>
> With clock scaling support device mode would be switched between low speed
> (hs50 or ddr52) and high speed mode(hs400 enhanced strobe).

I haven't read the patches, but just from this description it feels
like the wrong way to go.  From my understanding if you're running at
HS400 then you can run the card at any speed between 0 and 200 MHz.  I
see no reason why you'd want to switch modes.  Just stay at HS400 and
slow down the clock, right?  Then you can keep the "Enhanced Strobe"
which makes everything more reliable.

If you're running on a system that doesn't have enhanced strobe you
_should_ still be able to switch clock speeds without switching modes
since the spec has just a _maximum_ clock frequency for each mode, so
HS200, DDR50, etc should all be able to run at slower speeds without
an official mode switch.  You also shouldn't have to re-tune.  Tuning
is nothing magical, it's just trying to find the delay between sending
a pulse on the clock line and when you read the data sent by the other
side.  Assuming your tuning delay is natively represented in "phase
offset", you can convert that back to an actual delay and then back to
"phase offset" for the new clock.

To further argue against switching modes, I would also note that for
SD cards switching to a slower speed mode may result in an increase in
IO voltage, which seems like it could be bad for power consumption?

> And from energy point of view, this feature is only helping in saving energy
> and not adding any energy penalty. And we have observed a considerable amount
> of power saving(data shared in mid) when playing 1080p video playback with
> clock scaling feature support.

I am slightly baffled about why this would save energy unless it
allows you to lower the voltage to the controller.  I know you _can_
sometimes lower the voltage to the controller on Qualcomm parts, but
you are arguing that it is useful even on systems that can't lower the
voltage.  That feels slightly questionable.  I would expect that
racing to idle (with the right tuning parameters) would be a better
way to go.

As a specific example, let's imagine we want to transfer 1000 MB of
data and we have 20 seconds with which to do it.  We can achieve this
by transferring 50 MB/s for the whole 20 seconds.  Alternatively, we
could transfer at 400 MB/s 2.5 seconds and then fully power gate the
eMMC for the next 17.5 seconds.

In that example, I'd wonder ask is more power efficient.  Presumably
the 2nd.  This is the whole "race to idle" concept as I understand it.

The "race to idle" is talked about a lot in the context of CPU
frequency decisions.  Presumably you'll point out that "race to idle"
is NOT the right thing to do for choosing a CPU frequency.  As I
understand it, this is primarily true because we need to raise the CPU
voltage to run at faster speeds.  This would lead me to believe that
the only case you'd want to do frequency scaling like this is if it
allows you to lower the voltage provided to the eMMC controller.  As
you said, for Qualcomm it _does_ allow you to do this, but most other
SoCs don't.  ...so unless there's a flaw in my logic (always
possible!) this patch series should be amended to say it's only useful
if it allows you to down-volt the controller.

Just to think a little bit more about my logic: of course, you might
argue that we can't just do a 1000 MB data transfer.  We can break
that down into two cases:

a) For writing, presumably the data is produced over time and you
don't want to buffer the whole 1000 MB and delay 17.5 seconds before
you start writing.  ...but presumably you could do _some_ buffering
and then break things into chunks where you ungate the clock to the
card, write a chunk out, and then re-gate the clock.  There will
obviously be some overhead with each clock gate/ungate, but
(hopefully) not too much.  ...and there will be time when data is in
RAM and not on the disk so you'd worry about power failure, but if you
want to get data on the disk ASAP why are you scaling the clock (and
thus delaying the data from getting to the disk) at all?  Maybe some
math?  How long does it take to ungate/gate the clocks.  1 ms?  It's
just flipping a bit, right?  ...and does assuming we can allocate a 40
MB write buffer seem sane?  So we eat 1 ms to ungate, 100 ms to
transfer 40 MB, 1 ms to gate.  Compared to transferring at 50 MB/sec
(w/ no gating), we'd transfer the same 40 MB in 800 ms.  So we either
keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for
102 ms and gate it for 698 ms.

b) If you're reading data then hopefully the system has some sort of
readahead going on.  In the "video playback" case the system should
have no problem predicting that if you've just read bytes
1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that
you're likely to keep reading the same file.  Presumably it wouldn't
be totally insane to read 40 MB ahead of time and then we can do the
same math as a).  If 40 MB is too much for readahead, then shrink it
and redo the math.  Even with much smaller numbers the "race to idle"
wins because gating / ungating clocks is fast.  ...or do you know some
reason why gating / ungating clocks needs to be slow?  If so, how
slow?

> Test case used are 1080p and 4k video playback use case. Please find below
> test case information and power impact data. In both the below video playback
> cases, a considerable amount of power savings can be observed with clock scaling
> feature.
>
> Use cases Delta at battery (mA) Power impact %
> 30 fps at HD 1080p decode 20 Mbps 10 mA 11%
> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%

Numbers like this are exactly what is needed to justify your patch
series.  ...but I'd be super curious to how it would compare to:

1) Tuning the runtime PM auto-suspend delay.  If you have your
auto-suspend delay set wrong (like 500 ms) then all the math above is
totally wrong.  We'll keep clocking at 400 MHz needlessly even though
there is no data to transfer.  If autosuspend is just gating clocks
then it feels like you could set it to 1 ms, or 10 ms.  NOTE: if
autosuspend for you is something more major (fully turning off power
rails to the eMMC) then maybe you need another level where you just
turn off the clocks.  Seems like we could find some way to make that
work.

2) Tuning any readached mechanism in your system.  If your system
somehow does zero readahead then obviously all my arguments don't work
for reads.  ...but why would you not have readahead?

3) Tuning any write buffering in your system.  Same argument as #2.

4) Making sure that when the MMC card clock is gated that you request
the lowest voltage level for the controller (and set the controller's
bus clock to the lowest level since it's not doing anything).

I would also be very interested to know how much of those savings are
achieved if you keep the voltage to the MMC controller the same.  In
other words do something that arbitrarily keeps the MMC controller
requesting the same voltage level from the rest of the system and then
do your power measurements.  How much do your savings change?

I will also note that aggressive clock gating is exactly what the
dw_mmc controller does automatically (except for SDIO, but that's a
different story).  Seeing that the controller itself can stop the
clock in dw_mmc gives further credence that gating / ungating the
clock is a very lightweight operation and 1 ms is probably an
over-estimation of how long it takes.

I guess one very last note is that I spent most of the time above
arguing that the clock scaling concept is probably not useful for any
SoCs where you can't adjust the voltage provided to the MMC
controller.  That doesn't necessarily mean that your patch series is
useful for SoCs where you can.  Specifically you'd need to do math to
see how much more power the MMC controller takes at the higher
voltage.  Then you can calculate a "perf per watt".  If the watts to
transfer data at 400 MB/s aren't 8 times more than the watts to
transfer at 50 MB/s then that's a ding against your idea.  You'd also
don't have a dedicated voltage rail, right?  So you'd have to see what
percentage of the time the MMC controller was the only thing in the
system that was requesting the higher voltage.  If it happened that
something else in the system was keeping the voltage higher anyway
then it would be better for you to run faster and race to idle.
Really I guess the case you're most worried about is if the MMC
controller is causing other components in the system to be at a higher
voltage point (and thus take up more power)...

Hope that all makes sense.  I can read the patches themselves if
needed--everything from above is just based on your cover letter and
discussion with Ulf.  ;-)

-Doug

Ram Prakash Gupta Nov. 29, 2019, 7:34 a.m. UTC | #3

Hi Ulf,

Seems some setting issue with my thunderbird application.
Sorry for spams, please ignore my last responses as unsupported
characters got added.

Typing my response again from browser and re-sending.

Thanks,
Ram

On 2019-10-22 14:10, Ulf Hansson wrote:
> On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta 
> <rampraka@codeaurora.org> wrote:
>> 
>> This change adds the use of devfreq based clock scaling to MMC.
>> This applicable for eMMC and SDCard.
>> For some workloads, such as video playback, it isn't necessary
>> for these cards to run at high speed. Running at lower
>> frequency, in such cases can still meet the deadlines for data
>> transfers.
>> 
>> Scaling down the clock frequency dynamically has power savings
>> not only because the bus is running at lower frequency but also
>> has an advantage of scaling down the system core voltage, if
>> supported. Provide an ondemand clock scaling support similar
>> to the cpufreq ondemand governor having two thresholds,
>> up_threshold and down_threshold to decide whether to increase
>> the frequency or scale it down respectively as per load.
> 
> This sounds simple, but what the series is doing is far more
> complicated but scaling the bus clock, as it also re-negotiates the
> bus speed mode.
> 
> Each time the triggering point for scaling up/down is hit, then a
> series of commands needs to be sent to the card, including running the
> tuning procedure. The point is, for sure, this doesn't come for free,
> both from a latency point of view, but also from an energy cost point
> of view. So, whether this really improves the behaviour, seems like
> very use case sensitive, right!?

Switching modes would incur some latency for sending commands to switch
modes, but tuning is not needed as most of the emmc devices used now a
days are with enhanced strobe support, so tuning would not add up any
latency as it is not required in hs400 enhanced strobe mode.

This feature is implemented for video playback case, where data transfer
request is less, where this feature helps with saving power consumption.

And when there is burst of data transfer request, load will remain 
_high_
so there won't be any switching and hence it won't affect any existing
use cases from latency point of view.

Also if hw supports to switch clk frequency without changing mode. I 
will
make change in code. For this I have seek input from hw team.

 From collected data, I see this feature is helping in saving power
consumption. And no energy penalty is observed. Please share if I am
missing any specific. Power saving using this feature is quite good
and considerable. Please find the data below.

Use Case                             Delta at Battery  Power Impact
30 fps at HD 1080p decode 20Mbps       10 mA               11%
30 fps at UHD 8b H.264 42 Mbps         20.93 mA            19%

> 
> Overall, when it comes to use cases, we have very limited knowledge
> about them at the mmc block layer level. I think it should remain like
> that. If at any place at all, this information is better maintained by
> the generic block layer and potentially the configured I/O scheduler.

I think, generic block layer do not have knowledge of use case for data
transfer request. And devfreq framework have been used to implement this
feature, which should be same in any layer.

Also mobile platforms comes mostly with emmc and ufs as storage media.
And clock scaling is already implemented in upstream ufs driver using
devfreq framework. On similar line, this feature is implemented for mmc
driver. So I believe, clk scaling implementation is better placed in mmc
driver rather in generic block layer.

> 
> This brings me to a question about the tests you have you run. Can you
> share some information and data about that?

Test case used are 1080p and 4k video playback use case. As this feature
is implemented specifically for video playback use case.
> 
>> 
>> 
>> Ram Prakash Gupta (6):
>>   mmc: core: Parse clk scaling dt entries
>>   mmc: core: Add core scaling support in driver
>>   mmc: core: Initialize clk scaling for mmc and SDCard
>>   mmc: core: Add debugfs entries for scaling support
>>   mmc: sdhci-msm: Add capability in platfrom host
>>   dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters
>> 
>>  .../devicetree/bindings/mmc/sdhci-msm.txt          |  19 +
> 
> I noticed that the DT patch was put last in the series, but the
> parsing is implemented in the first patch. Please flip this around. If
> you want to implement DT parsing of new bindings, please make sure to
> discuss the new bindings first.

I will update in next post.

> 
>>  drivers/mmc/core/block.c                           |  19 +-
>>  drivers/mmc/core/core.c                            | 777 
>> +++++++++++++++++++++
>>  drivers/mmc/core/core.h                            |  17 +
>>  drivers/mmc/core/debugfs.c                         |  90 +++
>>  drivers/mmc/core/host.c                            | 226 ++++++
>>  drivers/mmc/core/mmc.c                             | 246 ++++++-
>>  drivers/mmc/core/queue.c                           |   2 +
>>  drivers/mmc/core/sd.c                              |  84 ++-
>>  drivers/mmc/host/sdhci-msm.c                       |   2 +
>>  include/linux/mmc/card.h                           |   7 +
>>  include/linux/mmc/host.h                           |  66 ++
>>  12 files changed, 1550 insertions(+), 5 deletions(-)
> 
> This is a lot of new code in the mmc core, which I would then need to
> maintain, of course. I have to admit, I am a bit concerned about that,
> so you have to convince me that there are good reasons for me to apply
> this.
> 
> As I stated above, I think the approach looks quite questionable in
> general. And even if you can share measurement, that it improves the
> behaviour, I suspect (without a deeper code review) that some of the
> code better belongs in common block device layer.

 From the collected power data, I see this as good reason to have this
feature in mmc driver as number is quite considerable.

For approach, it would be helpful if you share your inputs regarding 
this
approach. And as you have stated, this can be further discussed after a
review from you.

> 
> Kind regards
> Uffe

Ram Prakash Gupta Nov. 29, 2019, 8:10 a.m. UTC | #4

Hi,

On 2019-11-15 23:09, Doug Anderson wrote:
> Hi,
> 
> On Fri, Nov 15, 2019 at 3:13 AM Ram Prakash Gupta
> <rampraka@codeaurora.org> wrote:
>> 
>> Each time the triggering point for scaling up/down is hit, then a
>> series of commands needs to be sent to the card, including running the
>> tuning procedure. The point is, for sure, this doesn't come for free,
>> both from a latency point of view, but also from an energy cost point
>> of view. So, whether this really improves the behaviour, seems like
>> very use case sensitive, right!?
>> 
>> With clock scaling support device mode would be switched between low 
>> speed
>> (hs50 or ddr52) and high speed mode(hs400 enhanced strobe).
> 
> I haven't read the patches, but just from this description it feels
> like the wrong way to go.  From my understanding if you're running at
> HS400 then you can run the card at any speed between 0 and 200 MHz.  I
> see no reason why you'd want to switch modes.  Just stay at HS400 and
> slow down the clock, right?  Then you can keep the "Enhanced Strobe"
> which makes everything more reliable.
> 
> If you're running on a system that doesn't have enhanced strobe you
> _should_ still be able to switch clock speeds without switching modes
> since the spec has just a _maximum_ clock frequency for each mode, so
> HS200, DDR50, etc should all be able to run at slower speeds without
> an official mode switch.  You also shouldn't have to re-tune.  Tuning
> is nothing magical, it's just trying to find the delay between sending
> a pulse on the clock line and when you read the data sent by the other
> side.  Assuming your tuning delay is natively represented in "phase
> offset", you can convert that back to an actual delay and then back to
> "phase offset" for the new clock.

Thanks for the suggestion. I have seek input from hardware team on this.
I will get back on this.

> 
> To further argue against switching modes, I would also note that for
> SD cards switching to a slower speed mode may result in an increase in
> IO voltage, which seems like it could be bad for power consumption?
> 

For SDCard clk scaling, driver is not switching any mode. So this should
not affect any use case.

> 
>> And from energy point of view, this feature is only helping in saving 
>> energy
>> and not adding any energy penalty. And we have observed a considerable 
>> amount
>> of power saving(data shared in mid) when playing 1080p video playback 
>> with
>> clock scaling feature support.
> 
> I am slightly baffled about why this would save energy unless it
> allows you to lower the voltage to the controller.  I know you _can_
> sometimes lower the voltage to the controller on Qualcomm parts, but
> you are arguing that it is useful even on systems that can't lower the
> voltage.  That feels slightly questionable.  I would expect that
> racing to idle (with the right tuning parameters) would be a better
> way to go.

Sorry, if my explanation was misleading before. MMC driver is not 
changing
card/controller voltage but by lowering clock frequency of card and
controller brings down _bus_ and _system_ voltage corners of device 
which
helps in saving power consumption.

> 
> As a specific example, let's imagine we want to transfer 1000 MB of
> data and we have 20 seconds with which to do it.  We can achieve this
> by transferring 50 MB/s for the whole 20 seconds.  Alternatively, we
> could transfer at 400 MB/s 2.5 seconds and then fully power gate the
> eMMC for the next 17.5 seconds.
> 
> In that example, I'd wonder ask is more power efficient.  Presumably
> the 2nd.  This is the whole "race to idle" concept as I understand it.
> 
> The "race to idle" is talked about a lot in the context of CPU
> frequency decisions.  Presumably you'll point out that "race to idle"
> is NOT the right thing to do for choosing a CPU frequency.  As I
> understand it, this is primarily true because we need to raise the CPU
> voltage to run at faster speeds.  This would lead me to believe that
> the only case you'd want to do frequency scaling like this is if it
> allows you to lower the voltage provided to the eMMC controller.  As
> you said, for Qualcomm it _does_ allow you to do this, but most other
> SoCs don't.  ...so unless there's a flaw in my logic (always
> possible!) this patch series should be amended to say it's only useful
> if it allows you to down-volt the controller.
> 
> Just to think a little bit more about my logic: of course, you might
> argue that we can't just do a 1000 MB data transfer.  We can break
> that down into two cases:
> 
> a) For writing, presumably the data is produced over time and you
> don't want to buffer the whole 1000 MB and delay 17.5 seconds before
> you start writing.  ...but presumably you could do _some_ buffering
> and then break things into chunks where you ungate the clock to the
> card, write a chunk out, and then re-gate the clock.  There will
> obviously be some overhead with each clock gate/ungate, but
> (hopefully) not too much.  ...and there will be time when data is in
> RAM and not on the disk so you'd worry about power failure, but if you
> want to get data on the disk ASAP why are you scaling the clock (and
> thus delaying the data from getting to the disk) at all?  Maybe some
> math?  How long does it take to ungate/gate the clocks.  1 ms?  It's
> just flipping a bit, right?  ...and does assuming we can allocate a 40
> MB write buffer seem sane?  So we eat 1 ms to ungate, 100 ms to
> transfer 40 MB, 1 ms to gate.  Compared to transferring at 50 MB/sec
> (w/ no gating), we'd transfer the same 40 MB in 800 ms.  So we either
> keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for
> 102 ms and gate it for 698 ms.
> 

"race to idle" helps but this feature was implemented with focus on 
video
playback case, where data transfer request to mmc driver spans over 
entire
playback time of video. In this case, running device in low speed mode 
helps.

> b) If you're reading data then hopefully the system has some sort of
> readahead going on.  In the "video playback" case the system should
> have no problem predicting that if you've just read bytes
> 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that
> you're likely to keep reading the same file.  Presumably it wouldn't
> be totally insane to read 40 MB ahead of time and then we can do the
> same math as a).  If 40 MB is too much for readahead, then shrink it
> and redo the math.  Even with much smaller numbers the "race to idle"
> wins because gating / ungating clocks is fast.  ...or do you know some
> reason why gating / ungating clocks needs to be slow?  If so, how
> slow?
> 

I have performed one experiment by increasing read ahead size, but that 
is
not helping. And I don't observe much difference in data request pattern
generated in video playback case.

> 
>> Test case used are 1080p and 4k video playback use case. Please find 
>> below
>> test case information and power impact data. In both the below video 
>> playback
>> cases, a considerable amount of power savings can be observed with 
>> clock scaling
>> feature.
>> 
>> Use cases Delta at battery (mA) Power impact %
>> 30 fps at HD 1080p decode 20 Mbps 10 mA 11%
>> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%
> 
> Numbers like this are exactly what is needed to justify your patch
> series.  ...but I'd be super curious to how it would compare to:
> 
> 1) Tuning the runtime PM auto-suspend delay.  If you have your
> auto-suspend delay set wrong (like 500 ms) then all the math above is
> totally wrong.  We'll keep clocking at 400 MHz needlessly even though
> there is no data to transfer.  If autosuspend is just gating clocks
> then it feels like you could set it to 1 ms, or 10 ms.  NOTE: if
> autosuspend for you is something more major (fully turning off power
> rails to the eMMC) then maybe you need another level where you just
> turn off the clocks.  Seems like we could find some way to make that
> work.

Gating / Ungating can be fine tuned to help bring down power consumption
too. I will share power numbers with tuned parameters in next 
communication.

> 
> 2) Tuning any readached mechanism in your system.  If your system
> somehow does zero readahead then obviously all my arguments don't work
> for reads.  ...but why would you not have readahead?
> 
> 3) Tuning any write buffering in your system.  Same argument as #2.

This feature is specific to video playback use case from storage device.
Not sure, which buffering can be tuned. Can you point out any buffering 
used?

> 
> 4) Making sure that when the MMC card clock is gated that you request
> the lowest voltage level for the controller (and set the controller's
> bus clock to the lowest level since it's not doing anything).
> 
> 
> I would also be very interested to know how much of those savings are
> achieved if you keep the voltage to the MMC controller the same.  In
> other words do something that arbitrarily keeps the MMC controller
> requesting the same voltage level from the rest of the system and then
> do your power measurements.  How much do your savings change?
> 
> 
> I will also note that aggressive clock gating is exactly what the
> dw_mmc controller does automatically (except for SDIO, but that's a
> different story).  Seeing that the controller itself can stop the
> clock in dw_mmc gives further credence that gating / ungating the
> clock is a very lightweight operation and 1 ms is probably an
> over-estimation of how long it takes.
> 
> 
> I guess one very last note is that I spent most of the time above
> arguing that the clock scaling concept is probably not useful for any
> SoCs where you can't adjust the voltage provided to the MMC
> controller.  That doesn't necessarily mean that your patch series is
> useful for SoCs where you can.  Specifically you'd need to do math to
> see how much more power the MMC controller takes at the higher
> voltage.  Then you can calculate a "perf per watt".  If the watts to
> transfer data at 400 MB/s aren't 8 times more than the watts to
> transfer at 50 MB/s then that's a ding against your idea.  You'd also
> don't have a dedicated voltage rail, right?  So you'd have to see what
> percentage of the time the MMC controller was the only thing in the
> system that was requesting the higher voltage.  If it happened that
> something else in the system was keeping the voltage higher anyway
> then it would be better for you to run faster and race to idle.
> Really I guess the case you're most worried about is if the MMC
> controller is causing other components in the system to be at a higher
> voltage point (and thus take up more power)...
> 

Rail is not dedicated for eMMC device. So during video playback its mmc
votes only, which keeps device in higher voltage corners. So for a three
hours of a video playback, power consumption due to mmc vote would be 
quite
considerable.

> 
> Hope that all makes sense.  I can read the patches themselves if
> needed--everything from above is just based on your cover letter and
> discussion with Ulf.  ;-)
> 
> 
> -Doug

Thanks,
Ram

Doug Anderson Dec. 4, 2019, 2:09 a.m. UTC | #5

Hi,

On Fri, Nov 29, 2019 at 12:10 AM <rampraka@codeaurora.org> wrote:
>
> > I am slightly baffled about why this would save energy unless it
> > allows you to lower the voltage to the controller.  I know you _can_
> > sometimes lower the voltage to the controller on Qualcomm parts, but
> > you are arguing that it is useful even on systems that can't lower the
> > voltage.  That feels slightly questionable.  I would expect that
> > racing to idle (with the right tuning parameters) would be a better
> > way to go.
>
> Sorry, if my explanation was misleading before. MMC driver is not
> changing
> card/controller voltage but by lowering clock frequency of card and
> controller brings down _bus_ and _system_ voltage corners of device
> which
> helps in saving power consumption.

Yes, I understood.  You are able to lower the voltage supplied to the
controller itself, not to the card.  ...but that is not something that
all SoCs can do.  I think here you are arguing that your feature is
useful to everyone--even if they can't ever lower the voltage of the
controller.  I am questioning that.


> > As a specific example, let's imagine we want to transfer 1000 MB of
> > data and we have 20 seconds with which to do it.  We can achieve this
> > by transferring 50 MB/s for the whole 20 seconds.  Alternatively, we
> > could transfer at 400 MB/s 2.5 seconds and then fully power gate the
> > eMMC for the next 17.5 seconds.
> >
> > In that example, I'd wonder ask is more power efficient.  Presumably
> > the 2nd.  This is the whole "race to idle" concept as I understand it.
> >
> > The "race to idle" is talked about a lot in the context of CPU
> > frequency decisions.  Presumably you'll point out that "race to idle"
> > is NOT the right thing to do for choosing a CPU frequency.  As I
> > understand it, this is primarily true because we need to raise the CPU
> > voltage to run at faster speeds.  This would lead me to believe that
> > the only case you'd want to do frequency scaling like this is if it
> > allows you to lower the voltage provided to the eMMC controller.  As
> > you said, for Qualcomm it _does_ allow you to do this, but most other
> > SoCs don't.  ...so unless there's a flaw in my logic (always
> > possible!) this patch series should be amended to say it's only useful
> > if it allows you to down-volt the controller.
> >
> > Just to think a little bit more about my logic: of course, you might
> > argue that we can't just do a 1000 MB data transfer.  We can break
> > that down into two cases:
> >
> > a) For writing, presumably the data is produced over time and you
> > don't want to buffer the whole 1000 MB and delay 17.5 seconds before
> > you start writing.  ...but presumably you could do _some_ buffering
> > and then break things into chunks where you ungate the clock to the
> > card, write a chunk out, and then re-gate the clock.  There will
> > obviously be some overhead with each clock gate/ungate, but
> > (hopefully) not too much.  ...and there will be time when data is in
> > RAM and not on the disk so you'd worry about power failure, but if you
> > want to get data on the disk ASAP why are you scaling the clock (and
> > thus delaying the data from getting to the disk) at all?  Maybe some
> > math?  How long does it take to ungate/gate the clocks.  1 ms?  It's
> > just flipping a bit, right?  ...and does assuming we can allocate a 40
> > MB write buffer seem sane?  So we eat 1 ms to ungate, 100 ms to
> > transfer 40 MB, 1 ms to gate.  Compared to transferring at 50 MB/sec
> > (w/ no gating), we'd transfer the same 40 MB in 800 ms.  So we either
> > keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for
> > 102 ms and gate it for 698 ms.
> >
>
> "race to idle" helps but this feature was implemented with focus on
> video
> playback case, where data transfer request to mmc driver spans over
> entire
> playback time of video. In this case, running device in low speed mode
> helps.

It doesn't matter if you've got a long playback.  A properly tuned
"race to idle" should still be better unless you are seriously saving
a lot of power by never bringing the voltage up.  The above example is
with a fixed size transfer, but just imagine the same thing over and
over again and you've got the video playback case.


> > b) If you're reading data then hopefully the system has some sort of
> > readahead going on.  In the "video playback" case the system should
> > have no problem predicting that if you've just read bytes
> > 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that
> > you're likely to keep reading the same file.  Presumably it wouldn't
> > be totally insane to read 40 MB ahead of time and then we can do the
> > same math as a).  If 40 MB is too much for readahead, then shrink it
> > and redo the math.  Even with much smaller numbers the "race to idle"
> > wins because gating / ungating clocks is fast.  ...or do you know some
> > reason why gating / ungating clocks needs to be slow?  If so, how
> > slow?
> >
>
> I have performed one experiment by increasing read ahead size, but that
> is
> not helping. And I don't observe much difference in data request pattern
> generated in video playback case.

You have to make sure that you are proactively gating the card clock
in conjunction.  If you aren't gating the card clock quickly enough
(and aren't lowering the controller voltage when the card clock is
gated) then you won't notice the savings.


> >> Test case used are 1080p and 4k video playback use case. Please find
> >> below
> >> test case information and power impact data. In both the below video
> >> playback
> >> cases, a considerable amount of power savings can be observed with
> >> clock scaling
> >> feature.
> >>
> >> Use cases Delta at battery (mA) Power impact %
> >> 30 fps at HD 1080p decode 20 Mbps 10 mA 11%
> >> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19%
> >
> > Numbers like this are exactly what is needed to justify your patch
> > series.  ...but I'd be super curious to how it would compare to:
> >
> > 1) Tuning the runtime PM auto-suspend delay.  If you have your
> > auto-suspend delay set wrong (like 500 ms) then all the math above is
> > totally wrong.  We'll keep clocking at 400 MHz needlessly even though
> > there is no data to transfer.  If autosuspend is just gating clocks
> > then it feels like you could set it to 1 ms, or 10 ms.  NOTE: if
> > autosuspend for you is something more major (fully turning off power
> > rails to the eMMC) then maybe you need another level where you just
> > turn off the clocks.  Seems like we could find some way to make that
> > work.
>
> Gating / Ungating can be fine tuned to help bring down power consumption
> too. I will share power numbers with tuned parameters in next
> communication.

Thanks, I think this is critical for the discussion.  Please make sure
that when you are gating you code it up in a way that you can remove
the vote for the higher controller voltage.


> > 2) Tuning any readached mechanism in your system.  If your system
> > somehow does zero readahead then obviously all my arguments don't work
> > for reads.  ...but why would you not have readahead?
> >
> > 3) Tuning any write buffering in your system.  Same argument as #2.
>
> This feature is specific to video playback use case from storage device.
> Not sure, which buffering can be tuned. Can you point out any buffering
> used?

If you're only testing video playback then I guess you don't care so
much here.  I was assuming you cared about video record too, but maybe
that's not such an important use case for saving power.


> > 4) Making sure that when the MMC card clock is gated that you request
> > the lowest voltage level for the controller (and set the controller's
> > bus clock to the lowest level since it's not doing anything).
> >
> >
> > I would also be very interested to know how much of those savings are
> > achieved if you keep the voltage to the MMC controller the same.  In
> > other words do something that arbitrarily keeps the MMC controller
> > requesting the same voltage level from the rest of the system and then
> > do your power measurements.  How much do your savings change?
> >
> >
> > I will also note that aggressive clock gating is exactly what the
> > dw_mmc controller does automatically (except for SDIO, but that's a
> > different story).  Seeing that the controller itself can stop the
> > clock in dw_mmc gives further credence that gating / ungating the
> > clock is a very lightweight operation and 1 ms is probably an
> > over-estimation of how long it takes.
> >
> >
> > I guess one very last note is that I spent most of the time above
> > arguing that the clock scaling concept is probably not useful for any
> > SoCs where you can't adjust the voltage provided to the MMC
> > controller.  That doesn't necessarily mean that your patch series is
> > useful for SoCs where you can.  Specifically you'd need to do math to
> > see how much more power the MMC controller takes at the higher
> > voltage.  Then you can calculate a "perf per watt".  If the watts to
> > transfer data at 400 MB/s aren't 8 times more than the watts to
> > transfer at 50 MB/s then that's a ding against your idea.  You'd also
> > don't have a dedicated voltage rail, right?  So you'd have to see what
> > percentage of the time the MMC controller was the only thing in the
> > system that was requesting the higher voltage.  If it happened that
> > something else in the system was keeping the voltage higher anyway
> > then it would be better for you to run faster and race to idle.
> > Really I guess the case you're most worried about is if the MMC
> > controller is causing other components in the system to be at a higher
> > voltage point (and thus take up more power)...
> >
>
> Rail is not dedicated for eMMC device. So during video playback its mmc
> votes only, which keeps device in higher voltage corners. So for a three
> hours of a video playback, power consumption due to mmc vote would be
> quite
> considerable.

You should be able to remove your vote whenever the card clock is
gated.  Maybe this is what you're missing?

-Doug

Ulf Hansson Dec. 18, 2019, 3:11 p.m. UTC | #6

On Fri, 29 Nov 2019 at 08:34, <rampraka@codeaurora.org> wrote:
>
> Hi Ulf,
>
> Seems some setting issue with my thunderbird application.
> Sorry for spams, please ignore my last responses as unsupported
> characters got added.
>
> Typing my response again from browser and re-sending.
>
> Thanks,
> Ram
>
> On 2019-10-22 14:10, Ulf Hansson wrote:
> > On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta
> > <rampraka@codeaurora.org> wrote:
> >>
> >> This change adds the use of devfreq based clock scaling to MMC.
> >> This applicable for eMMC and SDCard.
> >> For some workloads, such as video playback, it isn't necessary
> >> for these cards to run at high speed. Running at lower
> >> frequency, in such cases can still meet the deadlines for data
> >> transfers.
> >>
> >> Scaling down the clock frequency dynamically has power savings
> >> not only because the bus is running at lower frequency but also
> >> has an advantage of scaling down the system core voltage, if
> >> supported. Provide an ondemand clock scaling support similar
> >> to the cpufreq ondemand governor having two thresholds,
> >> up_threshold and down_threshold to decide whether to increase
> >> the frequency or scale it down respectively as per load.
> >
> > This sounds simple, but what the series is doing is far more
> > complicated but scaling the bus clock, as it also re-negotiates the
> > bus speed mode.
> >
> > Each time the triggering point for scaling up/down is hit, then a
> > series of commands needs to be sent to the card, including running the
> > tuning procedure. The point is, for sure, this doesn't come for free,
> > both from a latency point of view, but also from an energy cost point
> > of view. So, whether this really improves the behaviour, seems like
> > very use case sensitive, right!?
>
> Switching modes would incur some latency for sending commands to switch
> modes, but tuning is not needed as most of the emmc devices used now a
> days are with enhanced strobe support, so tuning would not add up any
> latency as it is not required in hs400 enhanced strobe mode.
>
> This feature is implemented for video playback case, where data transfer
> request is less, where this feature helps with saving power consumption.
>
> And when there is burst of data transfer request, load will remain
> _high_
> so there won't be any switching and hence it won't affect any existing
> use cases from latency point of view.
>
> Also if hw supports to switch clk frequency without changing mode. I
> will
> make change in code. For this I have seek input from hw team.
>
>  From collected data, I see this feature is helping in saving power
> consumption. And no energy penalty is observed. Please share if I am
> missing any specific. Power saving using this feature is quite good
> and considerable. Please find the data below.
>
> Use Case                             Delta at Battery  Power Impact
> 30 fps at HD 1080p decode 20Mbps       10 mA               11%
> 30 fps at UHD 8b H.264 42 Mbps         20.93 mA            19%
>
> >
> > Overall, when it comes to use cases, we have very limited knowledge
> > about them at the mmc block layer level. I think it should remain like
> > that. If at any place at all, this information is better maintained by
> > the generic block layer and potentially the configured I/O scheduler.
>
> I think, generic block layer do not have knowledge of use case for data
> transfer request. And devfreq framework have been used to implement this
> feature, which should be same in any layer.

Just to be clear, my main concern here is not using the devfreq framework.

It's rather whether switching speed modes is really worth it, which
Doug pointed out as well.

>
> Also mobile platforms comes mostly with emmc and ufs as storage media.
> And clock scaling is already implemented in upstream ufs driver using
> devfreq framework. On similar line, this feature is implemented for mmc
> driver. So I believe, clk scaling implementation is better placed in mmc
> driver rather in generic block layer.

At some point you want to trigger the clock to be scaled up/down,
probably because of reaching some bandwidth threshold. This part seems
like a generic block thing and not an mmc specific thing.

Exactly how that would affect this approach is hard to say, but my
point is, that I don't want to sprinkle the mmc subsystem with code
that may belong in upper common layers.

>
> >
> > This brings me to a question about the tests you have you run. Can you
> > share some information and data about that?
>
> Test case used are 1080p and 4k video playback use case. As this feature
> is implemented specifically for video playback use case.

Right.

It seems to me, that you are optimizing for only one particular use
case. How do you make sure to not increase energy consumption, for
other use cases that would gain from running at the highest speed mode
and "race to idle"?

I think you need to take a step back and think more general about this
approach. More importantly, you need to provide more data to prove
your approach, also according to suggestions from Dough.

> >
> >>
> >>
> >> Ram Prakash Gupta (6):
> >>   mmc: core: Parse clk scaling dt entries
> >>   mmc: core: Add core scaling support in driver
> >>   mmc: core: Initialize clk scaling for mmc and SDCard
> >>   mmc: core: Add debugfs entries for scaling support
> >>   mmc: sdhci-msm: Add capability in platfrom host
> >>   dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters
> >>
> >>  .../devicetree/bindings/mmc/sdhci-msm.txt          |  19 +
> >
> > I noticed that the DT patch was put last in the series, but the
> > parsing is implemented in the first patch. Please flip this around. If
> > you want to implement DT parsing of new bindings, please make sure to
> > discuss the new bindings first.
>
> I will update in next post.
>
> >
> >>  drivers/mmc/core/block.c                           |  19 +-
> >>  drivers/mmc/core/core.c                            | 777
> >> +++++++++++++++++++++
> >>  drivers/mmc/core/core.h                            |  17 +
> >>  drivers/mmc/core/debugfs.c                         |  90 +++
> >>  drivers/mmc/core/host.c                            | 226 ++++++
> >>  drivers/mmc/core/mmc.c                             | 246 ++++++-
> >>  drivers/mmc/core/queue.c                           |   2 +
> >>  drivers/mmc/core/sd.c                              |  84 ++-
> >>  drivers/mmc/host/sdhci-msm.c                       |   2 +
> >>  include/linux/mmc/card.h                           |   7 +
> >>  include/linux/mmc/host.h                           |  66 ++
> >>  12 files changed, 1550 insertions(+), 5 deletions(-)
> >
> > This is a lot of new code in the mmc core, which I would then need to
> > maintain, of course. I have to admit, I am a bit concerned about that,
> > so you have to convince me that there are good reasons for me to apply
> > this.
> >
> > As I stated above, I think the approach looks quite questionable in
> > general. And even if you can share measurement, that it improves the
> > behaviour, I suspect (without a deeper code review) that some of the
> > code better belongs in common block device layer.
>
>  From the collected power data, I see this as good reason to have this
> feature in mmc driver as number is quite considerable.
>
> For approach, it would be helpful if you share your inputs regarding
> this
> approach. And as you have stated, this can be further discussed after a
> review from you.

Besides my comments above, Doug has provided you with valuable
feedback. Please follow up on that as well, if you still intend to
pursue with this approach.

Kind regards
Uffe

[RFC,0/6] mmc: Add clock scaling support for mmc driver

Message

Comments