Message ID | 1571668177-3766-1-git-send-email-rampraka@codeaurora.org (mailing list archive) |
---|---|
Headers | show |
Series | mmc: Add clock scaling support for mmc driver | expand |
On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta <rampraka@codeaurora.org> wrote: > > This change adds the use of devfreq based clock scaling to MMC. > This applicable for eMMC and SDCard. > For some workloads, such as video playback, it isn't necessary > for these cards to run at high speed. Running at lower > frequency, in such cases can still meet the deadlines for data > transfers. > > Scaling down the clock frequency dynamically has power savings > not only because the bus is running at lower frequency but also > has an advantage of scaling down the system core voltage, if > supported. Provide an ondemand clock scaling support similar > to the cpufreq ondemand governor having two thresholds, > up_threshold and down_threshold to decide whether to increase > the frequency or scale it down respectively as per load. This sounds simple, but what the series is doing is far more complicated but scaling the bus clock, as it also re-negotiates the bus speed mode. Each time the triggering point for scaling up/down is hit, then a series of commands needs to be sent to the card, including running the tuning procedure. The point is, for sure, this doesn't come for free, both from a latency point of view, but also from an energy cost point of view. So, whether this really improves the behaviour, seems like very use case sensitive, right!? Overall, when it comes to use cases, we have very limited knowledge about them at the mmc block layer level. I think it should remain like that. If at any place at all, this information is better maintained by the generic block layer and potentially the configured I/O scheduler. This brings me to a question about the tests you have you run. Can you share some information and data about that? > > > Ram Prakash Gupta (6): > mmc: core: Parse clk scaling dt entries > mmc: core: Add core scaling support in driver > mmc: core: Initialize clk scaling for mmc and SDCard > mmc: core: Add debugfs entries for scaling support > mmc: sdhci-msm: Add capability in platfrom host > dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters > > .../devicetree/bindings/mmc/sdhci-msm.txt | 19 + I noticed that the DT patch was put last in the series, but the parsing is implemented in the first patch. Please flip this around. If you want to implement DT parsing of new bindings, please make sure to discuss the new bindings first. > drivers/mmc/core/block.c | 19 +- > drivers/mmc/core/core.c | 777 +++++++++++++++++++++ > drivers/mmc/core/core.h | 17 + > drivers/mmc/core/debugfs.c | 90 +++ > drivers/mmc/core/host.c | 226 ++++++ > drivers/mmc/core/mmc.c | 246 ++++++- > drivers/mmc/core/queue.c | 2 + > drivers/mmc/core/sd.c | 84 ++- > drivers/mmc/host/sdhci-msm.c | 2 + > include/linux/mmc/card.h | 7 + > include/linux/mmc/host.h | 66 ++ > 12 files changed, 1550 insertions(+), 5 deletions(-) This is a lot of new code in the mmc core, which I would then need to maintain, of course. I have to admit, I am a bit concerned about that, so you have to convince me that there are good reasons for me to apply this. As I stated above, I think the approach looks quite questionable in general. And even if you can share measurement, that it improves the behaviour, I suspect (without a deeper code review) that some of the code better belongs in common block device layer. Kind regards Uffe
Hi, On Fri, Nov 15, 2019 at 3:13 AM Ram Prakash Gupta <rampraka@codeaurora.org> wrote: > > Each time the triggering point for scaling up/down is hit, then a > series of commands needs to be sent to the card, including running the > tuning procedure. The point is, for sure, this doesn't come for free, > both from a latency point of view, but also from an energy cost point > of view. So, whether this really improves the behaviour, seems like > very use case sensitive, right!? > > With clock scaling support device mode would be switched between low speed > (hs50 or ddr52) and high speed mode(hs400 enhanced strobe). I haven't read the patches, but just from this description it feels like the wrong way to go. From my understanding if you're running at HS400 then you can run the card at any speed between 0 and 200 MHz. I see no reason why you'd want to switch modes. Just stay at HS400 and slow down the clock, right? Then you can keep the "Enhanced Strobe" which makes everything more reliable. If you're running on a system that doesn't have enhanced strobe you _should_ still be able to switch clock speeds without switching modes since the spec has just a _maximum_ clock frequency for each mode, so HS200, DDR50, etc should all be able to run at slower speeds without an official mode switch. You also shouldn't have to re-tune. Tuning is nothing magical, it's just trying to find the delay between sending a pulse on the clock line and when you read the data sent by the other side. Assuming your tuning delay is natively represented in "phase offset", you can convert that back to an actual delay and then back to "phase offset" for the new clock. To further argue against switching modes, I would also note that for SD cards switching to a slower speed mode may result in an increase in IO voltage, which seems like it could be bad for power consumption? > And from energy point of view, this feature is only helping in saving energy > and not adding any energy penalty. And we have observed a considerable amount > of power saving(data shared in mid) when playing 1080p video playback with > clock scaling feature support. I am slightly baffled about why this would save energy unless it allows you to lower the voltage to the controller. I know you _can_ sometimes lower the voltage to the controller on Qualcomm parts, but you are arguing that it is useful even on systems that can't lower the voltage. That feels slightly questionable. I would expect that racing to idle (with the right tuning parameters) would be a better way to go. As a specific example, let's imagine we want to transfer 1000 MB of data and we have 20 seconds with which to do it. We can achieve this by transferring 50 MB/s for the whole 20 seconds. Alternatively, we could transfer at 400 MB/s 2.5 seconds and then fully power gate the eMMC for the next 17.5 seconds. In that example, I'd wonder ask is more power efficient. Presumably the 2nd. This is the whole "race to idle" concept as I understand it. The "race to idle" is talked about a lot in the context of CPU frequency decisions. Presumably you'll point out that "race to idle" is NOT the right thing to do for choosing a CPU frequency. As I understand it, this is primarily true because we need to raise the CPU voltage to run at faster speeds. This would lead me to believe that the only case you'd want to do frequency scaling like this is if it allows you to lower the voltage provided to the eMMC controller. As you said, for Qualcomm it _does_ allow you to do this, but most other SoCs don't. ...so unless there's a flaw in my logic (always possible!) this patch series should be amended to say it's only useful if it allows you to down-volt the controller. Just to think a little bit more about my logic: of course, you might argue that we can't just do a 1000 MB data transfer. We can break that down into two cases: a) For writing, presumably the data is produced over time and you don't want to buffer the whole 1000 MB and delay 17.5 seconds before you start writing. ...but presumably you could do _some_ buffering and then break things into chunks where you ungate the clock to the card, write a chunk out, and then re-gate the clock. There will obviously be some overhead with each clock gate/ungate, but (hopefully) not too much. ...and there will be time when data is in RAM and not on the disk so you'd worry about power failure, but if you want to get data on the disk ASAP why are you scaling the clock (and thus delaying the data from getting to the disk) at all? Maybe some math? How long does it take to ungate/gate the clocks. 1 ms? It's just flipping a bit, right? ...and does assuming we can allocate a 40 MB write buffer seem sane? So we eat 1 ms to ungate, 100 ms to transfer 40 MB, 1 ms to gate. Compared to transferring at 50 MB/sec (w/ no gating), we'd transfer the same 40 MB in 800 ms. So we either keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for 102 ms and gate it for 698 ms. b) If you're reading data then hopefully the system has some sort of readahead going on. In the "video playback" case the system should have no problem predicting that if you've just read bytes 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that you're likely to keep reading the same file. Presumably it wouldn't be totally insane to read 40 MB ahead of time and then we can do the same math as a). If 40 MB is too much for readahead, then shrink it and redo the math. Even with much smaller numbers the "race to idle" wins because gating / ungating clocks is fast. ...or do you know some reason why gating / ungating clocks needs to be slow? If so, how slow? > Test case used are 1080p and 4k video playback use case. Please find below > test case information and power impact data. In both the below video playback > cases, a considerable amount of power savings can be observed with clock scaling > feature. > > Use cases Delta at battery (mA) Power impact % > 30 fps at HD 1080p decode 20 Mbps 10 mA 11% > 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19% Numbers like this are exactly what is needed to justify your patch series. ...but I'd be super curious to how it would compare to: 1) Tuning the runtime PM auto-suspend delay. If you have your auto-suspend delay set wrong (like 500 ms) then all the math above is totally wrong. We'll keep clocking at 400 MHz needlessly even though there is no data to transfer. If autosuspend is just gating clocks then it feels like you could set it to 1 ms, or 10 ms. NOTE: if autosuspend for you is something more major (fully turning off power rails to the eMMC) then maybe you need another level where you just turn off the clocks. Seems like we could find some way to make that work. 2) Tuning any readached mechanism in your system. If your system somehow does zero readahead then obviously all my arguments don't work for reads. ...but why would you not have readahead? 3) Tuning any write buffering in your system. Same argument as #2. 4) Making sure that when the MMC card clock is gated that you request the lowest voltage level for the controller (and set the controller's bus clock to the lowest level since it's not doing anything). I would also be very interested to know how much of those savings are achieved if you keep the voltage to the MMC controller the same. In other words do something that arbitrarily keeps the MMC controller requesting the same voltage level from the rest of the system and then do your power measurements. How much do your savings change? I will also note that aggressive clock gating is exactly what the dw_mmc controller does automatically (except for SDIO, but that's a different story). Seeing that the controller itself can stop the clock in dw_mmc gives further credence that gating / ungating the clock is a very lightweight operation and 1 ms is probably an over-estimation of how long it takes. I guess one very last note is that I spent most of the time above arguing that the clock scaling concept is probably not useful for any SoCs where you can't adjust the voltage provided to the MMC controller. That doesn't necessarily mean that your patch series is useful for SoCs where you can. Specifically you'd need to do math to see how much more power the MMC controller takes at the higher voltage. Then you can calculate a "perf per watt". If the watts to transfer data at 400 MB/s aren't 8 times more than the watts to transfer at 50 MB/s then that's a ding against your idea. You'd also don't have a dedicated voltage rail, right? So you'd have to see what percentage of the time the MMC controller was the only thing in the system that was requesting the higher voltage. If it happened that something else in the system was keeping the voltage higher anyway then it would be better for you to run faster and race to idle. Really I guess the case you're most worried about is if the MMC controller is causing other components in the system to be at a higher voltage point (and thus take up more power)... Hope that all makes sense. I can read the patches themselves if needed--everything from above is just based on your cover letter and discussion with Ulf. ;-) -Doug
Hi Ulf, Seems some setting issue with my thunderbird application. Sorry for spams, please ignore my last responses as unsupported characters got added. Typing my response again from browser and re-sending. Thanks, Ram On 2019-10-22 14:10, Ulf Hansson wrote: > On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta > <rampraka@codeaurora.org> wrote: >> >> This change adds the use of devfreq based clock scaling to MMC. >> This applicable for eMMC and SDCard. >> For some workloads, such as video playback, it isn't necessary >> for these cards to run at high speed. Running at lower >> frequency, in such cases can still meet the deadlines for data >> transfers. >> >> Scaling down the clock frequency dynamically has power savings >> not only because the bus is running at lower frequency but also >> has an advantage of scaling down the system core voltage, if >> supported. Provide an ondemand clock scaling support similar >> to the cpufreq ondemand governor having two thresholds, >> up_threshold and down_threshold to decide whether to increase >> the frequency or scale it down respectively as per load. > > This sounds simple, but what the series is doing is far more > complicated but scaling the bus clock, as it also re-negotiates the > bus speed mode. > > Each time the triggering point for scaling up/down is hit, then a > series of commands needs to be sent to the card, including running the > tuning procedure. The point is, for sure, this doesn't come for free, > both from a latency point of view, but also from an energy cost point > of view. So, whether this really improves the behaviour, seems like > very use case sensitive, right!? Switching modes would incur some latency for sending commands to switch modes, but tuning is not needed as most of the emmc devices used now a days are with enhanced strobe support, so tuning would not add up any latency as it is not required in hs400 enhanced strobe mode. This feature is implemented for video playback case, where data transfer request is less, where this feature helps with saving power consumption. And when there is burst of data transfer request, load will remain _high_ so there won't be any switching and hence it won't affect any existing use cases from latency point of view. Also if hw supports to switch clk frequency without changing mode. I will make change in code. For this I have seek input from hw team. From collected data, I see this feature is helping in saving power consumption. And no energy penalty is observed. Please share if I am missing any specific. Power saving using this feature is quite good and considerable. Please find the data below. Use Case Delta at Battery Power Impact 30 fps at HD 1080p decode 20Mbps 10 mA 11% 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19% > > Overall, when it comes to use cases, we have very limited knowledge > about them at the mmc block layer level. I think it should remain like > that. If at any place at all, this information is better maintained by > the generic block layer and potentially the configured I/O scheduler. I think, generic block layer do not have knowledge of use case for data transfer request. And devfreq framework have been used to implement this feature, which should be same in any layer. Also mobile platforms comes mostly with emmc and ufs as storage media. And clock scaling is already implemented in upstream ufs driver using devfreq framework. On similar line, this feature is implemented for mmc driver. So I believe, clk scaling implementation is better placed in mmc driver rather in generic block layer. > > This brings me to a question about the tests you have you run. Can you > share some information and data about that? Test case used are 1080p and 4k video playback use case. As this feature is implemented specifically for video playback use case. > >> >> >> Ram Prakash Gupta (6): >> mmc: core: Parse clk scaling dt entries >> mmc: core: Add core scaling support in driver >> mmc: core: Initialize clk scaling for mmc and SDCard >> mmc: core: Add debugfs entries for scaling support >> mmc: sdhci-msm: Add capability in platfrom host >> dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters >> >> .../devicetree/bindings/mmc/sdhci-msm.txt | 19 + > > I noticed that the DT patch was put last in the series, but the > parsing is implemented in the first patch. Please flip this around. If > you want to implement DT parsing of new bindings, please make sure to > discuss the new bindings first. I will update in next post. > >> drivers/mmc/core/block.c | 19 +- >> drivers/mmc/core/core.c | 777 >> +++++++++++++++++++++ >> drivers/mmc/core/core.h | 17 + >> drivers/mmc/core/debugfs.c | 90 +++ >> drivers/mmc/core/host.c | 226 ++++++ >> drivers/mmc/core/mmc.c | 246 ++++++- >> drivers/mmc/core/queue.c | 2 + >> drivers/mmc/core/sd.c | 84 ++- >> drivers/mmc/host/sdhci-msm.c | 2 + >> include/linux/mmc/card.h | 7 + >> include/linux/mmc/host.h | 66 ++ >> 12 files changed, 1550 insertions(+), 5 deletions(-) > > This is a lot of new code in the mmc core, which I would then need to > maintain, of course. I have to admit, I am a bit concerned about that, > so you have to convince me that there are good reasons for me to apply > this. > > As I stated above, I think the approach looks quite questionable in > general. And even if you can share measurement, that it improves the > behaviour, I suspect (without a deeper code review) that some of the > code better belongs in common block device layer. From the collected power data, I see this as good reason to have this feature in mmc driver as number is quite considerable. For approach, it would be helpful if you share your inputs regarding this approach. And as you have stated, this can be further discussed after a review from you. > > Kind regards > Uffe
Hi, On 2019-11-15 23:09, Doug Anderson wrote: > Hi, > > On Fri, Nov 15, 2019 at 3:13 AM Ram Prakash Gupta > <rampraka@codeaurora.org> wrote: >> >> Each time the triggering point for scaling up/down is hit, then a >> series of commands needs to be sent to the card, including running the >> tuning procedure. The point is, for sure, this doesn't come for free, >> both from a latency point of view, but also from an energy cost point >> of view. So, whether this really improves the behaviour, seems like >> very use case sensitive, right!? >> >> With clock scaling support device mode would be switched between low >> speed >> (hs50 or ddr52) and high speed mode(hs400 enhanced strobe). > > I haven't read the patches, but just from this description it feels > like the wrong way to go. From my understanding if you're running at > HS400 then you can run the card at any speed between 0 and 200 MHz. I > see no reason why you'd want to switch modes. Just stay at HS400 and > slow down the clock, right? Then you can keep the "Enhanced Strobe" > which makes everything more reliable. > > If you're running on a system that doesn't have enhanced strobe you > _should_ still be able to switch clock speeds without switching modes > since the spec has just a _maximum_ clock frequency for each mode, so > HS200, DDR50, etc should all be able to run at slower speeds without > an official mode switch. You also shouldn't have to re-tune. Tuning > is nothing magical, it's just trying to find the delay between sending > a pulse on the clock line and when you read the data sent by the other > side. Assuming your tuning delay is natively represented in "phase > offset", you can convert that back to an actual delay and then back to > "phase offset" for the new clock. Thanks for the suggestion. I have seek input from hardware team on this. I will get back on this. > > To further argue against switching modes, I would also note that for > SD cards switching to a slower speed mode may result in an increase in > IO voltage, which seems like it could be bad for power consumption? > For SDCard clk scaling, driver is not switching any mode. So this should not affect any use case. > >> And from energy point of view, this feature is only helping in saving >> energy >> and not adding any energy penalty. And we have observed a considerable >> amount >> of power saving(data shared in mid) when playing 1080p video playback >> with >> clock scaling feature support. > > I am slightly baffled about why this would save energy unless it > allows you to lower the voltage to the controller. I know you _can_ > sometimes lower the voltage to the controller on Qualcomm parts, but > you are arguing that it is useful even on systems that can't lower the > voltage. That feels slightly questionable. I would expect that > racing to idle (with the right tuning parameters) would be a better > way to go. Sorry, if my explanation was misleading before. MMC driver is not changing card/controller voltage but by lowering clock frequency of card and controller brings down _bus_ and _system_ voltage corners of device which helps in saving power consumption. > > As a specific example, let's imagine we want to transfer 1000 MB of > data and we have 20 seconds with which to do it. We can achieve this > by transferring 50 MB/s for the whole 20 seconds. Alternatively, we > could transfer at 400 MB/s 2.5 seconds and then fully power gate the > eMMC for the next 17.5 seconds. > > In that example, I'd wonder ask is more power efficient. Presumably > the 2nd. This is the whole "race to idle" concept as I understand it. > > The "race to idle" is talked about a lot in the context of CPU > frequency decisions. Presumably you'll point out that "race to idle" > is NOT the right thing to do for choosing a CPU frequency. As I > understand it, this is primarily true because we need to raise the CPU > voltage to run at faster speeds. This would lead me to believe that > the only case you'd want to do frequency scaling like this is if it > allows you to lower the voltage provided to the eMMC controller. As > you said, for Qualcomm it _does_ allow you to do this, but most other > SoCs don't. ...so unless there's a flaw in my logic (always > possible!) this patch series should be amended to say it's only useful > if it allows you to down-volt the controller. > > Just to think a little bit more about my logic: of course, you might > argue that we can't just do a 1000 MB data transfer. We can break > that down into two cases: > > a) For writing, presumably the data is produced over time and you > don't want to buffer the whole 1000 MB and delay 17.5 seconds before > you start writing. ...but presumably you could do _some_ buffering > and then break things into chunks where you ungate the clock to the > card, write a chunk out, and then re-gate the clock. There will > obviously be some overhead with each clock gate/ungate, but > (hopefully) not too much. ...and there will be time when data is in > RAM and not on the disk so you'd worry about power failure, but if you > want to get data on the disk ASAP why are you scaling the clock (and > thus delaying the data from getting to the disk) at all? Maybe some > math? How long does it take to ungate/gate the clocks. 1 ms? It's > just flipping a bit, right? ...and does assuming we can allocate a 40 > MB write buffer seem sane? So we eat 1 ms to ungate, 100 ms to > transfer 40 MB, 1 ms to gate. Compared to transferring at 50 MB/sec > (w/ no gating), we'd transfer the same 40 MB in 800 ms. So we either > keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for > 102 ms and gate it for 698 ms. > "race to idle" helps but this feature was implemented with focus on video playback case, where data transfer request to mmc driver spans over entire playback time of video. In this case, running device in low speed mode helps. > b) If you're reading data then hopefully the system has some sort of > readahead going on. In the "video playback" case the system should > have no problem predicting that if you've just read bytes > 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that > you're likely to keep reading the same file. Presumably it wouldn't > be totally insane to read 40 MB ahead of time and then we can do the > same math as a). If 40 MB is too much for readahead, then shrink it > and redo the math. Even with much smaller numbers the "race to idle" > wins because gating / ungating clocks is fast. ...or do you know some > reason why gating / ungating clocks needs to be slow? If so, how > slow? > I have performed one experiment by increasing read ahead size, but that is not helping. And I don't observe much difference in data request pattern generated in video playback case. > >> Test case used are 1080p and 4k video playback use case. Please find >> below >> test case information and power impact data. In both the below video >> playback >> cases, a considerable amount of power savings can be observed with >> clock scaling >> feature. >> >> Use cases Delta at battery (mA) Power impact % >> 30 fps at HD 1080p decode 20 Mbps 10 mA 11% >> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19% > > Numbers like this are exactly what is needed to justify your patch > series. ...but I'd be super curious to how it would compare to: > > 1) Tuning the runtime PM auto-suspend delay. If you have your > auto-suspend delay set wrong (like 500 ms) then all the math above is > totally wrong. We'll keep clocking at 400 MHz needlessly even though > there is no data to transfer. If autosuspend is just gating clocks > then it feels like you could set it to 1 ms, or 10 ms. NOTE: if > autosuspend for you is something more major (fully turning off power > rails to the eMMC) then maybe you need another level where you just > turn off the clocks. Seems like we could find some way to make that > work. Gating / Ungating can be fine tuned to help bring down power consumption too. I will share power numbers with tuned parameters in next communication. > > 2) Tuning any readached mechanism in your system. If your system > somehow does zero readahead then obviously all my arguments don't work > for reads. ...but why would you not have readahead? > > 3) Tuning any write buffering in your system. Same argument as #2. This feature is specific to video playback use case from storage device. Not sure, which buffering can be tuned. Can you point out any buffering used? > > 4) Making sure that when the MMC card clock is gated that you request > the lowest voltage level for the controller (and set the controller's > bus clock to the lowest level since it's not doing anything). > > > I would also be very interested to know how much of those savings are > achieved if you keep the voltage to the MMC controller the same. In > other words do something that arbitrarily keeps the MMC controller > requesting the same voltage level from the rest of the system and then > do your power measurements. How much do your savings change? > > > I will also note that aggressive clock gating is exactly what the > dw_mmc controller does automatically (except for SDIO, but that's a > different story). Seeing that the controller itself can stop the > clock in dw_mmc gives further credence that gating / ungating the > clock is a very lightweight operation and 1 ms is probably an > over-estimation of how long it takes. > > > I guess one very last note is that I spent most of the time above > arguing that the clock scaling concept is probably not useful for any > SoCs where you can't adjust the voltage provided to the MMC > controller. That doesn't necessarily mean that your patch series is > useful for SoCs where you can. Specifically you'd need to do math to > see how much more power the MMC controller takes at the higher > voltage. Then you can calculate a "perf per watt". If the watts to > transfer data at 400 MB/s aren't 8 times more than the watts to > transfer at 50 MB/s then that's a ding against your idea. You'd also > don't have a dedicated voltage rail, right? So you'd have to see what > percentage of the time the MMC controller was the only thing in the > system that was requesting the higher voltage. If it happened that > something else in the system was keeping the voltage higher anyway > then it would be better for you to run faster and race to idle. > Really I guess the case you're most worried about is if the MMC > controller is causing other components in the system to be at a higher > voltage point (and thus take up more power)... > Rail is not dedicated for eMMC device. So during video playback its mmc votes only, which keeps device in higher voltage corners. So for a three hours of a video playback, power consumption due to mmc vote would be quite considerable. > > Hope that all makes sense. I can read the patches themselves if > needed--everything from above is just based on your cover letter and > discussion with Ulf. ;-) > > > -Doug Thanks, Ram
Hi, On Fri, Nov 29, 2019 at 12:10 AM <rampraka@codeaurora.org> wrote: > > > I am slightly baffled about why this would save energy unless it > > allows you to lower the voltage to the controller. I know you _can_ > > sometimes lower the voltage to the controller on Qualcomm parts, but > > you are arguing that it is useful even on systems that can't lower the > > voltage. That feels slightly questionable. I would expect that > > racing to idle (with the right tuning parameters) would be a better > > way to go. > > Sorry, if my explanation was misleading before. MMC driver is not > changing > card/controller voltage but by lowering clock frequency of card and > controller brings down _bus_ and _system_ voltage corners of device > which > helps in saving power consumption. Yes, I understood. You are able to lower the voltage supplied to the controller itself, not to the card. ...but that is not something that all SoCs can do. I think here you are arguing that your feature is useful to everyone--even if they can't ever lower the voltage of the controller. I am questioning that. > > As a specific example, let's imagine we want to transfer 1000 MB of > > data and we have 20 seconds with which to do it. We can achieve this > > by transferring 50 MB/s for the whole 20 seconds. Alternatively, we > > could transfer at 400 MB/s 2.5 seconds and then fully power gate the > > eMMC for the next 17.5 seconds. > > > > In that example, I'd wonder ask is more power efficient. Presumably > > the 2nd. This is the whole "race to idle" concept as I understand it. > > > > The "race to idle" is talked about a lot in the context of CPU > > frequency decisions. Presumably you'll point out that "race to idle" > > is NOT the right thing to do for choosing a CPU frequency. As I > > understand it, this is primarily true because we need to raise the CPU > > voltage to run at faster speeds. This would lead me to believe that > > the only case you'd want to do frequency scaling like this is if it > > allows you to lower the voltage provided to the eMMC controller. As > > you said, for Qualcomm it _does_ allow you to do this, but most other > > SoCs don't. ...so unless there's a flaw in my logic (always > > possible!) this patch series should be amended to say it's only useful > > if it allows you to down-volt the controller. > > > > Just to think a little bit more about my logic: of course, you might > > argue that we can't just do a 1000 MB data transfer. We can break > > that down into two cases: > > > > a) For writing, presumably the data is produced over time and you > > don't want to buffer the whole 1000 MB and delay 17.5 seconds before > > you start writing. ...but presumably you could do _some_ buffering > > and then break things into chunks where you ungate the clock to the > > card, write a chunk out, and then re-gate the clock. There will > > obviously be some overhead with each clock gate/ungate, but > > (hopefully) not too much. ...and there will be time when data is in > > RAM and not on the disk so you'd worry about power failure, but if you > > want to get data on the disk ASAP why are you scaling the clock (and > > thus delaying the data from getting to the disk) at all? Maybe some > > math? How long does it take to ungate/gate the clocks. 1 ms? It's > > just flipping a bit, right? ...and does assuming we can allocate a 40 > > MB write buffer seem sane? So we eat 1 ms to ungate, 100 ms to > > transfer 40 MB, 1 ms to gate. Compared to transferring at 50 MB/sec > > (w/ no gating), we'd transfer the same 40 MB in 800 ms. So we either > > keep the clock on at 50 MHz for 800 ms or we keep it on at 200 MHz for > > 102 ms and gate it for 698 ms. > > > > "race to idle" helps but this feature was implemented with focus on > video > playback case, where data transfer request to mmc driver spans over > entire > playback time of video. In this case, running device in low speed mode > helps. It doesn't matter if you've got a long playback. A properly tuned "race to idle" should still be better unless you are seriously saving a lot of power by never bringing the voltage up. The above example is with a fixed size transfer, but just imagine the same thing over and over again and you've got the video playback case. > > b) If you're reading data then hopefully the system has some sort of > > readahead going on. In the "video playback" case the system should > > have no problem predicting that if you've just read bytes > > 1,000,000,000 - 2,000,000,000 of a file over the last 10 seconds that > > you're likely to keep reading the same file. Presumably it wouldn't > > be totally insane to read 40 MB ahead of time and then we can do the > > same math as a). If 40 MB is too much for readahead, then shrink it > > and redo the math. Even with much smaller numbers the "race to idle" > > wins because gating / ungating clocks is fast. ...or do you know some > > reason why gating / ungating clocks needs to be slow? If so, how > > slow? > > > > I have performed one experiment by increasing read ahead size, but that > is > not helping. And I don't observe much difference in data request pattern > generated in video playback case. You have to make sure that you are proactively gating the card clock in conjunction. If you aren't gating the card clock quickly enough (and aren't lowering the controller voltage when the card clock is gated) then you won't notice the savings. > >> Test case used are 1080p and 4k video playback use case. Please find > >> below > >> test case information and power impact data. In both the below video > >> playback > >> cases, a considerable amount of power savings can be observed with > >> clock scaling > >> feature. > >> > >> Use cases Delta at battery (mA) Power impact % > >> 30 fps at HD 1080p decode 20 Mbps 10 mA 11% > >> 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19% > > > > Numbers like this are exactly what is needed to justify your patch > > series. ...but I'd be super curious to how it would compare to: > > > > 1) Tuning the runtime PM auto-suspend delay. If you have your > > auto-suspend delay set wrong (like 500 ms) then all the math above is > > totally wrong. We'll keep clocking at 400 MHz needlessly even though > > there is no data to transfer. If autosuspend is just gating clocks > > then it feels like you could set it to 1 ms, or 10 ms. NOTE: if > > autosuspend for you is something more major (fully turning off power > > rails to the eMMC) then maybe you need another level where you just > > turn off the clocks. Seems like we could find some way to make that > > work. > > Gating / Ungating can be fine tuned to help bring down power consumption > too. I will share power numbers with tuned parameters in next > communication. Thanks, I think this is critical for the discussion. Please make sure that when you are gating you code it up in a way that you can remove the vote for the higher controller voltage. > > 2) Tuning any readached mechanism in your system. If your system > > somehow does zero readahead then obviously all my arguments don't work > > for reads. ...but why would you not have readahead? > > > > 3) Tuning any write buffering in your system. Same argument as #2. > > This feature is specific to video playback use case from storage device. > Not sure, which buffering can be tuned. Can you point out any buffering > used? If you're only testing video playback then I guess you don't care so much here. I was assuming you cared about video record too, but maybe that's not such an important use case for saving power. > > 4) Making sure that when the MMC card clock is gated that you request > > the lowest voltage level for the controller (and set the controller's > > bus clock to the lowest level since it's not doing anything). > > > > > > I would also be very interested to know how much of those savings are > > achieved if you keep the voltage to the MMC controller the same. In > > other words do something that arbitrarily keeps the MMC controller > > requesting the same voltage level from the rest of the system and then > > do your power measurements. How much do your savings change? > > > > > > I will also note that aggressive clock gating is exactly what the > > dw_mmc controller does automatically (except for SDIO, but that's a > > different story). Seeing that the controller itself can stop the > > clock in dw_mmc gives further credence that gating / ungating the > > clock is a very lightweight operation and 1 ms is probably an > > over-estimation of how long it takes. > > > > > > I guess one very last note is that I spent most of the time above > > arguing that the clock scaling concept is probably not useful for any > > SoCs where you can't adjust the voltage provided to the MMC > > controller. That doesn't necessarily mean that your patch series is > > useful for SoCs where you can. Specifically you'd need to do math to > > see how much more power the MMC controller takes at the higher > > voltage. Then you can calculate a "perf per watt". If the watts to > > transfer data at 400 MB/s aren't 8 times more than the watts to > > transfer at 50 MB/s then that's a ding against your idea. You'd also > > don't have a dedicated voltage rail, right? So you'd have to see what > > percentage of the time the MMC controller was the only thing in the > > system that was requesting the higher voltage. If it happened that > > something else in the system was keeping the voltage higher anyway > > then it would be better for you to run faster and race to idle. > > Really I guess the case you're most worried about is if the MMC > > controller is causing other components in the system to be at a higher > > voltage point (and thus take up more power)... > > > > Rail is not dedicated for eMMC device. So during video playback its mmc > votes only, which keeps device in higher voltage corners. So for a three > hours of a video playback, power consumption due to mmc vote would be > quite > considerable. You should be able to remove your vote whenever the card clock is gated. Maybe this is what you're missing? -Doug
On Fri, 29 Nov 2019 at 08:34, <rampraka@codeaurora.org> wrote: > > Hi Ulf, > > Seems some setting issue with my thunderbird application. > Sorry for spams, please ignore my last responses as unsupported > characters got added. > > Typing my response again from browser and re-sending. > > Thanks, > Ram > > On 2019-10-22 14:10, Ulf Hansson wrote: > > On Mon, 21 Oct 2019 at 16:30, Ram Prakash Gupta > > <rampraka@codeaurora.org> wrote: > >> > >> This change adds the use of devfreq based clock scaling to MMC. > >> This applicable for eMMC and SDCard. > >> For some workloads, such as video playback, it isn't necessary > >> for these cards to run at high speed. Running at lower > >> frequency, in such cases can still meet the deadlines for data > >> transfers. > >> > >> Scaling down the clock frequency dynamically has power savings > >> not only because the bus is running at lower frequency but also > >> has an advantage of scaling down the system core voltage, if > >> supported. Provide an ondemand clock scaling support similar > >> to the cpufreq ondemand governor having two thresholds, > >> up_threshold and down_threshold to decide whether to increase > >> the frequency or scale it down respectively as per load. > > > > This sounds simple, but what the series is doing is far more > > complicated but scaling the bus clock, as it also re-negotiates the > > bus speed mode. > > > > Each time the triggering point for scaling up/down is hit, then a > > series of commands needs to be sent to the card, including running the > > tuning procedure. The point is, for sure, this doesn't come for free, > > both from a latency point of view, but also from an energy cost point > > of view. So, whether this really improves the behaviour, seems like > > very use case sensitive, right!? > > Switching modes would incur some latency for sending commands to switch > modes, but tuning is not needed as most of the emmc devices used now a > days are with enhanced strobe support, so tuning would not add up any > latency as it is not required in hs400 enhanced strobe mode. > > This feature is implemented for video playback case, where data transfer > request is less, where this feature helps with saving power consumption. > > And when there is burst of data transfer request, load will remain > _high_ > so there won't be any switching and hence it won't affect any existing > use cases from latency point of view. > > Also if hw supports to switch clk frequency without changing mode. I > will > make change in code. For this I have seek input from hw team. > > From collected data, I see this feature is helping in saving power > consumption. And no energy penalty is observed. Please share if I am > missing any specific. Power saving using this feature is quite good > and considerable. Please find the data below. > > Use Case Delta at Battery Power Impact > 30 fps at HD 1080p decode 20Mbps 10 mA 11% > 30 fps at UHD 8b H.264 42 Mbps 20.93 mA 19% > > > > > Overall, when it comes to use cases, we have very limited knowledge > > about them at the mmc block layer level. I think it should remain like > > that. If at any place at all, this information is better maintained by > > the generic block layer and potentially the configured I/O scheduler. > > I think, generic block layer do not have knowledge of use case for data > transfer request. And devfreq framework have been used to implement this > feature, which should be same in any layer. Just to be clear, my main concern here is not using the devfreq framework. It's rather whether switching speed modes is really worth it, which Doug pointed out as well. > > Also mobile platforms comes mostly with emmc and ufs as storage media. > And clock scaling is already implemented in upstream ufs driver using > devfreq framework. On similar line, this feature is implemented for mmc > driver. So I believe, clk scaling implementation is better placed in mmc > driver rather in generic block layer. At some point you want to trigger the clock to be scaled up/down, probably because of reaching some bandwidth threshold. This part seems like a generic block thing and not an mmc specific thing. Exactly how that would affect this approach is hard to say, but my point is, that I don't want to sprinkle the mmc subsystem with code that may belong in upper common layers. > > > > > This brings me to a question about the tests you have you run. Can you > > share some information and data about that? > > Test case used are 1080p and 4k video playback use case. As this feature > is implemented specifically for video playback use case. Right. It seems to me, that you are optimizing for only one particular use case. How do you make sure to not increase energy consumption, for other use cases that would gain from running at the highest speed mode and "race to idle"? I think you need to take a step back and think more general about this approach. More importantly, you need to provide more data to prove your approach, also according to suggestions from Dough. > > > >> > >> > >> Ram Prakash Gupta (6): > >> mmc: core: Parse clk scaling dt entries > >> mmc: core: Add core scaling support in driver > >> mmc: core: Initialize clk scaling for mmc and SDCard > >> mmc: core: Add debugfs entries for scaling support > >> mmc: sdhci-msm: Add capability in platfrom host > >> dt-bindings: mmc: sdhci-msm: Add clk scaling dt parameters > >> > >> .../devicetree/bindings/mmc/sdhci-msm.txt | 19 + > > > > I noticed that the DT patch was put last in the series, but the > > parsing is implemented in the first patch. Please flip this around. If > > you want to implement DT parsing of new bindings, please make sure to > > discuss the new bindings first. > > I will update in next post. > > > > >> drivers/mmc/core/block.c | 19 +- > >> drivers/mmc/core/core.c | 777 > >> +++++++++++++++++++++ > >> drivers/mmc/core/core.h | 17 + > >> drivers/mmc/core/debugfs.c | 90 +++ > >> drivers/mmc/core/host.c | 226 ++++++ > >> drivers/mmc/core/mmc.c | 246 ++++++- > >> drivers/mmc/core/queue.c | 2 + > >> drivers/mmc/core/sd.c | 84 ++- > >> drivers/mmc/host/sdhci-msm.c | 2 + > >> include/linux/mmc/card.h | 7 + > >> include/linux/mmc/host.h | 66 ++ > >> 12 files changed, 1550 insertions(+), 5 deletions(-) > > > > This is a lot of new code in the mmc core, which I would then need to > > maintain, of course. I have to admit, I am a bit concerned about that, > > so you have to convince me that there are good reasons for me to apply > > this. > > > > As I stated above, I think the approach looks quite questionable in > > general. And even if you can share measurement, that it improves the > > behaviour, I suspect (without a deeper code review) that some of the > > code better belongs in common block device layer. > > From the collected power data, I see this as good reason to have this > feature in mmc driver as number is quite considerable. > > For approach, it would be helpful if you share your inputs regarding > this > approach. And as you have stated, this can be further discussed after a > review from you. Besides my comments above, Doug has provided you with valuable feedback. Please follow up on that as well, if you still intend to pursue with this approach. Kind regards Uffe