Message ID | 20230527132747.83196-1-frank@oltmanns.dev (mailing list archive) |
---|---|
Headers | show |
Series | clk: sunxi-ng: Optimize rate selection for NKM clocks | expand |
Hi Frank, On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: > I would like to bring your attention to the current process of setting > the rate of an NKM clock. As it stands, when setting the rate of an > NKM clock, the rate nearest but less than or equal to the requested > rate is found, instead of the nearest rate. Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() for example. Some devices require that we don't overshoot, while some prefer to have the closest rate. Both are fine, and it's a bit context specific which one we should favour. If we were to do anything, it would be to support both and let the clock driver select which behaviour it wants. > Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) > when setting a rate, each time iterating over all combinations of n, > k, and m. Yeah, that's expected as well. > In response to this, I propose the following refinements to optimize the NKM > clock setting: > a. when finding the best rate use the nearest rate, even if it is greater than > the requested rate (PATCH 1) > b. utilize binary search to find the best rate by going through a > precalculated, ordered list of all meaningful combinations of n, k, and m > (PATCH 2) One thing you haven't really addressed is why we would be doing this? Is there some clocks that require a more precise clock and don't? Is the factor calculation a bottleneck for some workloads? Clocks in general are very regression-prone, so I'd rather be a bit conservative there, and "if it ain't broke, don't fix it". Maxime
Hi Maxime, On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: > [[PGP Signed Part:Undecided]] > Hi Frank, > > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: >> I would like to bring your attention to the current process of setting >> the rate of an NKM clock. As it stands, when setting the rate of an >> NKM clock, the rate nearest but less than or equal to the requested >> rate is found, instead of the nearest rate. > > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() > for example. Some devices require that we don't overshoot, while some > prefer to have the closest rate. > > Both are fine, and it's a bit context specific which one we should > favour. If we were to do anything, it would be to support both and let > the clock driver select which behaviour it wants. > Ok, understood. Thank you for the explanation! Now, I'm wondering if anyone would be using such a flag, if I added it. > >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) >> when setting a rate, each time iterating over all combinations of n, >> k, and m. > > Yeah, that's expected as well. I'm wondering though, if iterating over all combinations is set in stone, or if some kind of optimization would be in order. > >> In response to this, I propose the following refinements to optimize the NKM >> clock setting: >> a. when finding the best rate use the nearest rate, even if it is greater than >> the requested rate (PATCH 1) >> b. utilize binary search to find the best rate by going through a >> precalculated, ordered list of all meaningful combinations of n, k, and m >> (PATCH 2) > > One thing you haven't really addressed is why we would be doing this? Is > there some clocks that require a more precise clock and don't? Is the > factor calculation a bottleneck for some workloads? Background ========== I'm a pinephone user (ccu-sun50i-a64). I'm using U-Boot which sets the pll-video0 to 294 MHz on boot. The phone's panel requires DCLK to run at 108 MHz to get a nice 60 Hz vertical refresh rate. The clock structure is this: clock clock type -------------------------------------- pll-video0 ccu_nm pll-mipi ccu_nkm tcon0 ccu_mux tcon-data-clock sun4i_dclk The divider between tcon0 and tcon-data-clock is fixed at 4. So, I need pll-mipi to run at 432 MHz to get the desired vertical refresh rate. When pll-vdeo0 is at 294 MHz this is that rate cannot be matched exactly with any combination. The best we can get is 431.2 MHz (n=11, k=2, m=15). The pinephone has some "vendor" patches (megi kernel) that a. add HDMI b. allow re-setting pll-mipi's rate when pll-video0 changes Re: Who needs a more precise clock? =================================== When plugging in HDMI, pll-video's rate is set to 297 MHz, which - in the vendor kernel, not mainline - triggers recalculation of pll-mipi (trying to set it to 431.2 MHz). It ends up with a rate of 424.285714 MHz, because this is the nearest, but less than 431.2 MHz (n=5, k=2, m=7). The nearest rate would be 432 MHz. So, while analyzing the whole situation that I described above, I found out that the NKM clocks are not set to the closest rate and wondered why that is. Hence my request for comments. Now, one could argue that pll-video0 should be set to 297MHz at boot or that pll-mipi should try to set the *requested* rate instead of the previous rate when the pll-video0 changes. And I think that both are valid or even better approaches than my proposal in this RFC to address this specific problem and I'll probably sent patches to discuss this as well. Re: Why speed up factor calculation? ==================================== I'm not aware that the current implementation of calculating n, k, and m poses a bottleneck in any situation. Again, while going through the code, I wondered why not save a few CPU cycles by precalculating the meaningful combinations. In my opinion, it does not have any side effects, so we might as well do it. (There is of course the side effect of using a higher rate, but this is unrelated to precalculation as I could as well employ a rate comparison that only allows lower rates, or only optionally higher rates.) > Clocks in general are very regression-prone, so I'd rather be a bit > conservative there, and "if it ain't broke, don't fix it". Sure, I get that. As I stated in my cover letter: "The motivation for these proposed changes lies in the current behavior of rate selection for NKM clocks, which doesn't observe the CLK_SET_RATE_PARENT flag. I.e. it does not select a different rate for the parent clock to find the optimal rate." I thought that this required this optimization to be implemented, but by now, I'm no longer sure. I'll probably continue investigating different paths for CLK_SET_RATE_PARENT for NKM clocks and follow up with new findings. Thanks, Frank > > Maxime > > [[End of PGP Signed Part]]
Dne četrtek, 01. junij 2023 ob 07:16:45 CEST je Frank Oltmanns napisal(a): > Hi Maxime, > > On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: > > [[PGP Signed Part:Undecided]] > > Hi Frank, > > > > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: > >> I would like to bring your attention to the current process of setting > >> the rate of an NKM clock. As it stands, when setting the rate of an > >> NKM clock, the rate nearest but less than or equal to the requested > >> rate is found, instead of the nearest rate. > > > > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() > > for example. Some devices require that we don't overshoot, while some > > prefer to have the closest rate. > > > > Both are fine, and it's a bit context specific which one we should > > favour. If we were to do anything, it would be to support both and let > > the clock driver select which behaviour it wants. > > > > Ok, understood. Thank you for the explanation! Now, I'm wondering if > anyone would be using such a flag, if I added it. > > > > >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) > >> when setting a rate, each time iterating over all combinations of n, > >> k, and m. > > > > Yeah, that's expected as well. > > I'm wondering though, if iterating over all combinations is set in > stone, or if some kind of optimization would be in order. > > > > >> In response to this, I propose the following refinements to optimize the NKM > >> clock setting: > >> a. when finding the best rate use the nearest rate, even if it is greater than > >> the requested rate (PATCH 1) > >> b. utilize binary search to find the best rate by going through a > >> precalculated, ordered list of all meaningful combinations of n, k, and m > >> (PATCH 2) > > > > One thing you haven't really addressed is why we would be doing this? Is > > there some clocks that require a more precise clock and don't? Is the > > factor calculation a bottleneck for some workloads? > > Background > ========== > I'm a pinephone user (ccu-sun50i-a64). I'm using U-Boot which sets the > pll-video0 to 294 MHz on boot. The phone's panel requires DCLK to run at > 108 MHz to get a nice 60 Hz vertical refresh rate. The clock structure > is this: > > clock clock type > -------------------------------------- > pll-video0 ccu_nm > pll-mipi ccu_nkm > tcon0 ccu_mux > tcon-data-clock sun4i_dclk > > The divider between tcon0 and tcon-data-clock is fixed at 4. So, I need > pll-mipi to run at 432 MHz to get the desired vertical refresh rate. > When pll-vdeo0 is at 294 MHz this is that rate cannot be matched exactly > with any combination. The best we can get is 431.2 MHz (n=11, k=2, > m=15). > > The pinephone has some "vendor" patches (megi kernel) that > a. add HDMI > b. allow re-setting pll-mipi's rate when pll-video0 changes > > Re: Who needs a more precise clock? > =================================== > When plugging in HDMI, pll-video's rate is set to 297 MHz, which - in > the vendor kernel, not mainline - triggers recalculation of pll-mipi > (trying to set it to 431.2 MHz). It ends up with a rate of 424.285714 > MHz, because this is the nearest, but less than 431.2 MHz (n=5, k=2, > m=7). The nearest rate would be 432 MHz. > > So, while analyzing the whole situation that I described above, I found > out that the NKM clocks are not set to the closest rate and wondered why > that is. Hence my request for comments. > > Now, one could argue that pll-video0 should be set to 297MHz at boot or > that pll-mipi should try to set the *requested* rate instead of the > previous rate when the pll-video0 changes. And I think that both are > valid or even better approaches than my proposal in this RFC to address > this specific problem and I'll probably sent patches to discuss this as > well. Ideally, clock rate setting code should be immune on "initial" values, set by bootloader or default values. If it's not, then it should be improved in the way that it is. > > > Re: Why speed up factor calculation? > ==================================== > I'm not aware that the current implementation of calculating n, k, and m > poses a bottleneck in any situation. Again, while going through the > code, I wondered why not save a few CPU cycles by precalculating the > meaningful combinations. In my opinion, it does not have any side > effects, so we might as well do it. (There is of course the side effect > of using a higher rate, but this is unrelated to precalculation as I > could as well employ a rate comparison that only allows lower rates, or > only optionally higher rates.) > > > Clocks in general are very regression-prone, so I'd rather be a bit > > conservative there, and "if it ain't broke, don't fix it". > > Sure, I get that. > > As I stated in my cover letter: > "The motivation for these proposed changes lies in the current behavior > of rate selection for NKM clocks, which doesn't observe the > CLK_SET_RATE_PARENT flag. I.e. it does not select a different rate for > the parent clock to find the optimal rate." > > I thought that this required this optimization to be implemented, but by > now, I'm no longer sure. I'll probably continue investigating different > paths for CLK_SET_RATE_PARENT for NKM clocks and follow up with new > findings. Let's leave out any optimizations that are not apparently needed. Most clock rates are set only once at boot and others, like video clocks, not that often, so a suboptimal code speed doesn't hurt currently. Best regards, Jernej > > Thanks, > Frank > > > > > Maxime > > > > [[End of PGP Signed Part]] >
Hi, On Thu, Jun 01, 2023 at 07:16:45AM +0200, Frank Oltmanns wrote: > On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: > > [[PGP Signed Part:Undecided]] > > Hi Frank, > > > > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: > >> I would like to bring your attention to the current process of setting > >> the rate of an NKM clock. As it stands, when setting the rate of an > >> NKM clock, the rate nearest but less than or equal to the requested > >> rate is found, instead of the nearest rate. > > > > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() > > for example. Some devices require that we don't overshoot, while some > > prefer to have the closest rate. > > > > Both are fine, and it's a bit context specific which one we should > > favour. If we were to do anything, it would be to support both and let > > the clock driver select which behaviour it wants. > > > > Ok, understood. Thank you for the explanation! Now, I'm wondering if > anyone would be using such a flag, if I added it. I guess that's another thing :) If no-one is going to use it, why should we do it in the first place? But most likely the display and audio clocks are usually fairly ok with overshooting a bit, and a closest rate is usually better. > >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) > >> when setting a rate, each time iterating over all combinations of n, > >> k, and m. > > > > Yeah, that's expected as well. > > I'm wondering though, if iterating over all combinations is set in > stone, or if some kind of optimization would be in order. The thing with optimization is that you need to optimize for something. So you need to show that this code is suboptimal (by whatever metric you want to optimize for), and that your code is more optimal that it used to be. It means identifying a problem, writing benchmarks, and showing that the new code performs better there. For example, your code might very well be faster, but it will increase the kernel image (and thus the RAM usage). One is not more optimal than the other in absolute, they both are, using a different metric. And then you have the more social factors that play a huge part too: readibility, maintainability, etc. > >> In response to this, I propose the following refinements to optimize the NKM > >> clock setting: > >> a. when finding the best rate use the nearest rate, even if it is greater than > >> the requested rate (PATCH 1) > >> b. utilize binary search to find the best rate by going through a > >> precalculated, ordered list of all meaningful combinations of n, k, and m > >> (PATCH 2) > > > > One thing you haven't really addressed is why we would be doing this? Is > > there some clocks that require a more precise clock and don't? Is the > > factor calculation a bottleneck for some workloads? > > Background > ========== > I'm a pinephone user (ccu-sun50i-a64). I'm using U-Boot which sets the > pll-video0 to 294 MHz on boot. The phone's panel requires DCLK to run at > 108 MHz to get a nice 60 Hz vertical refresh rate. The clock structure > is this: > > clock clock type > -------------------------------------- > pll-video0 ccu_nm > pll-mipi ccu_nkm > tcon0 ccu_mux > tcon-data-clock sun4i_dclk > > The divider between tcon0 and tcon-data-clock is fixed at 4. So, I need > pll-mipi to run at 432 MHz to get the desired vertical refresh rate. > When pll-vdeo0 is at 294 MHz this is that rate cannot be matched exactly > with any combination. The best we can get is 431.2 MHz (n=11, k=2, > m=15). > > The pinephone has some "vendor" patches (megi kernel) that > a. add HDMI > b. allow re-setting pll-mipi's rate when pll-video0 changes > > Re: Who needs a more precise clock? > =================================== > When plugging in HDMI, pll-video's rate is set to 297 MHz, which - in > the vendor kernel, not mainline - triggers recalculation of pll-mipi > (trying to set it to 431.2 MHz). It ends up with a rate of 424.285714 > MHz, because this is the nearest, but less than 431.2 MHz (n=5, k=2, > m=7). The nearest rate would be 432 MHz. > > So, while analyzing the whole situation that I described above, I found > out that the NKM clocks are not set to the closest rate and wondered why > that is. Hence my request for comments. It all makes sense, but I'm not sure why it requires a complete rewrite of the factor calculation algo? > Now, one could argue that pll-video0 should be set to 297MHz at boot Not really no, we should strive to be as immune as possible from the boot state. > or that pll-mipi should try to set the *requested* rate instead of the > previous rate when the pll-video0 changes. It's not clear to me what is the distinction you make here between the requested rate and the previous rate? Do you mean that you have two clk_set_rate in sequence, with the first one on pll-mipi (or one of its child clocks), and the second one on pll-video0 (but on a different sub-tree than pll-mipi) and thus pll-mipi has its rate changed by the second, but doesn't match the previous rate enforced? If so, that's kind of expected: clk_set_rate doesn't provide any warranty on for how long the rate is going to stay there. There's two way to prevent that. Either we call clk_set_rate_exclusive (on the first) instead, but it will effectively lock a clock subtree and prevent any rate change which is a pretty big constraint. Or you add a notifier to adjust to the parent clock rate change and still provide the same output rate, with a different set of parameters. Maxime
On Thu, Jun 01, 2023 at 09:41:30PM +0200, Jernej Škrabec wrote: > Dne četrtek, 01. junij 2023 ob 07:16:45 CEST je Frank Oltmanns napisal(a): > > Re: Why speed up factor calculation? > > ==================================== > > I'm not aware that the current implementation of calculating n, k, and m > > poses a bottleneck in any situation. Again, while going through the > > code, I wondered why not save a few CPU cycles by precalculating the > > meaningful combinations. In my opinion, it does not have any side > > effects, so we might as well do it. (There is of course the side effect > > of using a higher rate, but this is unrelated to precalculation as I > > could as well employ a rate comparison that only allows lower rates, or > > only optionally higher rates.) > > > > > Clocks in general are very regression-prone, so I'd rather be a bit > > > conservative there, and "if it ain't broke, don't fix it". > > > > Sure, I get that. > > > > As I stated in my cover letter: > > "The motivation for these proposed changes lies in the current behavior > > of rate selection for NKM clocks, which doesn't observe the > > CLK_SET_RATE_PARENT flag. I.e. it does not select a different rate for > > the parent clock to find the optimal rate." > > > > I thought that this required this optimization to be implemented, but by > > now, I'm no longer sure. I'll probably continue investigating different > > paths for CLK_SET_RATE_PARENT for NKM clocks and follow up with new > > findings. > > Let's leave out any optimizations that are not apparently needed. Most clock > rates are set only once at boot and others, like video clocks, not that often, > so a suboptimal code speed doesn't hurt currently. I'm not even sure we can make that assumption for video clocks. We might for a panel, but for a more "dynamic" output like HDMI all bets are off and depending on the monitor, the user settings and the userspace stack we can definitely expect the video clock to change quite frequently. Maxime
Hi Maxime, On 2023-06-02 at 09:31:59 +0200, Maxime Ripard <mripard@kernel.org> wrote: > [[PGP Signed Part:Undecided]] > Hi, > > On Thu, Jun 01, 2023 at 07:16:45AM +0200, Frank Oltmanns wrote: >> On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: >> > [[PGP Signed Part:Undecided]] >> > Hi Frank, >> > >> > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: >> >> I would like to bring your attention to the current process of setting >> >> the rate of an NKM clock. As it stands, when setting the rate of an >> >> NKM clock, the rate nearest but less than or equal to the requested >> >> rate is found, instead of the nearest rate. >> > >> > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() >> > for example. Some devices require that we don't overshoot, while some >> > prefer to have the closest rate. >> > >> > Both are fine, and it's a bit context specific which one we should >> > favour. If we were to do anything, it would be to support both and let >> > the clock driver select which behaviour it wants. >> > >> >> Ok, understood. Thank you for the explanation! Now, I'm wondering if >> anyone would be using such a flag, if I added it. > > I guess that's another thing :) If no-one is going to use it, why should > we do it in the first place? > > But most likely the display and audio clocks are usually fairly ok with > overshooting a bit, and a closest rate is usually better. Ok, I dived a bit deeper into this, but, as far as I can tell, the closest rate is not used anywhere in the sunxi-ng ccu driver. So, when extending, e.g., the NM or NKM clock to support, one must also extend at least the mux clocks, because they expect rates less than the requested rate. That seems to be quite the undertaking for only a small gain in precision. >> >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) >> >> when setting a rate, each time iterating over all combinations of n, >> >> k, and m. >> > >> > Yeah, that's expected as well. >> >> I'm wondering though, if iterating over all combinations is set in >> stone, or if some kind of optimization would be in order. > > The thing with optimization is that you need to optimize for something. > So you need to show that this code is suboptimal (by whatever metric you > want to optimize for), and that your code is more optimal that it used > to be. > > It means identifying a problem, writing benchmarks, and showing that the > new code performs better there. > > For example, your code might very well be faster, but it will increase > the kernel image (and thus the RAM usage). One is not more optimal than > the other in absolute, they both are, using a different metric. Sure, I get that. I'll submit a patchset that adds the functionality to NKM clocks to set the rate of their parents. With the new patchset, the time for, e.g. setting DCLK increases from ~0.5 ms to a whopping 30 - 37 ms. Those times were taken unscientifically on my pinephone, i.e. kernel logging and a couple of re-boots. But I think that still might give an idea of why I thought about the need to increase performance. The reason for this massive increase is, that the patch iterates over all combinations of NKM for pll-mipi, and for each combination it iterates over all combinations of NM for pll-video0. Nevertheless, following your and Jernej's advice, I'll submit the patchset first and then we can discuss if speed optimizations are needed and what cost is acceptable. > And then you have the more social factors that play a huge part too: > readibility, maintainability, etc. > >> >> In response to this, I propose the following refinements to optimize the NKM >> >> clock setting: >> >> a. when finding the best rate use the nearest rate, even if it is greater than >> >> the requested rate (PATCH 1) >> >> b. utilize binary search to find the best rate by going through a >> >> precalculated, ordered list of all meaningful combinations of n, k, and m >> >> (PATCH 2) >> > >> > One thing you haven't really addressed is why we would be doing this? Is >> > there some clocks that require a more precise clock and don't? Is the >> > factor calculation a bottleneck for some workloads? >> >> Background >> ========== >> I'm a pinephone user (ccu-sun50i-a64). I'm using U-Boot which sets the >> pll-video0 to 294 MHz on boot. The phone's panel requires DCLK to run at >> 108 MHz to get a nice 60 Hz vertical refresh rate. The clock structure >> is this: >> >> clock clock type >> -------------------------------------- >> pll-video0 ccu_nm >> pll-mipi ccu_nkm >> tcon0 ccu_mux >> tcon-data-clock sun4i_dclk >> >> The divider between tcon0 and tcon-data-clock is fixed at 4. So, I need >> pll-mipi to run at 432 MHz to get the desired vertical refresh rate. >> When pll-vdeo0 is at 294 MHz this is that rate cannot be matched exactly >> with any combination. The best we can get is 431.2 MHz (n=11, k=2, >> m=15). >> >> The pinephone has some "vendor" patches (megi kernel) that >> a. add HDMI >> b. allow re-setting pll-mipi's rate when pll-video0 changes >> >> Re: Who needs a more precise clock? >> =================================== >> When plugging in HDMI, pll-video's rate is set to 297 MHz, which - in >> the vendor kernel, not mainline - triggers recalculation of pll-mipi >> (trying to set it to 431.2 MHz). It ends up with a rate of 424.285714 >> MHz, because this is the nearest, but less than 431.2 MHz (n=5, k=2, >> m=7). The nearest rate would be 432 MHz. >> >> So, while analyzing the whole situation that I described above, I found >> out that the NKM clocks are not set to the closest rate and wondered why >> that is. Hence my request for comments. > > It all makes sense, but I'm not sure why it requires a complete rewrite > of the factor calculation algo? You are absolutely right! It does not! Here, I was talking about the reasons why a more precise clock might be desirable (PATCH 1), which touches very few lines. I tried to explain the reasons for the algo change further below in the "Re: Why speed up factor calculation?" part of my mail that was cut off in your response. > >> Now, one could argue that pll-video0 should be set to 297MHz at boot > > Not really no, we should strive to be as immune as possible from the > boot state. Agreed. >> or that pll-mipi should try to set the *requested* rate instead of the >> previous rate when the pll-video0 changes. > > It's not clear to me what is the distinction you make here between the > requested rate and the previous rate? This is quite a de-tour from the original discussion, so I'm sorry for the confusion. By requested rate I mean the rate that the user (DCLK) requested. But this is not necessarily the rate that the clock is using in the end, because of its parent's rate. So, when the pll-video0 changes rate from 294 MHz to 297MHz (upon plugging in HDMI), pll-mipi does not know any longer what the requested rate (let's say 432MHz) was. It only knows the rate it had before the rate change of pll-video0 (431.2MHz in that case). Now with the new parent rate, it tries to find a new rate that's close to (but less than) 431.2 MHz instead of 432 MHz. And since with the new clock rate of 297 MHz for pll-video0, it cannot set to 431.2 MHz, so it rounds down again (to 424.285714 MHz). That means it rounded down twice and is now quite far from the originally requested 432 MHz. So, whenever the parent rate changes, we always round down and move further away from the rate that the user originally requested for our clock. This could be mitigated by storing the requested rate (i.e. 432 MHz instead of 431.2 MHz). However, I don't know if that's possible. I'm simply stating my observations. No call for action is implied in that statement. > Do you mean that you have two clk_set_rate in sequence, with the first > one on pll-mipi (or one of its child clocks), and the second one on > pll-video0 (but on a different sub-tree than pll-mipi) and thus pll-mipi > has its rate changed by the second, but doesn't match the previous rate > enforced? > > If so, that's kind of expected: clk_set_rate doesn't provide any > warranty on for how long the rate is going to stay there. There's two > way to prevent that. Either we call clk_set_rate_exclusive (on the > first) instead, but it will effectively lock a clock subtree and prevent > any rate change which is a pretty big constraint. This is actually the current constraint in mainline. sun40i_tcon locks pll-video0's tree. > Or you add a notifier > to adjust to the parent clock rate change and still provide the same > output rate, with a different set of parameters. This is what the "vendor" kernel (megi kernel) tries to do. It patches the locking away and uses a notifier to react to pll-video0's rate changes. With the caveat that it does not save the requested rate, which I tried to explain above. Anyhow, thank you for this detailed discussion! I really appreciate it! Let me send my new patchset and see what everyone thinks. Thanks, Frank > > Maxime > > [[End of PGP Signed Part]]
Hi Jernej, hi Maxime, On 2023-06-02 at 09:34:03 +0200, Maxime Ripard <mripard@kernel.org> wrote: > [[PGP Signed Part:Undecided]] > On Thu, Jun 01, 2023 at 09:41:30PM +0200, Jernej Škrabec wrote: >> Dne četrtek, 01. junij 2023 ob 07:16:45 CEST je Frank Oltmanns napisal(a): >> > Re: Why speed up factor calculation? >> > ==================================== >> > I'm not aware that the current implementation of calculating n, k, and m >> > poses a bottleneck in any situation. Again, while going through the >> > code, I wondered why not save a few CPU cycles by precalculating the >> > meaningful combinations. In my opinion, it does not have any side >> > effects, so we might as well do it. (There is of course the side effect >> > of using a higher rate, but this is unrelated to precalculation as I >> > could as well employ a rate comparison that only allows lower rates, or >> > only optionally higher rates.) >> > >> > > Clocks in general are very regression-prone, so I'd rather be a bit >> > > conservative there, and "if it ain't broke, don't fix it". >> > >> > Sure, I get that. >> > >> > As I stated in my cover letter: >> > "The motivation for these proposed changes lies in the current behavior >> > of rate selection for NKM clocks, which doesn't observe the >> > CLK_SET_RATE_PARENT flag. I.e. it does not select a different rate for >> > the parent clock to find the optimal rate." >> > >> > I thought that this required this optimization to be implemented, but by >> > now, I'm no longer sure. I'll probably continue investigating different >> > paths for CLK_SET_RATE_PARENT for NKM clocks and follow up with new >> > findings. >> >> Let's leave out any optimizations that are not apparently needed. Most clock >> rates are set only once at boot and others, like video clocks, not that often, >> so a suboptimal code speed doesn't hurt currently. > > I'm not even sure we can make that assumption for video clocks. We might > for a panel, but for a more "dynamic" output like HDMI all bets are off > and depending on the monitor, the user settings and the userspace stack > we can definitely expect the video clock to change quite frequently. Thank you both for your valuable feedback! The goal I head in mind was adjusting pll-video0's clock when setting DCLK on Allwinner A64. And you're both right, I got sidetracked by premature optimizations. As I wrote elsewhere in this thread, I will submit a patchset for the original goal and we can discuss potential needs for optimization there. Thanks, Frank > > Maxime > > [[End of PGP Signed Part]]
On Mon, Jun 05, 2023 at 12:31:41PM +0200, Frank Oltmanns wrote: > Hi Maxime, > > On 2023-06-02 at 09:31:59 +0200, Maxime Ripard <mripard@kernel.org> wrote: > > [[PGP Signed Part:Undecided]] > > Hi, > > > > On Thu, Jun 01, 2023 at 07:16:45AM +0200, Frank Oltmanns wrote: > >> On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: > >> > [[PGP Signed Part:Undecided]] > >> > Hi Frank, > >> > > >> > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: > >> >> I would like to bring your attention to the current process of setting > >> >> the rate of an NKM clock. As it stands, when setting the rate of an > >> >> NKM clock, the rate nearest but less than or equal to the requested > >> >> rate is found, instead of the nearest rate. > >> > > >> > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() > >> > for example. Some devices require that we don't overshoot, while some > >> > prefer to have the closest rate. > >> > > >> > Both are fine, and it's a bit context specific which one we should > >> > favour. If we were to do anything, it would be to support both and let > >> > the clock driver select which behaviour it wants. > >> > > >> > >> Ok, understood. Thank you for the explanation! Now, I'm wondering if > >> anyone would be using such a flag, if I added it. > > > > I guess that's another thing :) If no-one is going to use it, why should > > we do it in the first place? > > > > But most likely the display and audio clocks are usually fairly ok with > > overshooting a bit, and a closest rate is usually better. > > Ok, I dived a bit deeper into this, but, as far as I can tell, the > closest rate is not used anywhere in the sunxi-ng ccu driver. So, when > extending, e.g., the NM or NKM clock to support, one must also extend at > least the mux clocks, because they expect rates less than the requested > rate. That seems to be quite the undertaking for only a small gain in > precision. mux clocks are using __clk_mux_determine_rate which should have the behaviour you want when CLK_MUX_ROUND_CLOSEST is set. > >> >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) > >> >> when setting a rate, each time iterating over all combinations of n, > >> >> k, and m. > >> > > >> > Yeah, that's expected as well. > >> > >> I'm wondering though, if iterating over all combinations is set in > >> stone, or if some kind of optimization would be in order. > > > > The thing with optimization is that you need to optimize for something. > > So you need to show that this code is suboptimal (by whatever metric you > > want to optimize for), and that your code is more optimal that it used > > to be. > > > > It means identifying a problem, writing benchmarks, and showing that the > > new code performs better there. > > > > For example, your code might very well be faster, but it will increase > > the kernel image (and thus the RAM usage). One is not more optimal than > > the other in absolute, they both are, using a different metric. > > Sure, I get that. I'll submit a patchset that adds the functionality to > NKM clocks to set the rate of their parents. > > With the new patchset, the time for, e.g. setting DCLK increases from > ~0.5 ms to a whopping 30 - 37 ms. Those times were taken > unscientifically on my pinephone, i.e. kernel logging and a couple of > re-boots. But I think that still might give an idea of why I thought > about the need to increase performance. > > The reason for this massive increase is, that the patch iterates over > all combinations of NKM for pll-mipi, and for each combination it > iterates over all combinations of NM for pll-video0. > > Nevertheless, following your and Jernej's advice, I'll submit the > patchset first and then we can discuss if speed optimizations are needed > and what cost is acceptable. Honestly, for 40ms, it will be a hard sell :) > >> or that pll-mipi should try to set the *requested* rate instead of the > >> previous rate when the pll-video0 changes. > > > > It's not clear to me what is the distinction you make here between the > > requested rate and the previous rate? > > This is quite a de-tour from the original discussion, so I'm sorry for > the confusion. > > By requested rate I mean the rate that the user (DCLK) requested. But > this is not necessarily the rate that the clock is using in the end, > because of its parent's rate. > > So, when the pll-video0 changes rate from 294 MHz to 297MHz (upon > plugging in HDMI), pll-mipi does not know any longer what the requested > rate (let's say 432MHz) was. It does, it's struct clk_core's req_rate. It doesn't look like it's available to clk_hw users, but given the rest of your explanation, I guess you have a compelling use case to make it available. Maxime
Hi Maxime, On 2023-06-06 at 16:02:58 +0200, Maxime Ripard <mripard@kernel.org> wrote: > [[PGP Signed Part:Undecided]] > On Mon, Jun 05, 2023 at 12:31:41PM +0200, Frank Oltmanns wrote: >> Hi Maxime, >> >> On 2023-06-02 at 09:31:59 +0200, Maxime Ripard <mripard@kernel.org> wrote: >> > [[PGP Signed Part:Undecided]] >> > Hi, >> > >> > On Thu, Jun 01, 2023 at 07:16:45AM +0200, Frank Oltmanns wrote: >> >> On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: >> >> > [[PGP Signed Part:Undecided]] >> >> > Hi Frank, >> >> > >> >> > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: >> >> >> I would like to bring your attention to the current process of setting >> >> >> the rate of an NKM clock. As it stands, when setting the rate of an >> >> >> NKM clock, the rate nearest but less than or equal to the requested >> >> >> rate is found, instead of the nearest rate. >> >> > >> >> > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() >> >> > for example. Some devices require that we don't overshoot, while some >> >> > prefer to have the closest rate. >> >> > >> >> > Both are fine, and it's a bit context specific which one we should >> >> > favour. If we were to do anything, it would be to support both and let >> >> > the clock driver select which behaviour it wants. >> >> > >> >> >> >> Ok, understood. Thank you for the explanation! Now, I'm wondering if >> >> anyone would be using such a flag, if I added it. >> > >> > I guess that's another thing :) If no-one is going to use it, why should >> > we do it in the first place? >> > >> > But most likely the display and audio clocks are usually fairly ok with >> > overshooting a bit, and a closest rate is usually better. >> >> Ok, I dived a bit deeper into this, but, as far as I can tell, the >> closest rate is not used anywhere in the sunxi-ng ccu driver. So, when >> extending, e.g., the NM or NKM clock to support, one must also extend at >> least the mux clocks, because they expect rates less than the requested >> rate. That seems to be quite the undertaking for only a small gain in >> precision. > > mux clocks are using __clk_mux_determine_rate which should have the > behaviour you want when CLK_MUX_ROUND_CLOSEST is set. https://elixir.bootlin.com/linux/v6.3.6/source/drivers/clk/sunxi-ng/ccu-sun50i-a64.c#L539 is one example of a mux clock combined with a divider that is a bit more complex. I didn't look too deeply, but it seemed to me, that it would require two separate flags: One for the mux component and one for the div component. Maybe I'm mistaken, but it seems to me that the concept of having selected rates always be equal to or less than requested rates, seems to be deeply ingrained in the sunxi-ng driver. I'm afraid that I might miss some parts, therefore I abandoned that idea for now (especially since I have only one board for testing). >> >> >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) >> >> >> when setting a rate, each time iterating over all combinations of n, >> >> >> k, and m. >> >> > >> >> > Yeah, that's expected as well. >> >> >> >> I'm wondering though, if iterating over all combinations is set in >> >> stone, or if some kind of optimization would be in order. >> > >> > The thing with optimization is that you need to optimize for something. >> > So you need to show that this code is suboptimal (by whatever metric you >> > want to optimize for), and that your code is more optimal that it used >> > to be. >> > >> > It means identifying a problem, writing benchmarks, and showing that the >> > new code performs better there. >> > >> > For example, your code might very well be faster, but it will increase >> > the kernel image (and thus the RAM usage). One is not more optimal than >> > the other in absolute, they both are, using a different metric. >> >> Sure, I get that. I'll submit a patchset that adds the functionality to >> NKM clocks to set the rate of their parents. >> >> With the new patchset, the time for, e.g. setting DCLK increases from >> ~0.5 ms to a whopping 30 - 37 ms. Those times were taken >> unscientifically on my pinephone, i.e. kernel logging and a couple of >> re-boots. But I think that still might give an idea of why I thought >> about the need to increase performance. >> >> The reason for this massive increase is, that the patch iterates over >> all combinations of NKM for pll-mipi, and for each combination it >> iterates over all combinations of NM for pll-video0. >> >> Nevertheless, following your and Jernej's advice, I'll submit the >> patchset first and then we can discuss if speed optimizations are needed >> and what cost is acceptable. > > Honestly, for 40ms, it will be a hard sell :) I'm not sure what part you think is the "hard-sell": a. the patch itself because 30 to 40 ms is way too much b. the optimization, because 30 to 40 ms isn't all that much. I honestly don't know. BTW, this is the patchset in case you missed it: https://lore.kernel.org/lkml/20230605190745.366882-1-frank@oltmanns.dev/ Note, that I have a patch in the works, which is similar to the one in this thread, but for ccu_nm. Doing a binary search for finding the parent rate of pll-mipi, i.e., pll-video0, reduces the time from ~30 ms to less than 2 ms. If combined with only iterating through meaningful nkm combinations for pll-mipi, this should bring the time under 1 ms again. > >> >> or that pll-mipi should try to set the *requested* rate instead of the >> >> previous rate when the pll-video0 changes. >> > >> > It's not clear to me what is the distinction you make here between the >> > requested rate and the previous rate? >> >> This is quite a de-tour from the original discussion, so I'm sorry for >> the confusion. >> >> By requested rate I mean the rate that the user (DCLK) requested. But >> this is not necessarily the rate that the clock is using in the end, >> because of its parent's rate. >> >> So, when the pll-video0 changes rate from 294 MHz to 297MHz (upon >> plugging in HDMI), pll-mipi does not know any longer what the requested >> rate (let's say 432MHz) was. > > It does, it's struct clk_core's req_rate. It doesn't look like it's > available to clk_hw users, but given the rest of your explanation, I > guess you have a compelling use case to make it available. Oh, thank you for making me aware of that! I'll surely look into it. Thanks, Frank > > Maxime > > [[End of PGP Signed Part]]
On Tue, Jun 06, 2023 at 10:40:34PM +0200, Frank Oltmanns wrote: > Hi Maxime, > > On 2023-06-06 at 16:02:58 +0200, Maxime Ripard <mripard@kernel.org> wrote: > > [[PGP Signed Part:Undecided]] > > On Mon, Jun 05, 2023 at 12:31:41PM +0200, Frank Oltmanns wrote: > >> Hi Maxime, > >> > >> On 2023-06-02 at 09:31:59 +0200, Maxime Ripard <mripard@kernel.org> wrote: > >> > [[PGP Signed Part:Undecided]] > >> > Hi, > >> > > >> > On Thu, Jun 01, 2023 at 07:16:45AM +0200, Frank Oltmanns wrote: > >> >> On 2023-05-31 at 15:48:43 +0200, Maxime Ripard <mripard@kernel.org> wrote: > >> >> > [[PGP Signed Part:Undecided]] > >> >> > Hi Frank, > >> >> > > >> >> > On Sat, May 27, 2023 at 03:27:44PM +0200, Frank Oltmanns wrote: > >> >> >> I would like to bring your attention to the current process of setting > >> >> >> the rate of an NKM clock. As it stands, when setting the rate of an > >> >> >> NKM clock, the rate nearest but less than or equal to the requested > >> >> >> rate is found, instead of the nearest rate. > >> >> > > >> >> > Yeah, it's actually pretty common, see clk_mux_determine_rate_flags() > >> >> > for example. Some devices require that we don't overshoot, while some > >> >> > prefer to have the closest rate. > >> >> > > >> >> > Both are fine, and it's a bit context specific which one we should > >> >> > favour. If we were to do anything, it would be to support both and let > >> >> > the clock driver select which behaviour it wants. > >> >> > > >> >> > >> >> Ok, understood. Thank you for the explanation! Now, I'm wondering if > >> >> anyone would be using such a flag, if I added it. > >> > > >> > I guess that's another thing :) If no-one is going to use it, why should > >> > we do it in the first place? > >> > > >> > But most likely the display and audio clocks are usually fairly ok with > >> > overshooting a bit, and a closest rate is usually better. > >> > >> Ok, I dived a bit deeper into this, but, as far as I can tell, the > >> closest rate is not used anywhere in the sunxi-ng ccu driver. So, when > >> extending, e.g., the NM or NKM clock to support, one must also extend at > >> least the mux clocks, because they expect rates less than the requested > >> rate. That seems to be quite the undertaking for only a small gain in > >> precision. > > > > mux clocks are using __clk_mux_determine_rate which should have the > > behaviour you want when CLK_MUX_ROUND_CLOSEST is set. > > https://elixir.bootlin.com/linux/v6.3.6/source/drivers/clk/sunxi-ng/ccu-sun50i-a64.c#L539 > is one example of a mux clock combined with a divider that is a bit more > complex. I didn't look too deeply, but it seemed to me, that it would > require two separate flags: One for the mux component and one for the > div component. Maybe I'm mistaken, but it seems to me that the concept > of having selected rates always be equal to or less than requested > rates, seems to be deeply ingrained in the sunxi-ng driver. I'm afraid > that I might miss some parts, therefore I abandoned that idea for now > (especially since I have only one board for testing). All the implementations (afaik) of the "composite" clocks we have that combine a mux and dividers eventually use ccu_mux_helper_determine_rate and divider_round_rate_parent. The last one can use CLK_DIVIDER_ROUND_CLOSEST, and you covered the first one in your patch. Which one did you leave out? > >> >> >> Moreover, ccu_nkm_find_best() is called multiple times (footnote [1]) > >> >> >> when setting a rate, each time iterating over all combinations of n, > >> >> >> k, and m. > >> >> > > >> >> > Yeah, that's expected as well. > >> >> > >> >> I'm wondering though, if iterating over all combinations is set in > >> >> stone, or if some kind of optimization would be in order. > >> > > >> > The thing with optimization is that you need to optimize for something. > >> > So you need to show that this code is suboptimal (by whatever metric you > >> > want to optimize for), and that your code is more optimal that it used > >> > to be. > >> > > >> > It means identifying a problem, writing benchmarks, and showing that the > >> > new code performs better there. > >> > > >> > For example, your code might very well be faster, but it will increase > >> > the kernel image (and thus the RAM usage). One is not more optimal than > >> > the other in absolute, they both are, using a different metric. > >> > >> Sure, I get that. I'll submit a patchset that adds the functionality to > >> NKM clocks to set the rate of their parents. > >> > >> With the new patchset, the time for, e.g. setting DCLK increases from > >> ~0.5 ms to a whopping 30 - 37 ms. Those times were taken > >> unscientifically on my pinephone, i.e. kernel logging and a couple of > >> re-boots. But I think that still might give an idea of why I thought > >> about the need to increase performance. > >> > >> The reason for this massive increase is, that the patch iterates over > >> all combinations of NKM for pll-mipi, and for each combination it > >> iterates over all combinations of NM for pll-video0. > >> > >> Nevertheless, following your and Jernej's advice, I'll submit the > >> patchset first and then we can discuss if speed optimizations are needed > >> and what cost is acceptable. > > > > Honestly, for 40ms, it will be a hard sell :) > > I'm not sure what part you think is the "hard-sell": > a. the patch itself because 30 to 40 ms is way too much > b. the optimization, because 30 to 40 ms isn't all that much. > I honestly don't know. > > BTW, this is the patchset in case you missed it: > https://lore.kernel.org/lkml/20230605190745.366882-1-frank@oltmanns.dev/ > > Note, that I have a patch in the works, which is similar to the one in > this thread, but for ccu_nm. Doing a binary search for finding the > parent rate of pll-mipi, i.e., pll-video0, reduces the time from ~30 ms > to less than 2 ms. If combined with only iterating through meaningful > nkm combinations for pll-mipi, this should bring the time under 1 ms > again. My point is that: * it's regression prone, so quite risky * but if the endgoal is to gain 40ms once, at boot, without any use case to back that up, there's basically no reward. a high risk, low reward patch is what's hard to sell. Maxime