diff mbox series

[v2,11/11] interconnect: Add devfreq support

Message ID 20190614041733.120807-12-saravanak@google.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Introduce Bandwidth OPPs & interconnect devfreq driver | expand

Commit Message

Saravana Kannan June 14, 2019, 4:17 a.m. UTC
Add a icc_create_devfreq() and icc_remove_devfreq() to create and remove
devfreq devices for interconnect paths. A driver can create/remove devfreq
devices for the interconnects needed for its device by calling these APIs.
This would allow various devfreq governors to work with interconnect paths
and the device driver itself doesn't have to actively manage the bandwidth
votes for the interconnects.

Signed-off-by: Saravana Kannan <saravanak@google.com>
---
 drivers/interconnect/Makefile      |   2 +-
 drivers/interconnect/icc-devfreq.c | 156 +++++++++++++++++++++++++++++
 include/linux/interconnect.h       |  14 +++
 3 files changed, 171 insertions(+), 1 deletion(-)
 create mode 100644 drivers/interconnect/icc-devfreq.c

Comments

Georgi Djakov June 17, 2019, 3:43 p.m. UTC | #1
Hi Saravana,

On 6/14/19 07:17, Saravana Kannan wrote:
> Add a icc_create_devfreq() and icc_remove_devfreq() to create and remove
> devfreq devices for interconnect paths. A driver can create/remove devfreq
> devices for the interconnects needed for its device by calling these APIs.
> This would allow various devfreq governors to work with interconnect paths
> and the device driver itself doesn't have to actively manage the bandwidth
> votes for the interconnects.

Thanks for the patches, but creating devfreq devices for each interconnect path
seems odd to me - at least for consumers that already use a governor. So for DDR
scaling for example, are you suggesting that we add a devfreq device from the
cpufreq driver in order to scale the interconnect between CPU<->DDR? Also if the
GPU is already using devfreq, should we add a devfreq per each interconnect
path? What would be the benefit in this case - using different governors for
bandwidth scaling maybe?

Thanks,
Georgi
Saravana Kannan June 17, 2019, 9:18 p.m. UTC | #2
On Mon, Jun 17, 2019 at 8:44 AM Georgi Djakov <georgi.djakov@linaro.org> wrote:
>
> Hi Saravana,
>
> On 6/14/19 07:17, Saravana Kannan wrote:
> > Add a icc_create_devfreq() and icc_remove_devfreq() to create and remove
> > devfreq devices for interconnect paths. A driver can create/remove devfreq
> > devices for the interconnects needed for its device by calling these APIs.
> > This would allow various devfreq governors to work with interconnect paths
> > and the device driver itself doesn't have to actively manage the bandwidth
> > votes for the interconnects.
>
> Thanks for the patches, but creating devfreq devices for each interconnect path
> seems odd to me - at least for consumers that already use a governor.

Each governor instance always handles one "frequency" (more like
performance) domain at a time. So if a consumer is already using a
governor to scale the hardware block, then using another governor to
scale the interconnect performance points is the right way to go about
it. In fact, that's exactly what devfreq passive governor's
documentation even says it's meant for. That's also what cpufreq does
for each cluster/CPU frequency domain too.

> So for DDR
> scaling for example, are you suggesting that we add a devfreq device from the
> cpufreq driver in order to scale the interconnect between CPU<->DDR?

Yes in general. Although, CPUs are a special case because CPUs don't
go through devfreq. So passive governor as it stands today won't work.
CPU<->DDR scaling might need a separate governor (unlikely) or some
changes to the passive governor that I'm happy to work on once we
settle this for general devices like GPU, etc. But the DT format for
CPUs will be identical to GPUs or any other device.

> Also if the
> GPU is already using devfreq, should we add a devfreq per each interconnect
> path? What would be the benefit in this case - using different governors for
> bandwidth scaling maybe?

When saying "separate/different governors" in this email, I mean both
different instance of the same governor logic with different tunables
AND actually different algorithms/governor logic entirely.

The heuristics to use for each interconnect path might be (more like,
will be) different based on hardware characteristics (Eg: what voltage
domains the interconnect is sitting on) and what interconnect
information is available (Eg: Just busy time vs bandwidth count vs no
information etc) -- so having separate governors for each interconnect
path makes a lot of sense. It also allows userspace to control the
policy for scaling each of those paths based on product use cases.

For example, when the GPU is just doing simple UI rendering, userspace
can use the max_freq sysfs file for the devfreq device to disallow high
bandwidth OPPs on the GPU<->DDR path, but those higher OPPs could be
allowed by userspace when the GPU is used for games. Having devfreq
device for each interconnect path also make it easy to debug
performance issues -- you can independently change the votes for each
path to figure out what is causing the bottleneck, etc.

Adding a devfreq device for interconnect voting with a few lines gives
all these features "for free".

This doesn't mean all users of interconnect framework NEED to use
devfreq for interconnect. They might do it simply based on
calculations based on the use case (Eg: display driver from display
resolution). But if they are trying to use any kind of
algorithm/heuristics, writing it as a devfreq governor should be
encouraged.

Also want to point out that BW OPPs also work for drivers that don't
use devfreq at all. The interconnect-opp-table just lists the
meaningful OPP leveld for the path and the device driver can pick one
entry from the table based on the use case.

Thanks,
Saravana
Sibi Sankar July 16, 2019, 6:12 p.m. UTC | #3
Hey Saravana,

On 6/18/19 2:48 AM, Saravana Kannan wrote:
> On Mon, Jun 17, 2019 at 8:44 AM Georgi Djakov <georgi.djakov@linaro.org> wrote:
>>
>> Hi Saravana,
>>
>> On 6/14/19 07:17, Saravana Kannan wrote:
>>> Add a icc_create_devfreq() and icc_remove_devfreq() to create and remove
>>> devfreq devices for interconnect paths. A driver can create/remove devfreq
>>> devices for the interconnects needed for its device by calling these APIs.
>>> This would allow various devfreq governors to work with interconnect paths
>>> and the device driver itself doesn't have to actively manage the bandwidth
>>> votes for the interconnects.
>>
>> Thanks for the patches, but creating devfreq devices for each interconnect path
>> seems odd to me - at least for consumers that already use a governor.
> 
> Each governor instance always handles one "frequency" (more like
> performance) domain at a time. So if a consumer is already using a
> governor to scale the hardware block, then using another governor to
> scale the interconnect performance points is the right way to go about
> it. In fact, that's exactly what devfreq passive governor's
> documentation even says it's meant for. That's also what cpufreq does
> for each cluster/CPU frequency domain too.
> 
>> So for DDR
>> scaling for example, are you suggesting that we add a devfreq device from the
>> cpufreq driver in order to scale the interconnect between CPU<->DDR?
> 
> Yes in general. Although, CPUs are a special case because CPUs don't
> go through devfreq. So passive governor as it stands today won't work.
> CPU<->DDR scaling might need a separate governor (unlikely) or some
> changes to the passive governor that I'm happy to work on once we
> settle this for general devices like GPU, etc. But the DT format for
> CPUs will be identical to GPUs or any other device.

using icc_create_devfreq from the cpufreq-hw driver on SDM845 SoC
to scale CPU<->DDR would cause a circular dependency. (i.e) with
the addition of cpufreq notifier to the passive governor as in
https://patchwork.kernel.org/patch/11046147/ devm_devfreq_add_device
would require the cpufreq transistion notifier register and cpu
freq_cpu_get to go through. Please add your thought on addressing this.

> 
>> Also if the
>> GPU is already using devfreq, should we add a devfreq per each interconnect
>> path? What would be the benefit in this case - using different governors for
>> bandwidth scaling maybe?
> 
> When saying "separate/different governors" in this email, I mean both
> different instance of the same governor logic with different tunables
> AND actually different algorithms/governor logic entirely.
> 
> The heuristics to use for each interconnect path might be (more like,
> will be) different based on hardware characteristics (Eg: what voltage
> domains the interconnect is sitting on) and what interconnect
> information is available (Eg: Just busy time vs bandwidth count vs no
> information etc) -- so having separate governors for each interconnect
> path makes a lot of sense. It also allows userspace to control the
> policy for scaling each of those paths based on product use cases.
> 
> For example, when the GPU is just doing simple UI rendering, userspace
> can use the max_freq sysfs file for the devfreq device to disallow high
> bandwidth OPPs on the GPU<->DDR path, but those higher OPPs could be
> allowed by userspace when the GPU is used for games. Having devfreq
> device for each interconnect path also make it easy to debug
> performance issues -- you can independently change the votes for each
> path to figure out what is causing the bottleneck, etc.
> 
> Adding a devfreq device for interconnect voting with a few lines gives
> all these features "for free".
> 
> This doesn't mean all users of interconnect framework NEED to use
> devfreq for interconnect. They might do it simply based on
> calculations based on the use case (Eg: display driver from display
> resolution). But if they are trying to use any kind of
> algorithm/heuristics, writing it as a devfreq governor should be
> encouraged.
> 
> Also want to point out that BW OPPs also work for drivers that don't
> use devfreq at all. The interconnect-opp-table just lists the
> meaningful OPP leveld for the path and the device driver can pick one
> entry from the table based on the use case.
> 
> Thanks,
> Saravana
> 
> 
>
Saravana Kannan July 16, 2019, 7:17 p.m. UTC | #4
On Tue, Jul 16, 2019 at 11:13 AM Sibi Sankar <sibis@codeaurora.org> wrote:
>
> Hey Saravana,
>
> On 6/18/19 2:48 AM, Saravana Kannan wrote:
> > On Mon, Jun 17, 2019 at 8:44 AM Georgi Djakov <georgi.djakov@linaro.org> wrote:
> >>
> >> Hi Saravana,
> >>
> >> On 6/14/19 07:17, Saravana Kannan wrote:
> >>> Add a icc_create_devfreq() and icc_remove_devfreq() to create and remove
> >>> devfreq devices for interconnect paths. A driver can create/remove devfreq
> >>> devices for the interconnects needed for its device by calling these APIs.
> >>> This would allow various devfreq governors to work with interconnect paths
> >>> and the device driver itself doesn't have to actively manage the bandwidth
> >>> votes for the interconnects.
> >>
> >> Thanks for the patches, but creating devfreq devices for each interconnect path
> >> seems odd to me - at least for consumers that already use a governor.
> >
> > Each governor instance always handles one "frequency" (more like
> > performance) domain at a time. So if a consumer is already using a
> > governor to scale the hardware block, then using another governor to
> > scale the interconnect performance points is the right way to go about
> > it. In fact, that's exactly what devfreq passive governor's
> > documentation even says it's meant for. That's also what cpufreq does
> > for each cluster/CPU frequency domain too.
> >
> >> So for DDR
> >> scaling for example, are you suggesting that we add a devfreq device from the
> >> cpufreq driver in order to scale the interconnect between CPU<->DDR?
> >
> > Yes in general. Although, CPUs are a special case because CPUs don't
> > go through devfreq. So passive governor as it stands today won't work.
> > CPU<->DDR scaling might need a separate governor (unlikely) or some
> > changes to the passive governor that I'm happy to work on once we
> > settle this for general devices like GPU, etc. But the DT format for
> > CPUs will be identical to GPUs or any other device.
>
> using icc_create_devfreq from the cpufreq-hw driver on SDM845 SoC
> to scale CPU<->DDR would cause a circular dependency. (i.e) with
> the addition of cpufreq notifier to the passive governor as in
> https://patchwork.kernel.org/patch/11046147/ devm_devfreq_add_device
> would require the cpufreq transistion notifier register and cpu
> freq_cpu_get to go through. Please add your thought on addressing this.

This is an old series. So not going to dive into this much.

But to answer your question, I wrote the cpufreq_map governor a long
time ago. So not surprised if you are finding issues with it -- it
needs a rewrite anyway.

-Saravana

> >
> >> Also if the
> >> GPU is already using devfreq, should we add a devfreq per each interconnect
> >> path? What would be the benefit in this case - using different governors for
> >> bandwidth scaling maybe?
> >
> > When saying "separate/different governors" in this email, I mean both
> > different instance of the same governor logic with different tunables
> > AND actually different algorithms/governor logic entirely.
> >
> > The heuristics to use for each interconnect path might be (more like,
> > will be) different based on hardware characteristics (Eg: what voltage
> > domains the interconnect is sitting on) and what interconnect
> > information is available (Eg: Just busy time vs bandwidth count vs no
> > information etc) -- so having separate governors for each interconnect
> > path makes a lot of sense. It also allows userspace to control the
> > policy for scaling each of those paths based on product use cases.
> >
> > For example, when the GPU is just doing simple UI rendering, userspace
> > can use the max_freq sysfs file for the devfreq device to disallow high
> > bandwidth OPPs on the GPU<->DDR path, but those higher OPPs could be
> > allowed by userspace when the GPU is used for games. Having devfreq
> > device for each interconnect path also make it easy to debug
> > performance issues -- you can independently change the votes for each
> > path to figure out what is causing the bottleneck, etc.
> >
> > Adding a devfreq device for interconnect voting with a few lines gives
> > all these features "for free".
> >
> > This doesn't mean all users of interconnect framework NEED to use
> > devfreq for interconnect. They might do it simply based on
> > calculations based on the use case (Eg: display driver from display
> > resolution). But if they are trying to use any kind of
> > algorithm/heuristics, writing it as a devfreq governor should be
> > encouraged.
> >
> > Also want to point out that BW OPPs also work for drivers that don't
> > use devfreq at all. The interconnect-opp-table just lists the
> > meaningful OPP leveld for the path and the device driver can pick one
> > entry from the table based on the use case.
> >
> > Thanks,
> > Saravana
> >
> >
> >
>
> --
> Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc, is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
diff mbox series

Patch

diff --git a/drivers/interconnect/Makefile b/drivers/interconnect/Makefile
index 28f2ab0824d5..ddfb65b7fa55 100644
--- a/drivers/interconnect/Makefile
+++ b/drivers/interconnect/Makefile
@@ -1,6 +1,6 @@ 
 # SPDX-License-Identifier: GPL-2.0
 
-icc-core-objs				:= core.o
+icc-core-objs				:= core.o icc-devfreq.o
 
 obj-$(CONFIG_INTERCONNECT)		+= icc-core.o
 obj-$(CONFIG_INTERCONNECT_QCOM)		+= qcom/
diff --git a/drivers/interconnect/icc-devfreq.c b/drivers/interconnect/icc-devfreq.c
new file mode 100644
index 000000000000..608716fbe612
--- /dev/null
+++ b/drivers/interconnect/icc-devfreq.c
@@ -0,0 +1,156 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * A devfreq device driver for interconnect paths
+ *
+ * Copyright (C) 2019 Google, Inc
+ */
+
+#include <linux/devfreq.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/pm_opp.h>
+#include <linux/slab.h>
+#include <linux/interconnect.h>
+#include <linux/limits.h>
+
+struct icc_devfreq {
+	struct device icc_dev;
+	struct icc_path *path;
+	struct devfreq_dev_profile dp;
+	struct devfreq *df;
+	unsigned long peak_bw;
+	unsigned long avg_bw;
+#if IS_ENABLED(CONFIG_DEVFREQ_GOV_PASSIVE)
+	struct devfreq_passive_data p_data;
+#endif
+};
+
+static int icc_target(struct device *dev, unsigned long *freq,
+				u32 flags)
+{
+	struct icc_devfreq *d = dev_get_drvdata(dev);
+	struct dev_pm_opp *opp;
+	unsigned long peak_bw, avg_bw;
+
+	opp = devfreq_recommended_opp(dev, &peak_bw, flags);
+	if (IS_ERR(opp)) {
+		dev_err(dev, "Failed to find opp for %lu KHz\n", *freq);
+		return PTR_ERR(opp);
+	}
+	peak_bw = dev_pm_opp_get_bw(opp, &avg_bw);
+	dev_pm_opp_put(opp);
+
+	if (!icc_set_bw(d->path, avg_bw, peak_bw)) {
+		*freq = peak_bw;
+		d->peak_bw = peak_bw;
+		d->avg_bw = avg_bw;
+	}
+
+	return 0;
+}
+
+static int icc_get_dev_status(struct device *dev,
+					struct devfreq_dev_status *stat)
+{
+	struct icc_devfreq *d = dev_get_drvdata(dev);
+
+	stat->current_frequency = d->peak_bw;
+	return 0;
+}
+
+#define icc_dev_to_data(DEV)	container_of((DEV), struct icc_devfreq, icc_dev)
+static void icc_dev_release(struct device *dev)
+{
+	struct icc_devfreq *d = icc_dev_to_data(dev);
+
+	kfree(d);
+}
+
+struct icc_devfreq *icc_create_devfreq(struct device *dev, char *name,
+				       struct devfreq *devf)
+{
+	struct icc_devfreq *d;
+	struct icc_path *path;
+	struct opp_table *opp_table;
+	struct dev_pm_opp *opp;
+	void *p_data = NULL;
+	unsigned long peak_bw = U32_MAX, avg_bw = U32_MAX;
+	int err;
+
+	if (!name)
+		return ERR_PTR(-EINVAL);
+
+	path = of_icc_get(dev, name);
+	if (IS_ERR(path))
+		return (void *) path;
+
+	opp_table = icc_get_opp_table(path);
+	if (!opp_table) {
+		err = -EINVAL;
+		goto out_path;
+	}
+
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d) {
+		err = -ENOMEM;
+		goto out_path;
+	}
+	d->path = path;
+
+	d->icc_dev.parent = dev;
+	d->icc_dev.release = icc_dev_release;
+	dev_set_name(&d->icc_dev, "%s-icc-%s", dev_name(dev), name);
+	err = device_register(&d->icc_dev);
+	if (err) {
+		put_device(&d->icc_dev);
+		goto out_path;
+	}
+
+	dev_pm_opp_add_opp_table(&d->icc_dev, opp_table);
+	opp = dev_pm_opp_find_peak_bw_floor(dev, &peak_bw);
+	peak_bw = dev_pm_opp_get_bw(opp, &avg_bw);
+	dev_pm_opp_put(opp);
+
+	if (!icc_set_bw(d->path, avg_bw, peak_bw)) {
+		d->peak_bw = peak_bw;
+		d->avg_bw = avg_bw;
+	}
+
+#if !IS_ENABLED(CONFIG_DEVFREQ_GOV_PASSIVE)
+	if (devf) {
+		d->p_data.parent = devf;
+		pdata = &d->p_data;
+	}
+#endif
+	d->dp.initial_freq = peak_bw;
+	d->dp.polling_ms = 0;
+	d->dp.target = icc_target;
+	d->dp.get_dev_status = icc_get_dev_status;
+	d->df = devm_devfreq_add_device(&d->icc_dev,
+					&d->dp,
+					p_data ? "passive" : "performance",
+					p_data);
+	if (IS_ERR(d->df)) {
+		err = PTR_ERR(d->df);
+		goto out_dev;
+	}
+
+	return d;
+out_dev:
+	put_device(&d->icc_dev);
+out_path:
+	icc_put(path);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(icc_create_devfreq);
+
+void icc_remove_devfreq(struct icc_devfreq *d)
+{
+	icc_put(d->path);
+	put_device(&d->icc_dev);
+}
+EXPORT_SYMBOL_GPL(icc_remove_devfreq);
+
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("interconnect devfreq device driver");
+MODULE_AUTHOR("Saravana Kannan <saravanak@google.com>");
diff --git a/include/linux/interconnect.h b/include/linux/interconnect.h
index 0c0bc55f0e89..ec5879a79301 100644
--- a/include/linux/interconnect.h
+++ b/include/linux/interconnect.h
@@ -10,6 +10,7 @@ 
 #include <linux/mutex.h>
 #include <linux/types.h>
 #include <linux/pm_opp.h>
+#include <linux/devfreq.h>
 
 /* macros for converting to icc units */
 #define Bps_to_icc(x)	((x) / 1000)
@@ -23,6 +24,7 @@ 
 
 struct icc_path;
 struct device;
+struct icc_devfreq;
 
 #if IS_ENABLED(CONFIG_INTERCONNECT)
 
@@ -32,6 +34,9 @@  struct icc_path *of_icc_get(struct device *dev, const char *name);
 struct opp_table *icc_get_opp_table(struct icc_path *path);
 void icc_put(struct icc_path *path);
 int icc_set_bw(struct icc_path *path, u32 avg_bw, u32 peak_bw);
+struct icc_devfreq *icc_create_devfreq(struct device *dev, char *name,
+				       struct devfreq *devf);
+void icc_remove_devfreq(struct icc_devfreq *d);
 
 #else
 
@@ -61,6 +66,15 @@  static inline int icc_set_bw(struct icc_path *path, u32 avg_bw, u32 peak_bw)
 	return 0;
 }
 
+static inline struct icc_devfreq *icc_create_devfreq(struct device *dev,
+						     char *name,
+						     struct devfreq *devf);
+{
+	return NULL;
+}
+
+void icc_remove_devfreq(struct icc_devfreq *d) {}
+
 #endif /* CONFIG_INTERCONNECT */
 
 #endif /* __LINUX_INTERCONNECT_H */