diff mbox

[RFC] PM: Introduce generic DVFS framework with device-specific OPPs

Message ID 1303807401-16350-1-git-send-email-myungjoo.ham@samsung.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

MyungJoo Ham April 26, 2011, 8:43 a.m. UTC
With OPPs, a device may have multiple operable frequency and voltage
sets. However, there can be multiple possible operable sets and a system
will need to choose one from them. In order to reduce the power
consumption (by reducing frequency and voltage) without affecting the
performance too much, a DVFS scheme may be used.

This patch introduces the DVFS capability to non-CPU devices with OPPs.

In this RTC version, we do not have specific gorvernors or device
support, yet. However, we plan to test and use DVFS scheme on
DRAM/BUS/GPU for our test boards (Exynos4-NURI), which have DVFS
capability and have seperated DVFS-like drivers without any common code.

The generic DVFS may appear quite similar with /drivers/cpufreq.
However, CPUFREQ does not allow to have multiple devices registered and
is not suitable to have multiple heterogenous devices with different
(but simple) governors.

Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
---
 drivers/base/power/Makefile |    1 +
 drivers/base/power/dvfs.c   |  401 +++++++++++++++++++++++++++++++++++++++++++
 drivers/base/power/opp.c    |    7 +
 include/linux/dvfs.h        |   69 ++++++++
 kernel/power/Kconfig        |    9 +
 5 files changed, 487 insertions(+), 0 deletions(-)
 create mode 100644 drivers/base/power/dvfs.c
 create mode 100644 include/linux/dvfs.h

Comments

Rafael Wysocki April 26, 2011, 11:09 a.m. UTC | #1
Hi,

To start with let me say I don't have any fundamental objections at this point,
although I'm not 100% sure I understand correctly how this feature is supposed
to be used.  My understanding is that if a device is known to have multiple
OPPs, its subsystem and driver may call dvfs_add_device() for it, which will
cause it to be periodically monitored and put into the "optimum" OPP using
the infrastructure introduced by your patch (on the basis of the given
usage profile and with the help of the given governor).  Is that correct?

That said I have some specific comments below.

On Tuesday, April 26, 2011, MyungJoo Ham wrote:
> With OPPs, a device may have multiple operable frequency and voltage
> sets. However, there can be multiple possible operable sets and a system
> will need to choose one from them. In order to reduce the power
> consumption (by reducing frequency and voltage) without affecting the
> performance too much, a DVFS scheme may be used.
> 
> This patch introduces the DVFS capability to non-CPU devices with OPPs.

Well, I'd like the DVFS acronym to be expanded at least once somewhere,
possibly in the comments describing the new files and necessarily in the
Kconfig option description.

> In this RTC version, we do not have specific gorvernors or device

RFC I guess?

> support, yet. However, we plan to test and use DVFS scheme on
> DRAM/BUS/GPU for our test boards (Exynos4-NURI), which have DVFS
> capability and have seperated DVFS-like drivers without any common code.
> 
> The generic DVFS may appear quite similar with /drivers/cpufreq.
> However, CPUFREQ does not allow to have multiple devices registered and
> is not suitable to have multiple heterogenous devices with different
> (but simple) governors.
> 
> Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
> ---
>  drivers/base/power/Makefile |    1 +
>  drivers/base/power/dvfs.c   |  401 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/base/power/opp.c    |    7 +
>  include/linux/dvfs.h        |   69 ++++++++
>  kernel/power/Kconfig        |    9 +
>  5 files changed, 487 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/base/power/dvfs.c
>  create mode 100644 include/linux/dvfs.h
> 
> diff --git a/drivers/base/power/Makefile b/drivers/base/power/Makefile
> index 118c1b9..4d92d7f 100644
> --- a/drivers/base/power/Makefile
> +++ b/drivers/base/power/Makefile
> @@ -3,6 +3,7 @@ obj-$(CONFIG_PM_SLEEP)	+= main.o wakeup.o
>  obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
>  obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
>  obj-$(CONFIG_PM_OPP)	+= opp.o
> +obj-$(CONFIG_PM_GENERIC_DVFS)	+= dvfs.o
>  
>  ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
>  ccflags-$(CONFIG_PM_VERBOSE)   += -DDEBUG
> diff --git a/drivers/base/power/dvfs.c b/drivers/base/power/dvfs.c
> new file mode 100644
> index 0000000..867ad70
> --- /dev/null
> +++ b/drivers/base/power/dvfs.c
> @@ -0,0 +1,401 @@
> +/*
> + * Generic DVFS Interface for Non-CPU Devices
> + *	Based on OPP.
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + *	MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/err.h>
> +#include <linux/init.h>
> +#include <linux/slab.h>
> +#include <linux/opp.h>
> +#include <linux/dvfs.h>
> +#include <linux/workqueue.h>
> +#include <linux/platform_device.h>
> +#include <linux/list.h>
> +
> +#define DVFS_INTERVAL	20

You should state in a comment what's the unit of this and explain the choice
of the value.

> +/**
> + * struct dvfs - Device DVFS structure
> + * @node	list node - contains the devices with DVFS that have been
> + *		registered.
> + * @dev		device pointer
> + * @profile	device-specific dvfs profile
> + * @governor	method how to choose frequency based on the usage.
> + * @previous_freq	previously configured frequency value.
> + * @next_polling	the number of remaining "dvfs_monitor" executions to
> + *			reevaluate frequency/voltage of the device. Set by
> + *			profile's polling_ms interval.
> + * @tickle	positive if DVFS-tickling is activated for the device.
> + *		at each executino of dvfs_monitor, tickle is decremented.
> + *		User may tickle a device-dvfs in order to set maximum
> + *		frequency instaneously with some guaranteed duration.
> + *
> + * This structure stores the DVFS information for a give device.
> + */
> +struct dvfs {
> +	struct list_head node;
> +
> +	struct device *dev;
> +	struct dvfs_device_profile *profile;
> +	struct dvfs_governor *governor;
> +
> +	unsigned long previous_freq;
> +	unsigned int next_polling;
> +	unsigned int tickle;
> +};
> +

I'd move the structure definition to the header file you introduce below.
Even though it's supposed to be internal, it's better to have it defined
next to the other structures for the benefit of the people who will
try to understand your code in the future.

> +/*
> + * dvfs_work periodically (given by DVFS_INTERVAL) monitors every
> + * registered device.
> + */
> +static struct delayed_work dvfs_work;
> +/* The list of all device-dvfs */
> +static LIST_HEAD(dvfs_list);
> +/* Exclusive access to dvfs_list and its elements */
> +static DEFINE_MUTEX(dvfs_list_lock);
> +
> +/**
> + * find_device_dvfs() - find dvfs struct using device pointer
> + * @dev:	device pointer used to lookup device DVFS.
> + *
> + * Search the list of device DVFSs and return the matched device's DVFS info.
> + */
> +static struct dvfs *find_device_dvfs(struct device *dev)
> +{
> +	struct dvfs *tmp_dvfs, *dvfs = ERR_PTR(-ENODEV);
> +
> +	if (unlikely(IS_ERR_OR_NULL(dev))) {
> +		pr_err("%s: Invalid parameters\n", __func__);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	list_for_each_entry(tmp_dvfs, &dvfs_list, node) {
> +		if (tmp_dvfs->dev == dev) {
> +			dvfs = tmp_dvfs;
> +			break;
> +		}
> +	}
> +
> +	return dvfs;
> +}
> +
> +/**
> + * dvfs_next_polling() - the number of dvfs-monitor iterations to satisfy
> + *		the given polling interval. The interval is rounded by
> + *		dvfs-monitor poling interval (DVFS_INTERVAL) with ceiling
> + *		function.
> + * @ms:		the length of polling interval in milli-second
> + */
> +static unsigned int dvfs_next_polling(int ms)
> +{
> +	unsigned int ret;
> +
> +	if (ms <= 0)
> +		return 0;
> +
> +	ret = ms / DVFS_INTERVAL;
> +	if (ms % DVFS_INTERVAL)
> +		ret++;
> +
> +	return ret;
> +}

It seems you could use DIV_ROUND_UP instead of this whole function.

> +
> +/**
> + * dvfs_do() - Check the usage profile of a given device and configure
> + *		frequency and voltage accordingly
> + * @dvfs:	DVFS info of the given device
> + */
> +static int dvfs_do(struct dvfs *dvfs)
> +{
> +	struct dvfs_usage_profile profile;
> +	struct opp *opp;
> +	unsigned long freq;
> +	int err;
> +
> +	err = dvfs->profile->get_usage_profile(dvfs->dev, &profile);
> +	if (err) {
> +		dev_err(dvfs->dev, "%s: Cannot get usage profile (%d)\n",
> +			__func__, err);
> +		return err;
> +	}
> +
> +	err = dvfs->governor->get_target_freq(&profile, dvfs->governor, &freq);

It may be better to define ->get_target_freq() so that it takes devfs as an arg.
In that case you'd need to pass fewer pointers and the governor would be
able to access all of its data anyway.

> +	if (err) {
> +		dev_err(dvfs->dev, "%s: Cannot get target frequency (%d)\n",
> +			__func__, err);
> +		return err;
> +	}
> +
> +	opp = opp_find_freq_ceil(dvfs->dev, &freq);
> +	if (opp == ERR_PTR(-ENODEV))
> +		opp = opp_find_freq_floor(dvfs->dev, &freq);
> +
> +	if (IS_ERR(opp)) {
> +		dev_err(dvfs->dev, "%s: Cannot find opp with %luHz.\n",
> +				__func__, freq);
> +		return PTR_ERR(opp);
> +	}
> +
> +	freq = opp_get_freq(opp);
> +	if (dvfs->previous_freq != freq) {
> +		err = dvfs->profile->target(dvfs->dev, freq,
> +					    opp_get_voltage(opp));
> +		dvfs->previous_freq = freq;
> +	}
> +
> +	if (err)
> +		dev_err(dvfs->dev, "%s: Cannot set %luHz/%luuV\n", __func__,
> +			opp_get_freq(opp), opp_get_voltage(opp));
> +	return err;
> +}

I'd generally consider using fewer error messages.  While they may be good
for debugging, they have a potential of spamming the kernel's message buffer
in case of a problem on a production system.  Also they don't really make
the code cleaner, so to speak. :-)

> +
> +/**
> + * dvfs_monitor() - Regularly run dvfs_do() and support device DVFS tickle.
> + * @work: the work struct used to run dvfs_monitor periodically.
> + */
> +static void dvfs_monitor(struct work_struct *work)
> +{
> +	struct dvfs *dvfs;
> +
> +	mutex_lock(&dvfs_list_lock);
> +
> +	list_for_each_entry(dvfs, &dvfs_list, node) {
> +		if (dvfs->next_polling == 0)
> +			continue;
> +		if (dvfs->tickle) {
> +			dvfs->tickle--;
> +			continue;
> +		}
> +		if (dvfs->next_polling == 1) {
> +			dvfs_do(dvfs);
> +			dvfs->next_polling = dvfs_next_polling(
> +						dvfs->profile->polling_ms);
> +		} else {
> +			dvfs->next_polling--;
> +		}
> +	}
> +
> +	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
> +	mutex_unlock(&dvfs_list_lock);
> +}
> +
> +/**
> + * dvfs_add_device() - Add dvfs feature to the device
> + * @dev:	the device to add dvfs feature.
> + * @profile:	device-specific profile to run dvfs.
> + * @governor:	the policy to choose frequency.
> + */
> +int dvfs_add_device(struct device *dev, struct dvfs_device_profile *profile,
> +		    struct dvfs_governor *governor)
> +{
> +	struct dvfs *new_dvfs, *dvfs;
> +	struct cpufreq_frequency_table *table;
> +	int err;
> +
> +	if (!dev || !profile || !governor) {
> +		dev_err(dev, "%s: Invalid parameters.\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	err = opp_init_cpufreq_table(dev, &table);

I'd change the name opp_init_cpufreq_table() to something more suitable,
since it's no longer used for the CPU frequencies alone.  Something like
opp_init_frequency_table() might work.

> +	if (err) {
> +		dev_err(dev, "%s: Device OPP cannot provide DVFS table (%d)\n",
> +			__func__, err);
> +		return err;
> +	}
> +
> +	new_dvfs = kzalloc(sizeof(struct dvfs), GFP_KERNEL);
> +	if (!new_dvfs) {
> +		dev_err(dev, "%s: Unable to create DVFS for the device\n",
> +			__func__);
> +		return -ENOMEM;
> +	}
> +
> +	mutex_lock(&dvfs_list_lock);
> +
> +	dvfs = find_device_dvfs(dev);
> +	if (!IS_ERR(dvfs)) {
> +		dev_err(dev, "%s: Unable to create DVFS for the device. "
> +			"It already has one.\n", __func__);
> +		mutex_unlock(&dvfs_list_lock);
> +		kfree(new_dvfs);
> +		return -EINVAL;
> +	}

It seems that this should be checked before allocating the new object
(you can use kzalloc() under the mutex).  I'd even run it before initializing
the device's frequency table.

> +
> +	new_dvfs->dev = dev;
> +	new_dvfs->profile = profile;
> +	new_dvfs->governor = governor;
> +	new_dvfs->next_polling = dvfs_next_polling(profile->polling_ms);
> +	new_dvfs->previous_freq = 0;
> +
> +	list_add(&new_dvfs->node, &dvfs_list);
> +
> +	mutex_unlock(&dvfs_list_lock);
> +
> +	return 0;
> +}
> +
> +/**
> + * dvfs_remove_device() - Remove DVFS feature from a device.
> + * @device:	the device to remove dvfs feature.
> + */
> +int dvfs_remove_device(struct device *dev)
> +{
> +	struct dvfs *dvfs;
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&dvfs_list_lock);
> +	dvfs = find_device_dvfs(dev);
> +	if (IS_ERR(dvfs)) {
> +		dev_err(dev, "%s: Unable to find DVFS entry for the device.\n",
> +			__func__);
> +		mutex_unlock(&dvfs_list_lock);
> +		return -EINVAL;
> +	}
> +
> +	list_del(&dvfs->node);
> +
> +	kfree(dvfs);
> +
> +	mutex_unlock(&dvfs_list_lock);
> +
> +	return 0;
> +}
> +
> +/**
> + * dvfs_update() - Notify that the device OPP has been changed.
> + * @dev:	the device whose OPP has been changed.
> + * @may_not_exist:	do not print error message even if the device
> + *			does not have dvfs entry.
> + */
> +int dvfs_update(struct device *dev, bool may_not_exist)
> +{
> +	struct dvfs *dvfs;
> +	int err = 0;
> +
> +	mutex_lock(&dvfs_list_lock);
> +	dvfs = find_device_dvfs(dev);
> +	if (!IS_ERR(dvfs)) {
> +		if (dvfs->tickle) {
> +			unsigned long freq = dvfs->profile->max_freq;
> +			struct opp *opp = opp_find_freq_floor(dvfs->dev, &freq);
> +
> +			freq = opp_get_freq(opp);
> +			if (dvfs->previous_freq != freq) {
> +				err = dvfs->profile->target(dvfs->dev, freq,
> +						opp_get_voltage(opp));
> +				dvfs->previous_freq = freq;
> +			}
> +		} else {
> +			err = dvfs_do(dvfs);
> +		}
> +	}
> +	mutex_unlock(&dvfs_list_lock);
> +
> +	if (IS_ERR(dvfs)) {
> +		if (may_not_exist && PTR_ERR(dvfs) == -EINVAL)
> +			return 0;
> +
> +		dev_err(dev, "%s: Device DVFS not found (%ld)\n", __func__,
> +			PTR_ERR(dvfs));
> +		return PTR_ERR(dvfs);
> +	}
> +
> +	return err;
> +}
> +
> +/**
> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while
> + *			instaneously.
> + * @dev:	the device to be tickled.
> + * @duration_ms:	the duration of tickle effect.
> + *
> + * Tickle sets the device at the maximum frequency instaneously and
> + * the maximul frequency is guaranteed to be used for the given duration.
          ^^^^^^^
I guess that should be "maximum".

> + * For faster user reponse time, an input event may tickle a related device
> + * so that the input event does not need to wait for the DVFS to react with
> + * normal interval.
> + */

Can you explain how this function is supposed to be used, please?

> +int dvfs_tickle_device(struct device *dev, unsigned long duration_ms)
> +{
> +	struct dvfs *dvfs;
> +	struct opp *opp;
> +	unsigned long freq;
> +	int err = 0;
> +
> +	mutex_lock(&dvfs_list_lock);
> +	dvfs = find_device_dvfs(dev);
> +	if (!IS_ERR(dvfs)) {
> +		freq = dvfs->profile->max_freq;
> +		opp = opp_find_freq_floor(dvfs->dev, &freq);
> +		freq = opp_get_freq(opp);
> +		if (dvfs->previous_freq != freq) {
> +			err = dvfs->profile->target(dvfs->dev, freq,
> +						    opp_get_voltage(opp));
> +			dvfs->previous_freq = freq;
> +		}
> +		if (err)
> +			dev_err(dev, "%s: Cannot set frequency.\n", __func__);
> +		else
> +			dvfs->tickle = dvfs_next_polling(duration_ms);
> +	}
> +	mutex_unlock(&dvfs_list_lock);
> +
> +	if (IS_ERR(dvfs)) {
> +		dev_err(dev, "%s: Cannot find dvfs.\n", __func__);
> +		err = PTR_ERR(dvfs);
> +	}
> +
> +	return err;
> +}
> +
> +#ifdef CONFIG_PM_SLEEP
> +static int dvfs_pause(struct device *dev)
> +{
> +	cancel_delayed_work_sync(&dvfs_work);
> +	return 0;
> +}
> +
> +static void dvfs_continue(struct device *dev)
> +{
> +	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
> +}
> +#else
> +#define dvfs_pause	NULL
> +#define dvfs_continue	NULL
> +#endif
> +static const struct dev_pm_ops dvfs_pm = {
> +	.prepare	= dvfs_pause,
> +	.complete	= dvfs_continue,
> +};
> +static struct platform_driver dvfs_pm_drv = {
> +	.driver = {
> +		.name	= "generic_dvfs_pm",
> +		.pm	= &dvfs_pm,
> +	},
> +};
> +static struct platform_device dvfs_pm_dev = {
> +	.name = "generic_dvfs_pm",
> +};

Please don't do that.  You can simply use the freezable workqueue introduced
by commit 4149efb (workqueue: add system_freezeable_wq) and then you won't
have to worry about system suspend.

> +
> +static int __init dvfs_init(void)
> +{
> +	platform_driver_register(&dvfs_pm_drv);
> +	platform_device_register(&dvfs_pm_dev);
> +
> +	INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor);
> +	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
> +
> +	return 0;
> +}
> +late_initcall(dvfs_init);
> diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
> index 56a6899..df74655 100644
> --- a/drivers/base/power/opp.c
> +++ b/drivers/base/power/opp.c
> @@ -21,6 +21,7 @@
>  #include <linux/rculist.h>
>  #include <linux/rcupdate.h>
>  #include <linux/opp.h>
> +#include <linux/dvfs.h>
>  
>  /*
>   * Internal data structure organization with the OPP layer library is as
> @@ -428,6 +429,9 @@ int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
>  	list_add_rcu(&new_opp->node, head);
>  	mutex_unlock(&dev_opp_list_lock);
>  
> +	/* Notify generic dvfs for the change */
> +	dvfs_update(dev, true);
> +
>  	return 0;
>  }
>  
> @@ -512,6 +516,9 @@ unlock:
>  	mutex_unlock(&dev_opp_list_lock);
>  out:
>  	kfree(new_opp);
> +
> +	/* Notify generic dvfs for the change */
> +	dvfs_update(dev, true);
>  	return r;
>  }
>  
> diff --git a/include/linux/dvfs.h b/include/linux/dvfs.h
> new file mode 100644
> index 0000000..de3e53d
> --- /dev/null
> +++ b/include/linux/dvfs.h
> @@ -0,0 +1,69 @@
> +/*
> + * Generic DVFS Interface for Non-CPU Devices
> + *
> + * Copyright (C) 2011 Samsung Electronics
> + *	MyungJoo Ham <myungjoo.ham@samsung.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef __LINUX_DVFS_H__
> +#define __LINUX_DVFS_H__
> +
> +struct dvfs;
> +
> +struct dvfs_usage_profile {
> +	/* both since the last measure */
> +	unsigned long total_time;
> +	unsigned long busy_time;
> +};
> +
> +struct dvfs_device_profile {
> +	unsigned long max_freq; /* may be larger than the actual value */
> +	int (*target)(struct device *dev, unsigned long freq,
> +		      unsigned long u_volt);
> +	int (*get_usage_profile)(struct device *dev,
> +				 struct dvfs_usage_profile *profile);
> +	int polling_ms;	/* 0 for at opp change only */
> +};
> +
> +struct dvfs_governor {
> +	void *data; /* private data for get_target_freq */
> +	int (*get_target_freq)(struct dvfs_usage_profile *profile,
> +			       struct dvfs_governor *this, unsigned long *freq);
> +};
> +
> +#if defined(CONFIG_PM_GENERIC_DVFS)
> +extern int dvfs_add_device(struct device *ev,
> +			   struct dvfs_device_profile *profile,
> +			   struct dvfs_governor *governor);
> +extern int dvfs_remove_device(struct device *dev);
> +extern int dvfs_update(struct device *dev, bool may_not_exist);
> +extern int dvfs_tickle_device(struct device *dev, unsigned long duration_ms);
> +#else /* !CONFIG_PM_GENERIC_DVFS */
> +static int dvfs_add_device(struct device *ev,
> +			   struct dvfs_device_profile *profile,
> +			   struct dvfs_governor *governor)
> +{
> +	return 0;
> +}
> +
> +static int dvfs_remove_device(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static int dvfs_update(struct device *dev, bool may_not_exist)
> +{
> +	return 0;
> +}
> +
> +static int dvfs_tickle_device(struct device *dev, unsigned long duration_ms)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_PM_GENERIC_DVFS */
> +
> +#endif /* __LINUX_DVFS_H__ */
> diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
> index 4603f08..7d191ec 100644
> --- a/kernel/power/Kconfig
> +++ b/kernel/power/Kconfig
> @@ -225,3 +225,12 @@ config PM_OPP
>  	  representing individual voltage domains and provides SOC
>  	  implementations a ready to use framework to manage OPPs.
>  	  For more information, read <file:Documentation/power/opp.txt>
> +
> +config PM_GENERIC_DVFS
> +	bool "Generic DVFS framework"
> +	depends on PM_OPP
> +	help
> +	  With OPP support, a device may have a list of frequencies and
> +	  voltages available. Generic DVFS framework provides a method to
> +	  control frequency/voltage of a device with OPP with given profile
> +	  and governor per device.
> 

Thanks,
Rafael
Greg Kroah-Hartman April 26, 2011, 1:22 p.m. UTC | #2
On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote:
> With OPPs, a device may have multiple operable frequency and voltage
> sets. However, there can be multiple possible operable sets and a system
> will need to choose one from them. In order to reduce the power
> consumption (by reducing frequency and voltage) without affecting the
> performance too much, a DVFS scheme may be used.

What is "DVFS"?  A new filesystem?

confused,

greg k-h
JieJing.Zhang April 26, 2011, 2:45 p.m. UTC | #3
Hi Greg,

I think DVFS here means “Dynamic voltage and frequency scaling”,  refering
the first google's result:

a dynamic voltage and frequency scaling (DVFS) *technique that minimizes the
total system energy
consumption for performing a task while satisfying a given execution time
constraint. We first show that in order to
guarantee minimum energy for task execution by using DVFS it is essential to
divide the system power into active and
standby power components. Next, we present a new DVFS technique, which
considers not only the active power, but also
the standby component of the system power.

It's widely used in embeded Soc.

Best regards,
Zhang Jiejing


2011/4/26 Greg KH <gregkh@suse.de>

> On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote:
> > With OPPs, a device may have multiple operable frequency and voltage
> > sets. However, there can be multiple possible operable sets and a system
> > will need to choose one from them. In order to reduce the power
> > consumption (by reducing frequency and voltage) without affecting the
> > performance too much, a DVFS scheme may be used.
>
> What is "DVFS"?  A new filesystem?
>
> confused,
>
> greg k-h
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>
Mark Brown April 26, 2011, 3:07 p.m. UTC | #4
On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote:

> +static int __init dvfs_init(void)
> +{
> +	platform_driver_register(&dvfs_pm_drv);
> +	platform_device_register(&dvfs_pm_dev);
> +
> +	INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor);
> +	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));

This will unconditionally start the timer which polls the devices at
whatever rate.  I'd expect to start and stop the poll depending on if we
have devices and/or governor type code (which I don't see anything for
here, though it's hard to know what criteria a generic governor would
use) which require it.
Pavel Machek April 26, 2011, 7:45 p.m. UTC | #5
On Tue 2011-04-26 06:22:24, Greg KH wrote:
> On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote:
> > With OPPs, a device may have multiple operable frequency and voltage
> > sets. However, there can be multiple possible operable sets and a system
> > will need to choose one from them. In order to reduce the power
> > consumption (by reducing frequency and voltage) without affecting the
> > performance too much, a DVFS scheme may be used.
> 
> What is "DVFS"?  A new filesystem?

Dynamic voltage & frequency scaling ... very probably. But yes, the
naming sucks.
									Pavel
Greg Kroah-Hartman April 26, 2011, 8:54 p.m. UTC | #6
On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang  wrote:
> Hi Greg,
> 
> I think DVFS here means “Dynamic voltage and frequency scaling”,  refering the
> first google's result:

Ok, if so, please spell it out.

Also, don't use 'dvfs' as a prefix in the kernel, too many other people
will think it is a filesystem.  How about "dynvolt"?

Or anything else, just no 'fs' in the name otherwise confusion will
reign.

> a dynamic voltage and frequency scaling (DVFS) *technique that minimizes the
> total system energy consumption for performing a task while satisfying
> a given execution time constraint. We first show that in order to
> guarantee minimum energy for task execution by using DVFS it is
> essential to divide the system power into active and standby power
> components. Next, we present a new DVFS technique, which considers not
> only the active power, but also the standby component of the system
> power.

Please put this in the changelog entry, you need to explain it to those
of us who have never seen it.

thanks,

greg k-h
MyungJoo Ham April 27, 2011, 5:22 a.m. UTC | #7
Hello,

2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>:
> Hi,
>
> To start with let me say I don't have any fundamental objections at this point,
> although I'm not 100% sure I understand correctly how this feature is supposed
> to be used.  My understanding is that if a device is known to have multiple
> OPPs, its subsystem and driver may call dvfs_add_device() for it, which will
> cause it to be periodically monitored and put into the "optimum" OPP using
> the infrastructure introduced by your patch (on the basis of the given
> usage profile and with the help of the given governor).  Is that correct?

Yes, that is correct.

>
> That said I have some specific comments below.
>
> On Tuesday, April 26, 2011, MyungJoo Ham wrote:
>> With OPPs, a device may have multiple operable frequency and voltage
>> sets. However, there can be multiple possible operable sets and a system
>> will need to choose one from them. In order to reduce the power
>> consumption (by reducing frequency and voltage) without affecting the
>> performance too much, a DVFS scheme may be used.
>>
>> This patch introduces the DVFS capability to non-CPU devices with OPPs.
>
> Well, I'd like the DVFS acronym to be expanded at least once somewhere,
> possibly in the comments describing the new files and necessarily in the
> Kconfig option description.

Sure, I will do so.

>
>> In this RTC version, we do not have specific gorvernors or device
>
> RFC I guess?

Uh oh.. yes, it should've been typed "RFC".

>
[]
>> +
>> +#define DVFS_INTERVAL        20
>
> You should state in a comment what's the unit of this and explain the choice
> of the value.

Ok. I'll add a comment noting that it's in msec.

>
>> +/**
>> + * struct dvfs - Device DVFS structure
>> + * @node     list node - contains the devices with DVFS that have been
>> + *           registered.
>> + * @dev              device pointer
>> + * @profile  device-specific dvfs profile
>> + * @governor method how to choose frequency based on the usage.
>> + * @previous_freq    previously configured frequency value.
>> + * @next_polling     the number of remaining "dvfs_monitor" executions to
>> + *                   reevaluate frequency/voltage of the device. Set by
>> + *                   profile's polling_ms interval.
>> + * @tickle   positive if DVFS-tickling is activated for the device.
>> + *           at each executino of dvfs_monitor, tickle is decremented.
>> + *           User may tickle a device-dvfs in order to set maximum
>> + *           frequency instaneously with some guaranteed duration.
>> + *
>> + * This structure stores the DVFS information for a give device.
>> + */
>> +struct dvfs {
>> +     struct list_head node;
>> +
>> +     struct device *dev;
>> +     struct dvfs_device_profile *profile;
>> +     struct dvfs_governor *governor;
>> +
>> +     unsigned long previous_freq;
>> +     unsigned int next_polling;
>> +     unsigned int tickle;
>> +};
>> +
>
> I'd move the structure definition to the header file you introduce below.
> Even though it's supposed to be internal, it's better to have it defined
> next to the other structures for the benefit of the people who will
> try to understand your code in the future.

Ok. I'll do that.

>
[]
>> +static unsigned int dvfs_next_polling(int ms)
>> +{
>> +     unsigned int ret;
>> +
>> +     if (ms <= 0)
>> +             return 0;
>> +
>> +     ret = ms / DVFS_INTERVAL;
>> +     if (ms % DVFS_INTERVAL)
>> +             ret++;
>> +
>> +     return ret;
>> +}
>
> It seems you could use DIV_ROUND_UP instead of this whole function.

Thanks.

>
>> +
>> +/**
>> + * dvfs_do() - Check the usage profile of a given device and configure
>> + *           frequency and voltage accordingly
>> + * @dvfs:    DVFS info of the given device
>> + */
>> +static int dvfs_do(struct dvfs *dvfs)
>> +{
>> +     struct dvfs_usage_profile profile;
>> +     struct opp *opp;
>> +     unsigned long freq;
>> +     int err;
>> +
>> +     err = dvfs->profile->get_usage_profile(dvfs->dev, &profile);
>> +     if (err) {
>> +             dev_err(dvfs->dev, "%s: Cannot get usage profile (%d)\n",
>> +                     __func__, err);
>> +             return err;
>> +     }
>> +
>> +     err = dvfs->governor->get_target_freq(&profile, dvfs->governor, &freq);
>
> It may be better to define ->get_target_freq() so that it takes devfs as an arg.
> In that case you'd need to pass fewer pointers and the governor would be
> able to access all of its data anyway.

Oh Yes, I need to revise governor design so that the governor can take
more information about the device and dvfs options. I've been testing
with some test-case governors and it made some issues. :)

>
[]
>
> I'd generally consider using fewer error messages.  While they may be good
> for debugging, they have a potential of spamming the kernel's message buffer
> in case of a problem on a production system.  Also they don't really make
> the code cleaner, so to speak. :-)

Ok, I'll keep that in mind in the next revision.

>
[]
>> +     err = opp_init_cpufreq_table(dev, &table);
>
> I'd change the name opp_init_cpufreq_table() to something more suitable,
> since it's no longer used for the CPU frequencies alone.  Something like
> opp_init_frequency_table() might work.

Got it. Anyway, I'm going to remove the dependency on
cpufreq_frequency_table or anything similar as well.

>
[]
>> +
>> +     dvfs = find_device_dvfs(dev);
>> +     if (!IS_ERR(dvfs)) {
>> +             dev_err(dev, "%s: Unable to create DVFS for the device. "
>> +                     "It already has one.\n", __func__);
>> +             mutex_unlock(&dvfs_list_lock);
>> +             kfree(new_dvfs);
>> +             return -EINVAL;
>> +     }
>
> It seems that this should be checked before allocating the new object
> (you can use kzalloc() under the mutex).  I'd even run it before initializing
> the device's frequency table.

Fine. That seems more economic.

>
[]
>> +
>> +/**
>> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while
>> + *                   instaneously.
>> + * @dev:     the device to be tickled.
>> + * @duration_ms:     the duration of tickle effect.
>> + *
>> + * Tickle sets the device at the maximum frequency instaneously and
>> + * the maximul frequency is guaranteed to be used for the given duration.
>          ^^^^^^^
> I guess that should be "maximum".

Yes. :)

>
>> + * For faster user reponse time, an input event may tickle a related device
>> + * so that the input event does not need to wait for the DVFS to react with
>> + * normal interval.
>> + */
>
> Can you explain how this function is supposed to be used, please?

Ok, I'll show you a usage example

The system: a touchscreen based computer with DVFS'able GPU.
Use case: a user scrolls the screen with fingers suddenly. (think of
an iPhone or Android device's home screen filled wit icons)
While there is no touchscreen input (thus, the screen does not move
fast), we do not use GPU too much; thus it is scaled to run at slower
rate (let's say 100MHz).
If a user suddenly touches screen and starts moving, the screen should
follow it; thus, requires much output from GPU.
With normal DVFS-GPU, it would take some time to speed it up because
it should monitor the GPU usage and react afterwards.

With the API given, "tickle" may be included in the user input event.
So that GPU can be scaled to maximum frequency immediately.

If the monitoring frequency is 50ms, DVFS will likely to react with
about 75ms of delay.
For example, if the touch event has occurred at 25ms after the last
DVFS monitoring event, at the next DVFS monitoring event, the measured
load will be about 50%, which usually won't require DVFS to speed up.
Then, DVFS will react at the next monitoring event; thus, it took 50ms
+ 25ms.

75ms is enough to give some impression to users. These days, many
mobile devices require 60Hz-based UI, which is about 16ms.

Anyway, this happens with drivers/cpufreq also. We have been testing
"tickling" associated with drivers/cpufreq/cpufreq.c. This has been
reduced user response time significantly removing the need for tuning
the threshold values.

>
>> +int dvfs_tickle_device(struct device *dev, unsigned long duration_ms)
>> +{
>> +     struct dvfs *dvfs;
>> +     struct opp *opp;
>> +     unsigned long freq;
>> +     int err = 0;
>> +
>> +     mutex_lock(&dvfs_list_lock);
>> +     dvfs = find_device_dvfs(dev);
>> +     if (!IS_ERR(dvfs)) {
>> +             freq = dvfs->profile->max_freq;
>> +             opp = opp_find_freq_floor(dvfs->dev, &freq);
>> +             freq = opp_get_freq(opp);
>> +             if (dvfs->previous_freq != freq) {
>> +                     err = dvfs->profile->target(dvfs->dev, freq,
>> +                                                 opp_get_voltage(opp));
>> +                     dvfs->previous_freq = freq;
>> +             }
>> +             if (err)
>> +                     dev_err(dev, "%s: Cannot set frequency.\n", __func__);
>> +             else
>> +                     dvfs->tickle = dvfs_next_polling(duration_ms);
>> +     }
>> +     mutex_unlock(&dvfs_list_lock);
>> +
>> +     if (IS_ERR(dvfs)) {
>> +             dev_err(dev, "%s: Cannot find dvfs.\n", __func__);
>> +             err = PTR_ERR(dvfs);
>> +     }
>> +
>> +     return err;
>> +}
>> +
>> +#ifdef CONFIG_PM_SLEEP
>> +static int dvfs_pause(struct device *dev)
>> +{
>> +     cancel_delayed_work_sync(&dvfs_work);
>> +     return 0;
>> +}
>> +
>> +static void dvfs_continue(struct device *dev)
>> +{
>> +     schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
>> +}
>> +#else
>> +#define dvfs_pause   NULL
>> +#define dvfs_continue        NULL
>> +#endif
>> +static const struct dev_pm_ops dvfs_pm = {
>> +     .prepare        = dvfs_pause,
>> +     .complete       = dvfs_continue,
>> +};
>> +static struct platform_driver dvfs_pm_drv = {
>> +     .driver = {
>> +             .name   = "generic_dvfs_pm",
>> +             .pm     = &dvfs_pm,
>> +     },
>> +};
>> +static struct platform_device dvfs_pm_dev = {
>> +     .name = "generic_dvfs_pm",
>> +};
>
> Please don't do that.  You can simply use the freezable workqueue introduced
> by commit 4149efb (workqueue: add system_freezeable_wq) and then you won't
> have to worry about system suspend.

Cool! Thanks.

>
>
> Thanks,
> Rafael
>

Thank you for your comments.


Cheers!

- MyungJoo
MyungJoo Ham April 27, 2011, 5:24 a.m. UTC | #8
On Wed, Apr 27, 2011 at 12:07 AM, Mark Brown <broonie@sirena.org.uk> wrote:
> On Tue, Apr 26, 2011 at 05:43:21PM +0900, MyungJoo Ham wrote:
>
>> +static int __init dvfs_init(void)
>> +{
>> +     platform_driver_register(&dvfs_pm_drv);
>> +     platform_device_register(&dvfs_pm_dev);
>> +
>> +     INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor);
>> +     schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
>
> This will unconditionally start the timer which polls the devices at
> whatever rate.  I'd expect to start and stop the poll depending on if we
> have devices and/or governor type code (which I don't see anything for
> here, though it's hard to know what criteria a generic governor would
> use) which require it.
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

Thanks.

I will let it poll only when there are dvfs-devices with polling
interval specified.


Cheers!

- MyungJoo
MyungJoo Ham April 27, 2011, 5:33 a.m. UTC | #9
On Wed, Apr 27, 2011 at 5:54 AM, Greg KH <gregkh@suse.de> wrote:
> On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang  wrote:
>> Hi Greg,
>>
>> I think DVFS here means “Dynamic voltage and frequency scaling”,  refering the
>> first google's result:
>
> Ok, if so, please spell it out.

Yes, will do so.

>
> Also, don't use 'dvfs' as a prefix in the kernel, too many other people
> will think it is a filesystem.  How about "dynvolt"?
>
> Or anything else, just no 'fs' in the name otherwise confusion will
> reign.

Um... ok... I'll avoid using fs as postfix and think about the name again.
"oppfreq", "devfreq", "dvfscale", "oppscale", "devscale", "ddp
(dynamic device power)", or "dfvs" (switched frequency and voltage to
avoid confusion with file systems).

>
>> a dynamic voltage and frequency scaling (DVFS) *technique that minimizes the
>> total system energy consumption for performing a task while satisfying
>> a given execution time constraint. We first show that in order to
>> guarantee minimum energy for task execution by using DVFS it is
>> essential to divide the system power into active and standby power
>> components. Next, we present a new DVFS technique, which considers not
>> only the active power, but also the standby component of the system
>> power.
>
> Please put this in the changelog entry, you need to explain it to those
> of us who have never seen it.

Sure. I'll do that.

>
> thanks,
>
> greg k-h
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm


Cheers!

- MyungJoo
Mark Brown April 27, 2011, 1:36 p.m. UTC | #10
On Tue, Apr 26, 2011 at 01:54:30PM -0700, Greg KH wrote:
> On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang  wrote:

> > I think DVFS here means ???Dynamic voltage and frequency scaling???,?? refering the
> > first google's result:

> Ok, if so, please spell it out.

> Also, don't use 'dvfs' as a prefix in the kernel, too many other people
> will think it is a filesystem.  How about "dynvolt"?

> Or anything else, just no 'fs' in the name otherwise confusion will
> reign.

This really is the standard technical term for this, it's vanishingly
rare for it to be a source of confusion in context.  We should at least
mention the actual term in the documentation even if we invent our own
name for it.
Rafael Wysocki April 27, 2011, 1:46 p.m. UTC | #11
On Wednesday, April 27, 2011, MyungJoo Ham wrote:
> Hello,
> 
> 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>:
> > Hi,
> >
...
> >> +
> >> +/**
> >> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while
> >> + *                   instaneously.
> >> + * @dev:     the device to be tickled.
> >> + * @duration_ms:     the duration of tickle effect.
> >> + *
> >> + * Tickle sets the device at the maximum frequency instaneously and
> >> + * the maximul frequency is guaranteed to be used for the given duration.
> >          ^^^^^^^
> > I guess that should be "maximum".
> 
> Yes. :)
> 
> >
> >> + * For faster user reponse time, an input event may tickle a related device
> >> + * so that the input event does not need to wait for the DVFS to react with
> >> + * normal interval.
> >> + */
> >
> > Can you explain how this function is supposed to be used, please?
> 
> Ok, I'll show you a usage example
> 
> The system: a touchscreen based computer with DVFS'able GPU.
> Use case: a user scrolls the screen with fingers suddenly. (think of
> an iPhone or Android device's home screen filled wit icons)
> While there is no touchscreen input (thus, the screen does not move
> fast), we do not use GPU too much; thus it is scaled to run at slower
> rate (let's say 100MHz).
> If a user suddenly touches screen and starts moving, the screen should
> follow it; thus, requires much output from GPU.
> With normal DVFS-GPU, it would take some time to speed it up because
> it should monitor the GPU usage and react afterwards.
> 
> With the API given, "tickle" may be included in the user input event.
> So that GPU can be scaled to maximum frequency immediately.

So in this case the input subsystem would be calling dvfs_tickle_device()
(or whatever it is finally called) for the GPU, right?

I'm afraid you'd need a user space interface for that and let the
application (or library) handling the touchpad input control the "tickle".
Otherwise it would be difficult to arrange things correctly (the association
between the input subsystem and the GPU is not obvious at the kernel level).

Ideally, though, the working OPP for the GPU should depend on the number of
processing requests per a unit of time, pretty much like for a CPU.

> If the monitoring frequency is 50ms, DVFS will likely to react with
> about 75ms of delay.
> For example, if the touch event has occurred at 25ms after the last
> DVFS monitoring event, at the next DVFS monitoring event, the measured
> load will be about 50%, which usually won't require DVFS to speed up.
> Then, DVFS will react at the next monitoring event; thus, it took 50ms
> + 25ms.
> 
> 75ms is enough to give some impression to users. These days, many
> mobile devices require 60Hz-based UI, which is about 16ms.
> 
> Anyway, this happens with drivers/cpufreq also. We have been testing
> "tickling" associated with drivers/cpufreq/cpufreq.c. This has been
> reduced user response time significantly removing the need for tuning
> the threshold values.

I think I understand the problem, but I'm not sure if there's a clean way
to trigger the "tickle" from the kernel level.

Thanks,
Rafael
Rafael Wysocki April 27, 2011, 1:49 p.m. UTC | #12
On Wednesday, April 27, 2011, MyungJoo Ham wrote:
> On Wed, Apr 27, 2011 at 5:54 AM, Greg KH <gregkh@suse.de> wrote:
> > On Tue, Apr 26, 2011 at 10:45:18PM +0800, Jiejing.Zhang  wrote:
> >> Hi Greg,
> >>
> >> I think DVFS here means “Dynamic voltage and frequency scaling”,  refering the
> >> first google's result:
> >
> > Ok, if so, please spell it out.
> 
> Yes, will do so.
> 
> >
> > Also, don't use 'dvfs' as a prefix in the kernel, too many other people
> > will think it is a filesystem.  How about "dynvolt"?
> >
> > Or anything else, just no 'fs' in the name otherwise confusion will
> > reign.
> 
> Um... ok... I'll avoid using fs as postfix and think about the name again.
> "oppfreq", "devfreq", "dvfscale", "oppscale", "devscale", "ddp
> (dynamic device power)", or "dfvs" (switched frequency and voltage to
> avoid confusion with file systems).

I think devfreq is the best one of the above.  At least it will make everyone
think that the feature is analogous to cpufreq, which is correct.

Thanks,
Rafael
Colin Cross April 27, 2011, 5:49 p.m. UTC | #13
On Tue, Apr 26, 2011 at 10:22 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> Hello,
>
> 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>:
>> Hi,
>>
>> To start with let me say I don't have any fundamental objections at this point,
>> although I'm not 100% sure I understand correctly how this feature is supposed
>> to be used.  My understanding is that if a device is known to have multiple
>> OPPs, its subsystem and driver may call dvfs_add_device() for it, which will
>> cause it to be periodically monitored and put into the "optimum" OPP using
>> the infrastructure introduced by your patch (on the basis of the given
>> usage profile and with the help of the given governor).  Is that correct?
>
> Yes, that is correct.

I'm a little confused about the design for this, and OPP as well.  OPP
matches a struct device * and a frequency to a voltage, which is not a
generically useful pairing, as far as I can tell.  On Tegra, it is
quite possible for a single device to have multiple clocks that each
have different voltage requirements, for example the display block can
have an interface clock as well as a pixel clock.  Simplifying this to
dev + freq = voltage seems very OMAP specific, and will be difficult
or impossible to adapt to Tegra.

Moreover, from a silicon perspective, there is always a simple link
from a single frequency to a minimum voltage for a given circuit.
There is no need to group them into OPPs, which seem to have a group
of clocks and their frequencies that map to a single voltage.  That is
an artifact of the way TI specifies voltages.

I don't think DVFS is even the right place for any sort of governor.
DVFS is very simple - to increase to a specific clock speed, the
voltage must be immediately be raised, with minimum or no delay, to a
specified value that is specific to that clock.  When the frequency is
lowered, the voltage should be decreased.  There is a tiny bit of
policy to determine when to delay dropping the voltage in case the
frequency will immediately be raised again, but nowhere near the
complexity of what is shown here.

I proposed in a different thread on LKML that DVFS be handled within
the generic clock implementation.  Platforms would register a
regulator and a table of voltages for each struct clock that required
DVFS, and the voltages would be changed on normal clk_* requests.
This maintains compatibility with existing clk_* calls.

There is a place for a GPU, etc., frequency governor, but it is a
completely separate issue from DVFS, and should not be mixed in.  I
could have a GPU that is not voltage scalable, but could still benefit
from lowering the frequency when it is not in use.  A devfreq
interface sounds perfect for this, as long as it only ends up calling
clk_* functions, and those functions handle getting the voltage
correct.
Nishanth Menon April 27, 2011, 6:07 p.m. UTC | #14
On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
+l-o

> I'm a little confused about the design for this, and OPP as well.  OPP
> matches a struct device * and a frequency to a voltage, which is not a
> generically useful pairing, as far as I can tell.  On Tegra, it is
> quite possible for a single device to have multiple clocks that each
> have different voltage requirements, for example the display block can
> have an interface clock as well as a pixel clock.  Simplifying this to
> dev + freq = voltage seems very OMAP specific, and will be difficult
> or impossible to adapt to Tegra.
We have the same requirements as well(iclk,fclk,pixclk etc..)! We
group them under voltage domains in OMAP ;). if your issue was a
ability to have a single freq to a OPP, it is upto SoC to do the
proper mapping. Concept of an OPP still remains consistent - which is
for a voltage, there is only so much freq you can drive that specific
module to.

> Moreover, from a silicon perspective, there is always a simple link
> from a single frequency to a minimum voltage for a given circuit.
> There is no need to group them into OPPs, which seem to have a group
> of clocks and their frequencies that map to a single voltage.  That is
> an artifact of the way TI specifies voltages.
>
> I don't think DVFS is even the right place for any sort of governor.
> DVFS is very simple - to increase to a specific clock speed, the
> voltage must be immediately be raised, with minimum or no delay, to a
> specified value that is specific to that clock.  When the frequency is
> lowered, the voltage should be decreased.  There is a tiny bit of
> policy to determine when to delay dropping the voltage in case the
> frequency will immediately be raised again, but nowhere near the
> complexity of what is shown here.
>
> I proposed in a different thread on LKML that DVFS be handled within
> the generic clock implementation.  Platforms would register a
> regulator and a table of voltages for each struct clock that required
> DVFS, and the voltages would be changed on normal clk_* requests.
> This maintains compatibility with existing clk_* calls.

It is upto SoC frameworks to implement the transitions. E.g. lets look
at scalability: How'd the mechanism proposed work with temperature
variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it
be an SoC specific hack I'd need to introduce?

All OPP framework does is store that maps, and leaves it to users to
choose regulators, clock framework variances, SoC temperature sensors
or what ever mechanisms they choose to allow through a transition.

> There is a place for a GPU, etc., frequency governor, but it is a
> completely separate issue from DVFS, and should not be mixed in.  I
> could have a GPU that is not voltage scalable, but could still benefit
> from lowering the frequency when it is not in use.  A devfreq
> interface sounds perfect for this, as long as it only ends up calling
> clk_* functions, and those functions handle getting the voltage
> correct.

Regards,
Nishanth Menon
PS:
https://lists.linux-foundation.org/pipermail/linux-pm/2011-April/031113.html
for start of thread
Colin Cross April 27, 2011, 6:29 p.m. UTC | #15
On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote:
> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
> +l-o
>
>> I'm a little confused about the design for this, and OPP as well.  OPP
>> matches a struct device * and a frequency to a voltage, which is not a
>> generically useful pairing, as far as I can tell.  On Tegra, it is
>> quite possible for a single device to have multiple clocks that each
>> have different voltage requirements, for example the display block can
>> have an interface clock as well as a pixel clock.  Simplifying this to
>> dev + freq = voltage seems very OMAP specific, and will be difficult
>> or impossible to adapt to Tegra.
> We have the same requirements as well(iclk,fclk,pixclk etc..)! We
> group them under voltage domains in OMAP ;). if your issue was a
> ability to have a single freq to a OPP, it is upto SoC to do the
> proper mapping. Concept of an OPP still remains consistent - which is
> for a voltage, there is only so much freq you can drive that specific
> module to.
No, that is still wrong.  You don't drive a module at a frequency, you
drive a clock.  You can't map struct device * 1-1 to a clock.  Look at
omap2_set_init_voltage:
static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name,
						struct device *dev) {

	clk =  clk_get(NULL, clk_name);
	freq = clk->rate;
	opp = opp_find_freq_ceil(dev, &freq);
	...
}

Now what happens if I have a dev with two frequencies,

>> Moreover, from a silicon perspective, there is always a simple link
>> from a single frequency to a minimum voltage for a given circuit.
>> There is no need to group them into OPPs, which seem to have a group
>> of clocks and their frequencies that map to a single voltage.  That is
>> an artifact of the way TI specifies voltages.
>>
>> I don't think DVFS is even the right place for any sort of governor.
>> DVFS is very simple - to increase to a specific clock speed, the
>> voltage must be immediately be raised, with minimum or no delay, to a
>> specified value that is specific to that clock.  When the frequency is
>> lowered, the voltage should be decreased.  There is a tiny bit of
>> policy to determine when to delay dropping the voltage in case the
>> frequency will immediately be raised again, but nowhere near the
>> complexity of what is shown here.
>>
>> I proposed in a different thread on LKML that DVFS be handled within
>> the generic clock implementation.  Platforms would register a
>> regulator and a table of voltages for each struct clock that required
>> DVFS, and the voltages would be changed on normal clk_* requests.
>> This maintains compatibility with existing clk_* calls.
>
> It is upto SoC frameworks to implement the transitions. E.g. lets look
> at scalability: How'd the mechanism proposed work with temperature
> variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it
> be an SoC specific hack I'd need to introduce?
>
> All OPP framework does is store that maps, and leaves it to users to
> choose regulators, clock framework variances, SoC temperature sensors
> or what ever mechanisms they choose to allow through a transition.
>
>> There is a place for a GPU, etc., frequency governor, but it is a
>> completely separate issue from DVFS, and should not be mixed in.  I
>> could have a GPU that is not voltage scalable, but could still benefit
>> from lowering the frequency when it is not in use.  A devfreq
>> interface sounds perfect for this, as long as it only ends up calling
>> clk_* functions, and those functions handle getting the voltage
>> correct.
>
> Regards,
> Nishanth Menon
> PS:
> https://lists.linux-foundation.org/pipermail/linux-pm/2011-April/031113.html
> for start of thread
>
Mark Brown April 27, 2011, 6:34 p.m. UTC | #16
On Wed, Apr 27, 2011 at 10:49:47AM -0700, Colin Cross wrote:

> Moreover, from a silicon perspective, there is always a simple link
> from a single frequency to a minimum voltage for a given circuit.
> There is no need to group them into OPPs, which seem to have a group
> of clocks and their frequencies that map to a single voltage.  That is
> an artifact of the way TI specifies voltages.

This isn't just a TI thing, other CPU (and general big digital) vendors
spec things similarly.  It's just that TI have more code in mainline
than most.  My understanding is that when people do this the set of
operating points aren't purely about DVFS, they're also about specifying
the relationships between the various system clock rates and potentially
other things which are supported, especially in the blocks that mediate
between multiple domains.  The voltages are just one of the parameters
here, and with multiple supplies the voltages aren't always orthogonal
to each other.

> I proposed in a different thread on LKML that DVFS be handled within
> the generic clock implementation.  Platforms would register a
> regulator and a table of voltages for each struct clock that required
> DVFS, and the voltages would be changed on normal clk_* requests.
> This maintains compatibility with existing clk_* calls.

This comes back to the thing we were discussing in the thread there
about there being n:m constraints between settings.  I've not looked at
this in any detail but it may be that for the systems which spec things
as operating points we want to root the core clocking in an operating
point block which deals with stuff like this.
Colin Cross April 27, 2011, 6:37 p.m. UTC | #17
(sorry, missent the earlier one)

On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote:
> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
> +l-o
>
>> I'm a little confused about the design for this, and OPP as well.  OPP
>> matches a struct device * and a frequency to a voltage, which is not a
>> generically useful pairing, as far as I can tell.  On Tegra, it is
>> quite possible for a single device to have multiple clocks that each
>> have different voltage requirements, for example the display block can
>> have an interface clock as well as a pixel clock.  Simplifying this to
>> dev + freq = voltage seems very OMAP specific, and will be difficult
>> or impossible to adapt to Tegra.
> We have the same requirements as well(iclk,fclk,pixclk etc..)! We
> group them under voltage domains in OMAP ;). if your issue was a
> ability to have a single freq to a OPP, it is upto SoC to do the
> proper mapping. Concept of an OPP still remains consistent - which is
> for a voltage, there is only so much freq you can drive that specific
> module to.
No, that is still wrong.  You don't drive a module at a frequency, you
drive a clock.  You can't map struct device * 1-1 to a clock.  Look at
omap2_set_init_voltage:
static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name,
                                               struct device *dev) {
        ...
        clk =  clk_get(NULL, clk_name);
        freq = clk->rate;
        opp = opp_find_freq_ceil(dev, &freq);
        ...
}

What happens if I have a dev with two frequencies?  I can only pass a
dev into opp.  It makes infinitely more sense to pass in a clock:
opp_find_freq_ceil(clk, &freq).

> It is upto SoC frameworks to implement the transitions. E.g. lets look
> at scalability: How'd the mechanism proposed work with temperature
> variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it
> be an SoC specific hack I'd need to introduce?
No, because you're putting it in the wrong place, that is a policy
decision.  Handle it in the clock framework, or handle it in the
device driver.  That's a bad example either way - what happens if you
are already at 1.5GHz when the temperature crosses 70C?  You need an
interrupt that tells you the temperature is too high, and than needs
to affect a policy decision at a much higher level than dvfs.

>
> All OPP framework does is store that maps, and leaves it to users to
> choose regulators, clock framework variances, SoC temperature sensors
> or what ever mechanisms they choose to allow through a transition.
I understand its just a map, but its a map between two things that
don't have a direct mapping in many SoCs.  I think if you changed
every usage of struct dev * in opp to struct clk *, it would make much
more sense.  There is already a mapping from struct dev * to struct
clk *, its called clk_get, and it takes a second parameter to allow
devices to have multiple clocks.
Colin Cross April 27, 2011, 6:45 p.m. UTC | #18
On Wed, Apr 27, 2011 at 11:34 AM, Mark Brown <broonie@sirena.org.uk> wrote:
> On Wed, Apr 27, 2011 at 10:49:47AM -0700, Colin Cross wrote:
>
>> Moreover, from a silicon perspective, there is always a simple link
>> from a single frequency to a minimum voltage for a given circuit.
>> There is no need to group them into OPPs, which seem to have a group
>> of clocks and their frequencies that map to a single voltage.  That is
>> an artifact of the way TI specifies voltages.
>
> This isn't just a TI thing, other CPU (and general big digital) vendors
> spec things similarly.  It's just that TI have more code in mainline
> than most.  My understanding is that when people do this the set of
> operating points aren't purely about DVFS, they're also about specifying
> the relationships between the various system clock rates and potentially
> other things which are supported, especially in the blocks that mediate
> between multiple domains.  The voltages are just one of the parameters
> here, and with multiple supplies the voltages aren't always orthogonal
> to each other.
Then TI (and other vendors) are providing one table that specifies two
things - the relationship between frequency and voltage for each
clock, and the relationship between multiple clocks for optimal
performance.  Mixing those in the kernel breaks platforms (like Tegra)
that do not have any relationship between multiple clocks.

Can you point me to a platform that implements keeping multiple clocks
changing between OPPs together?  As far as I can tell, OMAP4 always
does dev -> clk, clk -> freq, dev + freq -> voltage, and never looks
at multiple clocks.

>> I proposed in a different thread on LKML that DVFS be handled within
>> the generic clock implementation.  Platforms would register a
>> regulator and a table of voltages for each struct clock that required
>> DVFS, and the voltages would be changed on normal clk_* requests.
>> This maintains compatibility with existing clk_* calls.
>
> This comes back to the thing we were discussing in the thread there
> about there being n:m constraints between settings.  I've not looked at
> this in any detail but it may be that for the systems which spec things
> as operating points we want to root the core clocking in an operating
> point block which deals with stuff like this.
>

On some platforms there may be a need for some global clock governor,
which can keep multiple clocks in sync, but I haven't seen one for
which that is truly required.
Nishanth Menon April 27, 2011, 6:48 p.m. UTC | #19
On Wed, Apr 27, 2011 at 13:29, Colin Cross <ccross@google.com> wrote:
> On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote:
>> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
>> +l-o
>>
>>> I'm a little confused about the design for this, and OPP as well.  OPP
>>> matches a struct device * and a frequency to a voltage, which is not a
>>> generically useful pairing, as far as I can tell.  On Tegra, it is
>>> quite possible for a single device to have multiple clocks that each
>>> have different voltage requirements, for example the display block can
>>> have an interface clock as well as a pixel clock.  Simplifying this to
>>> dev + freq = voltage seems very OMAP specific, and will be difficult
>>> or impossible to adapt to Tegra.
>> We have the same requirements as well(iclk,fclk,pixclk etc..)! We
>> group them under voltage domains in OMAP ;). if your issue was a
>> ability to have a single freq to a OPP, it is upto SoC to do the
>> proper mapping. Concept of an OPP still remains consistent - which is
>> for a voltage, there is only so much freq you can drive that specific
>> module to.
> No, that is still wrong.  You don't drive a module at a frequency, you
> drive a clock.  You can't map struct device * 1-1 to a clock.  Look at
Agreed, module runs on clocks - Lets say n clocks provide a module
it's functionality.

> omap2_set_init_voltage:
> static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name,
>                                                struct device *dev) {
>
>        clk =  clk_get(NULL, clk_name);
>        freq = clk->rate;
>        opp = opp_find_freq_ceil(dev, &freq);
>        ...
> }
>
> Now what happens if I have a dev with two frequencies,
we do have it - it depends on what the OPP table represents. we do
have modules which have both interface and functional clocks on OMAP
as well. for a module(represented by struct device *) which has n
clocks, choose the scheme of representation of clock that depends on
voltage for the module.
in the example you provided "the display block can have an interface
clock as well as a pixel clock" - I suppose you mean:
{.pclk = x, .iclk = y, .v = z}
The question I'd ask is this : for a voltage z, is the dependency on
pclk or iclk? I can expect a dependency of pclk to iclk requirement
(considering pixel clock drives an external display for example). the
table reduces to just
{.iclk = y, .v = z} and a different table that has divisor for .iclk
to pclk which is SoC based.

OPP table is just a storage and retrieval mechanism, it is upto SoC
frameworks to choose the most adequate of solutions - e.g. OMAP has
omap_device, hwmod and a clock framework for more intricate control to
work in conjunction with cpuidle frameworks as well.

There is cross domain dependency which OMAP (yet to be pushed to
mainline) has - example: when OMAP4's MPUs are at a certain OPP, L3
(OMAP's SoC bus) needs to be at least a certain OPP - these are
framework which may be very custom to OMAP itself.

---
Regards,
Nishanth Menon
Colin Cross April 27, 2011, 6:54 p.m. UTC | #20
On Wed, Apr 27, 2011 at 11:48 AM, Menon, Nishanth <nm@ti.com> wrote:
> On Wed, Apr 27, 2011 at 13:29, Colin Cross <ccross@google.com> wrote:
>> On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote:
>>> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
>>> +l-o
>>>
>>>> I'm a little confused about the design for this, and OPP as well.  OPP
>>>> matches a struct device * and a frequency to a voltage, which is not a
>>>> generically useful pairing, as far as I can tell.  On Tegra, it is
>>>> quite possible for a single device to have multiple clocks that each
>>>> have different voltage requirements, for example the display block can
>>>> have an interface clock as well as a pixel clock.  Simplifying this to
>>>> dev + freq = voltage seems very OMAP specific, and will be difficult
>>>> or impossible to adapt to Tegra.
>>> We have the same requirements as well(iclk,fclk,pixclk etc..)! We
>>> group them under voltage domains in OMAP ;). if your issue was a
>>> ability to have a single freq to a OPP, it is upto SoC to do the
>>> proper mapping. Concept of an OPP still remains consistent - which is
>>> for a voltage, there is only so much freq you can drive that specific
>>> module to.
>> No, that is still wrong.  You don't drive a module at a frequency, you
>> drive a clock.  You can't map struct device * 1-1 to a clock.  Look at
> Agreed, module runs on clocks - Lets say n clocks provide a module
> it's functionality.
>
>> omap2_set_init_voltage:
>> static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name,
>>                                                struct device *dev) {
>>
>>        clk =  clk_get(NULL, clk_name);
>>        freq = clk->rate;
>>        opp = opp_find_freq_ceil(dev, &freq);
>>        ...
>> }
>>
>> Now what happens if I have a dev with two frequencies,
> we do have it - it depends on what the OPP table represents. we do
> have modules which have both interface and functional clocks on OMAP
> as well. for a module(represented by struct device *) which has n
> clocks, choose the scheme of representation of clock that depends on
> voltage for the module.
> in the example you provided "the display block can have an interface
> clock as well as a pixel clock" - I suppose you mean:
> {.pclk = x, .iclk = y, .v = z}
> The question I'd ask is this : for a voltage z, is the dependency on
> pclk or iclk? I can expect a dependency of pclk to iclk requirement
> (considering pixel clock drives an external display for example). the
> table reduces to just
> {.iclk = y, .v = z} and a different table that has divisor for .iclk
> to pclk which is SoC based.

No, there can be voltage requirements on both, and the higher voltage
requirement of the two must be used.

> OPP table is just a storage and retrieval mechanism, it is upto SoC
> frameworks to choose the most adequate of solutions - e.g. OMAP has
> omap_device, hwmod and a clock framework for more intricate control to
> work in conjunction with cpuidle frameworks as well.
>
> There is cross domain dependency which OMAP (yet to be pushed to
> mainline) has - example: when OMAP4's MPUs are at a certain OPP, L3
> (OMAP's SoC bus) needs to be at least a certain OPP - these are
> framework which may be very custom to OMAP itself.
>
> ---
> Regards,
> Nishanth Menon
>
Mark Brown April 27, 2011, 6:58 p.m. UTC | #21
On Wed, Apr 27, 2011 at 11:45:18AM -0700, Colin Cross wrote:

> Then TI (and other vendors) are providing one table that specifies two
> things - the relationship between frequency and voltage for each
> clock, and the relationship between multiple clocks for optimal

I imagine you'll also see some of the constraints coming from other
sources (thermal was mentioned, for example).  

> performance.  Mixing those in the kernel breaks platforms (like Tegra)
> that do not have any relationship between multiple clocks.

I'm not sure why this is a problem, presumably you can either define a
lot of simple OPP domains or not use the relevant infrastructure?  If we
can't cope with that sort of scenario then we'd also be unable to cope
with two distinct chips simultaneously using the infrastructure which
seems like an obvious use case.

> Can you point me to a platform that implements keeping multiple clocks
> changing between OPPs together?  As far as I can tell, OMAP4 always
> does dev -> clk, clk -> freq, dev + freq -> voltage, and never looks
> at multiple clocks.

I don't know that I've got anything I can publish, sorry.

> > This comes back to the thing we were discussing in the thread there
> > about there being n:m constraints between settings. ?I've not looked at
> > this in any detail but it may be that for the systems which spec things
> > as operating points we want to root the core clocking in an operating
> > point block which deals with stuff like this.

> On some platforms there may be a need for some global clock governor,
> which can keep multiple clocks in sync, but I haven't seen one for
> which that is truly required.

I gather that with a lot of this stuff there's a big difference between
what works and what people are willing to sign off on over all process
variations, possible timing conditions and so on.
Thomas Gleixner April 27, 2011, 7 p.m. UTC | #22
On Wed, 27 Apr 2011, Menon, Nishanth wrote:
> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
> > I proposed in a different thread on LKML that DVFS be handled within
> > the generic clock implementation.  Platforms would register a
> > regulator and a table of voltages for each struct clock that required
> > DVFS, and the voltages would be changed on normal clk_* requests.
> > This maintains compatibility with existing clk_* calls.
> 
> It is upto SoC frameworks to implement the transitions. E.g. lets look
> at scalability: How'd the mechanism proposed work with temperature
> variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it
> be an SoC specific hack I'd need to introduce?

Why is limiting the max core frequency depending on temperature a SoC
specific problem ?

Everyone wants to do that. x86 does it in hardware / SMM, other
architectures want the kernel to take care of it.

So the decision is simple. Something wants to set core freq to 1.5
GHz, so it calls clk_set_rate() and there we consult the DVFS code
first to validate that setting. If it can be set, fine, then DVFS will
set the voltages _before_ we change the frequency or it will simply
veto the change because one of the preliminaries for such a change is
not given.

Please stop thinking that your SoC is sooo special. It's NOT.

The HW concepts are quite similar all over the place, they are just
named differently and use different IP blocks with slightly different
functionality, but the problems are not unique to a particular SoC at
all.

> All OPP framework does is store that maps, and leaves it to users to
> choose regulators, clock framework variances, SoC temperature sensors
> or what ever mechanisms they choose to allow through a transition.

That's how it's implemented, but that does not say that the design is
correct and usable for more than the usecase it was modeled after.
 
We are looking into a common clock framework, which abstracts out the
duplicated functionality of the various implementations and reduces
them to the real thing: hardware drivers. So we really need to look
into that DVFS problem as well, simply because it is tightly coupled
and not a complete separate entity.

And looking at the struct clk disaster we really don't want another
incarnation in terms of DVFS where we end up with the same decision
functions in various SoCs over and over.

Thanks,

	tglx
Thomas Gleixner April 27, 2011, 7:26 p.m. UTC | #23
On Wed, 27 Apr 2011, Menon, Nishanth wrote:

> OPP table is just a storage and retrieval mechanism, it is upto SoC
> frameworks to choose the most adequate of solutions - e.g. OMAP has
> omap_device, hwmod and a clock framework for more intricate control to
> work in conjunction with cpuidle frameworks as well.

Can you please stop thinking about OMAP for a minute?

A clock framework is nothing SoC specific. A framework is an
abstraction of common HW functionality, which implements general
functionality and relies on the HW specific part to configure it and
to provide access to the hardware itself.

clocks are ordered as trees in HW, simply because you cannot have a
clock consumer be driven by more than one active clock at the same
time. A clock consumer may select a different clock producer, but that
merily changes the tree structure nothing else. So why should every
SoC implement it's own (different buggy) version of tree handling and
call it framework?

Yes, I know you might argue that some devices need two clocks enabled
to be functional. That's correct, but coupling those clocks at the
framework level is the wrong thing to do. If a device needs both an
interface clock and a separate interconnect clock to work, then it
needs to enable both clocks and become a consumer of them.
 
> There is cross domain dependency which OMAP (yet to be pushed to
> mainline) has - example: when OMAP4's MPUs are at a certain OPP, L3
> (OMAP's SoC bus) needs to be at least a certain OPP - these are
> framework which may be very custom to OMAP itself.

Wrong again. That's not a framework when you hack SoC specific
decision functions into it. It's the OMAP internal hackery to make
stuff work, but that's far from a framework.

What you are describing is a restriction which can be expressed in
tables or rules which are fed into a general framework.

Look at generic irqs, generic timekeeping, generic clockevents and
tons of other real frameworks in the kernel. They abstract out
concepts and provide generic interfaces rather than claiming that the
problem is unique to a particular piece of silicon.

Forget OMAP implementation details for a while, sit back and look at
the big picture.

Thanks,

	tglx
Colin Cross April 27, 2011, 8:48 p.m. UTC | #24
On Wed, Apr 27, 2011 at 12:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Forget OMAP implementation details for a while, sit back and look at
> the big picture.

Here's my proposal for DVFS:
- DVFS is implemented in drivers/clk/dvfs.c, and is called by the
common clock implementation to adjust the voltages, if necessary, on
regular clk_* calls.
- Platform code provides mappings in the form (clk, regulator, max
frequency, min voltage) to the dvfs code.
- Everything that is in OPP today gets converted to helper functions
inside the dvfs implementation, and is never called from SoC code
(except to pass tables at init), or from drivers.
- OPP can be recreated in the future as a upper level policy manager
for clocks that need to move together, if that is ever necessary.  It
would not know anything about voltages.
- A few common policy implementations need to be added to the common
clock implementation, like temperature limits.

For Tegra:
- DVFS continues to be accessed by calling clk_* functions

For OMAP:
- DVFS is triggered by hwmod through clk_* functions.  Any cross-arch
driver can continue to call clk_* functions.

OPP currently has opp_enable and opp_disable functions.  I don't
understand why these are needed, they are only used at init time to
determine available voltages, which could be handled by never passing
unavailable voltages to the dvfs implementation.
MyungJoo Ham April 28, 2011, 5:59 a.m. UTC | #25
On Thu, Apr 28, 2011 at 3:37 AM, Colin Cross <ccross@google.com> wrote:
> (sorry, missent the earlier one)
>
> On Wed, Apr 27, 2011 at 11:07 AM, Menon, Nishanth <nm@ti.com> wrote:
>> On Wed, Apr 27, 2011 at 12:49, Colin Cross <ccross@google.com> wrote:
>> +l-o
>>
>>> I'm a little confused about the design for this, and OPP as well.  OPP
>>> matches a struct device * and a frequency to a voltage, which is not a
>>> generically useful pairing, as far as I can tell.  On Tegra, it is
>>> quite possible for a single device to have multiple clocks that each
>>> have different voltage requirements, for example the display block can
>>> have an interface clock as well as a pixel clock.  Simplifying this to
>>> dev + freq = voltage seems very OMAP specific, and will be difficult
>>> or impossible to adapt to Tegra.
>> We have the same requirements as well(iclk,fclk,pixclk etc..)! We
>> group them under voltage domains in OMAP ;). if your issue was a
>> ability to have a single freq to a OPP, it is upto SoC to do the
>> proper mapping. Concept of an OPP still remains consistent - which is
>> for a voltage, there is only so much freq you can drive that specific
>> module to.
> No, that is still wrong.  You don't drive a module at a frequency, you
> drive a clock.  You can't map struct device * 1-1 to a clock.  Look at
> omap2_set_init_voltage:
> static int __init omap2_set_init_voltage(char *vdd_name, char *clk_name,
>                                                struct device *dev) {
>        ...
>         clk =  clk_get(NULL, clk_name);
>         freq = clk->rate;
>         opp = opp_find_freq_ceil(dev, &freq);
>         ...
> }
>
> What happens if I have a dev with two frequencies?  I can only pass a
> dev into opp.  It makes infinitely more sense to pass in a clock:
> opp_find_freq_ceil(clk, &freq).

What one instance of DVFS (devfreq) controls are clocks and
regulators. (a device may have multiple regulators as well as multiple
clocks)
What one instance of DVFS (devfreq) monitors (device load and/or
temperature) is a device that uses the clocks and regulators.

If we focus on the things that are controlled by DVFS, connecting DVFS
with clock seems fine; however, DVFS's decision is based on the status
of the device and the decision (monitoring result) configures a set of
clocks and regulators. The clocks are not configured independently
from others if the clocks are used by a DVFS-capable device. The
frequency/voltage pair (OPP in this patch) associated with a device
becomes a representative value of a specific configuration that
configures the set of clocks and regulators.

This is quite similar with CPUFREQ. CPUFREQ provides a single
frequency value as a result of monitoring; however the machine's
cpufreq driver may set multiple clocks and multiple voltage regulators
based on the representative value (which is usually the core clock)
although the cpufreq driver may need to control many more clocks with
different frequencies.

With multiple clocks of a device, if there is a clock that is required
to be set independently from the "representative" clock with DVFS, it
means that the DVFS monitoring result (load/temperature) is not a
scalar value but a vector (multi-dimensional value). That implies that
we need to monitor different and independent values, which in turn,
implies that we need separated devices. Note that the DVFS monitor
result from load and temperature combined is not a multi-dimensional
value because the temperature limits "maximum possible frequency or
voltage" and the load gives "preferred lower bound of frequency" that
can be overridden by the limit set by temperature.

Therefore, having one DVFS per clock where multiple clocks are
attached to a device will create multiple monitors that monitor the
same object(device behavior) with same metrics (load and temperature).

Besides, the reason I've started with "target" callback, not clk and
regulator names or pointers is that a device may have multiple clks
and regulators and the OPP may only show the representative
clock/regulators as CPUFREQ does. Especially when the order of
transitions of those multiple clocks and regulators matter (if they
are in a single device, it sometimes does), running a DVFS per clock,
not per device, will be bothersome if not disasterous.

>
>> It is upto SoC frameworks to implement the transitions. E.g. lets look
>> at scalability: How'd the mechanism proposed work with temperature
>> variances: Example: I dont want to hit 1.5GHz if temp >70C - wont it
>> be an SoC specific hack I'd need to introduce?
> No, because you're putting it in the wrong place, that is a policy
> decision.  Handle it in the clock framework, or handle it in the
> device driver.  That's a bad example either way - what happens if you
> are already at 1.5GHz when the temperature crosses 70C?  You need an
> interrupt that tells you the temperature is too high, and than needs
> to affect a policy decision at a much higher level than dvfs.
>
>>
>> All OPP framework does is store that maps, and leaves it to users to
>> choose regulators, clock framework variances, SoC temperature sensors
>> or what ever mechanisms they choose to allow through a transition.
> I understand its just a map, but its a map between two things that
> don't have a direct mapping in many SoCs.  I think if you changed
> every usage of struct dev * in opp to struct clk *, it would make much
> more sense.  There is already a mapping from struct dev * to struct
> clk *, its called clk_get, and it takes a second parameter to allow
> devices to have multiple clocks.

Anyway, I agree that we need either one of the followings (at least in
the future)
1) OPP mapped with clock
or
2) multiple OPPs mapped with a dev, each OPP with a clock of the dev.

Still, DVFS(devfreq) would be better linked directly with a device
although we will need to tell which OPP of the device or which clock's
OPP should be linked as representative clock OPP.

> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

Cheers!
- MyungJoo
MyungJoo Ham April 28, 2011, 6:12 a.m. UTC | #26
On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote:
> On Wed, Apr 27, 2011 at 12:26 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> Forget OMAP implementation details for a while, sit back and look at
>> the big picture.
>
> Here's my proposal for DVFS:
> - DVFS is implemented in drivers/clk/dvfs.c, and is called by the
> common clock implementation to adjust the voltages, if necessary, on
> regular clk_* calls.
> - Platform code provides mappings in the form (clk, regulator, max
> frequency, min voltage) to the dvfs code.
> - Everything that is in OPP today gets converted to helper functions
> inside the dvfs implementation, and is never called from SoC code
> (except to pass tables at init), or from drivers.
> - OPP can be recreated in the future as a upper level policy manager
> for clocks that need to move together, if that is ever necessary.  It
> would not know anything about voltages.
> - A few common policy implementations need to be added to the common
> clock implementation, like temperature limits.

I hope that my previous reply answered this.

>
> For Tegra:
> - DVFS continues to be accessed by calling clk_* functions
>
> For OMAP:
> - DVFS is triggered by hwmod through clk_* functions.  Any cross-arch
> driver can continue to call clk_* functions.
>
> OPP currently has opp_enable and opp_disable functions.  I don't
> understand why these are needed, they are only used at init time to
> determine available voltages, which could be handled by never passing
> unavailable voltages to the dvfs implementation.

We need them in runtime.

A device "a" may want to guarantee that a device "b" to be at least
"200MHz" or faster while it does some operations. Then, "a" will
opp_disable("b", 100MHz and others); and opp_enable("b", them) later
on. We have similar issues with multimedia blocks (MFC, Camera, FB,
GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay
on catching up a workload (1.5x the sampling rate in average, <2.0x
the sampling rate in worst cases), which may incur flickering/tearing
issues with multimedia streams. On the other hand, a general thermal
monitor or battery manager might want to limit energy usage by
disabling top performance clocks if it is too hot or the battery level
is low.

> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>
Colin Cross April 28, 2011, 6:43 a.m. UTC | #27
On Wed, Apr 27, 2011 at 10:59 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> What one instance of DVFS (devfreq) controls are clocks and
> regulators. (a device may have multiple regulators as well as multiple
> clocks)
> What one instance of DVFS (devfreq) monitors (device load and/or
> temperature) is a device that uses the clocks and regulators.
>
> If we focus on the things that are controlled by DVFS, connecting DVFS
> with clock seems fine; however, DVFS's decision is based on the status
> of the device and the decision (monitoring result) configures a set of
> clocks and regulators. The clocks are not configured independently
> from others if the clocks are used by a DVFS-capable device. The
> frequency/voltage pair (OPP in this patch) associated with a device
> becomes a representative value of a specific configuration that
> configures the set of clocks and regulators.
>
> This is quite similar with CPUFREQ. CPUFREQ provides a single
> frequency value as a result of monitoring; however the machine's
> cpufreq driver may set multiple clocks and multiple voltage regulators
> based on the representative value (which is usually the core clock)
> although the cpufreq driver may need to control many more clocks with
> different frequencies.
>
> With multiple clocks of a device, if there is a clock that is required
> to be set independently from the "representative" clock with DVFS, it
> means that the DVFS monitoring result (load/temperature) is not a
> scalar value but a vector (multi-dimensional value). That implies that
> we need to monitor different and independent values, which in turn,
> implies that we need separated devices. Note that the DVFS monitor
> result from load and temperature combined is not a multi-dimensional
> value because the temperature limits "maximum possible frequency or
> voltage" and the load gives "preferred lower bound of frequency" that
> can be overridden by the limit set by temperature.
>
> Therefore, having one DVFS per clock where multiple clocks are
> attached to a device will create multiple monitors that monitor the
> same object(device behavior) with same metrics (load and temperature).
>
> Besides, the reason I've started with "target" callback, not clk and
> regulator names or pointers is that a device may have multiple clks
> and regulators and the OPP may only show the representative
> clock/regulators as CPUFREQ does. Especially when the order of
> transitions of those multiple clocks and regulators matter (if they
> are in a single device, it sometimes does), running a DVFS per clock,
> not per device, will be bothersome if not disasterous.

I understand the need for some sort of governor that can use device
state to determine the necessary clock frequencies.  Where I disagree
is the connection to voltages.  The governor should ONLY determine the
frequencies desired, and the voltage required to meet those
frequencies should be determined by the clock framework, based only on
the clock and the frequency.
Colin Cross April 28, 2011, 6:44 a.m. UTC | #28
On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote:
>> OPP currently has opp_enable and opp_disable functions.  I don't
>> understand why these are needed, they are only used at init time to
>> determine available voltages, which could be handled by never passing
>> unavailable voltages to the dvfs implementation.
>
> We need them in runtime.
>
> A device "a" may want to guarantee that a device "b" to be at least
> "200MHz" or faster while it does some operations. Then, "a" will
> opp_disable("b", 100MHz and others); and opp_enable("b", them) later
> on. We have similar issues with multimedia blocks (MFC, Camera, FB,
> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay
> on catching up a workload (1.5x the sampling rate in average, <2.0x
> the sampling rate in worst cases), which may incur flickering/tearing
> issues with multimedia streams. On the other hand, a general thermal
> monitor or battery manager might want to limit energy usage by
> disabling top performance clocks if it is too hot or the battery level
> is low.

That sounds like a very strange api, when what you really mean is
clk_set_min_rate or clk_set_max_rate.
MyungJoo Ham April 28, 2011, 6:50 a.m. UTC | #29
On Thu, Apr 28, 2011 at 3:44 PM, Colin Cross <ccross@google.com> wrote:
> On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
>> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote:
>>> OPP currently has opp_enable and opp_disable functions.  I don't
>>> understand why these are needed, they are only used at init time to
>>> determine available voltages, which could be handled by never passing
>>> unavailable voltages to the dvfs implementation.
>>
>> We need them in runtime.
>>
>> A device "a" may want to guarantee that a device "b" to be at least
>> "200MHz" or faster while it does some operations. Then, "a" will
>> opp_disable("b", 100MHz and others); and opp_enable("b", them) later
>> on. We have similar issues with multimedia blocks (MFC, Camera, FB,
>> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay
>> on catching up a workload (1.5x the sampling rate in average, <2.0x
>> the sampling rate in worst cases), which may incur flickering/tearing
>> issues with multimedia streams. On the other hand, a general thermal
>> monitor or battery manager might want to limit energy usage by
>> disabling top performance clocks if it is too hot or the battery level
>> is low.
>
> That sounds like a very strange api, when what you really mean is
> clk_set_min_rate or clk_set_max_rate.

Essentially, that's what needed.
However, with clk_set_min/max_rate, don't we need to let another
device to be consumer of other devices' clocks? Not just introducing a
device to other devices?

> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>
MyungJoo Ham April 28, 2011, 6:54 a.m. UTC | #30
On Wed, Apr 27, 2011 at 10:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Wednesday, April 27, 2011, MyungJoo Ham wrote:
>> Hello,
>>
>> 2011/4/26 Rafael J. Wysocki <rjw@sisk.pl>:
>> > Hi,
>> >
> ...
>> >> +
>> >> +/**
>> >> + * dvfs_tickle_device() - Guarantee maximum operation speed for a while
>> >> + *                   instaneously.
>> >> + * @dev:     the device to be tickled.
>> >> + * @duration_ms:     the duration of tickle effect.
>> >> + *
>> >> + * Tickle sets the device at the maximum frequency instaneously and
>> >> + * the maximul frequency is guaranteed to be used for the given duration.
>> >          ^^^^^^^
>> > I guess that should be "maximum".
>>
>> Yes. :)
>>
>> >
>> >> + * For faster user reponse time, an input event may tickle a related device
>> >> + * so that the input event does not need to wait for the DVFS to react with
>> >> + * normal interval.
>> >> + */
>> >
>> > Can you explain how this function is supposed to be used, please?
>>
>> Ok, I'll show you a usage example
>>
>> The system: a touchscreen based computer with DVFS'able GPU.
>> Use case: a user scrolls the screen with fingers suddenly. (think of
>> an iPhone or Android device's home screen filled wit icons)
>> While there is no touchscreen input (thus, the screen does not move
>> fast), we do not use GPU too much; thus it is scaled to run at slower
>> rate (let's say 100MHz).
>> If a user suddenly touches screen and starts moving, the screen should
>> follow it; thus, requires much output from GPU.
>> With normal DVFS-GPU, it would take some time to speed it up because
>> it should monitor the GPU usage and react afterwards.
>>
>> With the API given, "tickle" may be included in the user input event.
>> So that GPU can be scaled to maximum frequency immediately.
>
> So in this case the input subsystem would be calling dvfs_tickle_device()
> (or whatever it is finally called) for the GPU, right?

Yes, that's correct. The input subsystem should somehow "tickle" a
DVFS device (GPU in this case). With CPUFREQ-Tickle (never upstreamed
anyway), we have been doing this by adding tickle as an "input event"
at board file (i.e., arch/arm/mach-*/mach-*.c).

>
> I'm afraid you'd need a user space interface for that and let the
> application (or library) handling the touchpad input control the "tickle".
> Otherwise it would be difficult to arrange things correctly (the association
> between the input subsystem and the GPU is not obvious at the kernel level).
>

Creating a user space interface to connect an input event (or any
other user space event) with a DVFS (like
/sys/devices/system/cpu/cpux/cpufreq/...) will provide the flexibility
for user space to tickle although it might not be as efficient as the
approach in the previous paragraph in the metric of how much response
time is reduced. We may do that by allowing something like this: "echo
0 > /sys/class/devfreq/devname.0/tickle" or even "echo0 >
/sys/class/devfreq/tickle_all"

> Ideally, though, the working OPP for the GPU should depend on the number of
> processing requests per a unit of time, pretty much like for a CPU.

As long as we measure workload (number of requests or operations per
unit time) periodically, we have the issue mentioned that can be
mitigated by tickling. Ideally, if we can count each operation at
device driver, we do not need tickling because we can monitor the
workload with per-operation basis not per-sampling-time basis. If it
is a device with .. like .. 1000 operations/sec, it might be feasible.
However, GPUs may have like over 100,000,000 operations/sec and we'd
better just monitor them periodically. Besides, devices including many
GPUs allow DMA and user-addressable memory without device driver
intervention for normal operations once the device is configured.
Thus, device driver wouldn't know anything about per-operation basis;
we need to stick with periodic monitoring anyway (and with delayed
DVFS issue) unless a GPU can invoke an interrupt of "I'm overloaded,
let me work faster"

>
>> If the monitoring frequency is 50ms, DVFS will likely to react with
>> about 75ms of delay.
>> For example, if the touch event has occurred at 25ms after the last
>> DVFS monitoring event, at the next DVFS monitoring event, the measured
>> load will be about 50%, which usually won't require DVFS to speed up.
>> Then, DVFS will react at the next monitoring event; thus, it took 50ms
>> + 25ms.
>>
>> 75ms is enough to give some impression to users. These days, many
>> mobile devices require 60Hz-based UI, which is about 16ms.
>>
>> Anyway, this happens with drivers/cpufreq also. We have been testing
>> "tickling" associated with drivers/cpufreq/cpufreq.c. This has been
>> reduced user response time significantly removing the need for tuning
>> the threshold values.
>
> I think I understand the problem, but I'm not sure if there's a clean way
> to trigger the "tickle" from the kernel level.

I'm considering the followings (and they are not mutually exclusive).
How about them?
1. (the way that we've been using) Add a tickle call to input events
at board file.
2. provide sysfs interface that triggers tickling along with devfreq sysfs.

>
> Thanks,
> Rafael
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>


Cheers!
- MyungJoo
Colin Cross April 28, 2011, 7:06 a.m. UTC | #31
On Wed, Apr 27, 2011 at 11:50 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
> On Thu, Apr 28, 2011 at 3:44 PM, Colin Cross <ccross@google.com> wrote:
>> On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
>>> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote:
>>>> OPP currently has opp_enable and opp_disable functions.  I don't
>>>> understand why these are needed, they are only used at init time to
>>>> determine available voltages, which could be handled by never passing
>>>> unavailable voltages to the dvfs implementation.
>>>
>>> We need them in runtime.
>>>
>>> A device "a" may want to guarantee that a device "b" to be at least
>>> "200MHz" or faster while it does some operations. Then, "a" will
>>> opp_disable("b", 100MHz and others); and opp_enable("b", them) later
>>> on. We have similar issues with multimedia blocks (MFC, Camera, FB,
>>> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay
>>> on catching up a workload (1.5x the sampling rate in average, <2.0x
>>> the sampling rate in worst cases), which may incur flickering/tearing
>>> issues with multimedia streams. On the other hand, a general thermal
>>> monitor or battery manager might want to limit energy usage by
>>> disabling top performance clocks if it is too hot or the battery level
>>> is low.
>>
>> That sounds like a very strange api, when what you really mean is
>> clk_set_min_rate or clk_set_max_rate.
>
> Essentially, that's what needed.
> However, with clk_set_min/max_rate, don't we need to let another
> device to be consumer of other devices' clocks? Not just introducing a
> device to other devices?

Yes, but that's effectively what you're doing through a backwards api
anyways.  The question is, for these complicated clock scenarios where
the final frequency of a clock depends on so many factors, should that
control go through the clock framework, or through some sort of global
clock governor (which is where OPP would reappear).
MyungJoo Ham April 28, 2011, 7:16 a.m. UTC | #32
On Thu, Apr 28, 2011 at 3:43 PM, Colin Cross <ccross@google.com> wrote:
> I understand the need for some sort of governor that can use device
> state to determine the necessary clock frequencies.  Where I disagree
> is the connection to voltages.  The governor should ONLY determine the
> frequencies desired, and the voltage required to meet those
> frequencies should be determined by the clock framework, based only on
> the clock and the frequency.

Yes, as long as AVS(Adaptive Voltage Scaling) is not involved, devfreq
does not need to care about voltages and let device driver (such as
the target callback or its callee) take care of voltages. Besides, my
impression on AVS is that AVS wouldn't be depending on software DVFS
scheme, at least with some AVS test on S5PC110. So, I'd say that it's
safe to let devfreq framework handle frequency only and let target
callback handle anything else except for choosing representative clock
frequency.

However, if we are going to detach devfreq from OPP, we only need to
provide frequency list at init and { an interface to control max/min
freq or an interface to lookup max/min freq of corresponding
representative clock. }

> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

ps. In our AVS test, the device drivers had nothing to do with voltage
scaling except for initializing devices. The H/W did everything about
voltage scaling dynamically.

Thanks,

MyungJoo.
MyungJoo Ham April 28, 2011, 7:34 a.m. UTC | #33
On Thu, Apr 28, 2011 at 4:06 PM, Colin Cross <ccross@google.com> wrote:
> On Wed, Apr 27, 2011 at 11:50 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
>> On Thu, Apr 28, 2011 at 3:44 PM, Colin Cross <ccross@google.com> wrote:
>>> On Wed, Apr 27, 2011 at 11:12 PM, MyungJoo Ham <myungjoo.ham@samsung.com> wrote:
>>>> On Thu, Apr 28, 2011 at 5:48 AM, Colin Cross <ccross@google.com> wrote:
>>>>> OPP currently has opp_enable and opp_disable functions.  I don't
>>>>> understand why these are needed, they are only used at init time to
>>>>> determine available voltages, which could be handled by never passing
>>>>> unavailable voltages to the dvfs implementation.
>>>>
>>>> We need them in runtime.
>>>>
>>>> A device "a" may want to guarantee that a device "b" to be at least
>>>> "200MHz" or faster while it does some operations. Then, "a" will
>>>> opp_disable("b", 100MHz and others); and opp_enable("b", them) later
>>>> on. We have similar issues with multimedia blocks (MFC, Camera, FB,
>>>> GPU) and CPU/Memory Bus. Ondemand governor of CPUFREQ has some delay
>>>> on catching up a workload (1.5x the sampling rate in average, <2.0x
>>>> the sampling rate in worst cases), which may incur flickering/tearing
>>>> issues with multimedia streams. On the other hand, a general thermal
>>>> monitor or battery manager might want to limit energy usage by
>>>> disabling top performance clocks if it is too hot or the battery level
>>>> is low.
>>>
>>> That sounds like a very strange api, when what you really mean is
>>> clk_set_min_rate or clk_set_max_rate.
>>
>> Essentially, that's what needed.
>> However, with clk_set_min/max_rate, don't we need to let another
>> device to be consumer of other devices' clocks? Not just introducing a
>> device to other devices?
>
> Yes, but that's effectively what you're doing through a backwards api
> anyways.  The question is, for these complicated clock scenarios where
> the final frequency of a clock depends on so many factors, should that
> control go through the clock framework, or through some sort of global
> clock governor (which is where OPP would reappear).
>

In the use cases of runtime clock setting by devfreq or other devices
mentioned above, we are controlling the device's performance with the
representative clock of the device, not a specific clock among the
clocks that the device has. For a device "A" with clock "a1" and "a2",
another device "B" would not control both "a1" and "a2" directly to
get the guaranteed performance from "A". Besides, "B" should not do so
if there are specific orders, delays, and other controls for "A" to
properly change performance.

Therefore, my answer is that it would be preferred to control through
some wrapper/interface/or anything that is connected to the device of
the controlled clocks (and let the device's callback or something
control its clocks), not to control through clock framework directly.
In this version of devfreq+OPP, these are handled by the "target"
callback.


Cheers!
- MyungJoo
Mark Brown April 28, 2011, 12:19 p.m. UTC | #34
On Wed, Apr 27, 2011 at 01:48:52PM -0700, Colin Cross wrote:

> OPP currently has opp_enable and opp_disable functions.  I don't
> understand why these are needed, they are only used at init time to
> determine available voltages, which could be handled by never passing
> unavailable voltages to the dvfs implementation.

I queried this when OPP was originally added.  The motivation which was
given (which seemed fairly reasonable) was to reduce the number of data
tables for similar parts and board designs.  That did seem like
something which it was reasonable to factor out in some way, though
possibly with a different mechanism.
Rafael Wysocki April 28, 2011, 7:19 p.m. UTC | #35
On Thursday, April 28, 2011, MyungJoo Ham wrote:
> On Wed, Apr 27, 2011 at 10:46 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Wednesday, April 27, 2011, MyungJoo Ham wrote:
> >> Hello,
...
> >> Anyway, this happens with drivers/cpufreq also. We have been testing
> >> "tickling" associated with drivers/cpufreq/cpufreq.c. This has been
> >> reduced user response time significantly removing the need for tuning
> >> the threshold values.
> >
> > I think I understand the problem, but I'm not sure if there's a clean way
> > to trigger the "tickle" from the kernel level.
> 
> I'm considering the followings (and they are not mutually exclusive).
> How about them?
> 1. (the way that we've been using) Add a tickle call to input events
> at board file.

Well, first, the input events may not be the only user of that feature and
the GPU may not be the only target.  So, there should be a way to specify
multiple targets for each user and vice versa while keeping the drivers
portable between platforms.  That seems to be challenging. :-)

Second, even if the input events were the only user and the GPU were the only
possible target, it probably won't be necessary to "tickle" the GPU every
time the touchscreen is used, only when the user is "scrolling", which input
events really can't tell.  The only entity having an idea about what the
_user_ is doing is the application being used at the moment.

> 2. provide sysfs interface that triggers tickling along with devfreq sysfs.

That may be a better approach IMO, but making applications use it also will
be challenging.

Thanks,
Rafael
diff mbox

Patch

diff --git a/drivers/base/power/Makefile b/drivers/base/power/Makefile
index 118c1b9..4d92d7f 100644
--- a/drivers/base/power/Makefile
+++ b/drivers/base/power/Makefile
@@ -3,6 +3,7 @@  obj-$(CONFIG_PM_SLEEP)	+= main.o wakeup.o
 obj-$(CONFIG_PM_RUNTIME)	+= runtime.o
 obj-$(CONFIG_PM_TRACE_RTC)	+= trace.o
 obj-$(CONFIG_PM_OPP)	+= opp.o
+obj-$(CONFIG_PM_GENERIC_DVFS)	+= dvfs.o
 
 ccflags-$(CONFIG_DEBUG_DRIVER) := -DDEBUG
 ccflags-$(CONFIG_PM_VERBOSE)   += -DDEBUG
diff --git a/drivers/base/power/dvfs.c b/drivers/base/power/dvfs.c
new file mode 100644
index 0000000..867ad70
--- /dev/null
+++ b/drivers/base/power/dvfs.c
@@ -0,0 +1,401 @@ 
+/*
+ * Generic DVFS Interface for Non-CPU Devices
+ *	Based on OPP.
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ *	MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/err.h>
+#include <linux/init.h>
+#include <linux/slab.h>
+#include <linux/opp.h>
+#include <linux/dvfs.h>
+#include <linux/workqueue.h>
+#include <linux/platform_device.h>
+#include <linux/list.h>
+
+#define DVFS_INTERVAL	20
+/**
+ * struct dvfs - Device DVFS structure
+ * @node	list node - contains the devices with DVFS that have been
+ *		registered.
+ * @dev		device pointer
+ * @profile	device-specific dvfs profile
+ * @governor	method how to choose frequency based on the usage.
+ * @previous_freq	previously configured frequency value.
+ * @next_polling	the number of remaining "dvfs_monitor" executions to
+ *			reevaluate frequency/voltage of the device. Set by
+ *			profile's polling_ms interval.
+ * @tickle	positive if DVFS-tickling is activated for the device.
+ *		at each executino of dvfs_monitor, tickle is decremented.
+ *		User may tickle a device-dvfs in order to set maximum
+ *		frequency instaneously with some guaranteed duration.
+ *
+ * This structure stores the DVFS information for a give device.
+ */
+struct dvfs {
+	struct list_head node;
+
+	struct device *dev;
+	struct dvfs_device_profile *profile;
+	struct dvfs_governor *governor;
+
+	unsigned long previous_freq;
+	unsigned int next_polling;
+	unsigned int tickle;
+};
+
+/*
+ * dvfs_work periodically (given by DVFS_INTERVAL) monitors every
+ * registered device.
+ */
+static struct delayed_work dvfs_work;
+/* The list of all device-dvfs */
+static LIST_HEAD(dvfs_list);
+/* Exclusive access to dvfs_list and its elements */
+static DEFINE_MUTEX(dvfs_list_lock);
+
+/**
+ * find_device_dvfs() - find dvfs struct using device pointer
+ * @dev:	device pointer used to lookup device DVFS.
+ *
+ * Search the list of device DVFSs and return the matched device's DVFS info.
+ */
+static struct dvfs *find_device_dvfs(struct device *dev)
+{
+	struct dvfs *tmp_dvfs, *dvfs = ERR_PTR(-ENODEV);
+
+	if (unlikely(IS_ERR_OR_NULL(dev))) {
+		pr_err("%s: Invalid parameters\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	list_for_each_entry(tmp_dvfs, &dvfs_list, node) {
+		if (tmp_dvfs->dev == dev) {
+			dvfs = tmp_dvfs;
+			break;
+		}
+	}
+
+	return dvfs;
+}
+
+/**
+ * dvfs_next_polling() - the number of dvfs-monitor iterations to satisfy
+ *		the given polling interval. The interval is rounded by
+ *		dvfs-monitor poling interval (DVFS_INTERVAL) with ceiling
+ *		function.
+ * @ms:		the length of polling interval in milli-second
+ */
+static unsigned int dvfs_next_polling(int ms)
+{
+	unsigned int ret;
+
+	if (ms <= 0)
+		return 0;
+
+	ret = ms / DVFS_INTERVAL;
+	if (ms % DVFS_INTERVAL)
+		ret++;
+
+	return ret;
+}
+
+/**
+ * dvfs_do() - Check the usage profile of a given device and configure
+ *		frequency and voltage accordingly
+ * @dvfs:	DVFS info of the given device
+ */
+static int dvfs_do(struct dvfs *dvfs)
+{
+	struct dvfs_usage_profile profile;
+	struct opp *opp;
+	unsigned long freq;
+	int err;
+
+	err = dvfs->profile->get_usage_profile(dvfs->dev, &profile);
+	if (err) {
+		dev_err(dvfs->dev, "%s: Cannot get usage profile (%d)\n",
+			__func__, err);
+		return err;
+	}
+
+	err = dvfs->governor->get_target_freq(&profile, dvfs->governor, &freq);
+	if (err) {
+		dev_err(dvfs->dev, "%s: Cannot get target frequency (%d)\n",
+			__func__, err);
+		return err;
+	}
+
+	opp = opp_find_freq_ceil(dvfs->dev, &freq);
+	if (opp == ERR_PTR(-ENODEV))
+		opp = opp_find_freq_floor(dvfs->dev, &freq);
+
+	if (IS_ERR(opp)) {
+		dev_err(dvfs->dev, "%s: Cannot find opp with %luHz.\n",
+				__func__, freq);
+		return PTR_ERR(opp);
+	}
+
+	freq = opp_get_freq(opp);
+	if (dvfs->previous_freq != freq) {
+		err = dvfs->profile->target(dvfs->dev, freq,
+					    opp_get_voltage(opp));
+		dvfs->previous_freq = freq;
+	}
+
+	if (err)
+		dev_err(dvfs->dev, "%s: Cannot set %luHz/%luuV\n", __func__,
+			opp_get_freq(opp), opp_get_voltage(opp));
+	return err;
+}
+
+/**
+ * dvfs_monitor() - Regularly run dvfs_do() and support device DVFS tickle.
+ * @work: the work struct used to run dvfs_monitor periodically.
+ */
+static void dvfs_monitor(struct work_struct *work)
+{
+	struct dvfs *dvfs;
+
+	mutex_lock(&dvfs_list_lock);
+
+	list_for_each_entry(dvfs, &dvfs_list, node) {
+		if (dvfs->next_polling == 0)
+			continue;
+		if (dvfs->tickle) {
+			dvfs->tickle--;
+			continue;
+		}
+		if (dvfs->next_polling == 1) {
+			dvfs_do(dvfs);
+			dvfs->next_polling = dvfs_next_polling(
+						dvfs->profile->polling_ms);
+		} else {
+			dvfs->next_polling--;
+		}
+	}
+
+	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
+	mutex_unlock(&dvfs_list_lock);
+}
+
+/**
+ * dvfs_add_device() - Add dvfs feature to the device
+ * @dev:	the device to add dvfs feature.
+ * @profile:	device-specific profile to run dvfs.
+ * @governor:	the policy to choose frequency.
+ */
+int dvfs_add_device(struct device *dev, struct dvfs_device_profile *profile,
+		    struct dvfs_governor *governor)
+{
+	struct dvfs *new_dvfs, *dvfs;
+	struct cpufreq_frequency_table *table;
+	int err;
+
+	if (!dev || !profile || !governor) {
+		dev_err(dev, "%s: Invalid parameters.\n", __func__);
+		return -EINVAL;
+	}
+
+	err = opp_init_cpufreq_table(dev, &table);
+	if (err) {
+		dev_err(dev, "%s: Device OPP cannot provide DVFS table (%d)\n",
+			__func__, err);
+		return err;
+	}
+
+	new_dvfs = kzalloc(sizeof(struct dvfs), GFP_KERNEL);
+	if (!new_dvfs) {
+		dev_err(dev, "%s: Unable to create DVFS for the device\n",
+			__func__);
+		return -ENOMEM;
+	}
+
+	mutex_lock(&dvfs_list_lock);
+
+	dvfs = find_device_dvfs(dev);
+	if (!IS_ERR(dvfs)) {
+		dev_err(dev, "%s: Unable to create DVFS for the device. "
+			"It already has one.\n", __func__);
+		mutex_unlock(&dvfs_list_lock);
+		kfree(new_dvfs);
+		return -EINVAL;
+	}
+
+	new_dvfs->dev = dev;
+	new_dvfs->profile = profile;
+	new_dvfs->governor = governor;
+	new_dvfs->next_polling = dvfs_next_polling(profile->polling_ms);
+	new_dvfs->previous_freq = 0;
+
+	list_add(&new_dvfs->node, &dvfs_list);
+
+	mutex_unlock(&dvfs_list_lock);
+
+	return 0;
+}
+
+/**
+ * dvfs_remove_device() - Remove DVFS feature from a device.
+ * @device:	the device to remove dvfs feature.
+ */
+int dvfs_remove_device(struct device *dev)
+{
+	struct dvfs *dvfs;
+
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&dvfs_list_lock);
+	dvfs = find_device_dvfs(dev);
+	if (IS_ERR(dvfs)) {
+		dev_err(dev, "%s: Unable to find DVFS entry for the device.\n",
+			__func__);
+		mutex_unlock(&dvfs_list_lock);
+		return -EINVAL;
+	}
+
+	list_del(&dvfs->node);
+
+	kfree(dvfs);
+
+	mutex_unlock(&dvfs_list_lock);
+
+	return 0;
+}
+
+/**
+ * dvfs_update() - Notify that the device OPP has been changed.
+ * @dev:	the device whose OPP has been changed.
+ * @may_not_exist:	do not print error message even if the device
+ *			does not have dvfs entry.
+ */
+int dvfs_update(struct device *dev, bool may_not_exist)
+{
+	struct dvfs *dvfs;
+	int err = 0;
+
+	mutex_lock(&dvfs_list_lock);
+	dvfs = find_device_dvfs(dev);
+	if (!IS_ERR(dvfs)) {
+		if (dvfs->tickle) {
+			unsigned long freq = dvfs->profile->max_freq;
+			struct opp *opp = opp_find_freq_floor(dvfs->dev, &freq);
+
+			freq = opp_get_freq(opp);
+			if (dvfs->previous_freq != freq) {
+				err = dvfs->profile->target(dvfs->dev, freq,
+						opp_get_voltage(opp));
+				dvfs->previous_freq = freq;
+			}
+		} else {
+			err = dvfs_do(dvfs);
+		}
+	}
+	mutex_unlock(&dvfs_list_lock);
+
+	if (IS_ERR(dvfs)) {
+		if (may_not_exist && PTR_ERR(dvfs) == -EINVAL)
+			return 0;
+
+		dev_err(dev, "%s: Device DVFS not found (%ld)\n", __func__,
+			PTR_ERR(dvfs));
+		return PTR_ERR(dvfs);
+	}
+
+	return err;
+}
+
+/**
+ * dvfs_tickle_device() - Guarantee maximum operation speed for a while
+ *			instaneously.
+ * @dev:	the device to be tickled.
+ * @duration_ms:	the duration of tickle effect.
+ *
+ * Tickle sets the device at the maximum frequency instaneously and
+ * the maximul frequency is guaranteed to be used for the given duration.
+ * For faster user reponse time, an input event may tickle a related device
+ * so that the input event does not need to wait for the DVFS to react with
+ * normal interval.
+ */
+int dvfs_tickle_device(struct device *dev, unsigned long duration_ms)
+{
+	struct dvfs *dvfs;
+	struct opp *opp;
+	unsigned long freq;
+	int err = 0;
+
+	mutex_lock(&dvfs_list_lock);
+	dvfs = find_device_dvfs(dev);
+	if (!IS_ERR(dvfs)) {
+		freq = dvfs->profile->max_freq;
+		opp = opp_find_freq_floor(dvfs->dev, &freq);
+		freq = opp_get_freq(opp);
+		if (dvfs->previous_freq != freq) {
+			err = dvfs->profile->target(dvfs->dev, freq,
+						    opp_get_voltage(opp));
+			dvfs->previous_freq = freq;
+		}
+		if (err)
+			dev_err(dev, "%s: Cannot set frequency.\n", __func__);
+		else
+			dvfs->tickle = dvfs_next_polling(duration_ms);
+	}
+	mutex_unlock(&dvfs_list_lock);
+
+	if (IS_ERR(dvfs)) {
+		dev_err(dev, "%s: Cannot find dvfs.\n", __func__);
+		err = PTR_ERR(dvfs);
+	}
+
+	return err;
+}
+
+#ifdef CONFIG_PM_SLEEP
+static int dvfs_pause(struct device *dev)
+{
+	cancel_delayed_work_sync(&dvfs_work);
+	return 0;
+}
+
+static void dvfs_continue(struct device *dev)
+{
+	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
+}
+#else
+#define dvfs_pause	NULL
+#define dvfs_continue	NULL
+#endif
+static const struct dev_pm_ops dvfs_pm = {
+	.prepare	= dvfs_pause,
+	.complete	= dvfs_continue,
+};
+static struct platform_driver dvfs_pm_drv = {
+	.driver = {
+		.name	= "generic_dvfs_pm",
+		.pm	= &dvfs_pm,
+	},
+};
+static struct platform_device dvfs_pm_dev = {
+	.name = "generic_dvfs_pm",
+};
+
+static int __init dvfs_init(void)
+{
+	platform_driver_register(&dvfs_pm_drv);
+	platform_device_register(&dvfs_pm_dev);
+
+	INIT_DELAYED_WORK_DEFERRABLE(&dvfs_work, dvfs_monitor);
+	schedule_delayed_work(&dvfs_work, msecs_to_jiffies(DVFS_INTERVAL));
+
+	return 0;
+}
+late_initcall(dvfs_init);
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index 56a6899..df74655 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -21,6 +21,7 @@ 
 #include <linux/rculist.h>
 #include <linux/rcupdate.h>
 #include <linux/opp.h>
+#include <linux/dvfs.h>
 
 /*
  * Internal data structure organization with the OPP layer library is as
@@ -428,6 +429,9 @@  int opp_add(struct device *dev, unsigned long freq, unsigned long u_volt)
 	list_add_rcu(&new_opp->node, head);
 	mutex_unlock(&dev_opp_list_lock);
 
+	/* Notify generic dvfs for the change */
+	dvfs_update(dev, true);
+
 	return 0;
 }
 
@@ -512,6 +516,9 @@  unlock:
 	mutex_unlock(&dev_opp_list_lock);
 out:
 	kfree(new_opp);
+
+	/* Notify generic dvfs for the change */
+	dvfs_update(dev, true);
 	return r;
 }
 
diff --git a/include/linux/dvfs.h b/include/linux/dvfs.h
new file mode 100644
index 0000000..de3e53d
--- /dev/null
+++ b/include/linux/dvfs.h
@@ -0,0 +1,69 @@ 
+/*
+ * Generic DVFS Interface for Non-CPU Devices
+ *
+ * Copyright (C) 2011 Samsung Electronics
+ *	MyungJoo Ham <myungjoo.ham@samsung.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef __LINUX_DVFS_H__
+#define __LINUX_DVFS_H__
+
+struct dvfs;
+
+struct dvfs_usage_profile {
+	/* both since the last measure */
+	unsigned long total_time;
+	unsigned long busy_time;
+};
+
+struct dvfs_device_profile {
+	unsigned long max_freq; /* may be larger than the actual value */
+	int (*target)(struct device *dev, unsigned long freq,
+		      unsigned long u_volt);
+	int (*get_usage_profile)(struct device *dev,
+				 struct dvfs_usage_profile *profile);
+	int polling_ms;	/* 0 for at opp change only */
+};
+
+struct dvfs_governor {
+	void *data; /* private data for get_target_freq */
+	int (*get_target_freq)(struct dvfs_usage_profile *profile,
+			       struct dvfs_governor *this, unsigned long *freq);
+};
+
+#if defined(CONFIG_PM_GENERIC_DVFS)
+extern int dvfs_add_device(struct device *ev,
+			   struct dvfs_device_profile *profile,
+			   struct dvfs_governor *governor);
+extern int dvfs_remove_device(struct device *dev);
+extern int dvfs_update(struct device *dev, bool may_not_exist);
+extern int dvfs_tickle_device(struct device *dev, unsigned long duration_ms);
+#else /* !CONFIG_PM_GENERIC_DVFS */
+static int dvfs_add_device(struct device *ev,
+			   struct dvfs_device_profile *profile,
+			   struct dvfs_governor *governor)
+{
+	return 0;
+}
+
+static int dvfs_remove_device(struct device *dev)
+{
+	return 0;
+}
+
+static int dvfs_update(struct device *dev, bool may_not_exist)
+{
+	return 0;
+}
+
+static int dvfs_tickle_device(struct device *dev, unsigned long duration_ms)
+{
+	return 0;
+}
+#endif /* CONFIG_PM_GENERIC_DVFS */
+
+#endif /* __LINUX_DVFS_H__ */
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index 4603f08..7d191ec 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -225,3 +225,12 @@  config PM_OPP
 	  representing individual voltage domains and provides SOC
 	  implementations a ready to use framework to manage OPPs.
 	  For more information, read <file:Documentation/power/opp.txt>
+
+config PM_GENERIC_DVFS
+	bool "Generic DVFS framework"
+	depends on PM_OPP
+	help
+	  With OPP support, a device may have a list of frequencies and
+	  voltages available. Generic DVFS framework provides a method to
+	  control frequency/voltage of a device with OPP with given profile
+	  and governor per device.