diff mbox

[RFC] PCI: Allow sysfs control over totalvfs

Message ID 1474386588-16337-1-git-send-email-Yuval.Mintz@qlogic.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Yuval Mintz Sept. 20, 2016, 3:49 p.m. UTC
[Sorry in advance if this was already discussed in the past]

Some of the HW capable of SRIOV has resource limitations, where the
PF and VFs resources are drawn from a common pool.
In some cases, these limitations have to be considered early during
chip initialization and can only be changed by tearing down the
configuration and re-initializing.
As a result, drivers for such HWs sometimes have to make unfavorable
compromises where they reserve sufficient resources to accomadate
the maximal number of VFs that can be created - at the expanse of
resources that could have been used by the PF.

If users were able to provide 'hints' regarding the required number
of VFs *prior* to driver attachment, then such compromises could be
avoided. As we already have sysfs nodes that can be queried for the
number of totalvfs, it makes sense to let the user reduce the number
of said totalvfs using same infrastrucure.
Then, we can have drivers supporting SRIOV take that value into account
when deciding how much resources to reserve, allowing the PF to benefit
from the difference between the configuration space value and the actual
number needed by user.

Signed-off-by: Yuval Mintz <Yuval.Mintz@caviumnetworks.com>
---
 drivers/pci/pci-sysfs.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

Comments

Jiri Pirko Sept. 20, 2016, 7:36 p.m. UTC | #1
Tue, Sep 20, 2016 at 05:49:48PM CEST, Yuval.Mintz@qlogic.com wrote:
>[Sorry in advance if this was already discussed in the past]
>
>Some of the HW capable of SRIOV has resource limitations, where the
>PF and VFs resources are drawn from a common pool.
>In some cases, these limitations have to be considered early during
>chip initialization and can only be changed by tearing down the
>configuration and re-initializing.
>As a result, drivers for such HWs sometimes have to make unfavorable
>compromises where they reserve sufficient resources to accomadate
>the maximal number of VFs that can be created - at the expanse of
>resources that could have been used by the PF.
>
>If users were able to provide 'hints' regarding the required number
>of VFs *prior* to driver attachment, then such compromises could be
>avoided. As we already have sysfs nodes that can be queried for the
>number of totalvfs, it makes sense to let the user reduce the number
>of said totalvfs using same infrastrucure.
>Then, we can have drivers supporting SRIOV take that value into account
>when deciding how much resources to reserve, allowing the PF to benefit
>from the difference between the configuration space value and the actual
>number needed by user.

One of the motivations for introducing devlink interface was to allow
user to pass some kind of well defined option parameters or as you call
it hints to driver module. That would allow to replace module options
and introduce similar possibility to pre-configure hardware on probe time.
We plan to use devlink to allow user to change resource allocation for
mlxsw devices.

The plan is to allow to pre-create devlink instance before driver module
is loaded. Then the user will use this placeholder to do the options
setting. Once the driver module is loaded, it will fetch the options
from devlink core and process it accordingly.

I believe this is exactly what you need.


>
>Signed-off-by: Yuval Mintz <Yuval.Mintz@caviumnetworks.com>
>---
> drivers/pci/pci-sysfs.c | 28 +++++++++++++++++++++++++++-
> 1 file changed, 27 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
>index bcd10c7..c1546f8 100644
>--- a/drivers/pci/pci-sysfs.c
>+++ b/drivers/pci/pci-sysfs.c
>@@ -449,6 +449,30 @@ static ssize_t sriov_totalvfs_show(struct device *dev,
> 	return sprintf(buf, "%u\n", pci_sriov_get_totalvfs(pdev));
> }
> 
>+static ssize_t sriov_totalvfs_store(struct device *dev,
>+				    struct device_attribute *attr,
>+				    const char *buf, size_t count)
>+{
>+	struct pci_dev *pdev = to_pci_dev(dev);
>+	u16 max_vfs;
>+	int ret;
>+
>+	ret = kstrtou16(buf, 0, &max_vfs);
>+	if (ret < 0)
>+		return ret;
>+
>+	if (pdev->driver) {
>+		dev_info(&pdev->dev,
>+			 "Can't change totalvfs while driver is attached\n");
>+		return -EUSERS;
>+	}
>+
>+	ret = pci_sriov_set_totalvfs(pdev, max_vfs);
>+	if (ret)
>+		return ret;
>+
>+	return count;
>+}
> 
> static ssize_t sriov_numvfs_show(struct device *dev,
> 				 struct device_attribute *attr,
>@@ -516,7 +540,9 @@ static ssize_t sriov_numvfs_store(struct device *dev,
> 	return count;
> }
> 
>-static struct device_attribute sriov_totalvfs_attr = __ATTR_RO(sriov_totalvfs);
>+static struct device_attribute sriov_totalvfs_attr =
>+		__ATTR(sriov_totalvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
>+		       sriov_totalvfs_show, sriov_totalvfs_store);
> static struct device_attribute sriov_numvfs_attr =
> 		__ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
> 		       sriov_numvfs_show, sriov_numvfs_store);
>-- 
>1.9.3
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mintz, Yuval Sept. 20, 2016, 8:27 p.m. UTC | #2
> >Some of the HW capable of SRIOV has resource limitations, where the
> >PF and VFs resources are drawn from a common pool.
> >In some cases, these limitations have to be considered early during
> >chip initialization and can only be changed by tearing down the
> >configuration and re-initializing.
> >As a result, drivers for such HWs sometimes have to make unfavorable
> >compromises where they reserve sufficient resources to accomadate
> >the maximal number of VFs that can be created - at the expanse of
> >resources that could have been used by the PF.
> >
> >If users were able to provide 'hints' regarding the required number
> >of VFs *prior* to driver attachment, then such compromises could be
> >avoided. As we already have sysfs nodes that can be queried for the
> >number of totalvfs, it makes sense to let the user reduce the number
> >of said totalvfs using same infrastrucure.
> >Then, we can have drivers supporting SRIOV take that value into account
> >when deciding how much resources to reserve, allowing the PF to benefit
> >from the difference between the configuration space value and the actual
> >number needed by user.

> One of the motivations for introducing devlink interface was to allow
> user to pass some kind of well defined option parameters or as you call
> it hints to driver module. That would allow to replace module options
> and introduce similar possibility to pre-configure hardware on probe time.
> We plan to use devlink to allow user to change resource allocation for
> mlxsw devices.

Is IOV configuration something you're going to explore in the near
future for mlxsw devices? Or are you merely pointing out that
devlink could provide a superior configuration infrastrucutre and
should be investigated as a better alternative?

> The plan is to allow to pre-create devlink instance before driver module
> is loaded. Then the user will use this placeholder to do the options
> setting. Once the driver module is loaded, it will fetch the options
> from devlink core and process it accordingly.

> I believe this is exactly what you need.

While this sounds far-superior to anything we can do via pci sysfs,
question is whether adding a devlink support for a device is 
a reasonable cost for adding this specific configuration [given
the existing sysfs nodes we already have].
I'm not sufficiently familiar with the infrastrucutre there, and I
wonder whether it will set the bar too high for this sort of
configuration to be used.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck Sept. 20, 2016, 9:36 p.m. UTC | #3
On Tue, Sep 20, 2016 at 8:49 AM, Yuval Mintz <Yuval.Mintz@qlogic.com> wrote:
> [Sorry in advance if this was already discussed in the past]
>
> Some of the HW capable of SRIOV has resource limitations, where the
> PF and VFs resources are drawn from a common pool.
> In some cases, these limitations have to be considered early during
> chip initialization and can only be changed by tearing down the
> configuration and re-initializing.
> As a result, drivers for such HWs sometimes have to make unfavorable
> compromises where they reserve sufficient resources to accomadate
> the maximal number of VFs that can be created - at the expanse of
> resources that could have been used by the PF.
>
> If users were able to provide 'hints' regarding the required number
> of VFs *prior* to driver attachment, then such compromises could be
> avoided. As we already have sysfs nodes that can be queried for the
> number of totalvfs, it makes sense to let the user reduce the number
> of said totalvfs using same infrastrucure.
> Then, we can have drivers supporting SRIOV take that value into account
> when deciding how much resources to reserve, allowing the PF to benefit
> from the difference between the configuration space value and the actual
> number needed by user.
>
> Signed-off-by: Yuval Mintz <Yuval.Mintz@caviumnetworks.com>
> ---
>  drivers/pci/pci-sysfs.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index bcd10c7..c1546f8 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -449,6 +449,30 @@ static ssize_t sriov_totalvfs_show(struct device *dev,
>         return sprintf(buf, "%u\n", pci_sriov_get_totalvfs(pdev));
>  }
>
> +static ssize_t sriov_totalvfs_store(struct device *dev,
> +                                   struct device_attribute *attr,
> +                                   const char *buf, size_t count)
> +{
> +       struct pci_dev *pdev = to_pci_dev(dev);
> +       u16 max_vfs;
> +       int ret;
> +
> +       ret = kstrtou16(buf, 0, &max_vfs);
> +       if (ret < 0)
> +               return ret;
> +
> +       if (pdev->driver) {
> +               dev_info(&pdev->dev,
> +                        "Can't change totalvfs while driver is attached\n");
> +               return -EUSERS;
> +       }
> +
> +       ret = pci_sriov_set_totalvfs(pdev, max_vfs);
> +       if (ret)
> +               return ret;
> +
> +       return count;
> +}
>
>  static ssize_t sriov_numvfs_show(struct device *dev,
>                                  struct device_attribute *attr,
> @@ -516,7 +540,9 @@ static ssize_t sriov_numvfs_store(struct device *dev,
>         return count;
>  }
>
> -static struct device_attribute sriov_totalvfs_attr = __ATTR_RO(sriov_totalvfs);
> +static struct device_attribute sriov_totalvfs_attr =
> +               __ATTR(sriov_totalvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
> +                      sriov_totalvfs_show, sriov_totalvfs_store);
>  static struct device_attribute sriov_numvfs_attr =
>                 __ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
>                        sriov_numvfs_show, sriov_numvfs_store);

It would be useful to have an interface where you could increase the
number after you have decreased it.  With the interface as you have it
written that isn't an option since pci_sriov_set_totalvfs is really
only meant to strip VFs if they cannot be support by something such as
a bus limitation due to ARI not being supported.

I really think that if you need something like this you might be
better off using something like dev-link or just to figure out a way
to make your driver flexible enough to allow you to move resources
into and/or out of your PF interface if VFs are added or removed.  I
know in the case of the Intel parts we have to bounce the link when
SR-IOV is enabled because we actually go through and tear out the
queues and interrupts from the PF and then reassign all of them
between the PF and VFs before we bring the PF back up.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 21, 2016, 5:55 a.m. UTC | #4
Tue, Sep 20, 2016 at 10:27:24PM CEST, Yuval.Mintz@cavium.com wrote:
>> >Some of the HW capable of SRIOV has resource limitations, where the
>> >PF and VFs resources are drawn from a common pool.
>> >In some cases, these limitations have to be considered early during
>> >chip initialization and can only be changed by tearing down the
>> >configuration and re-initializing.
>> >As a result, drivers for such HWs sometimes have to make unfavorable
>> >compromises where they reserve sufficient resources to accomadate
>> >the maximal number of VFs that can be created - at the expanse of
>> >resources that could have been used by the PF.
>> >
>> >If users were able to provide 'hints' regarding the required number
>> >of VFs *prior* to driver attachment, then such compromises could be
>> >avoided. As we already have sysfs nodes that can be queried for the
>> >number of totalvfs, it makes sense to let the user reduce the number
>> >of said totalvfs using same infrastrucure.
>> >Then, we can have drivers supporting SRIOV take that value into account
>> >when deciding how much resources to reserve, allowing the PF to benefit
>> >from the difference between the configuration space value and the actual
>> >number needed by user.
>
>> One of the motivations for introducing devlink interface was to allow
>> user to pass some kind of well defined option parameters or as you call
>> it hints to driver module. That would allow to replace module options
>> and introduce similar possibility to pre-configure hardware on probe time.
>> We plan to use devlink to allow user to change resource allocation for
>> mlxsw devices.
>
>Is IOV configuration something you're going to explore in the near
>future for mlxsw devices? Or are you merely pointing out that

No, not sriov related directly.


>devlink could provide a superior configuration infrastrucutre and
>should be investigated as a better alternative?

Exactly. It is a general problem of how to pre-configure driver modules.


>
>> The plan is to allow to pre-create devlink instance before driver module
>> is loaded. Then the user will use this placeholder to do the options
>> setting. Once the driver module is loaded, it will fetch the options
>> from devlink core and process it accordingly.
>
>> I believe this is exactly what you need.
>
>While this sounds far-superior to anything we can do via pci sysfs,
>question is whether adding a devlink support for a device is 
>a reasonable cost for adding this specific configuration [given
>the existing sysfs nodes we already have].

Adding devlink support is trivial in most cases, I bet you can do it in
couple of minutes for your driver.


>I'm not sufficiently familiar with the infrastrucutre there, and I
>wonder whether it will set the bar too high for this sort of
>configuration to be used.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mintz, Yuval Sept. 21, 2016, 6 a.m. UTC | #5
> >> One of the motivations for introducing devlink interface was to allow
> >> user to pass some kind of well defined option parameters or as you call
> >> it hints to driver module. That would allow to replace module options
> >> and introduce similar possibility to pre-configure hardware on probe time.
> >> We plan to use devlink to allow user to change resource allocation for
> >> mlxsw devices.
> > >
> >Is IOV configuration something you're going to explore in the near
> >future for mlxsw devices? Or are you merely pointing out that

> No, not sriov related directly.

> >devlink could provide a superior configuration infrastrucutre and
> >should be investigated as a better alternative?

> Exactly. It is a general problem of how to pre-configure driver modules.

> >> The plan is to allow to pre-create devlink instance before driver module
> >> is loaded. Then the user will use this placeholder to do the options
> >> setting. Once the driver module is loaded, it will fetch the options
> >> from devlink core and process it accordingly.
> >
> >> I believe this is exactly what you need.

> Adding devlink support is trivial in most cases, I bet you can do it in
> couple of minutes for your driver.

I'll go and educate myself, then.
Thanks.--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index bcd10c7..c1546f8 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -449,6 +449,30 @@  static ssize_t sriov_totalvfs_show(struct device *dev,
 	return sprintf(buf, "%u\n", pci_sriov_get_totalvfs(pdev));
 }
 
+static ssize_t sriov_totalvfs_store(struct device *dev,
+				    struct device_attribute *attr,
+				    const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	u16 max_vfs;
+	int ret;
+
+	ret = kstrtou16(buf, 0, &max_vfs);
+	if (ret < 0)
+		return ret;
+
+	if (pdev->driver) {
+		dev_info(&pdev->dev,
+			 "Can't change totalvfs while driver is attached\n");
+		return -EUSERS;
+	}
+
+	ret = pci_sriov_set_totalvfs(pdev, max_vfs);
+	if (ret)
+		return ret;
+
+	return count;
+}
 
 static ssize_t sriov_numvfs_show(struct device *dev,
 				 struct device_attribute *attr,
@@ -516,7 +540,9 @@  static ssize_t sriov_numvfs_store(struct device *dev,
 	return count;
 }
 
-static struct device_attribute sriov_totalvfs_attr = __ATTR_RO(sriov_totalvfs);
+static struct device_attribute sriov_totalvfs_attr =
+		__ATTR(sriov_totalvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
+		       sriov_totalvfs_show, sriov_totalvfs_store);
 static struct device_attribute sriov_numvfs_attr =
 		__ATTR(sriov_numvfs, (S_IRUGO|S_IWUSR|S_IWGRP),
 		       sriov_numvfs_show, sriov_numvfs_store);