mbox series

[net-next,00/12] Add support for PSE port priority

Message ID 20241002-feature_poe_port_prio-v1-0-787054f74ed5@bootlin.com (mailing list archive)
Headers show
Series Add support for PSE port priority | expand

Message

Kory Maincent Oct. 2, 2024, 4:27 p.m. UTC
From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>

This series brings support for port priority in the PSE subsystem.
PSE controllers can set priorities to decide which ports should be
turned off in case of special events like over-current.

This series also adds support for the devm_pse_irq_helper() helper,
similarly to devm_regulator_irq_helper(), to report events and errors.
Wrappers are used to avoid regulator naming in PSE drivers to prevent
confusion.

Patches 1-3: Cosmetics.
Patch 4: Adds support for last supported features in the TPS23881 drivers.
Patches 5-7: Add support for port priority in PSE core and ethtool.
Patches 8-9: Add support for port priority in PD692x0 and TPS23881 drivers.
Patches 10-11: Add support for devm_pse_irq_helper() helper in PSE core and
               ethtool.
Patch 12: Adds support for interrupt and event report in TPS23881 driver.

This patch series is based on the fix sent recently:
https://lore.kernel.org/netdev/20241002121706.246143-1-kory.maincent@bootlin.com/T/#u

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
---
Kory Maincent (12):
      net: pse-pd: Remove unused pse_ethtool_get_pw_limit function declaration
      net: pse-pd: tps23881: Correct boolean evaluation for bitmask checks
      net: pse-pd: tps23881: Simplify function returns by removing redundant checks
      net: pse-pd: tps23881: Add support for power limit and measurement features
      net: pse-pd: Add support for getting and setting port priority
      net: ethtool: Add PSE new port priority support feature
      netlink: specs: Expand the PSE netlink command with C33 prio attributes
      net: pse-pd: pd692x0: Add support for PSE PI priority feature
      net: pse-pd: tps23881: Add support for PSE PI priority feature
      net: pse-pd: Register regulator even for undescribed PSE PIs
      net: pse-pd: Add support for event reporting using devm_regulator_irq_helper
      net: pse-pd: tps23881: Add support for PSE events and interrupts

 Documentation/netlink/specs/ethtool.yaml     |  11 +
 Documentation/networking/ethtool-netlink.rst |  16 +
 drivers/net/pse-pd/pd692x0.c                 |  23 ++
 drivers/net/pse-pd/pse_core.c                |  66 +++-
 drivers/net/pse-pd/tps23881.c                | 532 +++++++++++++++++++++++++--
 include/linux/pse-pd/pse.h                   |  43 ++-
 include/uapi/linux/ethtool_netlink.h         |   2 +
 net/ethtool/pse-pd.c                         |  18 +
 8 files changed, 674 insertions(+), 37 deletions(-)
---
base-commit: 8052e7ff851b33e77f23800f8d15bafae9f97d17
change-id: 20240913-feature_poe_port_prio-a51aed7332ec

Best regards,

Comments

Kyle Swenson Oct. 9, 2024, 1:54 p.m. UTC | #1
Hello Kory,

On Wed, Oct 02, 2024 at 06:27:56PM +0200, Kory Maincent wrote:
> From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
> 
> This series brings support for port priority in the PSE subsystem.
> PSE controllers can set priorities to decide which ports should be
> turned off in case of special events like over-current.

First off, great work here.  I've read through the patches in the series and
have a pretty good idea of what you're trying to achieve- use the PSE
controller's idea of "port priority" and expose this to userspace via ethtool.

I think this is probably sufficient but I wanted to share my experience
supporting a system level PSE power budget with PSE port priorities across
different PSE controllers through the same userspace interface such that
userspace doesn't know or care about the underlying PSE controller.

Out of the three PSE controllers I'm aware of (Microchip's PD692x0, TI's
TPS2388x, and LTC's LT4266), the PD692x0 definitely has the most advanced
configuration, supporting concepts like a system (well, manager) level budget
and powering off lower priority ports in the event that the port power
consumption is greater than the system budget.

When we experimented with this feature in our routers, we found it to be using
the dynamic power consumed by a particular port- literally, the summation of
port current * port voltage across all the ports.  While this behavior
technically saves the system from resetting or worse, it causes a bit of a
problem with lower priority ports getting powered off depending on the behavior
(power consumption) of unrelated devices.  

As an example, let's say we've got 4 devices, all powered, and we're close to
the power budget.  One of the devices starts consuming more power (perhaps it's
modem just powered on), but not more than it's class limit.  Say this device
consumes enough power to exceed the configured power budget, causing the lowest
priority device to be powered off.  This is the documented and intended
behavior of the PD692x0 chipset, but causes an unpleasant user experience
because it's not really clear why some device was powered down all the sudden.
Was it because someone unplugged it? Or because the modem on the high priority
device turned on?  Or maybe that device had an overcurrent?  It'd be impossible
to tell, and even worse, by the time someone is able to physically look at the
switch, the low priority device might be back online (perhaps the modem on
the high priority device powered off).

This behavior is unique to the PD692x0- I'm much less familiar with the
TPS2388x's idea of port priority but it is very different from the PD692x0.
Frankly the behavior of the OSS pin is confusing and since we don't use the PSE
controllers' idea of port priority, it was safe to ignore it. Finally, the
LTC4266 has a "masked shutdown" ability where a predetermined set of ports are
shutdown when a specific pin (MSD) is driven low.  Like the TPS2388x's OSS pin,
We ignore this feature on the LTC4266.

If the end-goal here is to have a device-independent idea of "port priority" I
think we need to add a level of indirection between the port priority concept and the
actual PSE hardware.  The indirection would enable a system with multiple
(possibly heterogeneous even) PSE chips to have a unified idea of port
priority.  The way we've implemented this in our routers is by putting the PSE
controllers in "semi-auto" mode, where they continually detect and classify PDs
(powered device), but do not power them until instructed to do so.  The
mechanism that decides to power a particular port or not (for lack of a better
term, "budgeting logic") uses the available system power budget (configured
from userspace), the relative port priorities (also configured from userspace)
and the class of a detected PD.  The classification result is used to determine
the _maximum_ power a particular PD might draw, and that is the value that is
subtracted from the power budget.

Using the PD's classification and then allocating it the maximum power for that
class enables a non-technical installer to plug in all the PDs at the switch,
and observe if all the PDs are powered (or not).  But the important part is
(unless the port priorities or power budget are changed from userspace) the
devices that are powered won't change due to dynamic power consumption of the
other devices.

I'm not sure what the right path is for the kernel, and I'm not sure how this
would look with the regulator integration, nor am I sure what the userspace API
should look like (we used sysfs, but that's probably not ideal for upstream).
It's also not clear how much of the budgeting logic should be in the kernel, if
any. Despite that, hopefully sharing our experience is insightful and/or
helpful.  If not, feel free to ignore it.  In any case, you've got my

Reviewed-by: Kyle Swenson <kyle.swenson@est.tech>

for all the patches in the series.

Thanks,
Kyle Swenson
Kory Maincent Oct. 9, 2024, 3:04 p.m. UTC | #2
Hello Kyle,

On Wed, 9 Oct 2024 13:54:51 +0000
Kyle Swenson <kyle.swenson@est.tech> wrote:

> Hello Kory,
> 
> On Wed, Oct 02, 2024 at 06:27:56PM +0200, Kory Maincent wrote:
> > From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
> > 
> > This series brings support for port priority in the PSE subsystem.
> > PSE controllers can set priorities to decide which ports should be
> > turned off in case of special events like over-current.  
> 
> First off, great work here.  I've read through the patches in the series and
> have a pretty good idea of what you're trying to achieve- use the PSE
> controller's idea of "port priority" and expose this to userspace via ethtool.
> 
> I think this is probably sufficient but I wanted to share my experience
> supporting a system level PSE power budget with PSE port priorities across
> different PSE controllers through the same userspace interface such that
> userspace doesn't know or care about the underlying PSE controller.
> 
> Out of the three PSE controllers I'm aware of (Microchip's PD692x0, TI's
> TPS2388x, and LTC's LT4266), the PD692x0 definitely has the most advanced
> configuration, supporting concepts like a system (well, manager) level budget
> and powering off lower priority ports in the event that the port power
> consumption is greater than the system budget.
> 
> When we experimented with this feature in our routers, we found it to be using
> the dynamic power consumed by a particular port- literally, the summation of
> port current * port voltage across all the ports.  While this behavior
> technically saves the system from resetting or worse, it causes a bit of a
> problem with lower priority ports getting powered off depending on the
> behavior (power consumption) of unrelated devices.  
> 
> As an example, let's say we've got 4 devices, all powered, and we're close to
> the power budget.  One of the devices starts consuming more power (perhaps
> it's modem just powered on), but not more than it's class limit.  Say this
> device consumes enough power to exceed the configured power budget, causing
> the lowest priority device to be powered off.  This is the documented and
> intended behavior of the PD692x0 chipset, but causes an unpleasant user
> experience because it's not really clear why some device was powered down all
> the sudden. Was it because someone unplugged it? Or because the modem on the
> high priority device turned on?  Or maybe that device had an overcurrent?
> It'd be impossible to tell, and even worse, by the time someone is able to
> physically look at the switch, the low priority device might be back online
> (perhaps the modem on the high priority device powered off).
> 
> This behavior is unique to the PD692x0- I'm much less familiar with the
> TPS2388x's idea of port priority but it is very different from the PD692x0.
> Frankly the behavior of the OSS pin is confusing and since we don't use the
> PSE controllers' idea of port priority, it was safe to ignore it. Finally, the
> LTC4266 has a "masked shutdown" ability where a predetermined set of ports are
> shutdown when a specific pin (MSD) is driven low.  Like the TPS2388x's OSS
> pin, We ignore this feature on the LTC4266.
> 
> If the end-goal here is to have a device-independent idea of "port priority" I
> think we need to add a level of indirection between the port priority concept
> and the actual PSE hardware.  The indirection would enable a system with
> multiple (possibly heterogeneous even) PSE chips to have a unified idea of
> port priority.  The way we've implemented this in our routers is by putting
> the PSE controllers in "semi-auto" mode, where they continually detect and
> classify PDs (powered device), but do not power them until instructed to do
> so.  The mechanism that decides to power a particular port or not (for lack
> of a better term, "budgeting logic") uses the available system power budget
> (configured from userspace), the relative port priorities (also configured
> from userspace) and the class of a detected PD.  The classification result is
> used to determine the _maximum_ power a particular PD might draw, and that is
> the value that is subtracted from the power budget.
> 
> Using the PD's classification and then allocating it the maximum power for
> that class enables a non-technical installer to plug in all the PDs at the
> switch, and observe if all the PDs are powered (or not).  But the important
> part is (unless the port priorities or power budget are changed from
> userspace) the devices that are powered won't change due to dynamic power
> consumption of the other devices.
> 
> I'm not sure what the right path is for the kernel, and I'm not sure how this
> would look with the regulator integration, nor am I sure what the userspace
> API should look like (we used sysfs, but that's probably not ideal for
> upstream). It's also not clear how much of the budgeting logic should be in
> the kernel, if any. Despite that, hopefully sharing our experience is
> insightful and/or helpful.  If not, feel free to ignore it.  In any case,
> you've got my

Thanks for your review and for sharing your PSE experience.
It indeed is insightful for further development and update of this series.

So you are saying that from a use experience the port priority feature is not
user-friendly as we don't know why a port has been shutdown.
Even if we can report the over-current event of which port caused it, you still
thinks it is not useful?

We could have several cases for over power budget event:
- The power limit exceeded is the one configured for the ports.
  We should shutdown only that port without taking care about priority.
  TPS23881 has this behavior when power exceed Pcut.
  I think the PD692x0 does the same. Need to verify.
- The power limit exceeded is the global (or manager PD69208M) power budget.
  Here port priority is interesting.
  Is there a way to know which port create this global power limit excess?
  Should we turn off this port even if he don't exceed his own power limit or
  should we turn off low priority ports?
  I can't find global power budget concept for the TPS23881. 
  I could't test this case because I don't have enough load. In fact, maybe by
  setting the PD692x0 power bank limit low it could work.

Regards,
Kyle Swenson Oct. 9, 2024, 5:42 p.m. UTC | #3
Hello Kory,

On Wed, Oct 09, 2024 at 05:04:00PM +0200, Kory Maincent wrote:
> Hello Kyle,
> 
> On Wed, 9 Oct 2024 13:54:51 +0000
> Kyle Swenson <kyle.swenson@est.tech> wrote:
> 
> > Hello Kory,
> > 
> > On Wed, Oct 02, 2024 at 06:27:56PM +0200, Kory Maincent wrote:
> > > From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
> > > 
> > > This series brings support for port priority in the PSE subsystem.
> > > PSE controllers can set priorities to decide which ports should be
> > > turned off in case of special events like over-current.  
> > 
> > First off, great work here.  I've read through the patches in the series and
> > have a pretty good idea of what you're trying to achieve- use the PSE
> > controller's idea of "port priority" and expose this to userspace via ethtool.
> > 
> > I think this is probably sufficient but I wanted to share my experience
> > supporting a system level PSE power budget with PSE port priorities across
> > different PSE controllers through the same userspace interface such that
> > userspace doesn't know or care about the underlying PSE controller.
> > 
> > Out of the three PSE controllers I'm aware of (Microchip's PD692x0, TI's
> > TPS2388x, and LTC's LT4266), the PD692x0 definitely has the most advanced
> > configuration, supporting concepts like a system (well, manager) level budget
> > and powering off lower priority ports in the event that the port power
> > consumption is greater than the system budget.
> > 
> > When we experimented with this feature in our routers, we found it to be using
> > the dynamic power consumed by a particular port- literally, the summation of
> > port current * port voltage across all the ports.  While this behavior
> > technically saves the system from resetting or worse, it causes a bit of a
> > problem with lower priority ports getting powered off depending on the
> > behavior (power consumption) of unrelated devices.  
> > 
> > As an example, let's say we've got 4 devices, all powered, and we're close to
> > the power budget.  One of the devices starts consuming more power (perhaps
> > it's modem just powered on), but not more than it's class limit.  Say this
> > device consumes enough power to exceed the configured power budget, causing
> > the lowest priority device to be powered off.  This is the documented and
> > intended behavior of the PD692x0 chipset, but causes an unpleasant user
> > experience because it's not really clear why some device was powered down all
> > the sudden. Was it because someone unplugged it? Or because the modem on the
> > high priority device turned on?  Or maybe that device had an overcurrent?
> > It'd be impossible to tell, and even worse, by the time someone is able to
> > physically look at the switch, the low priority device might be back online
> > (perhaps the modem on the high priority device powered off).
> > 
> > This behavior is unique to the PD692x0- I'm much less familiar with the
> > TPS2388x's idea of port priority but it is very different from the PD692x0.
> > Frankly the behavior of the OSS pin is confusing and since we don't use the
> > PSE controllers' idea of port priority, it was safe to ignore it. Finally, the
> > LTC4266 has a "masked shutdown" ability where a predetermined set of ports are
> > shutdown when a specific pin (MSD) is driven low.  Like the TPS2388x's OSS
> > pin, We ignore this feature on the LTC4266.
> > 
> > If the end-goal here is to have a device-independent idea of "port priority" I
> > think we need to add a level of indirection between the port priority concept
> > and the actual PSE hardware.  The indirection would enable a system with
> > multiple (possibly heterogeneous even) PSE chips to have a unified idea of
> > port priority.  The way we've implemented this in our routers is by putting
> > the PSE controllers in "semi-auto" mode, where they continually detect and
> > classify PDs (powered device), but do not power them until instructed to do
> > so.  The mechanism that decides to power a particular port or not (for lack
> > of a better term, "budgeting logic") uses the available system power budget
> > (configured from userspace), the relative port priorities (also configured
> > from userspace) and the class of a detected PD.  The classification result is
> > used to determine the _maximum_ power a particular PD might draw, and that is
> > the value that is subtracted from the power budget.
> > 
> > Using the PD's classification and then allocating it the maximum power for
> > that class enables a non-technical installer to plug in all the PDs at the
> > switch, and observe if all the PDs are powered (or not).  But the important
> > part is (unless the port priorities or power budget are changed from
> > userspace) the devices that are powered won't change due to dynamic power
> > consumption of the other devices.
> > 
> > I'm not sure what the right path is for the kernel, and I'm not sure how this
> > would look with the regulator integration, nor am I sure what the userspace
> > API should look like (we used sysfs, but that's probably not ideal for
> > upstream). It's also not clear how much of the budgeting logic should be in
> > the kernel, if any. Despite that, hopefully sharing our experience is
> > insightful and/or helpful.  If not, feel free to ignore it.  In any case,
> > you've got my
> 
> Thanks for your review and for sharing your PSE experience.
> It indeed is insightful for further development and update of this series.

Excellent, glad to hear it.

> So you are saying that from a use experience the port priority feature is not
> user-friendly as we don't know why a port has been shutdown.
> Even if we can report the over-current event of which port caused it, you still
> thinks it is not useful?

Well, not quite.  I think the concept of a "port priority" is useful,
but I don't know that the PD692xx's concept of "port priority" is what
we want.  The issue is the PD692xx's budgeting algorithm is based on
dynamic power used (i.e. the total power used at any given time).  Since
this is, well, dynamic, it makes it confusing when a lower priority port
is powered off due to the runtime behavior of higher-priority ports.
It's even more confusing if the implicit or default port priorities are
used.

Instead, we found that using the maximum power that is allowed be drawn
by a particular PD's class (set by the IEEE standard) is more user
friendly, because the set of devices that are powered won't change
(unless priorities are changed, or the system budget is changed).
For example, if we've got 4 devices plugged in, and the three highest
priority devices consume all the power budget, the lowest priority
device won't ever be powered.  There isn't a case where the lowest
priority device will be shut down because a higher priority device
starts consuming more power at some point in the future.

> We could have several cases for over power budget event:
> - The power limit exceeded is the one configured for the ports.
>   We should shutdown only that port without taking care about priority.
>   TPS23881 has this behavior when power exceed Pcut.
>   I think the PD692x0 does the same. Need to verify.

These conditions I'd not call "over power budget events".  I'd call them
"port overcurrent events" and I agree, those only affect the specific
problem port.

> - The power limit exceeded is the global (or manager PD69208M) power budget.
>   Here port priority is interesting.
>   Is there a way to know which port create this global power limit excess?
>   Should we turn off this port even if he don't exceed his own power limit or
>   should we turn off low priority ports?

I think it's important to make a distinction between an "overcurrent"
condition and the condition where we've exceeded the system power
budget.  An "overcurrent" is port-specific, and can happen if the PD
consumes more power than the classification of the device allows.  For
example, if a Class 3 PD (i.e. 802.3at, also referred to as a Type II
PD) consumes more than 15.4 W at the PSE, it will be shutdown
immediately.  This support is required by all the IEEE 802.3 standards
around PoE (.af, .at. and .bt) and is a safety thing.  The TPS2388x
implements this with Pcut, the LTC4266 impliments this with Icut
register, and the PD692xx implements it with the port power limit
registers.  

The condition where we've exceeded our system-level power
budget is a little different, in that it causes a port to be shutdown
despite that port not exceeding it's class power limit.  This condition
is the case I'm concerned we're solving in this series, and solving it
for the PD692xx case only, and it's based off dynamic power consumption.

So I guess I'm suggesting that we take the power budgeting concept out
of the PSE drivers, and put it into software (either kernel, userspace)
instead of the PSE hardware.  

>   I can't find global power budget concept for the TPS23881. 

This is because this idea doesn't exist on the TPS2388x.  

>   I could't test this case because I don't have enough load. In fact, maybe by
>   setting the PD692x0 power bank limit low it could work.

Hopefully this helps clarify.

> 
> Regards,
> -- 
> Köry Maincent, Bootlin
> Embedded Linux and kernel engineering
> https://bootlin.com

Thanks,
Kyle
Oleksij Rempel Oct. 10, 2024, 5:42 a.m. UTC | #4
Hello Kyle,

On Wed, Oct 09, 2024 at 05:42:30PM +0000, Kyle Swenson wrote:
> Hello Kory,
> 
> On Wed, Oct 09, 2024 at 05:04:00PM +0200, Kory Maincent wrote:
> > Hello Kyle,
> > 
> > On Wed, 9 Oct 2024 13:54:51 +0000
> > Kyle Swenson <kyle.swenson@est.tech> wrote:
> > 
> > > Hello Kory,
> > > 
> > > On Wed, Oct 02, 2024 at 06:27:56PM +0200, Kory Maincent wrote:
> > > > From: Kory Maincent (Dent Project) <kory.maincent@bootlin.com>
> > > > 
> > > > This series brings support for port priority in the PSE subsystem.
> > > > PSE controllers can set priorities to decide which ports should be
> > > > turned off in case of special events like over-current.  
> > > 
> > > First off, great work here.  I've read through the patches in the series and
> > > have a pretty good idea of what you're trying to achieve- use the PSE
> > > controller's idea of "port priority" and expose this to userspace via ethtool.
> > > 
> > > I think this is probably sufficient but I wanted to share my experience
> > > supporting a system level PSE power budget with PSE port priorities across
> > > different PSE controllers through the same userspace interface such that
> > > userspace doesn't know or care about the underlying PSE controller.
> > > 
> > > Out of the three PSE controllers I'm aware of (Microchip's PD692x0, TI's
> > > TPS2388x, and LTC's LT4266), the PD692x0 definitely has the most advanced
> > > configuration, supporting concepts like a system (well, manager) level budget
> > > and powering off lower priority ports in the event that the port power
> > > consumption is greater than the system budget.
> > > 
> > > When we experimented with this feature in our routers, we found it to be using
> > > the dynamic power consumed by a particular port- literally, the summation of
> > > port current * port voltage across all the ports.  While this behavior
> > > technically saves the system from resetting or worse, it causes a bit of a
> > > problem with lower priority ports getting powered off depending on the
> > > behavior (power consumption) of unrelated devices.  
> > > 
> > > As an example, let's say we've got 4 devices, all powered, and we're close to
> > > the power budget.  One of the devices starts consuming more power (perhaps
> > > it's modem just powered on), but not more than it's class limit.  Say this
> > > device consumes enough power to exceed the configured power budget, causing
> > > the lowest priority device to be powered off.  This is the documented and
> > > intended behavior of the PD692x0 chipset, but causes an unpleasant user
> > > experience because it's not really clear why some device was powered down all
> > > the sudden. Was it because someone unplugged it? Or because the modem on the
> > > high priority device turned on?  Or maybe that device had an overcurrent?
> > > It'd be impossible to tell, and even worse, by the time someone is able to
> > > physically look at the switch, the low priority device might be back online
> > > (perhaps the modem on the high priority device powered off).
> > > 
> > > This behavior is unique to the PD692x0- I'm much less familiar with the
> > > TPS2388x's idea of port priority but it is very different from the PD692x0.
> > > Frankly the behavior of the OSS pin is confusing and since we don't use the
> > > PSE controllers' idea of port priority, it was safe to ignore it. Finally, the
> > > LTC4266 has a "masked shutdown" ability where a predetermined set of ports are
> > > shutdown when a specific pin (MSD) is driven low.  Like the TPS2388x's OSS
> > > pin, We ignore this feature on the LTC4266.
> > > 
> > > If the end-goal here is to have a device-independent idea of "port priority" I
> > > think we need to add a level of indirection between the port priority concept
> > > and the actual PSE hardware.  The indirection would enable a system with
> > > multiple (possibly heterogeneous even) PSE chips to have a unified idea of
> > > port priority.  The way we've implemented this in our routers is by putting
> > > the PSE controllers in "semi-auto" mode, where they continually detect and
> > > classify PDs (powered device), but do not power them until instructed to do
> > > so.  The mechanism that decides to power a particular port or not (for lack
> > > of a better term, "budgeting logic") uses the available system power budget
> > > (configured from userspace), the relative port priorities (also configured
> > > from userspace) and the class of a detected PD.  The classification result is
> > > used to determine the _maximum_ power a particular PD might draw, and that is
> > > the value that is subtracted from the power budget.
> > > 
> > > Using the PD's classification and then allocating it the maximum power for
> > > that class enables a non-technical installer to plug in all the PDs at the
> > > switch, and observe if all the PDs are powered (or not).  But the important
> > > part is (unless the port priorities or power budget are changed from
> > > userspace) the devices that are powered won't change due to dynamic power
> > > consumption of the other devices.
> > > 
> > > I'm not sure what the right path is for the kernel, and I'm not sure how this
> > > would look with the regulator integration, nor am I sure what the userspace
> > > API should look like (we used sysfs, but that's probably not ideal for
> > > upstream). It's also not clear how much of the budgeting logic should be in
> > > the kernel, if any. Despite that, hopefully sharing our experience is
> > > insightful and/or helpful.  If not, feel free to ignore it.  In any case,
> > > you've got my
> > 
> > Thanks for your review and for sharing your PSE experience.
> > It indeed is insightful for further development and update of this series.
> 
> Excellent, glad to hear it.
> 
> > So you are saying that from a use experience the port priority feature is not
> > user-friendly as we don't know why a port has been shutdown.
> > Even if we can report the over-current event of which port caused it, you still
> > thinks it is not useful?
> 
> Well, not quite.  I think the concept of a "port priority" is useful,
> but I don't know that the PD692xx's concept of "port priority" is what
> we want.  The issue is the PD692xx's budgeting algorithm is based on
> dynamic power used (i.e. the total power used at any given time).  Since
> this is, well, dynamic, it makes it confusing when a lower priority port
> is powered off due to the runtime behavior of higher-priority ports.
> It's even more confusing if the implicit or default port priorities are
> used.
> 
> Instead, we found that using the maximum power that is allowed be drawn
> by a particular PD's class (set by the IEEE standard) is more user
> friendly, because the set of devices that are powered won't change
> (unless priorities are changed, or the system budget is changed).
> For example, if we've got 4 devices plugged in, and the three highest
> priority devices consume all the power budget, the lowest priority
> device won't ever be powered.  There isn't a case where the lowest
> priority device will be shut down because a higher priority device
> starts consuming more power at some point in the future.
> 
> > We could have several cases for over power budget event:
> > - The power limit exceeded is the one configured for the ports.
> >   We should shutdown only that port without taking care about priority.
> >   TPS23881 has this behavior when power exceed Pcut.
> >   I think the PD692x0 does the same. Need to verify.
> 
> These conditions I'd not call "over power budget events".  I'd call them
> "port overcurrent events" and I agree, those only affect the specific
> problem port.
> 
> > - The power limit exceeded is the global (or manager PD69208M) power budget.
> >   Here port priority is interesting.
> >   Is there a way to know which port create this global power limit excess?
> >   Should we turn off this port even if he don't exceed his own power limit or
> >   should we turn off low priority ports?
> 
> I think it's important to make a distinction between an "overcurrent"
> condition and the condition where we've exceeded the system power
> budget.  An "overcurrent" is port-specific, and can happen if the PD
> consumes more power than the classification of the device allows.  For
> example, if a Class 3 PD (i.e. 802.3at, also referred to as a Type II
> PD) consumes more than 15.4 W at the PSE, it will be shutdown
> immediately.  This support is required by all the IEEE 802.3 standards
> around PoE (.af, .at. and .bt) and is a safety thing.  The TPS2388x
> implements this with Pcut, the LTC4266 impliments this with Icut
> register, and the PD692xx implements it with the port power limit
> registers.  
> 
> The condition where we've exceeded our system-level power
> budget is a little different, in that it causes a port to be shutdown
> despite that port not exceeding it's class power limit.  This condition
> is the case I'm concerned we're solving in this series, and solving it
> for the PD692xx case only, and it's based off dynamic power consumption.
> 
> So I guess I'm suggesting that we take the power budgeting concept out
> of the PSE drivers, and put it into software (either kernel, userspace)
> instead of the PSE hardware.  
> 
> >   I can't find global power budget concept for the TPS23881. 
> 
> This is because this idea doesn't exist on the TPS2388x.  
> 
> >   I could't test this case because I don't have enough load. In fact, maybe by
> >   setting the PD692x0 power bank limit low it could work.
> 
> Hopefully this helps clarify.


Thank you for your detailed insights. Before we dive deeper into policies and
implementations, I’d like to clarify an important point to avoid confusion
later. When comparing different PSE components, it's crucial to note that the
Microchip PD692x0 operates in two distinct categories:
1. PoE controller (PD692x0)
2. PoE manager (PD6920x)

Comparing the PoE controller (PD692x0) with TPS2388x or LTC4266 isn't entirely
fair, as TPS2388x and LTC4266 are more comparable to the PoE manager (PD6920x).
The functionalities provided by the PoE controller (PD692x0) are things we
would need to implement ourselves on the software stack (kernel or userspace).
The budget heuristic that is implemented in the PD692x0's firmware is absent in
TPS2388x and LTC4266.

Policy Variants and Implementation

In cases where we are discussing prioritization, we are fundamentally talking
about over-provisioning. This typically means that while a device advertises a
certain maximum per-port power capacity (e.g., 95W), the total system power
budget (e.g., 300W) is insufficient to supply maximum power to all ports
simultaneously. This is often due to various system limitations, and if there
were no power limits, prioritization wouldn't be necessary.

The challenge then becomes how to squeeze more Powered Devices (PDs) onto one
PSE system. Here are two methods for over-provisioning:

1. Static Method:
 
   This method involves distributing power based on PD classification. It’s
   straightforward and stable, with the software (probably within the PSE
   framework) keeping track of the budget and subtracting the power requested by
   each PD’s class. 
 
   Advantages: Every PD gets its promised power at any time, which guarantees
   reliability. 

   Disadvantages: PD classification steps are large, meaning devices request
   much more power than they actually need. As a result, the power supply may
   only operate at, say, 50% capacity, which is inefficient and wastes money.

2. Dynamic Method:  

   To address the inefficiencies of the static method, vendors like Microchip
   have introduced dynamic power budgeting, as seen in the PD692x0 firmware.
   This method monitors the current consumption per port and subtracts it from
   the available power budget. When the budget is exceeded, lower-priority
   ports are shut down.  

   Advantages: This method optimizes resource utilization, saving costs.

   Disadvantages: Low-priority devices may experience instability. A possible
   improvement could involve using LLDP protocols to dynamically configure
   power limits per port, thus allowing us to reduce power on over-consuming
   ports rather than shutting them down entirely.

Recommendations for Software Handling

Both methods have their pros and cons. Since the dynamic method is not always
desirable, and if there's no way to disable it in the PD692x0's firmware, one
potential workaround could be handling the budget in software and dynamically
setting per-port limits. For instance, with a total budget of 300W and unused
ports, we could initially set 95W limits per port. As high-priority PDs (e.g.,
three 95W devices) are powered, we could dynamically reduce the power limit on
the remaining ports to 15W, ensuring that no device exceeds that classification
threshold.

This is just one idea, and there are likely other policy variants we could
explore. Importantly, I believe these heuristics don’t belong in the kernel
itself. Instead, the kernel should simply provide the necessary interfaces,
leaving the policy implementation to userspace management software. At least
this is a lesson learned from Thermal Management talk at LPC :D

Best regards,  
Oleksij
Kory Maincent Oct. 15, 2024, 9:43 a.m. UTC | #5
Hello,

On Thu, 10 Oct 2024 07:42:25 +0200
Oleksij Rempel <o.rempel@pengutronix.de> wrote:

> > The condition where we've exceeded our system-level power
> > budget is a little different, in that it causes a port to be shutdown
> > despite that port not exceeding it's class power limit.  This condition
> > is the case I'm concerned we're solving in this series, and solving it
> > for the PD692xx case only, and it's based off dynamic power consumption.
> > 
> > So I guess I'm suggesting that we take the power budgeting concept out
> > of the PSE drivers, and put it into software (either kernel, userspace)
> > instead of the PSE hardware.  
> >   
> > >   I can't find global power budget concept for the TPS23881.   
> > 
> > This is because this idea doesn't exist on the TPS2388x.  
> >   
> > >   I could't test this case because I don't have enough load. In fact,
> > > maybe by setting the PD692x0 power bank limit low it could work.  
> > 
> > Hopefully this helps clarify.  
> 
> 
> Thank you for your detailed insights. Before we dive deeper into policies and
> implementations, I’d like to clarify an important point to avoid confusion
> later. When comparing different PSE components, it's crucial to note that the
> Microchip PD692x0 operates in two distinct categories:
> 1. PoE controller (PD692x0)
> 2. PoE manager (PD6920x)
> 
> Comparing the PoE controller (PD692x0) with TPS2388x or LTC4266 isn't entirely
> fair, as TPS2388x and LTC4266 are more comparable to the PoE manager
> (PD6920x). The functionalities provided by the PoE controller (PD692x0) are
> things we would need to implement ourselves on the software stack (kernel or
> userspace). The budget heuristic that is implemented in the PD692x0's
> firmware is absent in TPS2388x and LTC4266.
> 
> Policy Variants and Implementation
> 
> In cases where we are discussing prioritization, we are fundamentally talking
> about over-provisioning. This typically means that while a device advertises a
> certain maximum per-port power capacity (e.g., 95W), the total system power
> budget (e.g., 300W) is insufficient to supply maximum power to all ports
> simultaneously. This is often due to various system limitations, and if there
> were no power limits, prioritization wouldn't be necessary.
> 
> The challenge then becomes how to squeeze more Powered Devices (PDs) onto one
> PSE system. Here are two methods for over-provisioning:
> 
> 1. Static Method:
>  
>    This method involves distributing power based on PD classification. It’s
>    straightforward and stable, with the software (probably within the PSE
>    framework) keeping track of the budget and subtracting the power requested
> by each PD’s class. 
>  
>    Advantages: Every PD gets its promised power at any time, which guarantees
>    reliability. 
> 
>    Disadvantages: PD classification steps are large, meaning devices request
>    much more power than they actually need. As a result, the power supply may
>    only operate at, say, 50% capacity, which is inefficient and wastes money.
> 
> 2. Dynamic Method:  
> 
>    To address the inefficiencies of the static method, vendors like Microchip
>    have introduced dynamic power budgeting, as seen in the PD692x0 firmware.
>    This method monitors the current consumption per port and subtracts it from
>    the available power budget. When the budget is exceeded, lower-priority
>    ports are shut down.  
> 
>    Advantages: This method optimizes resource utilization, saving costs.
> 
>    Disadvantages: Low-priority devices may experience instability. A possible
>    improvement could involve using LLDP protocols to dynamically configure
>    power limits per port, thus allowing us to reduce power on over-consuming
>    ports rather than shutting them down entirely.

Indeed we will have only static method for PSE controllers not supporting system
power budget management like the TPS2388x or LTC426.
Both method could be supported for "smart" PSE controller like PD692x0.

Let's begin with the static method implementation in the PSE framework for now.
It will need the power domain notion you have talked about.

> Recommendations for Software Handling
> 
> Both methods have their pros and cons. Since the dynamic method is not always
> desirable, and if there's no way to disable it in the PD692x0's firmware, one
> potential workaround could be handling the budget in software and dynamically
> setting per-port limits. For instance, with a total budget of 300W and unused
> ports, we could initially set 95W limits per port. As high-priority PDs (e.g.,
> three 95W devices) are powered, we could dynamically reduce the power limit on
> the remaining ports to 15W, ensuring that no device exceeds that
> classification threshold.
> 
> This is just one idea, and there are likely other policy variants we could
> explore. Importantly, I believe these heuristics don’t belong in the kernel
> itself. Instead, the kernel should simply provide the necessary interfaces,
> leaving the policy implementation to userspace management software. At least
> this is a lesson learned from Thermal Management talk at LPC :D

I think the kernel is only missing the PSE notification events to be ready to
leave the port priority policy to the userspace.

Regards,
Kory Maincent Oct. 17, 2024, 10:35 a.m. UTC | #6
On Tue, 15 Oct 2024 11:43:52 +0200
Kory Maincent <kory.maincent@bootlin.com> wrote:

> > Policy Variants and Implementation
> > 
> > In cases where we are discussing prioritization, we are fundamentally
> > talking about over-provisioning. This typically means that while a device
> > advertises a certain maximum per-port power capacity (e.g., 95W), the total
> > system power budget (e.g., 300W) is insufficient to supply maximum power to
> > all ports simultaneously. This is often due to various system limitations,
> > and if there were no power limits, prioritization wouldn't be necessary.
> > 
> > The challenge then becomes how to squeeze more Powered Devices (PDs) onto
> > one PSE system. Here are two methods for over-provisioning:
> > 
> > 1. Static Method:
> >  
> >    This method involves distributing power based on PD classification. It’s
> >    straightforward and stable, with the software (probably within the PSE
> >    framework) keeping track of the budget and subtracting the power
> > requested by each PD’s class. 
> >  
> >    Advantages: Every PD gets its promised power at any time, which
> > guarantees reliability. 
> > 
> >    Disadvantages: PD classification steps are large, meaning devices request
> >    much more power than they actually need. As a result, the power supply
> > may only operate at, say, 50% capacity, which is inefficient and wastes
> > money.
> > 
> > 2. Dynamic Method:  
> > 
> >    To address the inefficiencies of the static method, vendors like
> > Microchip have introduced dynamic power budgeting, as seen in the PD692x0
> > firmware. This method monitors the current consumption per port and
> > subtracts it from the available power budget. When the budget is exceeded,
> > lower-priority ports are shut down.  
> > 
> >    Advantages: This method optimizes resource utilization, saving costs.
> > 
> >    Disadvantages: Low-priority devices may experience instability. A
> > possible improvement could involve using LLDP protocols to dynamically
> > configure power limits per port, thus allowing us to reduce power on
> > over-consuming ports rather than shutting them down entirely.  
> 
> Indeed we will have only static method for PSE controllers not supporting
> system power budget management like the TPS2388x or LTC426.
> Both method could be supported for "smart" PSE controller like PD692x0.
> 
> Let's begin with the static method implementation in the PSE framework for
> now. It will need the power domain notion you have talked about.

While developing the software support for port priority in static method, I
faced an issue.

Supposing we are exceeding the power budget when we plug a new PD.
The port power should not be enabled directly or magic smoke will appear.
So we have to separate the detection part to know the needs of the PD from the
power enable part.

Currently the port power is enabled on the hardware automatically after the
detection process. There is no way to separate power port process and detection
process with the PD692x0 controller and it could be done on the TPS23881 by
configuring it to manual mode but: "The use of this mode is intended for system
diagnostic purposes only in the event that ports cannot be powered in
accordance with the IEEE 802.3bt standard from semiauto or auto modes."
Not sure we want that.

So in fact the workaround you talked about above will be needed for the two PSE
controllers.
 
> Both methods have their pros and cons. Since the dynamic method is not always
> desirable, and if there's no way to disable it in the PD692x0's firmware, one
> potential workaround could be handling the budget in software and dynamically
> setting per-port limits. For instance, with a total budget of 300W and unused
> ports, we could initially set 95W limits per port. As high-priority PDs (e.g.,
> three 95W devices) are powered, we could dynamically reduce the power limit on
> the remaining ports to 15W, ensuring that no device exceeds that
> classification threshold.

We would set port overcurrent limit for all unpowered ports when the power
budget available is less than max PI power 100W as you described.
If a new PD plugged exceed the overcurrent limit then it will raise an interrupt
and we could deal with the power budget to turn off low priority ports at that
time. 

Mmh in fact I could not know if the overcurrent event interrupt comes from a
newly plugged PD or not.

An option: When we get new PD device plug interrupt event, we wait the end of
classification time (Tpon 400ms) and read the interrupt states again to know if
there is an overcurrent or not on the port.

What do you think?

Regards,
Oleksij Rempel Oct. 18, 2024, 6:14 a.m. UTC | #7
On Thu, Oct 17, 2024 at 12:35:57PM +0200, Kory Maincent wrote:
> On Tue, 15 Oct 2024 11:43:52 +0200
> Kory Maincent <kory.maincent@bootlin.com> wrote:
> 
> > > Policy Variants and Implementation
> > > 
> > > In cases where we are discussing prioritization, we are fundamentally
> > > talking about over-provisioning. This typically means that while a device
> > > advertises a certain maximum per-port power capacity (e.g., 95W), the total
> > > system power budget (e.g., 300W) is insufficient to supply maximum power to
> > > all ports simultaneously. This is often due to various system limitations,
> > > and if there were no power limits, prioritization wouldn't be necessary.
> > > 
> > > The challenge then becomes how to squeeze more Powered Devices (PDs) onto
> > > one PSE system. Here are two methods for over-provisioning:
> > > 
> > > 1. Static Method:
> > >  
> > >    This method involves distributing power based on PD classification. It’s
> > >    straightforward and stable, with the software (probably within the PSE
> > >    framework) keeping track of the budget and subtracting the power
> > > requested by each PD’s class. 
> > >  
> > >    Advantages: Every PD gets its promised power at any time, which
> > > guarantees reliability. 
> > > 
> > >    Disadvantages: PD classification steps are large, meaning devices request
> > >    much more power than they actually need. As a result, the power supply
> > > may only operate at, say, 50% capacity, which is inefficient and wastes
> > > money.
> > > 
> > > 2. Dynamic Method:  
> > > 
> > >    To address the inefficiencies of the static method, vendors like
> > > Microchip have introduced dynamic power budgeting, as seen in the PD692x0
> > > firmware. This method monitors the current consumption per port and
> > > subtracts it from the available power budget. When the budget is exceeded,
> > > lower-priority ports are shut down.  
> > > 
> > >    Advantages: This method optimizes resource utilization, saving costs.
> > > 
> > >    Disadvantages: Low-priority devices may experience instability. A
> > > possible improvement could involve using LLDP protocols to dynamically
> > > configure power limits per port, thus allowing us to reduce power on
> > > over-consuming ports rather than shutting them down entirely.  
> > 
> > Indeed we will have only static method for PSE controllers not supporting
> > system power budget management like the TPS2388x or LTC426.
> > Both method could be supported for "smart" PSE controller like PD692x0.
> > 
> > Let's begin with the static method implementation in the PSE framework for
> > now. It will need the power domain notion you have talked about.
> 
> While developing the software support for port priority in static method, I
> faced an issue.
> 
> Supposing we are exceeding the power budget when we plug a new PD.
> The port power should not be enabled directly or magic smoke will appear.
> So we have to separate the detection part to know the needs of the PD from the
> power enable part.
> 
> Currently the port power is enabled on the hardware automatically after the
> detection process. There is no way to separate power port process and detection
> process with the PD692x0 controller and it could be done on the TPS23881 by
> configuring it to manual mode but: "The use of this mode is intended for system
> diagnostic purposes only in the event that ports cannot be powered in
> accordance with the IEEE 802.3bt standard from semiauto or auto modes."
> Not sure we want that.
> 
> So in fact the workaround you talked about above will be needed for the two PSE
> controllers.

For the TPS23881, "9.1.1.2 Semiauto", seems to be exactly what we wont:
"The port performs detection and classification (if valid detection
occurs) continuously. Registers are updated each time a detection or
classification occurs. The port power is not automatically turned on. A
Power Enable command is required to turn on the port"

For PD692x0 controller, i'm not 100% sure. There is "4.3.5 Set Enable/Disable
Channels" command, "Sets individual port Enable (Delivering power
enable) or Disable (Delivering power disable)." 

For my understanding, "Delivering power" is the state after
classification. So, it is what we wont too.

If, it works in both cases, it would be a more elegant way to go. THe
controller do auto- detection and classification, what we should do in
the software is do decide if the PD can be enabled based on
classification results, priority and available budget.

> > Both methods have their pros and cons. Since the dynamic method is not always
> > desirable, and if there's no way to disable it in the PD692x0's firmware, one
> > potential workaround could be handling the budget in software and dynamically
> > setting per-port limits. For instance, with a total budget of 300W and unused
> > ports, we could initially set 95W limits per port. As high-priority PDs (e.g.,
> > three 95W devices) are powered, we could dynamically reduce the power limit on
> > the remaining ports to 15W, ensuring that no device exceeds that
> > classification threshold.
> 
> We would set port overcurrent limit for all unpowered ports when the power
> budget available is less than max PI power 100W as you described.
> If a new PD plugged exceed the overcurrent limit then it will raise an interrupt
> and we could deal with the power budget to turn off low priority ports at that
> time. 

> Mmh in fact I could not know if the overcurrent event interrupt comes from a
> newly plugged PD or not.

Hm..  in case of PD692x0, may be using event counters?

> An option: When we get new PD device plug interrupt event, we wait the end of
> classification time (Tpon 400ms) and read the interrupt states again to know if
> there is an overcurrent or not on the port.

Let's try Semiauto mode for TPS23881 first, I assume it is designed
exactly for this use case.

And then, test if PD692x0 supports a way to disable auto power delivery
in the 4.3.5 command.

Regards,
Oleksij
Kory Maincent Oct. 18, 2024, 12:37 p.m. UTC | #8
On Fri, 18 Oct 2024 08:14:26 +0200
Oleksij Rempel <o.rempel@pengutronix.de> wrote:

> On Thu, Oct 17, 2024 at 12:35:57PM +0200, Kory Maincent wrote:
> > On Tue, 15 Oct 2024 11:43:52 +0200
> > Kory Maincent <kory.maincent@bootlin.com> wrote:
> >   
>  [...]  
> > > 
> > > Indeed we will have only static method for PSE controllers not supporting
> > > system power budget management like the TPS2388x or LTC426.
> > > Both method could be supported for "smart" PSE controller like PD692x0.
> > > 
> > > Let's begin with the static method implementation in the PSE framework for
> > > now. It will need the power domain notion you have talked about.  
> > 
> > While developing the software support for port priority in static method, I
> > faced an issue.
> > 
> > Supposing we are exceeding the power budget when we plug a new PD.
> > The port power should not be enabled directly or magic smoke will appear.
> > So we have to separate the detection part to know the needs of the PD from
> > the power enable part.
> > 
> > Currently the port power is enabled on the hardware automatically after the
> > detection process. There is no way to separate power port process and
> > detection process with the PD692x0 controller and it could be done on the
> > TPS23881 by configuring it to manual mode but: "The use of this mode is
> > intended for system diagnostic purposes only in the event that ports cannot
> > be powered in accordance with the IEEE 802.3bt standard from semiauto or
> > auto modes." Not sure we want that.
> > 
> > So in fact the workaround you talked about above will be needed for the two
> > PSE controllers.  
> 
> For the TPS23881, "9.1.1.2 Semiauto", seems to be exactly what we wont:
> "The port performs detection and classification (if valid detection
> occurs) continuously. Registers are updated each time a detection or
> classification occurs. The port power is not automatically turned on. A
> Power Enable command is required to turn on the port"

I tested reading the assigned class and not the requested class register so I
thought it was not working but indeed it detects the class even if the port
power is off. That's what I was looking for, nice!
Just figured out also that calling pwoff is reseting detection, classification,
power policy... So the port need to be setup again after a pwoff.
 
> For PD692x0 controller, i'm not 100% sure. There is "4.3.5 Set Enable/Disable
> Channels" command, "Sets individual port Enable (Delivering power
> enable) or Disable (Delivering power disable)." 
> 
> For my understanding, "Delivering power" is the state after
> classification. So, it is what we wont too.

On the PD692x0 there is also a requested class and power value but it stay "to
no class detected value" (0xc) if the port is not enabled.
It did not find a way to detect the class and keep port power off.
 
> If, it works in both cases, it would be a more elegant way to go. THe
> controller do auto- detection and classification, what we should do in
> the software is do decide if the PD can be enabled based on
> classification results, priority and available budget.
> 
> > > Both methods have their pros and cons. Since the dynamic method is not
> > > always desirable, and if there's no way to disable it in the PD692x0's
> > > firmware, one potential workaround could be handling the budget in
> > > software and dynamically setting per-port limits. For instance, with a
> > > total budget of 300W and unused ports, we could initially set 95W limits
> > > per port. As high-priority PDs (e.g., three 95W devices) are powered, we
> > > could dynamically reduce the power limit on the remaining ports to 15W,
> > > ensuring that no device exceeds that classification threshold.  
> > 
> > We would set port overcurrent limit for all unpowered ports when the power
> > budget available is less than max PI power 100W as you described.
> > If a new PD plugged exceed the overcurrent limit then it will raise an
> > interrupt and we could deal with the power budget to turn off low priority
> > ports at that time.   
> 
> > Mmh in fact I could not know if the overcurrent event interrupt comes from a
> > newly plugged PD or not.  
> 
> Hm..  in case of PD692x0, may be using event counters?

Counters? I don't see how.

> > An option: When we get new PD device plug interrupt event, we wait the end
> > of classification time (Tpon 400ms) and read the interrupt states again to
> > know if there is an overcurrent or not on the port.  
> 
> Let's try Semiauto mode for TPS23881 first, I assume it is designed
> exactly for this use case.

Yes,

> And then, test if PD692x0 supports a way to disable auto power delivery
> in the 4.3.5 command.

I don't have this 4.3.5 command. Are you refering to another document than the
communication protocol version 3.55 document?

Regards,