diff mbox series

[RFC,net-next,2/2] ipv6: ioam: Support for Buffer occupancy data field

Message ID 20211206211758.19057-3-justin.iurman@uliege.be (mailing list archive)
State New
Headers show
Series IOAM queue depth and buffer occupancy | expand

Commit Message

Justin Iurman Dec. 6, 2021, 9:17 p.m. UTC
This patch is an attempt to support the buffer occupancy in IOAM trace
data fields. Any feedback is appreciated, or any other idea if this one
is not correct.

The draft [1] says the following:

   The "buffer occupancy" field is a 4-octet unsigned integer field.
   This field indicates the current status of the occupancy of the
   common buffer pool used by a set of queues.  The units of this field
   are implementation specific.  Hence, the units are interpreted within
   the context of an IOAM-Namespace and/or node-id if used.  The authors
   acknowledge that in some operational cases there is a need for the
   units to be consistent across a packet path through the network,
   hence it is recommended for implementations to use standard units
   such as Bytes.

An existing function (i.e., get_slabinfo) is used to retrieve info about
skbuff_head_cache. For that, both the prototype of get_slabinfo and
struct definition of slabinfo were moved from mm/slab.h to
include/linux/slab.h. Any objection on this?

The function kmem_cache_size is used to retrieve the size of a slab
object. Note that it returns the "object_size" field, not the "size"
field. If needed, a new function (e.g., kmem_cache_full_size) could be
added to return the "size" field. To match the definition from the
draft, the number of bytes is computed as follows:

slabinfo.active_objs * size

Thoughts?

  [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.12

Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
---
 include/linux/slab.h | 15 +++++++++++++++
 mm/slab.h            | 14 --------------
 net/ipv6/ioam6.c     | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 15 deletions(-)

Comments

Jakub Kicinski Dec. 7, 2021, 12:16 a.m. UTC | #1
On Mon,  6 Dec 2021 22:17:58 +0100 Justin Iurman wrote:
> This patch is an attempt to support the buffer occupancy in IOAM trace
> data fields. Any feedback is appreciated, or any other idea if this one
> is not correct.
> 
> The draft [1] says the following:
> 
>    The "buffer occupancy" field is a 4-octet unsigned integer field.
>    This field indicates the current status of the occupancy of the
>    common buffer pool used by a set of queues.  The units of this field
>    are implementation specific.  Hence, the units are interpreted within
>    the context of an IOAM-Namespace and/or node-id if used.  The authors
>    acknowledge that in some operational cases there is a need for the
>    units to be consistent across a packet path through the network,
>    hence it is recommended for implementations to use standard units
>    such as Bytes.
> 
> An existing function (i.e., get_slabinfo) is used to retrieve info about
> skbuff_head_cache. For that, both the prototype of get_slabinfo and
> struct definition of slabinfo were moved from mm/slab.h to
> include/linux/slab.h. Any objection on this?
> 
> The function kmem_cache_size is used to retrieve the size of a slab
> object. Note that it returns the "object_size" field, not the "size"
> field. If needed, a new function (e.g., kmem_cache_full_size) could be
> added to return the "size" field. To match the definition from the
> draft, the number of bytes is computed as follows:
> 
> slabinfo.active_objs * size
> 
> Thoughts?

Implementing the standard is one thing but how useful is this 
in practice?
Justin Iurman Dec. 7, 2021, 11:54 a.m. UTC | #2
Hi Jakub,

>> [...]
>>
>> The function kmem_cache_size is used to retrieve the size of a slab
>> object. Note that it returns the "object_size" field, not the "size"
>> field. If needed, a new function (e.g., kmem_cache_full_size) could be
>> added to return the "size" field. To match the definition from the
>> draft, the number of bytes is computed as follows:
>> 
>> slabinfo.active_objs * size
>> 
>> Thoughts?
> 
> Implementing the standard is one thing but how useful is this
> in practice?

IMHO, very useful. To be honest, if I were to implement only a few data
fields, these two would be both included. Take the example of CLT [1]
where the queue length data field is used to detect low-level issues
from inside a L5-7 distributed tracing tool. And this is just one
example among many others. The queue length data field is very specific
to TX queues, but we could also use the buffer occupancy data field to
detect more global loads on a node. Actually, the goal for operators
running their IOAM domain is to quickly detect a problem along a path
and react accordingly (human or automatic action). For example, if you
monitor TX queues along a path and detect an increasing queue on a
router, you could choose to, e.g.,  rebalance its queues. With the
buffer occupancy, you could detect high-loaded nodes in general and,
e.g., rebalance traffic to another branch. Again, this is just one
example among others. Apart from more accurate ECMPs, you could for
instance deploy a smart (micro)service selection based on different
metrics, etc.

  [1] https://github.com/Advanced-Observability/cross-layer-telemetry
Jakub Kicinski Dec. 7, 2021, 3:50 p.m. UTC | #3
On Tue, 7 Dec 2021 12:54:04 +0100 (CET) Justin Iurman wrote:
> >> The function kmem_cache_size is used to retrieve the size of a slab
> >> object. Note that it returns the "object_size" field, not the "size"
> >> field. If needed, a new function (e.g., kmem_cache_full_size) could be
> >> added to return the "size" field. To match the definition from the
> >> draft, the number of bytes is computed as follows:
> >> 
> >> slabinfo.active_objs * size
> > 
> > Implementing the standard is one thing but how useful is this
> > in practice?  
> 
> IMHO, very useful. To be honest, if I were to implement only a few data
> fields, these two would be both included. Take the example of CLT [1]
> where the queue length data field is used to detect low-level issues
> from inside a L5-7 distributed tracing tool. And this is just one
> example among many others. The queue length data field is very specific
> to TX queues, but we could also use the buffer occupancy data field to
> detect more global loads on a node. Actually, the goal for operators
> running their IOAM domain is to quickly detect a problem along a path
> and react accordingly (human or automatic action). For example, if you
> monitor TX queues along a path and detect an increasing queue on a
> router, you could choose to, e.g.,  rebalance its queues. With the
> buffer occupancy, you could detect high-loaded nodes in general and,
> e.g., rebalance traffic to another branch. Again, this is just one
> example among others. Apart from more accurate ECMPs, you could for
> instance deploy a smart (micro)service selection based on different
> metrics, etc.
> 
>   [1] https://github.com/Advanced-Observability/cross-layer-telemetry

Ack, my question was more about whether the metric as implemented
provides the best signal. Since the slab cache scales dynamically
(AFAIU) it's not really a big deal if it's full as long as there's
memory available on the system.
Justin Iurman Dec. 7, 2021, 4:35 p.m. UTC | #4
On Dec 7, 2021, at 4:50 PM, Jakub Kicinski kuba@kernel.org wrote:
> On Tue, 7 Dec 2021 12:54:04 +0100 (CET) Justin Iurman wrote:
>> >> The function kmem_cache_size is used to retrieve the size of a slab
>> >> object. Note that it returns the "object_size" field, not the "size"
>> >> field. If needed, a new function (e.g., kmem_cache_full_size) could be
>> >> added to return the "size" field. To match the definition from the
>> >> draft, the number of bytes is computed as follows:
>> >> 
>> >> slabinfo.active_objs * size
>> > 
>> > Implementing the standard is one thing but how useful is this
>> > in practice?
>> 
>> IMHO, very useful. To be honest, if I were to implement only a few data
>> fields, these two would be both included. Take the example of CLT [1]
>> where the queue length data field is used to detect low-level issues
>> from inside a L5-7 distributed tracing tool. And this is just one
>> example among many others. The queue length data field is very specific
>> to TX queues, but we could also use the buffer occupancy data field to
>> detect more global loads on a node. Actually, the goal for operators
>> running their IOAM domain is to quickly detect a problem along a path
>> and react accordingly (human or automatic action). For example, if you
>> monitor TX queues along a path and detect an increasing queue on a
>> router, you could choose to, e.g.,  rebalance its queues. With the
>> buffer occupancy, you could detect high-loaded nodes in general and,
>> e.g., rebalance traffic to another branch. Again, this is just one
>> example among others. Apart from more accurate ECMPs, you could for
>> instance deploy a smart (micro)service selection based on different
>> metrics, etc.
>> 
>>   [1] https://github.com/Advanced-Observability/cross-layer-telemetry
> 
> Ack, my question was more about whether the metric as implemented

Oh, sorry about that.

> provides the best signal. Since the slab cache scales dynamically
> (AFAIU) it's not really a big deal if it's full as long as there's
> memory available on the system.

Well, I got the same understanding as you. However, we do not provide a
value meaning "X percent used" just because it wouldn't make much sense,
as you pointed out. So I think it is sound to have the current value,
even if it's a quite dynamic one. Indeed, what's important here is to
know how many bytes are used and this is exactly what it does. If a node
is under heavy load, the value would be hell high. The operator could
define a threshold for each node resp. and detect abnormal values.

We probably want the metadata included for accuracy as well (e.g.,
kmem_cache_size vs new function kmem_cache_full_size).
David Ahern Dec. 7, 2021, 4:37 p.m. UTC | #5
On 12/6/21 2:17 PM, Justin Iurman wrote:
> This patch is an attempt to support the buffer occupancy in IOAM trace
> data fields. Any feedback is appreciated, or any other idea if this one
> is not correct.
> 
> The draft [1] says the following:
> 
>    The "buffer occupancy" field is a 4-octet unsigned integer field.
>    This field indicates the current status of the occupancy of the
>    common buffer pool used by a set of queues.  The units of this field
>    are implementation specific.  Hence, the units are interpreted within
>    the context of an IOAM-Namespace and/or node-id if used.  The authors
>    acknowledge that in some operational cases there is a need for the
>    units to be consistent across a packet path through the network,
>    hence it is recommended for implementations to use standard units
>    such as Bytes.
> 
> An existing function (i.e., get_slabinfo) is used to retrieve info about
> skbuff_head_cache. For that, both the prototype of get_slabinfo and
> struct definition of slabinfo were moved from mm/slab.h to
> include/linux/slab.h. Any objection on this?
> 
> The function kmem_cache_size is used to retrieve the size of a slab
> object. Note that it returns the "object_size" field, not the "size"
> field. If needed, a new function (e.g., kmem_cache_full_size) could be
> added to return the "size" field. To match the definition from the
> draft, the number of bytes is computed as follows:
> 
> slabinfo.active_objs * size
> 
> Thoughts?
> 
>   [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.12
> 
> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
> ---
>  include/linux/slab.h | 15 +++++++++++++++
>  mm/slab.h            | 14 --------------
>  net/ipv6/ioam6.c     | 13 ++++++++++++-
>  3 files changed, 27 insertions(+), 15 deletions(-)
> 

this should be 2 patches - one that moves the slabinfo struct and
function across header files and then the ioam6 change.

[ I agree with Jakub's line of questioning - how useful is this across
nodes with different OS'es and s/w versions. ]
Justin Iurman Dec. 7, 2021, 4:54 p.m. UTC | #6
Hi David,

>> [...]
>>
>> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
>> ---
>>  include/linux/slab.h | 15 +++++++++++++++
>>  mm/slab.h            | 14 --------------
>>  net/ipv6/ioam6.c     | 13 ++++++++++++-
>>  3 files changed, 27 insertions(+), 15 deletions(-)
>> 
> 
> this should be 2 patches - one that moves the slabinfo struct and
> function across header files and then the ioam6 change.

Agree. I didn't do it since it's "just" a RFC for now though.

> [ I agree with Jakub's line of questioning - how useful is this across
> nodes with different OS'es and s/w versions. ]

See my answer to Jakub. Also, as the draft says:

   "The units of this field
   are implementation specific.  Hence, the units are interpreted within
   the context of an IOAM-Namespace and/or node-id if used.  The authors
   acknowledge that in some operational cases there is a need for the
   units to be consistent across a packet path through the network,
   hence it is recommended for implementations to use standard units
   such as Bytes."

Therefore, what we define here is the behavior of the Linux kernel
regarding the way it handles the buffer occupancy for IOAM, and the
meaning/semantic of its value (in bytes). Having different OS'es and
s/w versions across nodes is not a problem, it's up to the operators
to bring more context to their own domains.
Jakub Kicinski Dec. 7, 2021, 5:07 p.m. UTC | #7
On Tue, 7 Dec 2021 17:35:49 +0100 (CET) Justin Iurman wrote:
> > provides the best signal. Since the slab cache scales dynamically
> > (AFAIU) it's not really a big deal if it's full as long as there's
> > memory available on the system.  
> 
> Well, I got the same understanding as you. However, we do not provide a
> value meaning "X percent used" just because it wouldn't make much sense,
> as you pointed out. So I think it is sound to have the current value,
> even if it's a quite dynamic one. Indeed, what's important here is to
> know how many bytes are used and this is exactly what it does. If a node
> is under heavy load, the value would be hell high. The operator could
> define a threshold for each node resp. and detect abnormal values.

Hm, reading thru the quoted portion of the standard from the commit
message the semantics of the field are indeed pretty disappointing.
What's the value of defining a field in a standard if it's entirely
implementation specific? Eh.

> We probably want the metadata included for accuracy as well (e.g.,
> kmem_cache_size vs new function kmem_cache_full_size).

Does the standard support carrying arbitrary metadata?

Anyway, in general I personally don't have a good feeling about
implementing this field. Would be good to have a clear user who 
can justify the choice of slab vs something else. Wouldn't modern
deployments use some form of streaming telemetry for nodes within 
the same domain of control? I'm not sure I understand the value
of limited slab info in OAM when there's probably a more powerful
metric collection going on.

Patch 1 makes perfect sense, FWIW.
Justin Iurman Dec. 7, 2021, 6:05 p.m. UTC | #8
On Dec 7, 2021, at 6:07 PM, Jakub Kicinski kuba@kernel.org wrote:
> Hm, reading thru the quoted portion of the standard from the commit
> message the semantics of the field are indeed pretty disappointing.
> What's the value of defining a field in a standard if it's entirely
> implementation specific? Eh.

True. But keep also in mind the scope of IOAM which is not to be
deployed widely on the Internet. It is deployed on limited (aka private)
domains where each node is therefore managed by the operator. So, I'm
not really sure why you think that the implementation specific thing is
a problem here. Context of "unit" is provided by the IOAM Namespace-ID
attached to the trace, as well as each Node-ID if included. Again, it's
up to the operator to interpret values accordingly, depending on each
node (i.e., the operator has a large and detailed view of his domain; he
knows if the buffer occupancy value "X" is abnormal or not for a
specific node, he knows which unit is used for a specific node, etc).

>> We probably want the metadata included for accuracy as well (e.g.,
>> kmem_cache_size vs new function kmem_cache_full_size).
> 
> Does the standard support carrying arbitrary metadata?

It says:

  "This field indicates the current status of the occupancy of the
   common buffer pool used by a set of queues."

So, as long as metadata are part of it, I'd say yes it does, since bytes
are allocated for that too. Does it make sense?

> Anyway, in general I personally don't have a good feeling about
> implementing this field. Would be good to have a clear user who
> can justify the choice of slab vs something else. Wouldn't modern
> deployments use some form of streaming telemetry for nodes within
> the same domain of control? I'm not sure I understand the value
> of limited slab info in OAM when there's probably a more powerful
> metric collection going on.

Do you believe this patch does not provide what is defined in the spec?
If so, I'm open to any suggestions.

> Patch 1 makes perfect sense, FWIW.

Thanks for (all) the feedback, Jakub, I appreciate it.
Jakub Kicinski Dec. 8, 2021, 10:18 p.m. UTC | #9
On Tue, 7 Dec 2021 19:05:13 +0100 (CET) Justin Iurman wrote:
> On Dec 7, 2021, at 6:07 PM, Jakub Kicinski kuba@kernel.org wrote:
> > Hm, reading thru the quoted portion of the standard from the commit
> > message the semantics of the field are indeed pretty disappointing.
> > What's the value of defining a field in a standard if it's entirely
> > implementation specific? Eh.  
> 
> True. But keep also in mind the scope of IOAM which is not to be
> deployed widely on the Internet. It is deployed on limited (aka private)
> domains where each node is therefore managed by the operator. So, I'm
> not really sure why you think that the implementation specific thing is
> a problem here. Context of "unit" is provided by the IOAM Namespace-ID
> attached to the trace, as well as each Node-ID if included. Again, it's
> up to the operator to interpret values accordingly, depending on each
> node (i.e., the operator has a large and detailed view of his domain; he
> knows if the buffer occupancy value "X" is abnormal or not for a
> specific node, he knows which unit is used for a specific node, etc).

It's quite likely I'm missing the point.

> >> We probably want the metadata included for accuracy as well (e.g.,
> >> kmem_cache_size vs new function kmem_cache_full_size).  
> > 
> > Does the standard support carrying arbitrary metadata?  
> 
> It says:
> 
>   "This field indicates the current status of the occupancy of the
>    common buffer pool used by a set of queues."
> 
> So, as long as metadata are part of it, I'd say yes it does, since bytes
> are allocated for that too. Does it make sense?

Indeed, but see below.

> > Anyway, in general I personally don't have a good feeling about
> > implementing this field. Would be good to have a clear user who
> > can justify the choice of slab vs something else. Wouldn't modern
> > deployments use some form of streaming telemetry for nodes within
> > the same domain of control? I'm not sure I understand the value
> > of limited slab info in OAM when there's probably a more powerful
> > metric collection going on.  
> 
> Do you believe this patch does not provide what is defined in the spec?
> If so, I'm open to any suggestions.

The opposite, in a sense. I think the patch does implement behavior
within a reasonable interpretation of the standard. But the feature
itself seems more useful for forwarding ASICs than Linux routers,
because Linux routers can run a full telemetry stack and all sort 
of advanced SW instrumentation. The use case for reporting kernel
memory use via IOAM's constrained interface does not seem particularly
practical since it's not providing a very strong signal on what's 
going on.

For switches running Linux the switch ASIC buffer occupancy can be read
via devlink-sb that'd seem like a better fit for me, but unfortunately
the devlink calls can sleep so we can't read such device info from the
datapath.

> > Patch 1 makes perfect sense, FWIW.  
> 
> Thanks for (all) the feedback, Jakub, I appreciate it.
Justin Iurman Dec. 9, 2021, 2:10 p.m. UTC | #10
>> True. But keep also in mind the scope of IOAM which is not to be
>> deployed widely on the Internet. It is deployed on limited (aka private)
>> domains where each node is therefore managed by the operator. So, I'm
>> not really sure why you think that the implementation specific thing is
>> a problem here. Context of "unit" is provided by the IOAM Namespace-ID
>> attached to the trace, as well as each Node-ID if included. Again, it's
>> up to the operator to interpret values accordingly, depending on each
>> node (i.e., the operator has a large and detailed view of his domain; he
>> knows if the buffer occupancy value "X" is abnormal or not for a
>> specific node, he knows which unit is used for a specific node, etc).
> 
> It's quite likely I'm missing the point.

Let me try again to make it all clear on your mind. Here are some quoted
paragraphs from the spec:

  "Generic data: Format-free information where syntax and semantic of
   the information is defined by the operator in a specific
   deployment.  For a specific IOAM-Namespace, all IOAM nodes have to
   interpret the generic data the same way.  Examples for generic
   IOAM data include geo-location information (location of the node
   at the time the packet was processed), buffer queue fill level or
   cache fill level at the time the packet was processed, or even a
   battery charge level."

This one basically says that the IOAM Namespace-ID (in the IOAM Trace
Option-Type header) is responsible for providing context to data fields
(i.e., for "units" too, in case of generic fields such as queue depth or
buffer occupancy). So it's up to the operator to gather similar nodes
within a same IOAM Namespace. And, even if "different" kind of nodes are
within an IOAM Namespace, you still have a fallback solution if Node IDs
are part of the trace (the "hop-lim & node-id" data field, bit 0 in the
trace type). Indeed, the operator (or the collector/interpretor) knows if
node A uses "bytes" or any other units for a generic data field.

  "It should be noted that the semantics of some of the node data fields
   that are defined below, such as the queue depth and buffer occupancy,
   are implementation specific.  This approach is intended to allow IOAM
   nodes with various different architectures."

The last sentence is important here and is, in fact, related to what you
describe below. Having genericity on units for such data fields allows
for supporting multiple architectures. Same idea for the following
paragraph:

  "Data fields and associated data types for each of the IOAM-Data-
   Fields are specified in the following sections.  The definition of
   IOAM-Data-Fields focuses on the syntax of the data-fields and avoids
   specifying the semantics where feasible.  This is why no units are
   defined for data-fields like e.g., "buffer occupancy" or "queue
   depth".  With this approach, nodes can supply the information in
   their native format and are not required to perform unit or format
   conversions.  Systems that further process IOAM information, like
   e.g., a network management system are assumed to also handle unit
   conversions as part of their IOAM data-fields processing.  The
   combination of a particular data-field and the namespace-id provides
   for the context to interpret the provided data appropriately."

Does it make more sense now on why it's not really a problem to have
implementation specific units for such data fields?

>> [...]
>>
>> Do you believe this patch does not provide what is defined in the spec?
>> If so, I'm open to any suggestions.
> 
> The opposite, in a sense. I think the patch does implement behavior
> within a reasonable interpretation of the standard. But the feature
> itself seems more useful for forwarding ASICs than Linux routers,

Good point. Actually, it's probably why such data field was defined as it
is.

> because Linux routers can run a full telemetry stack and all sort
> of advanced SW instrumentation. The use case for reporting kernel
> memory use via IOAM's constrained interface does not seem particularly
> practical since it's not providing a very strong signal on what's
> going on.

I agree and disagree. I disagree because this value definitely tells you
that something (potentially bad) is going on, when it increases
significantly enough to reach a critical threshold. Basically, we need
more skb's, but oh, the pool is exhausted. OK, not a problem, expand the
pool. Oh wait, no memory left. Why? Is it only due to too much
(temporary?) load? Should I put the blame on the NIC? Is it a memory
issue? Is it something else? Or maybe several issues combined? Well, you
might not know exactly why (though you know there is a problem), which is
also why I agree with you. But, this is also why you have other data
fields available (i.e., detecting a problem might require 2+ symptoms
instead of just one).

> For switches running Linux the switch ASIC buffer occupancy can be read
> via devlink-sb that'd seem like a better fit for me, but unfortunately
> the devlink calls can sleep so we can't read such device info from the
> datapath.

Indeed, would be a better fit. I didn't know about this one, thanks for
that. It's a shame it can't be used in this context, though. But, at the
end of the day, we're left with nothing regarding buffer occupancy. So
I'm wondering if "something" is not better than "nothing" in this case.
And, for that, we're back to my previous answer on why I agree and
disagree with what you said about its utility.
Jakub Kicinski Dec. 10, 2021, 12:38 a.m. UTC | #11
On Thu, 9 Dec 2021 15:10:24 +0100 (CET) Justin Iurman wrote:
> > because Linux routers can run a full telemetry stack and all sort
> > of advanced SW instrumentation. The use case for reporting kernel
> > memory use via IOAM's constrained interface does not seem particularly
> > practical since it's not providing a very strong signal on what's
> > going on.  
> 
> I agree and disagree. I disagree because this value definitely tells you
> that something (potentially bad) is going on, when it increases
> significantly enough to reach a critical threshold. Basically, we need
> more skb's, but oh, the pool is exhausted. OK, not a problem, expand the
> pool. Oh wait, no memory left. Why? Is it only due to too much
> (temporary?) load? Should I put the blame on the NIC? Is it a memory
> issue? Is it something else? Or maybe several issues combined? Well, you
> might not know exactly why (though you know there is a problem), which is
> also why I agree with you. But, this is also why you have other data
> fields available (i.e., detecting a problem might require 2+ symptoms
> instead of just one).
> 
> > For switches running Linux the switch ASIC buffer occupancy can be read
> > via devlink-sb that'd seem like a better fit for me, but unfortunately
> > the devlink calls can sleep so we can't read such device info from the
> > datapath.  
> 
> Indeed, would be a better fit. I didn't know about this one, thanks for
> that. It's a shame it can't be used in this context, though. But, at the
> end of the day, we're left with nothing regarding buffer occupancy. So
> I'm wondering if "something" is not better than "nothing" in this case.
> And, for that, we're back to my previous answer on why I agree and
> disagree with what you said about its utility.

I think we're on the same page, the main problem is I've not seen
anyone use the skbuff_head_cache occupancy as a signal in practice.

I'm adding a bunch of people to the CC list, hopefully someone has
an opinion one way or the other.

Lore link to the full thread, FWIW:

https://lore.kernel.org/all/20211206211758.19057-1-justin.iurman@uliege.be/
Justin Iurman Dec. 10, 2021, 12:57 p.m. UTC | #12
>> Indeed, would be a better fit. I didn't know about this one, thanks for
>> that. It's a shame it can't be used in this context, though. But, at the
>> end of the day, we're left with nothing regarding buffer occupancy. So
>> I'm wondering if "something" is not better than "nothing" in this case.
>> And, for that, we're back to my previous answer on why I agree and
>> disagree with what you said about its utility.
> 
> I think we're on the same page, the main problem is I've not seen
> anyone use the skbuff_head_cache occupancy as a signal in practice.

Indeed.

> I'm adding a bunch of people to the CC list, hopefully someone has
> an opinion one way or the other.

+1, thanks Jakub.

> Lore link to the full thread, FWIW:
> 
> https://lore.kernel.org/all/20211206211758.19057-1-justin.iurman@uliege.be/
Justin Iurman Dec. 21, 2021, 5:06 p.m. UTC | #13
On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote:
> [...]
> I think we're on the same page, the main problem is I've not seen
> anyone use the skbuff_head_cache occupancy as a signal in practice.
> 
> I'm adding a bunch of people to the CC list, hopefully someone has
> an opinion one way or the other.

It looks like we won't have more opinions on that, unfortunately.

@Jakub - Should I submit it as a PATCH and see if we receive more
feedback there?
Vladimir Oltean Dec. 21, 2021, 5:23 p.m. UTC | #14
On Tue, Dec 21, 2021 at 06:06:39PM +0100, Justin Iurman wrote:
> On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote:
> > [...]
> > I think we're on the same page, the main problem is I've not seen
> > anyone use the skbuff_head_cache occupancy as a signal in practice.
> > 
> > I'm adding a bunch of people to the CC list, hopefully someone has
> > an opinion one way or the other.
> 
> It looks like we won't have more opinions on that, unfortunately.
> 
> @Jakub - Should I submit it as a PATCH and see if we receive more
> feedback there?

I know nothing about OAM and therefore did not want to comment, but I
think the point raised about the metric you propose being irrelevant in
the context of offloaded data paths is quite important. The "devlink-sb"
proposal was dismissed very quickly on grounds of requiring sleepable
context, is that a deal breaker, and if it is, why? Not only offloaded
interfaces like switches/routers can report buffer occupancy. Plain NICs
also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could
indicate congestion or otherwise high load. Maybe slab information could
be relevant, for lack of a better option, on virtual interfaces, but if
they're physical, why limit ourselves on reporting that? The IETF draft
you present says "This field indicates the current status of the
occupancy of the common buffer pool used by a set of queues." It appears
to me that we could try to get a reporting that has better granularity
(per interface, per queue) than just something based on
skbuff_head_cache. What if someone will need that finer granularity in
the future.
Jakub Kicinski Dec. 21, 2021, 8:13 p.m. UTC | #15
On Tue, 21 Dec 2021 19:23:37 +0200 Vladimir Oltean wrote:
> On Tue, Dec 21, 2021 at 06:06:39PM +0100, Justin Iurman wrote:
> > On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote:  
> > > I think we're on the same page, the main problem is I've not seen
> > > anyone use the skbuff_head_cache occupancy as a signal in practice.
> > > 
> > > I'm adding a bunch of people to the CC list, hopefully someone has
> > > an opinion one way or the other.  
> > 
> > It looks like we won't have more opinions on that, unfortunately.
> > 
> > @Jakub - Should I submit it as a PATCH and see if we receive more
> > feedback there?  
> 
> I know nothing about OAM and therefore did not want to comment, but I
> think the point raised about the metric you propose being irrelevant in
> the context of offloaded data paths is quite important. The "devlink-sb"
> proposal was dismissed very quickly on grounds of requiring sleepable
> context, is that a deal breaker, and if it is, why? Not only offloaded
> interfaces like switches/routers can report buffer occupancy. Plain NICs
> also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could
> indicate congestion or otherwise high load. Maybe slab information could
> be relevant, for lack of a better option, on virtual interfaces, but if
> they're physical, why limit ourselves on reporting that? The IETF draft
> you present says "This field indicates the current status of the
> occupancy of the common buffer pool used by a set of queues." It appears
> to me that we could try to get a reporting that has better granularity
> (per interface, per queue) than just something based on
> skbuff_head_cache. What if someone will need that finer granularity in
> the future.

Indeed.

In my experience finding meaningful metrics is heard, the chances that
something that seems useful on the surface actually provides meaningful
signal in deployments is a lot lower than one may expect. And the
commit message reads as if the objective was checking a box in the
implemented IOAM metrics, rather exporting relevant information. 

We can do a roll call on people CCed but I read their silence as nobody
thinks this metric is useful. Is there any experimental data you can
point to which proves the signal strength?
Justin Iurman Dec. 22, 2021, 3:49 p.m. UTC | #16
On Dec 21, 2021, at 6:23 PM, Vladimir Oltean olteanv@gmail.com wrote:
> I know nothing about OAM and therefore did not want to comment, but I

NP, all opinions are more than welcome.

> think the point raised about the metric you propose being irrelevant in
> the context of offloaded data paths is quite important. The "devlink-sb"
> proposal was dismissed very quickly on grounds of requiring sleepable
> context, is that a deal breaker, and if it is, why? Not only offloaded

Can't sleep in the datapath.

> interfaces like switches/routers can report buffer occupancy. Plain NICs
> also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could
> indicate congestion or otherwise high load. Maybe slab information could

Indeed. Is there any API to retrieve such metric? Anyway, that would
probably involve (again) sleepable context.

> be relevant, for lack of a better option, on virtual interfaces, but if
> they're physical, why limit ourselves on reporting that? The IETF draft
> you present says "This field indicates the current status of the
> occupancy of the common buffer pool used by a set of queues." It appears
> to me that we could try to get a reporting that has better granularity
> (per interface, per queue) than just something based on
> skbuff_head_cache. What if someone will need that finer granularity in
> the future.

I think we all agree (Jakub, you, and I) on this point. The thing is,
what could be a better solution to have something generic that makes
sense, instead of just nothing? Is it actually feasible at all?
Justin Iurman Dec. 22, 2021, 4:13 p.m. UTC | #17
On Dec 21, 2021, at 9:13 PM, Jakub Kicinski kuba@kernel.org wrote:
> On Tue, 21 Dec 2021 19:23:37 +0200 Vladimir Oltean wrote:
>> On Tue, Dec 21, 2021 at 06:06:39PM +0100, Justin Iurman wrote:
>> > On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote:
>> > > I think we're on the same page, the main problem is I've not seen
>> > > anyone use the skbuff_head_cache occupancy as a signal in practice.
>> > > 
>> > > I'm adding a bunch of people to the CC list, hopefully someone has
>> > > an opinion one way or the other.
>> > 
>> > It looks like we won't have more opinions on that, unfortunately.
>> > 
>> > @Jakub - Should I submit it as a PATCH and see if we receive more
>> > feedback there?
>> 
>> I know nothing about OAM and therefore did not want to comment, but I
>> think the point raised about the metric you propose being irrelevant in
>> the context of offloaded data paths is quite important. The "devlink-sb"
>> proposal was dismissed very quickly on grounds of requiring sleepable
>> context, is that a deal breaker, and if it is, why? Not only offloaded
>> interfaces like switches/routers can report buffer occupancy. Plain NICs
>> also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could
>> indicate congestion or otherwise high load. Maybe slab information could
>> be relevant, for lack of a better option, on virtual interfaces, but if
>> they're physical, why limit ourselves on reporting that? The IETF draft
>> you present says "This field indicates the current status of the
>> occupancy of the common buffer pool used by a set of queues." It appears
>> to me that we could try to get a reporting that has better granularity
>> (per interface, per queue) than just something based on
>> skbuff_head_cache. What if someone will need that finer granularity in
>> the future.
> 
> Indeed.
> 
> In my experience finding meaningful metrics is heard, the chances that
> something that seems useful on the surface actually provides meaningful
> signal in deployments is a lot lower than one may expect. And the

True.

> commit message reads as if the objective was checking a box in the
> implemented IOAM metrics, rather exporting relevant information.

Indeed, but not only. I sent this patchset as a Request for Comments to
see if it was correct and relevant. I mean, if there is no consensus on
this, I'll keep this data field as not supported, not a big deal. But
it would obviously be good to have it at some point (as long as what we
retrieve makes sense enough, and for all cases).

> We can do a roll call on people CCed but I read their silence as nobody

I thought that silence means consent. That's why more opinions would be
welcome, even if we seem to converge. Not only for opinions, but also
for any idea or guidance for a better solution, if any.

> thinks this metric is useful. Is there any experimental data you can
> point to which proves the signal strength?

Apart from the fact that I monitored the metric during normal and
high-load situations, honestly no. Values made sense during
those tests, though.
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 181045148b06..745790dbcbcb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -19,6 +19,21 @@ 
 #include <linux/percpu-refcount.h>
 
 
+struct slabinfo {
+	unsigned long active_objs;
+	unsigned long num_objs;
+	unsigned long active_slabs;
+	unsigned long num_slabs;
+	unsigned long shared_avail;
+	unsigned int limit;
+	unsigned int batchcount;
+	unsigned int shared;
+	unsigned int objects_per_slab;
+	unsigned int cache_order;
+};
+
+void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo);
+
 /*
  * Flags to pass to kmem_cache_create().
  * The ones marked DEBUG are only valid if CONFIG_DEBUG_SLAB is set.
diff --git a/mm/slab.h b/mm/slab.h
index 56ad7eea3ddf..cd6a8a2768e3 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -175,20 +175,6 @@  void slab_kmem_cache_release(struct kmem_cache *);
 struct seq_file;
 struct file;
 
-struct slabinfo {
-	unsigned long active_objs;
-	unsigned long num_objs;
-	unsigned long active_slabs;
-	unsigned long num_slabs;
-	unsigned long shared_avail;
-	unsigned int limit;
-	unsigned int batchcount;
-	unsigned int shared;
-	unsigned int objects_per_slab;
-	unsigned int cache_order;
-};
-
-void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo);
 void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
 ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 		       size_t count, loff_t *ppos);
diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
index 088eb2f877bc..f0a44dc2a0df 100644
--- a/net/ipv6/ioam6.c
+++ b/net/ipv6/ioam6.c
@@ -14,6 +14,8 @@ 
 #include <linux/ioam6_genl.h>
 #include <linux/rhashtable.h>
 #include <linux/netdevice.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
 
 #include <net/addrconf.h>
 #include <net/genetlink.h>
@@ -778,7 +780,16 @@  static void __ioam6_fill_trace_data(struct sk_buff *skb,
 
 	/* buffer occupancy */
 	if (trace->type.bit11) {
-		*(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
+		struct slabinfo sinfo;
+		u32 size, bytes;
+
+		sinfo.active_objs = 0;
+		get_slabinfo(skbuff_head_cache, &sinfo);
+		size = kmem_cache_size(skbuff_head_cache);
+
+		bytes = min(sinfo.active_objs * size, (unsigned long)(U32_MAX-1));
+
+		*(__be32 *)data = cpu_to_be32(bytes);
 		data += sizeof(__be32);
 	}