Message ID | 20211206211758.19057-3-justin.iurman@uliege.be (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | IOAM queue depth and buffer occupancy | expand |
On Mon, 6 Dec 2021 22:17:58 +0100 Justin Iurman wrote: > This patch is an attempt to support the buffer occupancy in IOAM trace > data fields. Any feedback is appreciated, or any other idea if this one > is not correct. > > The draft [1] says the following: > > The "buffer occupancy" field is a 4-octet unsigned integer field. > This field indicates the current status of the occupancy of the > common buffer pool used by a set of queues. The units of this field > are implementation specific. Hence, the units are interpreted within > the context of an IOAM-Namespace and/or node-id if used. The authors > acknowledge that in some operational cases there is a need for the > units to be consistent across a packet path through the network, > hence it is recommended for implementations to use standard units > such as Bytes. > > An existing function (i.e., get_slabinfo) is used to retrieve info about > skbuff_head_cache. For that, both the prototype of get_slabinfo and > struct definition of slabinfo were moved from mm/slab.h to > include/linux/slab.h. Any objection on this? > > The function kmem_cache_size is used to retrieve the size of a slab > object. Note that it returns the "object_size" field, not the "size" > field. If needed, a new function (e.g., kmem_cache_full_size) could be > added to return the "size" field. To match the definition from the > draft, the number of bytes is computed as follows: > > slabinfo.active_objs * size > > Thoughts? Implementing the standard is one thing but how useful is this in practice?
Hi Jakub, >> [...] >> >> The function kmem_cache_size is used to retrieve the size of a slab >> object. Note that it returns the "object_size" field, not the "size" >> field. If needed, a new function (e.g., kmem_cache_full_size) could be >> added to return the "size" field. To match the definition from the >> draft, the number of bytes is computed as follows: >> >> slabinfo.active_objs * size >> >> Thoughts? > > Implementing the standard is one thing but how useful is this > in practice? IMHO, very useful. To be honest, if I were to implement only a few data fields, these two would be both included. Take the example of CLT [1] where the queue length data field is used to detect low-level issues from inside a L5-7 distributed tracing tool. And this is just one example among many others. The queue length data field is very specific to TX queues, but we could also use the buffer occupancy data field to detect more global loads on a node. Actually, the goal for operators running their IOAM domain is to quickly detect a problem along a path and react accordingly (human or automatic action). For example, if you monitor TX queues along a path and detect an increasing queue on a router, you could choose to, e.g., rebalance its queues. With the buffer occupancy, you could detect high-loaded nodes in general and, e.g., rebalance traffic to another branch. Again, this is just one example among others. Apart from more accurate ECMPs, you could for instance deploy a smart (micro)service selection based on different metrics, etc. [1] https://github.com/Advanced-Observability/cross-layer-telemetry
On Tue, 7 Dec 2021 12:54:04 +0100 (CET) Justin Iurman wrote: > >> The function kmem_cache_size is used to retrieve the size of a slab > >> object. Note that it returns the "object_size" field, not the "size" > >> field. If needed, a new function (e.g., kmem_cache_full_size) could be > >> added to return the "size" field. To match the definition from the > >> draft, the number of bytes is computed as follows: > >> > >> slabinfo.active_objs * size > > > > Implementing the standard is one thing but how useful is this > > in practice? > > IMHO, very useful. To be honest, if I were to implement only a few data > fields, these two would be both included. Take the example of CLT [1] > where the queue length data field is used to detect low-level issues > from inside a L5-7 distributed tracing tool. And this is just one > example among many others. The queue length data field is very specific > to TX queues, but we could also use the buffer occupancy data field to > detect more global loads on a node. Actually, the goal for operators > running their IOAM domain is to quickly detect a problem along a path > and react accordingly (human or automatic action). For example, if you > monitor TX queues along a path and detect an increasing queue on a > router, you could choose to, e.g., rebalance its queues. With the > buffer occupancy, you could detect high-loaded nodes in general and, > e.g., rebalance traffic to another branch. Again, this is just one > example among others. Apart from more accurate ECMPs, you could for > instance deploy a smart (micro)service selection based on different > metrics, etc. > > [1] https://github.com/Advanced-Observability/cross-layer-telemetry Ack, my question was more about whether the metric as implemented provides the best signal. Since the slab cache scales dynamically (AFAIU) it's not really a big deal if it's full as long as there's memory available on the system.
On Dec 7, 2021, at 4:50 PM, Jakub Kicinski kuba@kernel.org wrote: > On Tue, 7 Dec 2021 12:54:04 +0100 (CET) Justin Iurman wrote: >> >> The function kmem_cache_size is used to retrieve the size of a slab >> >> object. Note that it returns the "object_size" field, not the "size" >> >> field. If needed, a new function (e.g., kmem_cache_full_size) could be >> >> added to return the "size" field. To match the definition from the >> >> draft, the number of bytes is computed as follows: >> >> >> >> slabinfo.active_objs * size >> > >> > Implementing the standard is one thing but how useful is this >> > in practice? >> >> IMHO, very useful. To be honest, if I were to implement only a few data >> fields, these two would be both included. Take the example of CLT [1] >> where the queue length data field is used to detect low-level issues >> from inside a L5-7 distributed tracing tool. And this is just one >> example among many others. The queue length data field is very specific >> to TX queues, but we could also use the buffer occupancy data field to >> detect more global loads on a node. Actually, the goal for operators >> running their IOAM domain is to quickly detect a problem along a path >> and react accordingly (human or automatic action). For example, if you >> monitor TX queues along a path and detect an increasing queue on a >> router, you could choose to, e.g., rebalance its queues. With the >> buffer occupancy, you could detect high-loaded nodes in general and, >> e.g., rebalance traffic to another branch. Again, this is just one >> example among others. Apart from more accurate ECMPs, you could for >> instance deploy a smart (micro)service selection based on different >> metrics, etc. >> >> [1] https://github.com/Advanced-Observability/cross-layer-telemetry > > Ack, my question was more about whether the metric as implemented Oh, sorry about that. > provides the best signal. Since the slab cache scales dynamically > (AFAIU) it's not really a big deal if it's full as long as there's > memory available on the system. Well, I got the same understanding as you. However, we do not provide a value meaning "X percent used" just because it wouldn't make much sense, as you pointed out. So I think it is sound to have the current value, even if it's a quite dynamic one. Indeed, what's important here is to know how many bytes are used and this is exactly what it does. If a node is under heavy load, the value would be hell high. The operator could define a threshold for each node resp. and detect abnormal values. We probably want the metadata included for accuracy as well (e.g., kmem_cache_size vs new function kmem_cache_full_size).
On 12/6/21 2:17 PM, Justin Iurman wrote: > This patch is an attempt to support the buffer occupancy in IOAM trace > data fields. Any feedback is appreciated, or any other idea if this one > is not correct. > > The draft [1] says the following: > > The "buffer occupancy" field is a 4-octet unsigned integer field. > This field indicates the current status of the occupancy of the > common buffer pool used by a set of queues. The units of this field > are implementation specific. Hence, the units are interpreted within > the context of an IOAM-Namespace and/or node-id if used. The authors > acknowledge that in some operational cases there is a need for the > units to be consistent across a packet path through the network, > hence it is recommended for implementations to use standard units > such as Bytes. > > An existing function (i.e., get_slabinfo) is used to retrieve info about > skbuff_head_cache. For that, both the prototype of get_slabinfo and > struct definition of slabinfo were moved from mm/slab.h to > include/linux/slab.h. Any objection on this? > > The function kmem_cache_size is used to retrieve the size of a slab > object. Note that it returns the "object_size" field, not the "size" > field. If needed, a new function (e.g., kmem_cache_full_size) could be > added to return the "size" field. To match the definition from the > draft, the number of bytes is computed as follows: > > slabinfo.active_objs * size > > Thoughts? > > [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.12 > > Signed-off-by: Justin Iurman <justin.iurman@uliege.be> > --- > include/linux/slab.h | 15 +++++++++++++++ > mm/slab.h | 14 -------------- > net/ipv6/ioam6.c | 13 ++++++++++++- > 3 files changed, 27 insertions(+), 15 deletions(-) > this should be 2 patches - one that moves the slabinfo struct and function across header files and then the ioam6 change. [ I agree with Jakub's line of questioning - how useful is this across nodes with different OS'es and s/w versions. ]
Hi David, >> [...] >> >> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> >> --- >> include/linux/slab.h | 15 +++++++++++++++ >> mm/slab.h | 14 -------------- >> net/ipv6/ioam6.c | 13 ++++++++++++- >> 3 files changed, 27 insertions(+), 15 deletions(-) >> > > this should be 2 patches - one that moves the slabinfo struct and > function across header files and then the ioam6 change. Agree. I didn't do it since it's "just" a RFC for now though. > [ I agree with Jakub's line of questioning - how useful is this across > nodes with different OS'es and s/w versions. ] See my answer to Jakub. Also, as the draft says: "The units of this field are implementation specific. Hence, the units are interpreted within the context of an IOAM-Namespace and/or node-id if used. The authors acknowledge that in some operational cases there is a need for the units to be consistent across a packet path through the network, hence it is recommended for implementations to use standard units such as Bytes." Therefore, what we define here is the behavior of the Linux kernel regarding the way it handles the buffer occupancy for IOAM, and the meaning/semantic of its value (in bytes). Having different OS'es and s/w versions across nodes is not a problem, it's up to the operators to bring more context to their own domains.
On Tue, 7 Dec 2021 17:35:49 +0100 (CET) Justin Iurman wrote: > > provides the best signal. Since the slab cache scales dynamically > > (AFAIU) it's not really a big deal if it's full as long as there's > > memory available on the system. > > Well, I got the same understanding as you. However, we do not provide a > value meaning "X percent used" just because it wouldn't make much sense, > as you pointed out. So I think it is sound to have the current value, > even if it's a quite dynamic one. Indeed, what's important here is to > know how many bytes are used and this is exactly what it does. If a node > is under heavy load, the value would be hell high. The operator could > define a threshold for each node resp. and detect abnormal values. Hm, reading thru the quoted portion of the standard from the commit message the semantics of the field are indeed pretty disappointing. What's the value of defining a field in a standard if it's entirely implementation specific? Eh. > We probably want the metadata included for accuracy as well (e.g., > kmem_cache_size vs new function kmem_cache_full_size). Does the standard support carrying arbitrary metadata? Anyway, in general I personally don't have a good feeling about implementing this field. Would be good to have a clear user who can justify the choice of slab vs something else. Wouldn't modern deployments use some form of streaming telemetry for nodes within the same domain of control? I'm not sure I understand the value of limited slab info in OAM when there's probably a more powerful metric collection going on. Patch 1 makes perfect sense, FWIW.
On Dec 7, 2021, at 6:07 PM, Jakub Kicinski kuba@kernel.org wrote: > Hm, reading thru the quoted portion of the standard from the commit > message the semantics of the field are indeed pretty disappointing. > What's the value of defining a field in a standard if it's entirely > implementation specific? Eh. True. But keep also in mind the scope of IOAM which is not to be deployed widely on the Internet. It is deployed on limited (aka private) domains where each node is therefore managed by the operator. So, I'm not really sure why you think that the implementation specific thing is a problem here. Context of "unit" is provided by the IOAM Namespace-ID attached to the trace, as well as each Node-ID if included. Again, it's up to the operator to interpret values accordingly, depending on each node (i.e., the operator has a large and detailed view of his domain; he knows if the buffer occupancy value "X" is abnormal or not for a specific node, he knows which unit is used for a specific node, etc). >> We probably want the metadata included for accuracy as well (e.g., >> kmem_cache_size vs new function kmem_cache_full_size). > > Does the standard support carrying arbitrary metadata? It says: "This field indicates the current status of the occupancy of the common buffer pool used by a set of queues." So, as long as metadata are part of it, I'd say yes it does, since bytes are allocated for that too. Does it make sense? > Anyway, in general I personally don't have a good feeling about > implementing this field. Would be good to have a clear user who > can justify the choice of slab vs something else. Wouldn't modern > deployments use some form of streaming telemetry for nodes within > the same domain of control? I'm not sure I understand the value > of limited slab info in OAM when there's probably a more powerful > metric collection going on. Do you believe this patch does not provide what is defined in the spec? If so, I'm open to any suggestions. > Patch 1 makes perfect sense, FWIW. Thanks for (all) the feedback, Jakub, I appreciate it.
On Tue, 7 Dec 2021 19:05:13 +0100 (CET) Justin Iurman wrote: > On Dec 7, 2021, at 6:07 PM, Jakub Kicinski kuba@kernel.org wrote: > > Hm, reading thru the quoted portion of the standard from the commit > > message the semantics of the field are indeed pretty disappointing. > > What's the value of defining a field in a standard if it's entirely > > implementation specific? Eh. > > True. But keep also in mind the scope of IOAM which is not to be > deployed widely on the Internet. It is deployed on limited (aka private) > domains where each node is therefore managed by the operator. So, I'm > not really sure why you think that the implementation specific thing is > a problem here. Context of "unit" is provided by the IOAM Namespace-ID > attached to the trace, as well as each Node-ID if included. Again, it's > up to the operator to interpret values accordingly, depending on each > node (i.e., the operator has a large and detailed view of his domain; he > knows if the buffer occupancy value "X" is abnormal or not for a > specific node, he knows which unit is used for a specific node, etc). It's quite likely I'm missing the point. > >> We probably want the metadata included for accuracy as well (e.g., > >> kmem_cache_size vs new function kmem_cache_full_size). > > > > Does the standard support carrying arbitrary metadata? > > It says: > > "This field indicates the current status of the occupancy of the > common buffer pool used by a set of queues." > > So, as long as metadata are part of it, I'd say yes it does, since bytes > are allocated for that too. Does it make sense? Indeed, but see below. > > Anyway, in general I personally don't have a good feeling about > > implementing this field. Would be good to have a clear user who > > can justify the choice of slab vs something else. Wouldn't modern > > deployments use some form of streaming telemetry for nodes within > > the same domain of control? I'm not sure I understand the value > > of limited slab info in OAM when there's probably a more powerful > > metric collection going on. > > Do you believe this patch does not provide what is defined in the spec? > If so, I'm open to any suggestions. The opposite, in a sense. I think the patch does implement behavior within a reasonable interpretation of the standard. But the feature itself seems more useful for forwarding ASICs than Linux routers, because Linux routers can run a full telemetry stack and all sort of advanced SW instrumentation. The use case for reporting kernel memory use via IOAM's constrained interface does not seem particularly practical since it's not providing a very strong signal on what's going on. For switches running Linux the switch ASIC buffer occupancy can be read via devlink-sb that'd seem like a better fit for me, but unfortunately the devlink calls can sleep so we can't read such device info from the datapath. > > Patch 1 makes perfect sense, FWIW. > > Thanks for (all) the feedback, Jakub, I appreciate it.
>> True. But keep also in mind the scope of IOAM which is not to be >> deployed widely on the Internet. It is deployed on limited (aka private) >> domains where each node is therefore managed by the operator. So, I'm >> not really sure why you think that the implementation specific thing is >> a problem here. Context of "unit" is provided by the IOAM Namespace-ID >> attached to the trace, as well as each Node-ID if included. Again, it's >> up to the operator to interpret values accordingly, depending on each >> node (i.e., the operator has a large and detailed view of his domain; he >> knows if the buffer occupancy value "X" is abnormal or not for a >> specific node, he knows which unit is used for a specific node, etc). > > It's quite likely I'm missing the point. Let me try again to make it all clear on your mind. Here are some quoted paragraphs from the spec: "Generic data: Format-free information where syntax and semantic of the information is defined by the operator in a specific deployment. For a specific IOAM-Namespace, all IOAM nodes have to interpret the generic data the same way. Examples for generic IOAM data include geo-location information (location of the node at the time the packet was processed), buffer queue fill level or cache fill level at the time the packet was processed, or even a battery charge level." This one basically says that the IOAM Namespace-ID (in the IOAM Trace Option-Type header) is responsible for providing context to data fields (i.e., for "units" too, in case of generic fields such as queue depth or buffer occupancy). So it's up to the operator to gather similar nodes within a same IOAM Namespace. And, even if "different" kind of nodes are within an IOAM Namespace, you still have a fallback solution if Node IDs are part of the trace (the "hop-lim & node-id" data field, bit 0 in the trace type). Indeed, the operator (or the collector/interpretor) knows if node A uses "bytes" or any other units for a generic data field. "It should be noted that the semantics of some of the node data fields that are defined below, such as the queue depth and buffer occupancy, are implementation specific. This approach is intended to allow IOAM nodes with various different architectures." The last sentence is important here and is, in fact, related to what you describe below. Having genericity on units for such data fields allows for supporting multiple architectures. Same idea for the following paragraph: "Data fields and associated data types for each of the IOAM-Data- Fields are specified in the following sections. The definition of IOAM-Data-Fields focuses on the syntax of the data-fields and avoids specifying the semantics where feasible. This is why no units are defined for data-fields like e.g., "buffer occupancy" or "queue depth". With this approach, nodes can supply the information in their native format and are not required to perform unit or format conversions. Systems that further process IOAM information, like e.g., a network management system are assumed to also handle unit conversions as part of their IOAM data-fields processing. The combination of a particular data-field and the namespace-id provides for the context to interpret the provided data appropriately." Does it make more sense now on why it's not really a problem to have implementation specific units for such data fields? >> [...] >> >> Do you believe this patch does not provide what is defined in the spec? >> If so, I'm open to any suggestions. > > The opposite, in a sense. I think the patch does implement behavior > within a reasonable interpretation of the standard. But the feature > itself seems more useful for forwarding ASICs than Linux routers, Good point. Actually, it's probably why such data field was defined as it is. > because Linux routers can run a full telemetry stack and all sort > of advanced SW instrumentation. The use case for reporting kernel > memory use via IOAM's constrained interface does not seem particularly > practical since it's not providing a very strong signal on what's > going on. I agree and disagree. I disagree because this value definitely tells you that something (potentially bad) is going on, when it increases significantly enough to reach a critical threshold. Basically, we need more skb's, but oh, the pool is exhausted. OK, not a problem, expand the pool. Oh wait, no memory left. Why? Is it only due to too much (temporary?) load? Should I put the blame on the NIC? Is it a memory issue? Is it something else? Or maybe several issues combined? Well, you might not know exactly why (though you know there is a problem), which is also why I agree with you. But, this is also why you have other data fields available (i.e., detecting a problem might require 2+ symptoms instead of just one). > For switches running Linux the switch ASIC buffer occupancy can be read > via devlink-sb that'd seem like a better fit for me, but unfortunately > the devlink calls can sleep so we can't read such device info from the > datapath. Indeed, would be a better fit. I didn't know about this one, thanks for that. It's a shame it can't be used in this context, though. But, at the end of the day, we're left with nothing regarding buffer occupancy. So I'm wondering if "something" is not better than "nothing" in this case. And, for that, we're back to my previous answer on why I agree and disagree with what you said about its utility.
On Thu, 9 Dec 2021 15:10:24 +0100 (CET) Justin Iurman wrote: > > because Linux routers can run a full telemetry stack and all sort > > of advanced SW instrumentation. The use case for reporting kernel > > memory use via IOAM's constrained interface does not seem particularly > > practical since it's not providing a very strong signal on what's > > going on. > > I agree and disagree. I disagree because this value definitely tells you > that something (potentially bad) is going on, when it increases > significantly enough to reach a critical threshold. Basically, we need > more skb's, but oh, the pool is exhausted. OK, not a problem, expand the > pool. Oh wait, no memory left. Why? Is it only due to too much > (temporary?) load? Should I put the blame on the NIC? Is it a memory > issue? Is it something else? Or maybe several issues combined? Well, you > might not know exactly why (though you know there is a problem), which is > also why I agree with you. But, this is also why you have other data > fields available (i.e., detecting a problem might require 2+ symptoms > instead of just one). > > > For switches running Linux the switch ASIC buffer occupancy can be read > > via devlink-sb that'd seem like a better fit for me, but unfortunately > > the devlink calls can sleep so we can't read such device info from the > > datapath. > > Indeed, would be a better fit. I didn't know about this one, thanks for > that. It's a shame it can't be used in this context, though. But, at the > end of the day, we're left with nothing regarding buffer occupancy. So > I'm wondering if "something" is not better than "nothing" in this case. > And, for that, we're back to my previous answer on why I agree and > disagree with what you said about its utility. I think we're on the same page, the main problem is I've not seen anyone use the skbuff_head_cache occupancy as a signal in practice. I'm adding a bunch of people to the CC list, hopefully someone has an opinion one way or the other. Lore link to the full thread, FWIW: https://lore.kernel.org/all/20211206211758.19057-1-justin.iurman@uliege.be/
>> Indeed, would be a better fit. I didn't know about this one, thanks for >> that. It's a shame it can't be used in this context, though. But, at the >> end of the day, we're left with nothing regarding buffer occupancy. So >> I'm wondering if "something" is not better than "nothing" in this case. >> And, for that, we're back to my previous answer on why I agree and >> disagree with what you said about its utility. > > I think we're on the same page, the main problem is I've not seen > anyone use the skbuff_head_cache occupancy as a signal in practice. Indeed. > I'm adding a bunch of people to the CC list, hopefully someone has > an opinion one way or the other. +1, thanks Jakub. > Lore link to the full thread, FWIW: > > https://lore.kernel.org/all/20211206211758.19057-1-justin.iurman@uliege.be/
On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote: > [...] > I think we're on the same page, the main problem is I've not seen > anyone use the skbuff_head_cache occupancy as a signal in practice. > > I'm adding a bunch of people to the CC list, hopefully someone has > an opinion one way or the other. It looks like we won't have more opinions on that, unfortunately. @Jakub - Should I submit it as a PATCH and see if we receive more feedback there?
On Tue, Dec 21, 2021 at 06:06:39PM +0100, Justin Iurman wrote: > On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote: > > [...] > > I think we're on the same page, the main problem is I've not seen > > anyone use the skbuff_head_cache occupancy as a signal in practice. > > > > I'm adding a bunch of people to the CC list, hopefully someone has > > an opinion one way or the other. > > It looks like we won't have more opinions on that, unfortunately. > > @Jakub - Should I submit it as a PATCH and see if we receive more > feedback there? I know nothing about OAM and therefore did not want to comment, but I think the point raised about the metric you propose being irrelevant in the context of offloaded data paths is quite important. The "devlink-sb" proposal was dismissed very quickly on grounds of requiring sleepable context, is that a deal breaker, and if it is, why? Not only offloaded interfaces like switches/routers can report buffer occupancy. Plain NICs also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could indicate congestion or otherwise high load. Maybe slab information could be relevant, for lack of a better option, on virtual interfaces, but if they're physical, why limit ourselves on reporting that? The IETF draft you present says "This field indicates the current status of the occupancy of the common buffer pool used by a set of queues." It appears to me that we could try to get a reporting that has better granularity (per interface, per queue) than just something based on skbuff_head_cache. What if someone will need that finer granularity in the future.
On Tue, 21 Dec 2021 19:23:37 +0200 Vladimir Oltean wrote: > On Tue, Dec 21, 2021 at 06:06:39PM +0100, Justin Iurman wrote: > > On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote: > > > I think we're on the same page, the main problem is I've not seen > > > anyone use the skbuff_head_cache occupancy as a signal in practice. > > > > > > I'm adding a bunch of people to the CC list, hopefully someone has > > > an opinion one way or the other. > > > > It looks like we won't have more opinions on that, unfortunately. > > > > @Jakub - Should I submit it as a PATCH and see if we receive more > > feedback there? > > I know nothing about OAM and therefore did not want to comment, but I > think the point raised about the metric you propose being irrelevant in > the context of offloaded data paths is quite important. The "devlink-sb" > proposal was dismissed very quickly on grounds of requiring sleepable > context, is that a deal breaker, and if it is, why? Not only offloaded > interfaces like switches/routers can report buffer occupancy. Plain NICs > also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could > indicate congestion or otherwise high load. Maybe slab information could > be relevant, for lack of a better option, on virtual interfaces, but if > they're physical, why limit ourselves on reporting that? The IETF draft > you present says "This field indicates the current status of the > occupancy of the common buffer pool used by a set of queues." It appears > to me that we could try to get a reporting that has better granularity > (per interface, per queue) than just something based on > skbuff_head_cache. What if someone will need that finer granularity in > the future. Indeed. In my experience finding meaningful metrics is heard, the chances that something that seems useful on the surface actually provides meaningful signal in deployments is a lot lower than one may expect. And the commit message reads as if the objective was checking a box in the implemented IOAM metrics, rather exporting relevant information. We can do a roll call on people CCed but I read their silence as nobody thinks this metric is useful. Is there any experimental data you can point to which proves the signal strength?
On Dec 21, 2021, at 6:23 PM, Vladimir Oltean olteanv@gmail.com wrote: > I know nothing about OAM and therefore did not want to comment, but I NP, all opinions are more than welcome. > think the point raised about the metric you propose being irrelevant in > the context of offloaded data paths is quite important. The "devlink-sb" > proposal was dismissed very quickly on grounds of requiring sleepable > context, is that a deal breaker, and if it is, why? Not only offloaded Can't sleep in the datapath. > interfaces like switches/routers can report buffer occupancy. Plain NICs > also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could > indicate congestion or otherwise high load. Maybe slab information could Indeed. Is there any API to retrieve such metric? Anyway, that would probably involve (again) sleepable context. > be relevant, for lack of a better option, on virtual interfaces, but if > they're physical, why limit ourselves on reporting that? The IETF draft > you present says "This field indicates the current status of the > occupancy of the common buffer pool used by a set of queues." It appears > to me that we could try to get a reporting that has better granularity > (per interface, per queue) than just something based on > skbuff_head_cache. What if someone will need that finer granularity in > the future. I think we all agree (Jakub, you, and I) on this point. The thing is, what could be a better solution to have something generic that makes sense, instead of just nothing? Is it actually feasible at all?
On Dec 21, 2021, at 9:13 PM, Jakub Kicinski kuba@kernel.org wrote: > On Tue, 21 Dec 2021 19:23:37 +0200 Vladimir Oltean wrote: >> On Tue, Dec 21, 2021 at 06:06:39PM +0100, Justin Iurman wrote: >> > On Dec 10, 2021, at 1:38 AM, Jakub Kicinski kuba@kernel.org wrote: >> > > I think we're on the same page, the main problem is I've not seen >> > > anyone use the skbuff_head_cache occupancy as a signal in practice. >> > > >> > > I'm adding a bunch of people to the CC list, hopefully someone has >> > > an opinion one way or the other. >> > >> > It looks like we won't have more opinions on that, unfortunately. >> > >> > @Jakub - Should I submit it as a PATCH and see if we receive more >> > feedback there? >> >> I know nothing about OAM and therefore did not want to comment, but I >> think the point raised about the metric you propose being irrelevant in >> the context of offloaded data paths is quite important. The "devlink-sb" >> proposal was dismissed very quickly on grounds of requiring sleepable >> context, is that a deal breaker, and if it is, why? Not only offloaded >> interfaces like switches/routers can report buffer occupancy. Plain NICs >> also have buffer pools, DMA RX/TX rings, MAC FIFOs, etc, that could >> indicate congestion or otherwise high load. Maybe slab information could >> be relevant, for lack of a better option, on virtual interfaces, but if >> they're physical, why limit ourselves on reporting that? The IETF draft >> you present says "This field indicates the current status of the >> occupancy of the common buffer pool used by a set of queues." It appears >> to me that we could try to get a reporting that has better granularity >> (per interface, per queue) than just something based on >> skbuff_head_cache. What if someone will need that finer granularity in >> the future. > > Indeed. > > In my experience finding meaningful metrics is heard, the chances that > something that seems useful on the surface actually provides meaningful > signal in deployments is a lot lower than one may expect. And the True. > commit message reads as if the objective was checking a box in the > implemented IOAM metrics, rather exporting relevant information. Indeed, but not only. I sent this patchset as a Request for Comments to see if it was correct and relevant. I mean, if there is no consensus on this, I'll keep this data field as not supported, not a big deal. But it would obviously be good to have it at some point (as long as what we retrieve makes sense enough, and for all cases). > We can do a roll call on people CCed but I read their silence as nobody I thought that silence means consent. That's why more opinions would be welcome, even if we seem to converge. Not only for opinions, but also for any idea or guidance for a better solution, if any. > thinks this metric is useful. Is there any experimental data you can > point to which proves the signal strength? Apart from the fact that I monitored the metric during normal and high-load situations, honestly no. Values made sense during those tests, though.
diff --git a/include/linux/slab.h b/include/linux/slab.h index 181045148b06..745790dbcbcb 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -19,6 +19,21 @@ #include <linux/percpu-refcount.h> +struct slabinfo { + unsigned long active_objs; + unsigned long num_objs; + unsigned long active_slabs; + unsigned long num_slabs; + unsigned long shared_avail; + unsigned int limit; + unsigned int batchcount; + unsigned int shared; + unsigned int objects_per_slab; + unsigned int cache_order; +}; + +void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo); + /* * Flags to pass to kmem_cache_create(). * The ones marked DEBUG are only valid if CONFIG_DEBUG_SLAB is set. diff --git a/mm/slab.h b/mm/slab.h index 56ad7eea3ddf..cd6a8a2768e3 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -175,20 +175,6 @@ void slab_kmem_cache_release(struct kmem_cache *); struct seq_file; struct file; -struct slabinfo { - unsigned long active_objs; - unsigned long num_objs; - unsigned long active_slabs; - unsigned long num_slabs; - unsigned long shared_avail; - unsigned int limit; - unsigned int batchcount; - unsigned int shared; - unsigned int objects_per_slab; - unsigned int cache_order; -}; - -void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo); void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s); ssize_t slabinfo_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos); diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c index 088eb2f877bc..f0a44dc2a0df 100644 --- a/net/ipv6/ioam6.c +++ b/net/ipv6/ioam6.c @@ -14,6 +14,8 @@ #include <linux/ioam6_genl.h> #include <linux/rhashtable.h> #include <linux/netdevice.h> +#include <linux/slab.h> +#include <linux/skbuff.h> #include <net/addrconf.h> #include <net/genetlink.h> @@ -778,7 +780,16 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb, /* buffer occupancy */ if (trace->type.bit11) { - *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE); + struct slabinfo sinfo; + u32 size, bytes; + + sinfo.active_objs = 0; + get_slabinfo(skbuff_head_cache, &sinfo); + size = kmem_cache_size(skbuff_head_cache); + + bytes = min(sinfo.active_objs * size, (unsigned long)(U32_MAX-1)); + + *(__be32 *)data = cpu_to_be32(bytes); data += sizeof(__be32); }
This patch is an attempt to support the buffer occupancy in IOAM trace data fields. Any feedback is appreciated, or any other idea if this one is not correct. The draft [1] says the following: The "buffer occupancy" field is a 4-octet unsigned integer field. This field indicates the current status of the occupancy of the common buffer pool used by a set of queues. The units of this field are implementation specific. Hence, the units are interpreted within the context of an IOAM-Namespace and/or node-id if used. The authors acknowledge that in some operational cases there is a need for the units to be consistent across a packet path through the network, hence it is recommended for implementations to use standard units such as Bytes. An existing function (i.e., get_slabinfo) is used to retrieve info about skbuff_head_cache. For that, both the prototype of get_slabinfo and struct definition of slabinfo were moved from mm/slab.h to include/linux/slab.h. Any objection on this? The function kmem_cache_size is used to retrieve the size of a slab object. Note that it returns the "object_size" field, not the "size" field. If needed, a new function (e.g., kmem_cache_full_size) could be added to return the "size" field. To match the definition from the draft, the number of bytes is computed as follows: slabinfo.active_objs * size Thoughts? [1] https://datatracker.ietf.org/doc/html/draft-ietf-ippm-ioam-data#section-5.4.2.12 Signed-off-by: Justin Iurman <justin.iurman@uliege.be> --- include/linux/slab.h | 15 +++++++++++++++ mm/slab.h | 14 -------------- net/ipv6/ioam6.c | 13 ++++++++++++- 3 files changed, 27 insertions(+), 15 deletions(-)