diff mbox series

[RFC,net] Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")

Message ID 20221205153557.28549-1-justin.iurman@uliege.be (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net] Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field") | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 7 of 7 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Justin Iurman Dec. 5, 2022, 3:35 p.m. UTC
This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
for IOAM.

IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
reasoning below).

Kernel panic:
[...]
RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
[...]

...which basically points to the call to qdisc_qstats_qlen_backlog
inside net/ipv6/ioam6.c.

From there, I directly thought of a NULL pointer (queue->qdisc). To make
sure, I added some printk's to know exactly *why* and *when* it happens.
Here is the (summarized by queue) output:

skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL

What the hell? So, not sure why queue #16 would *never* have a qdisc
attached. Is it something expected I'm not aware of? As an FYI, here is
the output of "tc qdisc list dev xxx":

qdisc mq 0: root
qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64

By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
version (6.1.0-rc7+). Is this a bug in the i40e driver?

Cc: stable@vger.kernel.org
Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
---
 net/ipv6/ioam6.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Eric Dumazet Dec. 5, 2022, 4:30 p.m. UTC | #1
Patch title seems

On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
> for IOAM.
>
> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
> reasoning below).
>
> Kernel panic:
> [...]
> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
> [...]
>
> ...which basically points to the call to qdisc_qstats_qlen_backlog
> inside net/ipv6/ioam6.c.
>
> From there, I directly thought of a NULL pointer (queue->qdisc). To make
> sure, I added some printk's to know exactly *why* and *when* it happens.
> Here is the (summarized by queue) output:
>
> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
>
> What the hell? So, not sure why queue #16 would *never* have a qdisc
> attached. Is it something expected I'm not aware of? As an FYI, here is
> the output of "tc qdisc list dev xxx":
>
> qdisc mq 0: root
> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>
> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
> version (6.1.0-rc7+). Is this a bug in the i40e driver?
>

> Cc: stable@vger.kernel.org

Patch title is mangled. The Fixes: tag should appear here, not in the title.


Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")

> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
> ---
>  net/ipv6/ioam6.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
> index 571f0e4d9cf3..2472a8a043c4 100644
> --- a/net/ipv6/ioam6.c
> +++ b/net/ipv6/ioam6.c
> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
>                         *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>                 } else {
>                         queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);

Are you sure skb_dst(skb)->dev is correct at this stage, what about
stacked devices ?

> -                       qdisc = rcu_dereference(queue->qdisc);
> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> -
> -                       *(__be32 *)data = cpu_to_be32(backlog);
> +                       if (!queue->qdisc) {

This is racy.

> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> +                       } else {
> +                               qdisc = rcu_dereference(queue->qdisc);
> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> +                               *(__be32 *)data = cpu_to_be32(backlog);
> +                       }
>                 }
>                 data += sizeof(__be32);
>         }
> --
> 2.25.1
>

Quite frankly I suggest to revert b63c5478e9cb completely.

The notion of Queue depth can not be properly gathered in Linux with a
multi queue model,
so why trying to get a wrong value ?
Eric Dumazet Dec. 5, 2022, 4:50 p.m. UTC | #2
On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Patch title seems
>
> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
> >
> > This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
> > for IOAM.
> >
> > IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
> > reasoning below).
> >
> > Kernel panic:
> > [...]
> > RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
> > [...]
> >
> > ...which basically points to the call to qdisc_qstats_qlen_backlog
> > inside net/ipv6/ioam6.c.
> >
> > From there, I directly thought of a NULL pointer (queue->qdisc). To make
> > sure, I added some printk's to know exactly *why* and *when* it happens.
> > Here is the (summarized by queue) output:
> >
> > skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
> > skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
> > skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
> > skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
> > skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
> > skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
> > skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
> > skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
> > skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
> > skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
> > skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
> > skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
> > skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
> > skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
> > skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
> > skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
> >
> > What the hell? So, not sure why queue #16 would *never* have a qdisc
> > attached. Is it something expected I'm not aware of? As an FYI, here is
> > the output of "tc qdisc list dev xxx":
> >
> > qdisc mq 0: root
> > qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> > qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >
> > By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
> > version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
> > version (6.1.0-rc7+). Is this a bug in the i40e driver?
> >
>
> > Cc: stable@vger.kernel.org
>
> Patch title is mangled. The Fixes: tag should appear here, not in the title.
>
>
> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")
>
> > Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
> > ---
> >  net/ipv6/ioam6.c | 11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
> > index 571f0e4d9cf3..2472a8a043c4 100644
> > --- a/net/ipv6/ioam6.c
> > +++ b/net/ipv6/ioam6.c
> > @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
> >                         *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> >                 } else {
> >                         queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
>
> Are you sure skb_dst(skb)->dev is correct at this stage, what about
> stacked devices ?
>
> > -                       qdisc = rcu_dereference(queue->qdisc);
> > -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> > -
> > -                       *(__be32 *)data = cpu_to_be32(backlog);
> > +                       if (!queue->qdisc) {
>
> This is racy.
>
> > +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> > +                       } else {
> > +                               qdisc = rcu_dereference(queue->qdisc);
> > +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> > +                               *(__be32 *)data = cpu_to_be32(backlog);
> > +                       }
> >                 }
> >                 data += sizeof(__be32);
> >         }
> > --
> > 2.25.1
> >
>
> Quite frankly I suggest to revert b63c5478e9cb completely.
>
> The notion of Queue depth can not be properly gathered in Linux with a
> multi queue model,
> so why trying to get a wrong value ?

Additional reason for a revert is that qdisc_qstats_qlen_backlog() is
reserved for net/sched
code, I think it needs the qdisc lock to be held.
Justin Iurman Dec. 5, 2022, 6:14 p.m. UTC | #3
On 12/5/22 17:30, Eric Dumazet wrote:
> Patch title seems
> 
> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
>> for IOAM.
>>
>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
>> reasoning below).
>>
>> Kernel panic:
>> [...]
>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
>> [...]
>>
>> ...which basically points to the call to qdisc_qstats_qlen_backlog
>> inside net/ipv6/ioam6.c.
>>
>>  From there, I directly thought of a NULL pointer (queue->qdisc). To make
>> sure, I added some printk's to know exactly *why* and *when* it happens.
>> Here is the (summarized by queue) output:
>>
>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
>>
>> What the hell? So, not sure why queue #16 would *never* have a qdisc
>> attached. Is it something expected I'm not aware of? As an FYI, here is
>> the output of "tc qdisc list dev xxx":
>>
>> qdisc mq 0: root
>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>
>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
>> version (6.1.0-rc7+). Is this a bug in the i40e driver?
>>
> 
>> Cc: stable@vger.kernel.org
> 
> Patch title is mangled. The Fixes: tag should appear here, not in the title.
> 
> 
> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")

Correct, sorry about that. Fortunately, this is only an RFC.

>> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
>> ---
>>   net/ipv6/ioam6.c | 11 +++++++----
>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
>> index 571f0e4d9cf3..2472a8a043c4 100644
>> --- a/net/ipv6/ioam6.c
>> +++ b/net/ipv6/ioam6.c
>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
>>                          *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>                  } else {
>>                          queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
> 
> Are you sure skb_dst(skb)->dev is correct at this stage, what about
> stacked devices ?

It is. I don't see a problem here.

>> -                       qdisc = rcu_dereference(queue->qdisc);
>> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>> -
>> -                       *(__be32 *)data = cpu_to_be32(backlog);
>> +                       if (!queue->qdisc) {
> 
> This is racy.

True. The intent was to start a discussion on the aforementioned bug. I 
can still fix that later in the PATCH version.

>> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>> +                       } else {
>> +                               qdisc = rcu_dereference(queue->qdisc);
>> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>> +                               *(__be32 *)data = cpu_to_be32(backlog);
>> +                       }
>>                  }
>>                  data += sizeof(__be32);
>>          }
>> --
>> 2.25.1
>>
> 
> Quite frankly I suggest to revert b63c5478e9cb completely.

IMHO, I'd definitely not go for a revert. This is an important IOAM 
metric. And, besides, this is not really the topic here. The issue can 
be fixed easily for IOAM, but this thread is more about discussing the 
strange behavior I observed with qdisc's. Is it a bug? If so, let's find 
and fix the cause before fixing the consequences. Is it only for i40e or 
some others too? It's probably worth investigating, regardless of this 
patch.

> The notion of Queue depth can not be properly gathered in Linux with a
> multi queue model,
> so why trying to get a wrong value ?

I don't think it's true. The queue depth is a per queue metric, and this 
is exactly what qdisc_qstats_qlen_backlog provides in this case. I can't 
find any issue regarding a wrong value. If what you say is true, then 
let's remove it from everywhere (tc reports the queue depth). Not convinced.
Justin Iurman Dec. 5, 2022, 6:24 p.m. UTC | #4
On 12/5/22 17:50, Eric Dumazet wrote:
> On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>> Patch title seems
>>
>> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>
>>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
>>> for IOAM.
>>>
>>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
>>> reasoning below).
>>>
>>> Kernel panic:
>>> [...]
>>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
>>> [...]
>>>
>>> ...which basically points to the call to qdisc_qstats_qlen_backlog
>>> inside net/ipv6/ioam6.c.
>>>
>>>  From there, I directly thought of a NULL pointer (queue->qdisc). To make
>>> sure, I added some printk's to know exactly *why* and *when* it happens.
>>> Here is the (summarized by queue) output:
>>>
>>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
>>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
>>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
>>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
>>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
>>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
>>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
>>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
>>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
>>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
>>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
>>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
>>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
>>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
>>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
>>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
>>>
>>> What the hell? So, not sure why queue #16 would *never* have a qdisc
>>> attached. Is it something expected I'm not aware of? As an FYI, here is
>>> the output of "tc qdisc list dev xxx":
>>>
>>> qdisc mq 0: root
>>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>
>>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
>>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
>>> version (6.1.0-rc7+). Is this a bug in the i40e driver?
>>>
>>
>>> Cc: stable@vger.kernel.org
>>
>> Patch title is mangled. The Fixes: tag should appear here, not in the title.
>>
>>
>> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")
>>
>>> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
>>> ---
>>>   net/ipv6/ioam6.c | 11 +++++++----
>>>   1 file changed, 7 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
>>> index 571f0e4d9cf3..2472a8a043c4 100644
>>> --- a/net/ipv6/ioam6.c
>>> +++ b/net/ipv6/ioam6.c
>>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
>>>                          *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>>                  } else {
>>>                          queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
>>
>> Are you sure skb_dst(skb)->dev is correct at this stage, what about
>> stacked devices ?
>>
>>> -                       qdisc = rcu_dereference(queue->qdisc);
>>> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>>> -
>>> -                       *(__be32 *)data = cpu_to_be32(backlog);
>>> +                       if (!queue->qdisc) {
>>
>> This is racy.
>>
>>> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>> +                       } else {
>>> +                               qdisc = rcu_dereference(queue->qdisc);
>>> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>>> +                               *(__be32 *)data = cpu_to_be32(backlog);
>>> +                       }
>>>                  }
>>>                  data += sizeof(__be32);
>>>          }
>>> --
>>> 2.25.1
>>>
>>
>> Quite frankly I suggest to revert b63c5478e9cb completely.
>>
>> The notion of Queue depth can not be properly gathered in Linux with a
>> multi queue model,
>> so why trying to get a wrong value ?
> 
> Additional reason for a revert is that qdisc_qstats_qlen_backlog() is
> reserved for net/sched

If by "reserved" you mean "only used by at the moment", then yes (with 
the only exception being IOAM). But some other functions are defined as 
well, and some are used elsewhere than the "net/sched" context. So I 
don't think it's really an issue to use this function "from somewhere else".

> code, I think it needs the qdisc lock to be held.

Good point. But is it really needed when called with rcu_read_lock?
Eric Dumazet Dec. 5, 2022, 6:34 p.m. UTC | #5
On Mon, Dec 5, 2022 at 7:24 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 12/5/22 17:50, Eric Dumazet wrote:
> > On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >> Patch title seems
> >>
> >> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>>
> >>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
> >>> for IOAM.
> >>>
> >>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
> >>> reasoning below).
> >>>
> >>> Kernel panic:
> >>> [...]
> >>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
> >>> [...]
> >>>
> >>> ...which basically points to the call to qdisc_qstats_qlen_backlog
> >>> inside net/ipv6/ioam6.c.
> >>>
> >>>  From there, I directly thought of a NULL pointer (queue->qdisc). To make
> >>> sure, I added some printk's to know exactly *why* and *when* it happens.
> >>> Here is the (summarized by queue) output:
> >>>
> >>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
> >>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
> >>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
> >>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
> >>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
> >>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
> >>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
> >>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
> >>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
> >>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
> >>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
> >>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
> >>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
> >>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
> >>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
> >>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
> >>>
> >>> What the hell? So, not sure why queue #16 would *never* have a qdisc
> >>> attached. Is it something expected I'm not aware of? As an FYI, here is
> >>> the output of "tc qdisc list dev xxx":
> >>>
> >>> qdisc mq 0: root
> >>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>
> >>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
> >>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
> >>> version (6.1.0-rc7+). Is this a bug in the i40e driver?
> >>>
> >>
> >>> Cc: stable@vger.kernel.org
> >>
> >> Patch title is mangled. The Fixes: tag should appear here, not in the title.
> >>
> >>
> >> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")
> >>
> >>> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
> >>> ---
> >>>   net/ipv6/ioam6.c | 11 +++++++----
> >>>   1 file changed, 7 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
> >>> index 571f0e4d9cf3..2472a8a043c4 100644
> >>> --- a/net/ipv6/ioam6.c
> >>> +++ b/net/ipv6/ioam6.c
> >>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
> >>>                          *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> >>>                  } else {
> >>>                          queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
> >>
> >> Are you sure skb_dst(skb)->dev is correct at this stage, what about
> >> stacked devices ?
> >>
> >>> -                       qdisc = rcu_dereference(queue->qdisc);
> >>> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> >>> -
> >>> -                       *(__be32 *)data = cpu_to_be32(backlog);
> >>> +                       if (!queue->qdisc) {
> >>
> >> This is racy.
> >>
> >>> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> >>> +                       } else {
> >>> +                               qdisc = rcu_dereference(queue->qdisc);
> >>> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> >>> +                               *(__be32 *)data = cpu_to_be32(backlog);
> >>> +                       }
> >>>                  }
> >>>                  data += sizeof(__be32);
> >>>          }
> >>> --
> >>> 2.25.1
> >>>
> >>
> >> Quite frankly I suggest to revert b63c5478e9cb completely.
> >>
> >> The notion of Queue depth can not be properly gathered in Linux with a
> >> multi queue model,
> >> so why trying to get a wrong value ?
> >
> > Additional reason for a revert is that qdisc_qstats_qlen_backlog() is
> > reserved for net/sched
>
> If by "reserved" you mean "only used by at the moment", then yes (with
> the only exception being IOAM). But some other functions are defined as
> well, and some are used elsewhere than the "net/sched" context. So I
> don't think it's really an issue to use this function "from somewhere else".
>
> > code, I think it needs the qdisc lock to be held.
>
> Good point. But is it really needed when called with rcu_read_lock?

It is needed.

Please revert this patch.

Many people use FQ qdisc, where packets are waiting for their Earliest
Departure Time to be released.

Also, the draft says:

5.4.2.7.  queue depth

   The "queue depth" field is a 4-octet unsigned integer field.  This
   field indicates the current length of the egress interface queue of
   the interface from where the packet is forwarded out.  The queue
   depth is expressed as the current amount of memory buffers used by
   the queue (a packet could consume one or more memory buffers,
   depending on its size).

    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       queue depth                             |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


It is relatively clear that the egress interface is the aggregate
egress interface,
not a subset of the interface.

If you have 32 TX queues on a NIC, all of them being backlogged (line rate),
sensing the queue length of one of the queues would give a 97% error
on the measure.

To properly implement that 'Queue depth' metric, you would need to use
the backlog of the MQ qdisc.
That would be very expensive.
Justin Iurman Dec. 5, 2022, 8:44 p.m. UTC | #6
On 12/5/22 19:34, Eric Dumazet wrote:
> On Mon, Dec 5, 2022 at 7:24 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>>
>> On 12/5/22 17:50, Eric Dumazet wrote:
>>> On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>
>>>> Patch title seems
>>>>
>>>> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>>>>>
>>>>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
>>>>> for IOAM.
>>>>>
>>>>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
>>>>> reasoning below).
>>>>>
>>>>> Kernel panic:
>>>>> [...]
>>>>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
>>>>> [...]
>>>>>
>>>>> ...which basically points to the call to qdisc_qstats_qlen_backlog
>>>>> inside net/ipv6/ioam6.c.
>>>>>
>>>>>   From there, I directly thought of a NULL pointer (queue->qdisc). To make
>>>>> sure, I added some printk's to know exactly *why* and *when* it happens.
>>>>> Here is the (summarized by queue) output:
>>>>>
>>>>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
>>>>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
>>>>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
>>>>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
>>>>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
>>>>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
>>>>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
>>>>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
>>>>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
>>>>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
>>>>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
>>>>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
>>>>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
>>>>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
>>>>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
>>>>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
>>>>>
>>>>> What the hell? So, not sure why queue #16 would *never* have a qdisc
>>>>> attached. Is it something expected I'm not aware of? As an FYI, here is
>>>>> the output of "tc qdisc list dev xxx":
>>>>>
>>>>> qdisc mq 0: root
>>>>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>>
>>>>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
>>>>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
>>>>> version (6.1.0-rc7+). Is this a bug in the i40e driver?
>>>>>
>>>>
>>>>> Cc: stable@vger.kernel.org
>>>>
>>>> Patch title is mangled. The Fixes: tag should appear here, not in the title.
>>>>
>>>>
>>>> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")
>>>>
>>>>> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
>>>>> ---
>>>>>    net/ipv6/ioam6.c | 11 +++++++----
>>>>>    1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
>>>>> index 571f0e4d9cf3..2472a8a043c4 100644
>>>>> --- a/net/ipv6/ioam6.c
>>>>> +++ b/net/ipv6/ioam6.c
>>>>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
>>>>>                           *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>>>>                   } else {
>>>>>                           queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
>>>>
>>>> Are you sure skb_dst(skb)->dev is correct at this stage, what about
>>>> stacked devices ?
>>>>
>>>>> -                       qdisc = rcu_dereference(queue->qdisc);
>>>>> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>>>>> -
>>>>> -                       *(__be32 *)data = cpu_to_be32(backlog);
>>>>> +                       if (!queue->qdisc) {
>>>>
>>>> This is racy.
>>>>
>>>>> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>>>> +                       } else {
>>>>> +                               qdisc = rcu_dereference(queue->qdisc);
>>>>> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>>>>> +                               *(__be32 *)data = cpu_to_be32(backlog);
>>>>> +                       }
>>>>>                   }
>>>>>                   data += sizeof(__be32);
>>>>>           }
>>>>> --
>>>>> 2.25.1
>>>>>
>>>>
>>>> Quite frankly I suggest to revert b63c5478e9cb completely.
>>>>
>>>> The notion of Queue depth can not be properly gathered in Linux with a
>>>> multi queue model,
>>>> so why trying to get a wrong value ?
>>>
>>> Additional reason for a revert is that qdisc_qstats_qlen_backlog() is
>>> reserved for net/sched
>>
>> If by "reserved" you mean "only used by at the moment", then yes (with
>> the only exception being IOAM). But some other functions are defined as
>> well, and some are used elsewhere than the "net/sched" context. So I
>> don't think it's really an issue to use this function "from somewhere else".
>>
>>> code, I think it needs the qdisc lock to be held.
>>
>> Good point. But is it really needed when called with rcu_read_lock?
> 
> It is needed.

So I guess I could just submit a patch to wrap that part with:

spin_lock(qdisc_lock(qdisc));
[...]
spin_unlock(qdisc_lock(qdisc));

Anyway, there'd still be that queue-without-qdisc bug out there that I'd 
like to discuss but which seems to be avoided in the conversation somehow.

> Please revert this patch.
> 
> Many people use FQ qdisc, where packets are waiting for their Earliest
> Departure Time to be released.

The IOAM queue depth is a very important value and is already used. Just 
reverting the patch is not really an option. Besides, if IOAM is not 
used, then what you described above would not be a problem (probably 
most of the time as IOAM is enabled on limited domains). Not to mention 
that you probably don't need the queue depth in every packet.

> Also, the draft says:
> 
> 5.4.2.7.  queue depth
> 
>     The "queue depth" field is a 4-octet unsigned integer field.  This
>     field indicates the current length of the egress interface queue of
>     the interface from where the packet is forwarded out.  The queue
>     depth is expressed as the current amount of memory buffers used by
>     the queue (a packet could consume one or more memory buffers,
>     depending on its size).
> 
>      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>     |                       queue depth                             |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 
> 
> It is relatively clear that the egress interface is the aggregate
> egress interface,
> not a subset of the interface.

Correct, even though the definition of an interface in RFC 9197 is quite 
abstract (see the end of section 4.4.2.2: "[...] could represent a 
physical interface, a virtual or logical interface, or even a queue").

> If you have 32 TX queues on a NIC, all of them being backlogged (line rate),
> sensing the queue length of one of the queues would give a 97% error
> on the measure.

Why would it? Not sure I get your idea based on that example.

> To properly implement that 'Queue depth' metric, you would need to use
> the backlog of the MQ qdisc.
> That would be very expensive.

Could be a solution, thanks for the suggestion. But if it's too 
expensive, I'd rather have the current solution (with locks) than 
nothing at all. At least it has provided an acceptable solution so far.
Eric Dumazet Dec. 5, 2022, 8:45 p.m. UTC | #7
On Mon, Dec 5, 2022 at 9:44 PM Justin Iurman <justin.iurman@uliege.be> wrote:
>
> On 12/5/22 19:34, Eric Dumazet wrote:
> > On Mon, Dec 5, 2022 at 7:24 PM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>
> >> On 12/5/22 17:50, Eric Dumazet wrote:
> >>> On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@google.com> wrote:
> >>>>
> >>>> Patch title seems
> >>>>
> >>>> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@uliege.be> wrote:
> >>>>>
> >>>>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
> >>>>> for IOAM.
> >>>>>
> >>>>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
> >>>>> reasoning below).
> >>>>>
> >>>>> Kernel panic:
> >>>>> [...]
> >>>>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
> >>>>> [...]
> >>>>>
> >>>>> ...which basically points to the call to qdisc_qstats_qlen_backlog
> >>>>> inside net/ipv6/ioam6.c.
> >>>>>
> >>>>>   From there, I directly thought of a NULL pointer (queue->qdisc). To make
> >>>>> sure, I added some printk's to know exactly *why* and *when* it happens.
> >>>>> Here is the (summarized by queue) output:
> >>>>>
> >>>>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
> >>>>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
> >>>>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
> >>>>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
> >>>>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
> >>>>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
> >>>>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
> >>>>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
> >>>>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
> >>>>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
> >>>>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
> >>>>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
> >>>>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
> >>>>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
> >>>>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
> >>>>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
> >>>>>
> >>>>> What the hell? So, not sure why queue #16 would *never* have a qdisc
> >>>>> attached. Is it something expected I'm not aware of? As an FYI, here is
> >>>>> the output of "tc qdisc list dev xxx":
> >>>>>
> >>>>> qdisc mq 0: root
> >>>>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
> >>>>>
> >>>>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
> >>>>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
> >>>>> version (6.1.0-rc7+). Is this a bug in the i40e driver?
> >>>>>
> >>>>
> >>>>> Cc: stable@vger.kernel.org
> >>>>
> >>>> Patch title is mangled. The Fixes: tag should appear here, not in the title.
> >>>>
> >>>>
> >>>> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")
> >>>>
> >>>>> Signed-off-by: Justin Iurman <justin.iurman@uliege.be>
> >>>>> ---
> >>>>>    net/ipv6/ioam6.c | 11 +++++++----
> >>>>>    1 file changed, 7 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
> >>>>> index 571f0e4d9cf3..2472a8a043c4 100644
> >>>>> --- a/net/ipv6/ioam6.c
> >>>>> +++ b/net/ipv6/ioam6.c
> >>>>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
> >>>>>                           *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> >>>>>                   } else {
> >>>>>                           queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
> >>>>
> >>>> Are you sure skb_dst(skb)->dev is correct at this stage, what about
> >>>> stacked devices ?
> >>>>
> >>>>> -                       qdisc = rcu_dereference(queue->qdisc);
> >>>>> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> >>>>> -
> >>>>> -                       *(__be32 *)data = cpu_to_be32(backlog);
> >>>>> +                       if (!queue->qdisc) {
> >>>>
> >>>> This is racy.
> >>>>
> >>>>> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
> >>>>> +                       } else {
> >>>>> +                               qdisc = rcu_dereference(queue->qdisc);
> >>>>> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
> >>>>> +                               *(__be32 *)data = cpu_to_be32(backlog);
> >>>>> +                       }
> >>>>>                   }
> >>>>>                   data += sizeof(__be32);
> >>>>>           }
> >>>>> --
> >>>>> 2.25.1
> >>>>>
> >>>>
> >>>> Quite frankly I suggest to revert b63c5478e9cb completely.
> >>>>
> >>>> The notion of Queue depth can not be properly gathered in Linux with a
> >>>> multi queue model,
> >>>> so why trying to get a wrong value ?
> >>>
> >>> Additional reason for a revert is that qdisc_qstats_qlen_backlog() is
> >>> reserved for net/sched
> >>
> >> If by "reserved" you mean "only used by at the moment", then yes (with
> >> the only exception being IOAM). But some other functions are defined as
> >> well, and some are used elsewhere than the "net/sched" context. So I
> >> don't think it's really an issue to use this function "from somewhere else".
> >>
> >>> code, I think it needs the qdisc lock to be held.
> >>
> >> Good point. But is it really needed when called with rcu_read_lock?
> >
> > It is needed.
>
> So I guess I could just submit a patch to wrap that part with:
>
> spin_lock(qdisc_lock(qdisc));
> [...]
> spin_unlock(qdisc_lock(qdisc));
>
> Anyway, there'd still be that queue-without-qdisc bug out there that I'd
> like to discuss but which seems to be avoided in the conversation somehow.
>
> > Please revert this patch.
> >
> > Many people use FQ qdisc, where packets are waiting for their Earliest
> > Departure Time to be released.
>
> The IOAM queue depth is a very important value and is already used. Just
> reverting the patch is not really an option. Besides, if IOAM is not
> used, then what you described above would not be a problem (probably
> most of the time as IOAM is enabled on limited domains). Not to mention
> that you probably don't need the queue depth in every packet.
>
> > Also, the draft says:
> >
> > 5.4.2.7.  queue depth
> >
> >     The "queue depth" field is a 4-octet unsigned integer field.  This
> >     field indicates the current length of the egress interface queue of
> >     the interface from where the packet is forwarded out.  The queue
> >     depth is expressed as the current amount of memory buffers used by
> >     the queue (a packet could consume one or more memory buffers,
> >     depending on its size).
> >
> >      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> >     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> >     |                       queue depth                             |
> >     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> >
> >
> > It is relatively clear that the egress interface is the aggregate
> > egress interface,
> > not a subset of the interface.
>
> Correct, even though the definition of an interface in RFC 9197 is quite
> abstract (see the end of section 4.4.2.2: "[...] could represent a
> physical interface, a virtual or logical interface, or even a queue").
>
> > If you have 32 TX queues on a NIC, all of them being backlogged (line rate),
> > sensing the queue length of one of the queues would give a 97% error
> > on the measure.
>
> Why would it? Not sure I get your idea based on that example.
>
> > To properly implement that 'Queue depth' metric, you would need to use
> > the backlog of the MQ qdisc.
> > That would be very expensive.
>
> Could be a solution, thanks for the suggestion. But if it's too
> expensive, I'd rather have the current solution (with locks) than
> nothing at all. At least it has provided an acceptable solution so far.

I will let other maintainers decide.

I gave my feedback, I won't change it.
Jakub Kicinski Dec. 6, 2022, 8:43 p.m. UTC | #8
On Mon, 5 Dec 2022 21:44:09 +0100 Justin Iurman wrote:
> > Please revert this patch.
> > 
> > Many people use FQ qdisc, where packets are waiting for their Earliest
> > Departure Time to be released.  
> 
> The IOAM queue depth is a very important value and is already used.

Can you say more about the use? What signal do you derive from it?
I do track qlen on Meta's servers but haven't found a strong use 
for it yet (I did for backlog drops but not the qlen itself).

> > Also, the draft says:
> > 
> > 5.4.2.7.  queue depth
> > 
> >     The "queue depth" field is a 4-octet unsigned integer field.  This
> >     field indicates the current length of the egress interface queue of
> >     the interface from where the packet is forwarded out.  The queue
> >     depth is expressed as the current amount of memory buffers used by
> >     the queue (a packet could consume one or more memory buffers,
> >     depending on its size).
> > 
> >      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
> >     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> >     |                       queue depth                             |
> >     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> > 
> > 
> > It is relatively clear that the egress interface is the aggregate
> > egress interface,
> > not a subset of the interface.  
> 
> Correct, even though the definition of an interface in RFC 9197 is quite 
> abstract (see the end of section 4.4.2.2: "[...] could represent a 
> physical interface, a virtual or logical interface, or even a queue").
> 
> > If you have 32 TX queues on a NIC, all of them being backlogged (line rate),
> > sensing the queue length of one of the queues would give a 97% error
> > on the measure.  
> 
> Why would it? Not sure I get your idea based on that example.

Because it measures the length of a single queue not the device.
Justin Iurman Dec. 7, 2022, 12:07 p.m. UTC | #9
On 12/6/22 21:43, Jakub Kicinski wrote:
> On Mon, 5 Dec 2022 21:44:09 +0100 Justin Iurman wrote:
>>> Please revert this patch.
>>>
>>> Many people use FQ qdisc, where packets are waiting for their Earliest
>>> Departure Time to be released.
>>
>> The IOAM queue depth is a very important value and is already used.
> 
> Can you say more about the use? What signal do you derive from it?
> I do track qlen on Meta's servers but haven't found a strong use
> for it yet (I did for backlog drops but not the qlen itself).

The specification goal of the queue depth was initially to be able to 
track the entire path with a detailed view for packets or flows (kind of 
a zoom on the interface to have details about its queues). With the 
current definition/implementation of the queue depth, if only one queue 
is congested, you're able to know it. Which doesn't necessarily mean 
that all queues are full, but this one is and there might be something 
going on. And this is something operators might want to be able to 
detect precisely, for a lot of use cases depending on the situation. On 
the contrary, if all queues are full, then you could deduce that as well 
for each queue separately, as soon as a packet is assigned to it. So I 
think that with "queue depth = sum(queues)", you don't have details and 
you're not able to detect a single queue congestion, while with "queue 
depth = queue" you could detect both. One might argue that it's fine to 
only have the aggregation in some situation. I'd say that we might need 
both, actually. Which is technically possible (even though expensive, as 
Eric mentioned) thanks to the way it is specified by the RFC, where some 
freedom was intentionally given. I could come up with a solution for that.

>>> Also, the draft says:
>>>
>>> 5.4.2.7.  queue depth
>>>
>>>      The "queue depth" field is a 4-octet unsigned integer field.  This
>>>      field indicates the current length of the egress interface queue of
>>>      the interface from where the packet is forwarded out.  The queue
>>>      depth is expressed as the current amount of memory buffers used by
>>>      the queue (a packet could consume one or more memory buffers,
>>>      depending on its size).
>>>
>>>       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>>>      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>>      |                       queue depth                             |
>>>      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>>
>>>
>>> It is relatively clear that the egress interface is the aggregate
>>> egress interface,
>>> not a subset of the interface.
>>
>> Correct, even though the definition of an interface in RFC 9197 is quite
>> abstract (see the end of section 4.4.2.2: "[...] could represent a
>> physical interface, a virtual or logical interface, or even a queue").
>>
>>> If you have 32 TX queues on a NIC, all of them being backlogged (line rate),
>>> sensing the queue length of one of the queues would give a 97% error
>>> on the measure.
>>
>> Why would it? Not sure I get your idea based on that example.
> 
> Because it measures the length of a single queue not the device.

Yep, I figured that out after the off-list discussion we've had with Eric.

So my plan would be, if you all agree with, to correct and repost this 
patch to fix the NULL qdisc issue. Then, I'd come with a solution to 
allow both (with and without aggregation of queues) and post it on 
net-next. But again, if the consensus is to revert this patch (which I 
think would bring no benefit IMHO), then so be it. Thoughts?
Justin Iurman Dec. 7, 2022, 12:10 p.m. UTC | #10
On 12/7/22 13:07, Justin Iurman wrote:
> On 12/6/22 21:43, Jakub Kicinski wrote:
>> On Mon, 5 Dec 2022 21:44:09 +0100 Justin Iurman wrote:
>>>> Please revert this patch.
>>>>
>>>> Many people use FQ qdisc, where packets are waiting for their Earliest
>>>> Departure Time to be released.
>>>
>>> The IOAM queue depth is a very important value and is already used.
>>
>> Can you say more about the use? What signal do you derive from it?
>> I do track qlen on Meta's servers but haven't found a strong use
>> for it yet (I did for backlog drops but not the qlen itself).
> 
> The specification goal of the queue depth was initially to be able to 
> track the entire path with a detailed view for packets or flows (kind of 
> a zoom on the interface to have details about its queues). With the 
> current definition/implementation of the queue depth, if only one queue 
> is congested, you're able to know it. Which doesn't necessarily mean 
> that all queues are full, but this one is and there might be something 
> going on. And this is something operators might want to be able to 
> detect precisely, for a lot of use cases depending on the situation. On 
> the contrary, if all queues are full, then you could deduce that as well 
> for each queue separately, as soon as a packet is assigned to it. So I 
> think that with "queue depth = sum(queues)", you don't have details and 
> you're not able to detect a single queue congestion, while with "queue 
> depth = queue" you could detect both. One might argue that it's fine to 
> only have the aggregation in some situation. I'd say that we might need 
> both, actually. Which is technically possible (even though expensive, as 
> Eric mentioned) thanks to the way it is specified by the RFC, where some 
> freedom was intentionally given. I could come up with a solution for that.
> 
>>>> Also, the draft says:
>>>>
>>>> 5.4.2.7.  queue depth
>>>>
>>>>      The "queue depth" field is a 4-octet unsigned integer field.  This
>>>>      field indicates the current length of the egress interface 
>>>> queue of
>>>>      the interface from where the packet is forwarded out.  The queue
>>>>      depth is expressed as the current amount of memory buffers used by
>>>>      the queue (a packet could consume one or more memory buffers,
>>>>      depending on its size).
>>>>
>>>>       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>>>>      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>>>      |                       queue depth                             |
>>>>      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>>>>
>>>>
>>>> It is relatively clear that the egress interface is the aggregate
>>>> egress interface,
>>>> not a subset of the interface.
>>>
>>> Correct, even though the definition of an interface in RFC 9197 is quite
>>> abstract (see the end of section 4.4.2.2: "[...] could represent a
>>> physical interface, a virtual or logical interface, or even a queue").
>>>
>>>> If you have 32 TX queues on a NIC, all of them being backlogged 
>>>> (line rate),
>>>> sensing the queue length of one of the queues would give a 97% error
>>>> on the measure.
>>>
>>> Why would it? Not sure I get your idea based on that example.
>>
>> Because it measures the length of a single queue not the device.
> 
> Yep, I figured that out after the off-list discussion we've had with Eric.
> 
> So my plan would be, if you all agree with, to correct and repost this 
> patch to fix the NULL qdisc issue. Then, I'd come with a solution to 
> allow both (with and without aggregation of queues) and post it on 
> net-next. But again, if the consensus is to revert this patch (which I 
> think would bring no benefit IMHO), then so be it. Thoughts?

And by "revert this patch", I meant revert b63c5478e9cb, sorry.
Jakub Kicinski Dec. 8, 2022, 12:04 a.m. UTC | #11
On Wed, 7 Dec 2022 13:07:18 +0100 Justin Iurman wrote:
> > Can you say more about the use? What signal do you derive from it?
> > I do track qlen on Meta's servers but haven't found a strong use
> > for it yet (I did for backlog drops but not the qlen itself).  
> 
> The specification goal of the queue depth was initially to be able to 
> track the entire path with a detailed view for packets or flows (kind of 
> a zoom on the interface to have details about its queues). With the 
> current definition/implementation of the queue depth, if only one queue 
> is congested, you're able to know it. Which doesn't necessarily mean 
> that all queues are full, but this one is and there might be something 
> going on. And this is something operators might want to be able to 
> detect precisely, for a lot of use cases depending on the situation. On 
> the contrary, if all queues are full, then you could deduce that as well 
> for each queue separately, as soon as a packet is assigned to it. So I 
> think that with "queue depth = sum(queues)", you don't have details and 
> you're not able to detect a single queue congestion, while with "queue 
> depth = queue" you could detect both. One might argue that it's fine to 
> only have the aggregation in some situation. I'd say that we might need 
> both, actually. Which is technically possible (even though expensive, as 
> Eric mentioned) thanks to the way it is specified by the RFC, where some 
> freedom was intentionally given. I could come up with a solution for that.

Understood. My hope was that by now there was some in-field experience
which could help us judge how much signal can one derive from a single
queue. Or a user that could attest.

> > Because it measures the length of a single queue not the device.  
> 
> Yep, I figured that out after the off-list discussion we've had with Eric.
> 
> So my plan would be, if you all agree with, to correct and repost this 
> patch to fix the NULL qdisc issue. Then, I'd come with a solution to 
> allow both (with and without aggregation of queues) and post it on 
> net-next. But again, if the consensus is to revert this patch (which I 
> think would bring no benefit IMHO), then so be it. Thoughts?

To summarize - we have reservations about correctness and about 
breaking layering (ip6 calling down to net/sched).

You can stick to your approach, respost and see if any of the other
maintainer is willing to pick this up (i.e. missed this nack).
If you ask for my option I'll side with Eric.
Justin Iurman Dec. 8, 2022, 12:47 p.m. UTC | #12
On 12/8/22 01:04, Jakub Kicinski wrote:
> On Wed, 7 Dec 2022 13:07:18 +0100 Justin Iurman wrote:
>>> Can you say more about the use? What signal do you derive from it?
>>> I do track qlen on Meta's servers but haven't found a strong use
>>> for it yet (I did for backlog drops but not the qlen itself).
>>
>> The specification goal of the queue depth was initially to be able to
>> track the entire path with a detailed view for packets or flows (kind of
>> a zoom on the interface to have details about its queues). With the
>> current definition/implementation of the queue depth, if only one queue
>> is congested, you're able to know it. Which doesn't necessarily mean
>> that all queues are full, but this one is and there might be something
>> going on. And this is something operators might want to be able to
>> detect precisely, for a lot of use cases depending on the situation. On
>> the contrary, if all queues are full, then you could deduce that as well
>> for each queue separately, as soon as a packet is assigned to it. So I
>> think that with "queue depth = sum(queues)", you don't have details and
>> you're not able to detect a single queue congestion, while with "queue
>> depth = queue" you could detect both. One might argue that it's fine to
>> only have the aggregation in some situation. I'd say that we might need
>> both, actually. Which is technically possible (even though expensive, as
>> Eric mentioned) thanks to the way it is specified by the RFC, where some
>> freedom was intentionally given. I could come up with a solution for that.
> 
> Understood. My hope was that by now there was some in-field experience
> which could help us judge how much signal can one derive from a single
> queue. Or a user that could attest.

I asked the people concerned. I'll get back to you.

>>> Because it measures the length of a single queue not the device.
>>
>> Yep, I figured that out after the off-list discussion we've had with Eric.
>>
>> So my plan would be, if you all agree with, to correct and repost this
>> patch to fix the NULL qdisc issue. Then, I'd come with a solution to
>> allow both (with and without aggregation of queues) and post it on
>> net-next. But again, if the consensus is to revert this patch (which I
>> think would bring no benefit IMHO), then so be it. Thoughts?
> 
> To summarize - we have reservations about correctness and about

Ack. And this is exactly why I proposed the alternative of having both 
solutions implemented (i.e., a specific queue depth or an aggregation of 
queues per interface).

> breaking layering (ip6 calling down to net/sched).

Indeed. Funny enough, this one's going to be hard to beat. How 
frustrating when the value to be retrieved is there, but cannot be used. 
Especially when it's probably one of the most important metric from 
IOAM. I guess it's the price to pay when dealing with telemetry over 
multiple layers. Anyway...

> You can stick to your approach, respost and see if any of the other
> maintainer is willing to pick this up (i.e. missed this nack).
> If you ask for my option I'll side with Eric
Got it, thanks.
diff mbox series

Patch

diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
index 571f0e4d9cf3..2472a8a043c4 100644
--- a/net/ipv6/ioam6.c
+++ b/net/ipv6/ioam6.c
@@ -727,10 +727,13 @@  static void __ioam6_fill_trace_data(struct sk_buff *skb,
 			*(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
 		} else {
 			queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
-			qdisc = rcu_dereference(queue->qdisc);
-			qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
-
-			*(__be32 *)data = cpu_to_be32(backlog);
+			if (!queue->qdisc) {
+				*(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
+			} else {
+				qdisc = rcu_dereference(queue->qdisc);
+				qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
+				*(__be32 *)data = cpu_to_be32(backlog);
+			}
 		}
 		data += sizeof(__be32);
 	}