diff mbox series

[net-next,v2,4/5] octeontx2-pf: Add devlink support to configure TL1 RR_PRIO

Message ID 20230118105107.9516-5-hkelam@marvell.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series octeontx2-pf: HTB offload support | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 11 of 11 maintainers
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 23 this patch: 23
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 281 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Hariprasad Kelam Jan. 18, 2023, 10:51 a.m. UTC
All VFs and PF netdev shares same TL1 schedular, each interface
PF or VF will have different TL2 schedulars having same parent
TL1. The TL1 RR_PRIO value is static and PF/VFs use the same value
to configure its TL2 node priority in case of DWRR children.

This patch adds support to configure TL1 RR_PRIO value using devlink.
The TL1 RR_PRIO can be configured for each PF. The VFs are not allowed
to configure TL1 RR_PRIO value. The VFs can get the RR_PRIO value from
the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.

Signed-off-by: Hariprasad Kelam <hkelam@marvell.com>
---
 .../ethernet/marvell/octeontx2/af/common.h    |  2 +-
 .../net/ethernet/marvell/octeontx2/af/mbox.h  |  9 ++-
 .../net/ethernet/marvell/octeontx2/af/rvu.c   | 15 ++++
 .../net/ethernet/marvell/octeontx2/af/rvu.h   |  1 +
 .../ethernet/marvell/octeontx2/af/rvu_nix.c   | 39 +++++++++-
 .../marvell/octeontx2/nic/otx2_common.c       |  9 ++-
 .../marvell/octeontx2/nic/otx2_common.h       |  1 +
 .../marvell/octeontx2/nic/otx2_devlink.c      | 78 +++++++++++++++++++
 8 files changed, 147 insertions(+), 7 deletions(-)

Comments

Maxim Mikityanskiy Jan. 18, 2023, 8:37 p.m. UTC | #1
On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> All VFs and PF netdev shares same TL1 schedular, each interface
> PF or VF will have different TL2 schedulars having same parent
> TL1. The TL1 RR_PRIO value is static and PF/VFs use the same value
> to configure its TL2 node priority in case of DWRR children.
> 
> This patch adds support to configure TL1 RR_PRIO value using devlink.
> The TL1 RR_PRIO can be configured for each PF. The VFs are not allowed
> to configure TL1 RR_PRIO value. The VFs can get the RR_PRIO value from
> the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.

I asked this question under v1, but didn't get an answer, could you shed
some light?

"Could you please elaborate how these priorities of Transmit Levels are
related to HTB priorities? I don't seem to understand why something has
to be configured with devlink in addition to HTB.

Also, what do MDQ and SMQ abbreviations mean?"

BTW, why did you remove the paragraphs with an example and a limitation?
I think they are pretty useful.

Another question unanswered under v1 was:

"Is there any technical difficulty or hardware limitation preventing from
implementing modifications?" (TC_HTB_NODE_MODIFY)
Hariprasad Kelam Jan. 20, 2023, 8:50 a.m. UTC | #2
On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> All VFs and PF netdev shares same TL1 schedular, each interface PF or 
> VF will have different TL2 schedulars having same parent TL1. The TL1 
> RR_PRIO value is static and PF/VFs use the same value to configure its 
> TL2 node priority in case of DWRR children.
> 
> This patch adds support to configure TL1 RR_PRIO value using devlink.
> The TL1 RR_PRIO can be configured for each PF. The VFs are not allowed 
> to configure TL1 RR_PRIO value. The VFs can get the RR_PRIO value from 
> the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.

I asked this question under v1, but didn't get an answer, could you shed some light?

"Could you please elaborate how these priorities of Transmit Levels are related to HTB priorities? I don't seem to understand why something has to be configured with devlink in addition to HTB.

SMQ (send meta-descriptor queue) and MDQ (meta-descriptor queue) are the first transmit levels.
Each send queue is mapped with SMQ.
 
As mentioned in cover letter, each egress packet needs to traverse all transmit levels starting from TL5 to TL1.
This applies to non-QOS Send queues as well.
 
                       SMQ/MDQ --> TL4 -->TL3 -->TL2 -->TL1

By default non QOS queues use a default hierarchy  with round robin priority. 
To avoid conflict with QOS tree priorities, with devlink user can choose round-robin priority before Qos tree formation.


BTW, why did you remove the paragraphs with an example and a limitation?
I think they are pretty useful.

Ok , removed them accidentally will correct in the next version.

Another question unanswered under v1 was:

"Is there any technical difficulty or hardware limitation preventing from implementing modifications?" (TC_HTB_NODE_MODIFY)

There is no hardware limitation, we are currently implementing it.  once it's implemented we will submit for review.
Maxim Mikityanskiy Jan. 20, 2023, 1:37 p.m. UTC | #3
On Fri, Jan 20, 2023 at 08:50:16AM +0000, Hariprasad Kelam wrote:
> 
> On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> > All VFs and PF netdev shares same TL1 schedular, each interface PF or 
> > VF will have different TL2 schedulars having same parent TL1. The TL1 
> > RR_PRIO value is static and PF/VFs use the same value to configure its 
> > TL2 node priority in case of DWRR children.
> > 
> > This patch adds support to configure TL1 RR_PRIO value using devlink.
> > The TL1 RR_PRIO can be configured for each PF. The VFs are not allowed 
> > to configure TL1 RR_PRIO value. The VFs can get the RR_PRIO value from 
> > the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.
> 
> I asked this question under v1, but didn't get an answer, could you shed some light?
> 
> "Could you please elaborate how these priorities of Transmit Levels are related to HTB priorities? I don't seem to understand why something has to be configured with devlink in addition to HTB.
> 
> SMQ (send meta-descriptor queue) and MDQ (meta-descriptor queue) are the first transmit levels.
> Each send queue is mapped with SMQ.
>  
> As mentioned in cover letter, each egress packet needs to traverse all transmit levels starting from TL5 to TL1.

Yeah, I saw that, just some details about your hardware which might be
obvious to you aren't so clear to me...

Do these transmit levels map to "layers" of HTB hierarchy? Does it look
like this, or is my understanding completely wrong?

TL1                 [HTB root node]
                   /               \
TL2          [HTB node]         [HTB node]
            /          \             |
TL3    [HTB node]  [HTB node]   [HTB node]
...                       ...


> This applies to non-QOS Send queues as well.
>  
>                        SMQ/MDQ --> TL4 -->TL3 -->TL2 -->TL1
> 
> By default non QOS queues use a default hierarchy  with round robin priority. 
> To avoid conflict with QOS tree priorities, with devlink user can choose round-robin priority before Qos tree formation.

So, this priority that you set with devlink is basically a weight of
unclassified (default) traffic for round robin between unclassified and
classified traffic, right? I.e. you have two hierarchies (one for HTB,
another for non-QoS queue), and you do DWRR between them, according to
this priority?

> BTW, why did you remove the paragraphs with an example and a limitation?
> I think they are pretty useful.
> 
> Ok , removed them accidentally will correct in the next version.
> 
> Another question unanswered under v1 was:
> 
> "Is there any technical difficulty or hardware limitation preventing from implementing modifications?" (TC_HTB_NODE_MODIFY)
> 
> There is no hardware limitation, we are currently implementing it.  once it's implemented we will submit for review.

Great, that's nice to hear, looking forward to it.
Hariprasad Kelam Jan. 20, 2023, 5:03 p.m. UTC | #4
On Fri, Jan 20, 2023 at 08:50:16AM +0000, Hariprasad Kelam wrote:
> 
> On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> > All VFs and PF netdev shares same TL1 schedular, each interface PF 
> > or VF will have different TL2 schedulars having same parent TL1. The 
> > TL1 RR_PRIO value is static and PF/VFs use the same value to 
> > configure its
> > TL2 node priority in case of DWRR children.
> > 
> > This patch adds support to configure TL1 RR_PRIO value using devlink.
> > The TL1 RR_PRIO can be configured for each PF. The VFs are not 
> > allowed to configure TL1 RR_PRIO value. The VFs can get the RR_PRIO 
> > value from the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.
> 
> I asked this question under v1, but didn't get an answer, could you shed some light?
> 
> "Could you please elaborate how these priorities of Transmit Levels are related to HTB priorities? I don't seem to understand why something has to be configured with devlink in addition to HTB.
> 
> SMQ (send meta-descriptor queue) and MDQ (meta-descriptor queue) are the first transmit levels.
> Each send queue is mapped with SMQ.
>  
> As mentioned in cover letter, each egress packet needs to traverse all transmit levels starting from TL5 to TL1.

Yeah, I saw that, just some details about your hardware which might be obvious to you aren't so clear to me...

Do these transmit levels map to "layers" of HTB hierarchy? Does it look like this, or is my understanding completely wrong?

TL1                 [HTB root node]
                   /               \
TL2          [HTB node]         [HTB node]
            /          \             |
TL3    [HTB node]  [HTB node]   [HTB node]
...                       ...

Transmit levels to HTB mapping is correct.



> This applies to non-QOS Send queues as well.
>  
>                        SMQ/MDQ --> TL4 -->TL3 -->TL2 -->TL1
> 
> By default non QOS queues use a default hierarchy  with round robin priority. 
> To avoid conflict with QOS tree priorities, with devlink user can choose round-robin priority before Qos tree formation.

So, this priority that you set with devlink is basically a weight of unclassified (default) traffic for round robin between unclassified and classified traffic, right? I.e. you have two hierarchies (one for HTB, another for non-QoS queue), and you do DWRR between them, according to this priority?


 Not exactly, In the given scenario where  multiple vfs are attached to PF netdev.
 each VF unclassified traffic forms a hierarchy and PF also forms a hierarchy for unclassified traffic.
 
Now traffic from these all tress(multiple vfs and PFs) are aggregated at TL1. HW performs DWRR among them since these TL2 queues (belonging to each pf and vf netdevs) will be configured with the same priority by the driver.

Currently, this priority is hard coded. Now we are providing this as a configurable value to the user.

Now if a user adds a HTB node, this will have strict priority at TL2 level since DWRR priority is different this traffic won't be affected by DWRR unclassified traffic.

Thanks,
Hariprasad k







> BTW, why did you remove the paragraphs with an example and a limitation?
> I think they are pretty useful.
> 
> Ok , removed them accidentally will correct in the next version.
> 
> Another question unanswered under v1 was:
> 
> "Is there any technical difficulty or hardware limitation preventing 
> from implementing modifications?" (TC_HTB_NODE_MODIFY)
> 
> There is no hardware limitation, we are currently implementing it.  once it's implemented we will submit for review.

Great, that's nice to hear, looking forward to it.
Maxim Mikityanskiy Jan. 22, 2023, 1:18 p.m. UTC | #5
On Fri, Jan 20, 2023 at 05:03:20PM +0000, Hariprasad Kelam wrote:
> 
> 
> On Fri, Jan 20, 2023 at 08:50:16AM +0000, Hariprasad Kelam wrote:
> > 
> > On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> > > All VFs and PF netdev shares same TL1 schedular, each interface PF 
> > > or VF will have different TL2 schedulars having same parent TL1. The 
> > > TL1 RR_PRIO value is static and PF/VFs use the same value to 
> > > configure its
> > > TL2 node priority in case of DWRR children.
> > > 
> > > This patch adds support to configure TL1 RR_PRIO value using devlink.
> > > The TL1 RR_PRIO can be configured for each PF. The VFs are not 
> > > allowed to configure TL1 RR_PRIO value. The VFs can get the RR_PRIO 
> > > value from the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.
> > 
> > I asked this question under v1, but didn't get an answer, could you shed some light?
> > 
> > "Could you please elaborate how these priorities of Transmit Levels are related to HTB priorities? I don't seem to understand why something has to be configured with devlink in addition to HTB.
> > 
> > SMQ (send meta-descriptor queue) and MDQ (meta-descriptor queue) are the first transmit levels.
> > Each send queue is mapped with SMQ.
> >  
> > As mentioned in cover letter, each egress packet needs to traverse all transmit levels starting from TL5 to TL1.
> 
> Yeah, I saw that, just some details about your hardware which might be obvious to you aren't so clear to me...
> 
> Do these transmit levels map to "layers" of HTB hierarchy? Does it look like this, or is my understanding completely wrong?
> 
> TL1                 [HTB root node]
>                    /               \
> TL2          [HTB node]         [HTB node]
>             /          \             |
> TL3    [HTB node]  [HTB node]   [HTB node]
> ...                       ...
> 
> Transmit levels to HTB mapping is correct.
> 
> 
> 
> > This applies to non-QOS Send queues as well.
> >  
> >                        SMQ/MDQ --> TL4 -->TL3 -->TL2 -->TL1
> > 
> > By default non QOS queues use a default hierarchy  with round robin priority. 
> > To avoid conflict with QOS tree priorities, with devlink user can choose round-robin priority before Qos tree formation.
> 
> So, this priority that you set with devlink is basically a weight of unclassified (default) traffic for round robin between unclassified and classified traffic, right? I.e. you have two hierarchies (one for HTB, another for non-QoS queue), and you do DWRR between them, according to this priority?
> 
> 
>  Not exactly, In the given scenario where  multiple vfs are attached to PF netdev.
>  each VF unclassified traffic forms a hierarchy and PF also forms a hierarchy for unclassified traffic.
>  
> Now traffic from these all tress(multiple vfs and PFs) are aggregated at TL1. HW performs DWRR among them since these TL2 queues (belonging to each pf and vf netdevs) will be configured with the same priority by the driver.
> 
> Currently, this priority is hard coded. Now we are providing this as a configurable value to the user.
> 
> Now if a user adds a HTB node, this will have strict priority at TL2 level since DWRR priority is different this traffic won't be affected by DWRR unclassified traffic.

So, did I get it right now?

                                     [strict priority**]
                           /---------/                 \-----\
                           |                                 |
                        [DWRR*]                              |
        /---------------/  |   \---------------\             |
        |                  |                   |             |
[ Hierarchy for ]  [ Hierarchy for  ]  [ Hierarchy for  ]    |
[ unclassified  ]  [  unclassified  ]  [  unclassified  ]    |
[traffic from PF]  [traffic from VF1]  [traffic from VF2]    |
[      ***      ]  [      ***       ]  [      ***       ]    |
                                                             |
                                                [HTB hierarchy using]
                                                [  strict priority  ]
                                                [   between nodes   ]

As far as I understand, you set priorities at ***, which affect DWRR
balancing at *, but it's not clear to me how the selection at ** works.
Does the HTB hierarchy have some fixed priority, i.e. the user can set
priority for unclassified traffic to be higher or lower than HTB
traffic?

Please also point me at any inaccuracies in my picture, I really want to
understand the algorithm here, because configuring additional priorities
outside of HTB looks unusual to me.

> 
> Thanks,
> Hariprasad k
Hariprasad Kelam Jan. 23, 2023, 5:03 p.m. UTC | #6
> 
> On Fri, Jan 20, 2023 at 08:50:16AM +0000, Hariprasad Kelam wrote:
> > 
> > On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> > > All VFs and PF netdev shares same TL1 schedular, each interface PF 
> > > or VF will have different TL2 schedulars having same parent TL1. 
> > > The
> > > TL1 RR_PRIO value is static and PF/VFs use the same value to 
> > > configure its
> > > TL2 node priority in case of DWRR children.
> > > 
> > > This patch adds support to configure TL1 RR_PRIO value using devlink.
> > > The TL1 RR_PRIO can be configured for each PF. The VFs are not 
> > > allowed to configure TL1 RR_PRIO value. The VFs can get the 
> > > RR_PRIO value from the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.
> > 
> > I asked this question under v1, but didn't get an answer, could you shed some light?
> > 
> > "Could you please elaborate how these priorities of Transmit Levels are related to HTB priorities? I don't seem to understand why something has to be configured with devlink in addition to HTB.
> > 
> > SMQ (send meta-descriptor queue) and MDQ (meta-descriptor queue) are the first transmit levels.
> > Each send queue is mapped with SMQ.
> >  
> > As mentioned in cover letter, each egress packet needs to traverse all transmit levels starting from TL5 to TL1.
> 
> Yeah, I saw that, just some details about your hardware which might be obvious to you aren't so clear to me...
> 
> Do these transmit levels map to "layers" of HTB hierarchy? Does it look like this, or is my understanding completely wrong?
> 
> TL1                 [HTB root node]
>                    /               \
> TL2          [HTB node]         [HTB node]
>             /          \             |
> TL3    [HTB node]  [HTB node]   [HTB node]
> ...                       ...
> 
> Transmit levels to HTB mapping is correct.
> 
> 
> 
> > This applies to non-QOS Send queues as well.
> >  
> >                        SMQ/MDQ --> TL4 -->TL3 -->TL2 -->TL1
> > 
> > By default non QOS queues use a default hierarchy  with round robin priority. 
> > To avoid conflict with QOS tree priorities, with devlink user can choose round-robin priority before Qos tree formation.
> 
> So, this priority that you set with devlink is basically a weight of unclassified (default) traffic for round robin between unclassified and classified traffic, right? I.e. you have two hierarchies (one for HTB, another for non-QoS queue), and you do DWRR between them, according to this priority?
> 
> 
>  Not exactly, In the given scenario where  multiple vfs are attached to PF netdev.
>  each VF unclassified traffic forms a hierarchy and PF also forms a hierarchy for unclassified traffic.
>  
> Now traffic from these all tress(multiple vfs and PFs) are aggregated at TL1. HW performs DWRR among them since these TL2 queues (belonging to each pf and vf netdevs) will be configured with the same priority by the driver.
> 
> Currently, this priority is hard coded. Now we are providing this as a configurable value to the user.
> 
> Now if a user adds a HTB node, this will have strict priority at TL2 level since DWRR priority is different this traffic won't be affected by DWRR unclassified traffic.

So, did I get it right now?

                                     [strict priority**]
                           /---------/                 \-----\
                           |                                 |
                        [DWRR*]                              |
        /---------------/  |   \---------------\             |
        |                  |                   |             |
[ Hierarchy for ]  [ Hierarchy for  ]  [ Hierarchy for  ]    |
[ unclassified  ]  [  unclassified  ]  [  unclassified  ]    |
[traffic from PF]  [traffic from VF1]  [traffic from VF2]    |
[      ***      ]  [      ***       ]  [      ***       ]    |
                                                             |
                                                [HTB hierarchy using]
                                                [  strict priority  ]
                                                [   between nodes   ]



        Adjusted picture

                        /--------------------------------------------------------------------------------/       Transmit level 1
                            |                                                                                      | 
                        [DWRR*] [ priority 6 ]                                      [strict priority** ]      [ priority 0 ]  Transmit level 2
        /---------------/  |   \-----------------------------------\                           |
             |                   |                                   |                                           |
[ Hierarchy for ]  [ Hierarchy for  ]  [ Hierarchy for  ]                 [ Hierarchy for  ] 
[ unclassified  ]  [  unclassified  ]  [  unclassified  ]                      [ strict priority  ]
[traffic from PF]  [traffic from VF1]  [traffic from VF2]   
[      ***      ]  [      ***       ]  [      ***       ]    
                                                


As far as I understand, you set priorities at ***, which affect DWRR balancing at *, but it's not clear to me how the selection at ** works.
Does the HTB hierarchy have some fixed priority,  ?

Hardware supports priorities from 0 to 7. lower value has high priority.
nodes having the same priority are treated as DWRR childs.

i.e. the user can set priority for unclassified traffic to be higher or lower than HTB traffic?

Yes its user configurable, unclassified traffic priority can be higher or lower than HTB traffic if a user wishes to configure it.

Please also point me at any inaccuracies in my picture, I really want to understand the algorithm here, because configuring additional priorities outside of HTB looks unusual to me.

  Please check the adjusted picture. Let us assume a user has set the priority as 6 for DWRR (unclassified traffic)  and  HTB strict priority as 0.
Once all traffic reaches  TL2,  Now hardware algorithm first pics HTB strict priority and processes DWRR later according to their priorities.

> 
> Thanks,
> Hariprasad k
Maxim Mikityanskiy Jan. 23, 2023, 8:05 p.m. UTC | #7
On Mon, Jan 23, 2023 at 05:03:01PM +0000, Hariprasad Kelam wrote:
> 
> 
> > 
> > On Fri, Jan 20, 2023 at 08:50:16AM +0000, Hariprasad Kelam wrote:
> > > 
> > > On Wed, Jan 18, 2023 at 04:21:06PM +0530, Hariprasad Kelam wrote:
> > > > All VFs and PF netdev shares same TL1 schedular, each interface PF 
> > > > or VF will have different TL2 schedulars having same parent TL1. 
> > > > The
> > > > TL1 RR_PRIO value is static and PF/VFs use the same value to 
> > > > configure its
> > > > TL2 node priority in case of DWRR children.
> > > > 
> > > > This patch adds support to configure TL1 RR_PRIO value using devlink.
> > > > The TL1 RR_PRIO can be configured for each PF. The VFs are not 
> > > > allowed to configure TL1 RR_PRIO value. The VFs can get the 
> > > > RR_PRIO value from the mailbox NIX_TXSCH_ALLOC response parameter aggr_lvl_rr_prio.
> > > 
> > > I asked this question under v1, but didn't get an answer, could you shed some light?
> > > 
> > > "Could you please elaborate how these priorities of Transmit Levels are related to HTB priorities? I don't seem to understand why something has to be configured with devlink in addition to HTB.
> > > 
> > > SMQ (send meta-descriptor queue) and MDQ (meta-descriptor queue) are the first transmit levels.
> > > Each send queue is mapped with SMQ.
> > >  
> > > As mentioned in cover letter, each egress packet needs to traverse all transmit levels starting from TL5 to TL1.
> > 
> > Yeah, I saw that, just some details about your hardware which might be obvious to you aren't so clear to me...
> > 
> > Do these transmit levels map to "layers" of HTB hierarchy? Does it look like this, or is my understanding completely wrong?
> > 
> > TL1                 [HTB root node]
> >                    /               \
> > TL2          [HTB node]         [HTB node]
> >             /          \             |
> > TL3    [HTB node]  [HTB node]   [HTB node]
> > ...                       ...
> > 
> > Transmit levels to HTB mapping is correct.
> > 
> > 
> > 
> > > This applies to non-QOS Send queues as well.
> > >  
> > >                        SMQ/MDQ --> TL4 -->TL3 -->TL2 -->TL1
> > > 
> > > By default non QOS queues use a default hierarchy  with round robin priority. 
> > > To avoid conflict with QOS tree priorities, with devlink user can choose round-robin priority before Qos tree formation.
> > 
> > So, this priority that you set with devlink is basically a weight of unclassified (default) traffic for round robin between unclassified and classified traffic, right? I.e. you have two hierarchies (one for HTB, another for non-QoS queue), and you do DWRR between them, according to this priority?
> > 
> > 
> >  Not exactly, In the given scenario where  multiple vfs are attached to PF netdev.
> >  each VF unclassified traffic forms a hierarchy and PF also forms a hierarchy for unclassified traffic.
> >  
> > Now traffic from these all tress(multiple vfs and PFs) are aggregated at TL1. HW performs DWRR among them since these TL2 queues (belonging to each pf and vf netdevs) will be configured with the same priority by the driver.
> > 
> > Currently, this priority is hard coded. Now we are providing this as a configurable value to the user.
> > 
> > Now if a user adds a HTB node, this will have strict priority at TL2 level since DWRR priority is different this traffic won't be affected by DWRR unclassified traffic.
> 
> So, did I get it right now?
> 
>                                      [strict priority**]
>                            /---------/                 \-----\
>                            |                                 |
>                         [DWRR*]                              |
>         /---------------/  |   \---------------\             |
>         |                  |                   |             |
> [ Hierarchy for ]  [ Hierarchy for  ]  [ Hierarchy for  ]    |
> [ unclassified  ]  [  unclassified  ]  [  unclassified  ]    |
> [traffic from PF]  [traffic from VF1]  [traffic from VF2]    |
> [      ***      ]  [      ***       ]  [      ***       ]    |
>                                                              |
>                                                 [HTB hierarchy using]
>                                                 [  strict priority  ]
>                                                 [   between nodes   ]
> 
> 
> 
>         Adjusted picture
> 
>                         /--------------------------------------------------------------------------------/       Transmit level 1
>                             |                                                                                      | 
>                         [DWRR*] [ priority 6 ]                                      [strict priority** ]      [ priority 0 ]  Transmit level 2
>         /---------------/  |   \-----------------------------------\                           |
>              |                   |                                   |                                           |
> [ Hierarchy for ]  [ Hierarchy for  ]  [ Hierarchy for  ]                 [ Hierarchy for  ] 
> [ unclassified  ]  [  unclassified  ]  [  unclassified  ]                      [ strict priority  ]
> [traffic from PF]  [traffic from VF1]  [traffic from VF2]   
> [      ***      ]  [      ***       ]  [      ***       ]    
>                                                 
> 
> 
> As far as I understand, you set priorities at ***, which affect DWRR balancing at *, but it's not clear to me how the selection at ** works.
> Does the HTB hierarchy have some fixed priority,  ?
> 
> Hardware supports priorities from 0 to 7. lower value has high priority.
> nodes having the same priority are treated as DWRR childs.
> 
> i.e. the user can set priority for unclassified traffic to be higher or lower than HTB traffic?
> 
> Yes its user configurable, unclassified traffic priority can be higher or lower than HTB traffic if a user wishes to configure it.
> 
> Please also point me at any inaccuracies in my picture, I really want to understand the algorithm here, because configuring additional priorities outside of HTB looks unusual to me.
> 
>   Please check the adjusted picture. Let us assume a user has set the priority as 6 for DWRR (unclassified traffic)  and  HTB strict priority as 0.
> Once all traffic reaches  TL2,  Now hardware algorithm first pics HTB strict priority and processes DWRR later according to their priorities.

OK, I seem to get it now, thanks for the explanation!

How do you set the priority for HTB, though? You mentioned this command
to set priority of unclassified traffic:

devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 6 \
cmode runtime

But what is the command to change priority for HTB?

What bothers me about using devlink to configure HTB priority is:

1. Software HTB implementation doesn't have this functionality, and it
always prioritizes unclassified traffic. As far as I understand, the
rule for tc stuff is that all features must have a reference
implementation in software.

2. Adding a flag (prefer unclassified vs prefer classified) to HTB
itself may be not straightforward, because your devlink command has a
second purpose of setting priorities between PFs/VFs, and it may
conflict with the HTB flag.

> 
> > 
> > Thanks,
> > Hariprasad k
Jakub Kicinski Jan. 23, 2023, 10:44 p.m. UTC | #8
On Mon, 23 Jan 2023 17:03:01 +0000 Hariprasad Kelam wrote:
> So, did I get it right now?

Hariprasad, please fix your email setup.
It's impossible to read your messages :|
There is no marking of quotes and the responses are misaligned.
Jakub Kicinski Jan. 23, 2023, 10:45 p.m. UTC | #9
On Mon, 23 Jan 2023 22:05:58 +0200 Maxim Mikityanskiy wrote:
> OK, I seem to get it now, thanks for the explanation!
> 
> How do you set the priority for HTB, though? You mentioned this command
> to set priority of unclassified traffic:
> 
> devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 6 \
> cmode runtime
> 
> But what is the command to change priority for HTB?
> 
> What bothers me about using devlink to configure HTB priority is:
> 
> 1. Software HTB implementation doesn't have this functionality, and it
> always prioritizes unclassified traffic. As far as I understand, the
> rule for tc stuff is that all features must have a reference
> implementation in software.
> 
> 2. Adding a flag (prefer unclassified vs prefer classified) to HTB
> itself may be not straightforward, because your devlink command has a
> second purpose of setting priorities between PFs/VFs, and it may
> conflict with the HTB flag.

If there is a two-stage hierarchy the lower level should be controlled
by devlink-rate, no?
Maxim Mikityanskiy Jan. 23, 2023, 11:01 p.m. UTC | #10
On Mon, Jan 23, 2023 at 02:45:48PM -0800, Jakub Kicinski wrote:
> On Mon, 23 Jan 2023 22:05:58 +0200 Maxim Mikityanskiy wrote:
> > OK, I seem to get it now, thanks for the explanation!
> > 
> > How do you set the priority for HTB, though? You mentioned this command
> > to set priority of unclassified traffic:
> > 
> > devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 6 \
> > cmode runtime
> > 
> > But what is the command to change priority for HTB?
> > 
> > What bothers me about using devlink to configure HTB priority is:
> > 
> > 1. Software HTB implementation doesn't have this functionality, and it
> > always prioritizes unclassified traffic. As far as I understand, the
> > rule for tc stuff is that all features must have a reference
> > implementation in software.
> > 
> > 2. Adding a flag (prefer unclassified vs prefer classified) to HTB
> > itself may be not straightforward, because your devlink command has a
> > second purpose of setting priorities between PFs/VFs, and it may
> > conflict with the HTB flag.
> 
> If there is a two-stage hierarchy the lower level should be controlled
> by devlink-rate, no?

From the last picture by Hariprasad, I understood that the user sets all
priorities (for unclassified traffic per PF and VF, and for HTB traffic)
on the same TL2 level, i.e. it's not two-stage. (Maybe I got it all
wrong again?)

I asked about the command to change the HTB priority, cause the
parameters aren't easily guessed, but I assume it's also devlink (i.e.
driver-specific).

If there were two levels (the upper level chooses who goes first: HTB or
unclassified, and the lower level sets priorities per PF and VF for
unclassified traffic), that would be more straightforward to solve: the
upper level should be controlled by a new HTB parameter, and the lower
level is device-specific, thus it goes to devlink.
Hariprasad Kelam Jan. 24, 2023, 11:49 a.m. UTC | #11
> On Mon, Jan 23, 2023 at 02:45:48PM -0800, Jakub Kicinski wrote:
> > On Mon, 23 Jan 2023 22:05:58 +0200 Maxim Mikityanskiy wrote:
> > > OK, I seem to get it now, thanks for the explanation!
> > >
> > > How do you set the priority for HTB, though? You mentioned this
> > > command to set priority of unclassified traffic:
> > >
> > > devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 6 \
> > > cmode runtime
> > >
> > > But what is the command to change priority for HTB?
> > >
> > > What bothers me about using devlink to configure HTB priority is:
> > >
> > > 1. Software HTB implementation doesn't have this functionality, and
> > > it always prioritizes unclassified traffic. As far as I understand,
> > > the rule for tc stuff is that all features must have a reference
> > > implementation in software.
> > >
> > > 2. Adding a flag (prefer unclassified vs prefer classified) to HTB
> > > itself may be not straightforward, because your devlink command has
> > > a second purpose of setting priorities between PFs/VFs, and it may
> > > conflict with the HTB flag.
> >
> > If there is a two-stage hierarchy the lower level should be controlled
> > by devlink-rate, no?
> 
> From the last picture by Hariprasad, I understood that the user sets all
> priorities (for unclassified traffic per PF and VF, and for HTB traffic) on the
> same TL2 level, i.e. it's not two-stage. (Maybe I got it all wrong again?)
> 
> I asked about the command to change the HTB priority, cause the
> parameters aren't easily guessed, but I assume it's also devlink (i.e.
> driver-specific).
> 
Currently, we don't support changing HTB priority since TC_HTB_MODIFY is not yet supported.
The driver implementation is inline with htb tc framework, below are commands we use for setting htb priority

ethtool -K eth0 hw-tc-offload on
tc qdisc replace dev eth0 root handle 1: htb offload
tc class add dev eth0 parent 1: classid 1:1 htb rate 10Gbit prio 2
tc class add dev eth0 parent 1: classid 1:1 htb rate 10Gbit prio 3


> If there were two levels (the upper level chooses who goes first: HTB or
> unclassified, and the lower level sets priorities per PF and VF for unclassified
> traffic), that would be more straightforward to solve: the upper level should
> be controlled by a new HTB parameter, and the lower level is device-specific,
> thus it goes to devlink.

The PF netdev and VFs share the same physical link and share the same TL1 node.
Hardware supports one DWRR group and the rest are strict priority nodes. Driver configures
the priority set by devlink to PF and VF traffic TL2 nodes such that traffic is forwarded
to TL1 using DWRR algorithm.

Now that if we add htb command for unclassified traffic, at any given point in time HTB
rule only applies to a single interface, since we require to set DWRR priority at TL1, 
which applies to both PF/VFs, we feel it's a different use case altogether.
Maxim Mikityanskiy Jan. 24, 2023, 1:02 p.m. UTC | #12
On Tue, Jan 24, 2023 at 11:49:48AM +0000, Hariprasad Kelam wrote:
> 
> 
> > On Mon, Jan 23, 2023 at 02:45:48PM -0800, Jakub Kicinski wrote:
> > > On Mon, 23 Jan 2023 22:05:58 +0200 Maxim Mikityanskiy wrote:
> > > > OK, I seem to get it now, thanks for the explanation!
> > > >
> > > > How do you set the priority for HTB, though? You mentioned this
> > > > command to set priority of unclassified traffic:
> > > >
> > > > devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 6 \
> > > > cmode runtime
> > > >
> > > > But what is the command to change priority for HTB?
> > > >
> > > > What bothers me about using devlink to configure HTB priority is:
> > > >
> > > > 1. Software HTB implementation doesn't have this functionality, and
> > > > it always prioritizes unclassified traffic. As far as I understand,
> > > > the rule for tc stuff is that all features must have a reference
> > > > implementation in software.
> > > >
> > > > 2. Adding a flag (prefer unclassified vs prefer classified) to HTB
> > > > itself may be not straightforward, because your devlink command has
> > > > a second purpose of setting priorities between PFs/VFs, and it may
> > > > conflict with the HTB flag.
> > >
> > > If there is a two-stage hierarchy the lower level should be controlled
> > > by devlink-rate, no?
> > 
> > From the last picture by Hariprasad, I understood that the user sets all
> > priorities (for unclassified traffic per PF and VF, and for HTB traffic) on the
> > same TL2 level, i.e. it's not two-stage. (Maybe I got it all wrong again?)
> > 
> > I asked about the command to change the HTB priority, cause the
> > parameters aren't easily guessed, but I assume it's also devlink (i.e.
> > driver-specific).
> > 
> Currently, we don't support changing HTB priority since TC_HTB_MODIFY is not yet supported.
> The driver implementation is inline with htb tc framework, below are commands we use for setting htb priority
> 
> ethtool -K eth0 hw-tc-offload on
> tc qdisc replace dev eth0 root handle 1: htb offload
> tc class add dev eth0 parent 1: classid 1:1 htb rate 10Gbit prio 2
> tc class add dev eth0 parent 1: classid 1:1 htb rate 10Gbit prio 3

I thought there was a concept of a priority of the whole HTB tree in
your implementation...

So, if I run these commands:

devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 2 cmode runtime
tc class add dev eth0 parent 1: classid 1:1 htb rate 10Gbit prio 1
tc class add dev eth0 parent 1: classid 1:2 htb rate 10Gbit prio 3

Will it prioritize class 1:1 over unclassified traffic, and unclassified
traffic over class 1:2?

> > If there were two levels (the upper level chooses who goes first: HTB or
> > unclassified, and the lower level sets priorities per PF and VF for unclassified
> > traffic), that would be more straightforward to solve: the upper level should
> > be controlled by a new HTB parameter, and the lower level is device-specific,
> > thus it goes to devlink.
> 
> The PF netdev and VFs share the same physical link and share the same TL1 node.
> Hardware supports one DWRR group and the rest are strict priority nodes. Driver configures
> the priority set by devlink to PF and VF traffic TL2 nodes such that traffic is forwarded
> to TL1 using DWRR algorithm.
> 
> Now that if we add htb command for unclassified traffic, at any given point in time HTB
> rule only applies to a single interface, since we require to set DWRR priority at TL1, 
> which applies to both PF/VFs, we feel it's a different use case altogether.

Thanks, with the example above your explanation makes sense to me now.

So, basically, in your implementation, entities prioritized by hardware
are: each HTB class, each VF and PF; you want to keep user-assigned
priorities for HTB classes, you want to let the user assign a priority
for unclassified traffic, but the latter must be equal for all VFs and
PF (for DWRR to work), correct? And that devlink command is only useful
in the HTB scenario, i.e. it doesn't matter what tl1_rr_prio you set if
HTB is not used, right?

What I don't like in the current implementation is that it adds a
feature to HTB, bypassing HTB (also not providing a software equivalent
of the feature). I would expect the priority of unclassified traffic to
be configured with tc commands that set up HTB (by the way, HTB has a
"default" option to choose a class for unclassified traffic, and a
priority can be set for this class - this functionality can be leveraged
for this purpose, or a new option can be added, whatever looks more
appropriate). On the other hand, I understand your hardware limitation
requiring to have the same priority for all VFs and PF on the same port.

It's hard to suggest something good here, actually... An obvious thought
is to let the first netdev that configures HTB choose the priority for
unclassified traffic, and block attempts from other netdevs to set a
different one, but this approach also has obvious drawbacks: PF has no
special powers here, and it can't set the desired priority if PF itself
doesn't use HTB (or doesn't configure it first, before VFs).

Another idea of mine is to keep the devlink command for enforcement
purpose, and make the behavior as follows:

1. The user will pick a priority for unclassified traffic when attaching
HTB.

2. If tl1_rr_prio was set with devlink, block attempts to attach HTB
with a different default priority.

3. If tl1_rr_prio wasn't set or was reset, allow attaching HTB to PF
with any default priority. 

This way, VFs can't attach HTB with arbitrary priorities, only with the
one that the admin has enforced using devlink. At the same time, if VFs
aren't used, no extra steps are needed to just use HTB on a PF. On the
other hand, it adds some complexity and may sound controversial to
someone. Thoughts?
Jakub Kicinski Jan. 25, 2023, 1:26 a.m. UTC | #13
On Tue, 24 Jan 2023 15:02:29 +0200 Maxim Mikityanskiy wrote:
> So, basically, in your implementation, entities prioritized by hardware
> are: each HTB class, each VF and PF; you want to keep user-assigned
> priorities for HTB classes, you want to let the user assign a priority
> for unclassified traffic, but the latter must be equal for all VFs and
> PF (for DWRR to work), correct? And that devlink command is only useful
> in the HTB scenario, i.e. it doesn't matter what tl1_rr_prio you set if
> HTB is not used, right?

To me HW this limited does not sound like a match for HTB offload.
Hariprasad Kelam Jan. 26, 2023, 5:16 p.m. UTC | #14
> On Tue, Jan 24, 2023 at 11:49:48AM +0000, Hariprasad Kelam wrote:
> >
> >
> > > On Mon, Jan 23, 2023 at 02:45:48PM -0800, Jakub Kicinski wrote:
> > > > On Mon, 23 Jan 2023 22:05:58 +0200 Maxim Mikityanskiy wrote:
> > > > > OK, I seem to get it now, thanks for the explanation!
> > > > >
> > > > > How do you set the priority for HTB, though? You mentioned this
> > > > > command to set priority of unclassified traffic:
> > > > >
> > > > > devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value
> > > > > 6 \ cmode runtime
> > > > >
> > > > > But what is the command to change priority for HTB?
> > > > >
> > > > > What bothers me about using devlink to configure HTB priority is:
> > > > >
> > > > > 1. Software HTB implementation doesn't have this functionality,
> > > > > and it always prioritizes unclassified traffic. As far as I
> > > > > understand, the rule for tc stuff is that all features must have
> > > > > a reference implementation in software.
> > > > >
> > > > > 2. Adding a flag (prefer unclassified vs prefer classified) to
> > > > > HTB itself may be not straightforward, because your devlink
> > > > > command has a second purpose of setting priorities between
> > > > > PFs/VFs, and it may conflict with the HTB flag.
> > > >
> > > > If there is a two-stage hierarchy the lower level should be
> > > > controlled by devlink-rate, no?
> > >
> > > From the last picture by Hariprasad, I understood that the user sets
> > > all priorities (for unclassified traffic per PF and VF, and for HTB
> > > traffic) on the same TL2 level, i.e. it's not two-stage. (Maybe I
> > > got it all wrong again?)
> > >
> > > I asked about the command to change the HTB priority, cause the
> > > parameters aren't easily guessed, but I assume it's also devlink (i.e.
> > > driver-specific).
> > >
> > Currently, we don't support changing HTB priority since TC_HTB_MODIFY is
> not yet supported.
> > The driver implementation is inline with htb tc framework, below are
> > commands we use for setting htb priority
> >
> > ethtool -K eth0 hw-tc-offload on
> > tc qdisc replace dev eth0 root handle 1: htb offload tc class add dev
> > eth0 parent 1: classid 1:1 htb rate 10Gbit prio 2 tc class add dev
> > eth0 parent 1: classid 1:1 htb rate 10Gbit prio 3
> 
> I thought there was a concept of a priority of the whole HTB tree in your
> implementation...
> 
> So, if I run these commands:
> 
> devlink -p dev param set pci/0002:04:00.0 name tl1_rr_prio value 2 cmode
> runtime tc class add dev eth0 parent 1: classid 1:1 htb rate 10Gbit prio 1 tc
> class add dev eth0 parent 1: classid 1:2 htb rate 10Gbit prio 3
> 
> Will it prioritize class 1:1 over unclassified traffic, and unclassified traffic over
> class 1:2?
> 
> > > If there were two levels (the upper level chooses who goes first:
> > > HTB or unclassified, and the lower level sets priorities per PF and
> > > VF for unclassified traffic), that would be more straightforward to
> > > solve: the upper level should be controlled by a new HTB parameter,
> > > and the lower level is device-specific, thus it goes to devlink.
> >
> > The PF netdev and VFs share the same physical link and share the same TL1
> node.
> > Hardware supports one DWRR group and the rest are strict priority
> > nodes. Driver configures the priority set by devlink to PF and VF
> > traffic TL2 nodes such that traffic is forwarded to TL1 using DWRR algorithm.
> >
> > Now that if we add htb command for unclassified traffic, at any given
> > point in time HTB rule only applies to a single interface, since we
> > require to set DWRR priority at TL1, which applies to both PF/VFs, we feel
> it's a different use case altogether.
> 
> Thanks, with the example above your explanation makes sense to me now.
> 
> So, basically, in your implementation, entities prioritized by hardware
> are: each HTB class, each VF and PF; you want to keep user-assigned
> priorities for HTB classes, you want to let the user assign a priority for
> unclassified traffic, but the latter must be equal for all VFs and PF (for DWRR
> to work), correct? And that devlink command is only useful in the HTB
> scenario, i.e. it doesn't matter what tl1_rr_prio you set if HTB is not used,
> right?
> 
> What I don't like in the current implementation is that it adds a feature to
> HTB, bypassing HTB (also not providing a software equivalent of the feature).
> I would expect the priority of unclassified traffic to be configured with tc
> commands that set up HTB (by the way, HTB has a "default" option to choose
> a class for unclassified traffic, and a priority can be set for this class - this
> functionality can be leveraged for this purpose, or a new option can be
> added, whatever looks more appropriate). On the other hand, I understand
> your hardware limitation requiring to have the same priority for all VFs and PF
> on the same port.

Thanks for pointing out the "default" option. we will explore this option if we can fit this into our current implementation or not and get back.
Meanwhile,   there are comments on other patches in this series and as we are in line with htb offload without the devlink option we will submit
 a new version of patches addressing the comments excluding this patch.

Thanks,
Hariprasad k




> 
> It's hard to suggest something good here, actually... An obvious thought is to
> let the first netdev that configures HTB choose the priority for unclassified
> traffic, and block attempts from other netdevs to set a different one, but this
> approach also has obvious drawbacks: PF has no special powers here, and it
> can't set the desired priority if PF itself doesn't use HTB (or doesn't configure
> it first, before VFs).
> 
> Another idea of mine is to keep the devlink command for enforcement
> purpose, and make the behavior as follows:
> 
> 1. The user will pick a priority for unclassified traffic when attaching HTB.
> 
> 2. If tl1_rr_prio was set with devlink, block attempts to attach HTB with a
> different default priority.
> 
> 3. If tl1_rr_prio wasn't set or was reset, allow attaching HTB to PF with any
> default priority.
> 
> This way, VFs can't attach HTB with arbitrary priorities, only with the one that
> the admin has enforced using devlink. At the same time, if VFs aren't used,
> no extra steps are needed to just use HTB on a PF. On the other hand, it adds
> some complexity and may sound controversial to someone. Thoughts?
diff mbox series

Patch

diff --git a/drivers/net/ethernet/marvell/octeontx2/af/common.h b/drivers/net/ethernet/marvell/octeontx2/af/common.h
index 8931864ee110..f5bf719a6ccf 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/common.h
@@ -142,7 +142,7 @@  enum nix_scheduler {
 
 #define TXSCH_RR_QTM_MAX		((1 << 24) - 1)
 #define TXSCH_TL1_DFLT_RR_QTM		TXSCH_RR_QTM_MAX
-#define TXSCH_TL1_DFLT_RR_PRIO		(0x1ull)
+#define TXSCH_TL1_DFLT_RR_PRIO		(0x7ull)
 #define CN10K_MAX_DWRR_WEIGHT          16384 /* Weight is 14bit on CN10K */
 
 /* Min/Max packet sizes, excluding FCS */
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
index 94e743b5b545..6f0199766d5a 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/mbox.h
@@ -84,7 +84,7 @@  struct mbox_msghdr {
 #define OTX2_MBOX_REQ_SIG (0xdead)
 #define OTX2_MBOX_RSP_SIG (0xbeef)
 	u16 sig;         /* Signature, for validating corrupted msgs */
-#define OTX2_MBOX_VERSION (0x000a)
+#define OTX2_MBOX_VERSION (0x000b)
 	u16 ver;         /* Version of msg's structure for this ID */
 	u16 next_msgoff; /* Offset of next msg within mailbox region */
 	int rc;          /* Msg process'ed response code */
@@ -299,6 +299,7 @@  M(NIX_BANDPROF_GET_HWINFO, 0x801f, nix_bandprof_get_hwinfo, msg_req,		\
 				nix_bandprof_get_hwinfo_rsp)		    \
 M(NIX_READ_INLINE_IPSEC_CFG, 0x8023, nix_read_inline_ipsec_cfg,		\
 				msg_req, nix_inline_ipsec_cfg)		\
+M(NIX_TL1_RR_PRIO,	0x8025, nix_tl1_rr_prio, nix_tl1_rr_prio_req, msg_rsp) \
 /* MCS mbox IDs (range 0xA000 - 0xBFFF) */					\
 M(MCS_ALLOC_RESOURCES,	0xa000, mcs_alloc_resources, mcs_alloc_rsrc_req,	\
 				mcs_alloc_rsrc_rsp)				\
@@ -825,6 +826,7 @@  enum nix_af_status {
 	NIX_AF_ERR_CQ_CTX_WRITE_ERR  = -429,
 	NIX_AF_ERR_AQ_CTX_RETRY_WRITE  = -430,
 	NIX_AF_ERR_LINK_CREDITS  = -431,
+	NIX_AF_ERR_TL1_RR_PRIO_PERM_DENIED  = -433,
 };
 
 /* For NIX RX vtag action  */
@@ -1269,6 +1271,11 @@  struct nix_bandprof_get_hwinfo_rsp {
 	u32 policer_timeunit;
 };
 
+struct nix_tl1_rr_prio_req {
+	struct mbox_msghdr hdr;
+	u8 tl1_rr_prio;
+};
+
 /* NPC mbox message structs */
 
 #define NPC_MCAM_ENTRY_INVALID	0xFFFF
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
index 3f5e09b77d4b..7d2230359bf0 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.c
@@ -808,6 +808,18 @@  static void rvu_setup_pfvf_macaddress(struct rvu *rvu)
 	}
 }
 
+static void rvu_setup_pfvf_aggr_lvl_rr_prio(struct rvu *rvu)
+{
+	struct rvu_hwinfo *hw = rvu->hw;
+	struct rvu_pfvf *pfvf;
+	int pf;
+
+	for (pf = 0; pf < hw->total_pfs; pf++) {
+		pfvf = &rvu->pf[pf];
+		pfvf->tl1_rr_prio = TXSCH_TL1_DFLT_RR_PRIO;
+	}
+}
+
 static int rvu_fwdata_init(struct rvu *rvu)
 {
 	u64 fwdbase;
@@ -1136,6 +1148,9 @@  static int rvu_setup_hw_resources(struct rvu *rvu)
 	/* Assign MACs for CGX mapped functions */
 	rvu_setup_pfvf_macaddress(rvu);
 
+	/* Assign aggr level RR Priority */
+	rvu_setup_pfvf_aggr_lvl_rr_prio(rvu);
+
 	err = rvu_npa_init(rvu);
 	if (err) {
 		dev_err(rvu->dev, "%s: Failed to initialize npa\n", __func__);
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
index 7f0a64731c67..eec890f98803 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
@@ -254,6 +254,7 @@  struct rvu_pfvf {
 	u64     lmt_map_ent_w1; /* Preseving the word1 of lmtst map table entry*/
 	unsigned long flags;
 	struct  sdp_node_info *sdp_info;
+	u8	tl1_rr_prio; /* RR PRIORITY set by PF */
 };
 
 enum rvu_pfvf_flags {
diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
index c11859999074..ef9f62aa6e2b 100644
--- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
+++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_nix.c
@@ -2001,6 +2001,7 @@  int rvu_mbox_handler_nix_txsch_alloc(struct rvu *rvu,
 {
 	struct rvu_hwinfo *hw = rvu->hw;
 	u16 pcifunc = req->hdr.pcifunc;
+	struct rvu_pfvf *parent_pf;
 	int link, blkaddr, rc = 0;
 	int lvl, idx, start, end;
 	struct nix_txsch *txsch;
@@ -2017,6 +2018,8 @@  int rvu_mbox_handler_nix_txsch_alloc(struct rvu *rvu,
 	if (!nix_hw)
 		return NIX_AF_ERR_INVALID_NIXBLK;
 
+	parent_pf = &rvu->pf[rvu_get_pf(pcifunc)];
+
 	mutex_lock(&rvu->rsrc_lock);
 
 	/* Check if request is valid as per HW capabilities
@@ -2076,7 +2079,7 @@  int rvu_mbox_handler_nix_txsch_alloc(struct rvu *rvu,
 	}
 
 	rsp->aggr_level = hw->cap.nix_tx_aggr_lvl;
-	rsp->aggr_lvl_rr_prio = TXSCH_TL1_DFLT_RR_PRIO;
+	rsp->aggr_lvl_rr_prio = parent_pf->tl1_rr_prio;
 	rsp->link_cfg_lvl = rvu_read64(rvu, blkaddr,
 				       NIX_AF_PSE_CHANNEL_LEVEL) & 0x01 ?
 				       NIX_TXSCH_LVL_TL3 : NIX_TXSCH_LVL_TL2;
@@ -2375,7 +2378,9 @@  static bool is_txschq_shaping_valid(struct rvu_hwinfo *hw, int lvl, u64 reg)
 static void nix_tl1_default_cfg(struct rvu *rvu, struct nix_hw *nix_hw,
 				u16 pcifunc, int blkaddr)
 {
+	struct rvu_pfvf *parent_pf = &rvu->pf[rvu_get_pf(pcifunc)];
 	u32 *pfvf_map;
+
 	int schq;
 
 	schq = nix_get_tx_link(rvu, pcifunc);
@@ -2384,7 +2389,7 @@  static void nix_tl1_default_cfg(struct rvu *rvu, struct nix_hw *nix_hw,
 	if (TXSCH_MAP_FLAGS(pfvf_map[schq]) & NIX_TXSCHQ_CFG_DONE)
 		return;
 	rvu_write64(rvu, blkaddr, NIX_AF_TL1X_TOPOLOGY(schq),
-		    (TXSCH_TL1_DFLT_RR_PRIO << 1));
+		    (parent_pf->tl1_rr_prio << 1));
 
 	/* On OcteonTx2 the config was in bytes and newer silcons
 	 * it's changed to weight.
@@ -5548,3 +5553,33 @@  int rvu_mbox_handler_nix_bandprof_get_hwinfo(struct rvu *rvu, struct msg_req *re
 
 	return 0;
 }
+
+int rvu_mbox_handler_nix_tl1_rr_prio(struct rvu *rvu,
+				     struct nix_tl1_rr_prio_req *req,
+				     struct msg_rsp *rsp)
+{
+	u16 pcifunc = req->hdr.pcifunc;
+	int blkaddr, nixlf, schq, err;
+	struct rvu_pfvf *pfvf;
+	u16 regval;
+
+	err = nix_get_nixlf(rvu, pcifunc, &nixlf, &blkaddr);
+	if (err)
+		return err;
+
+	pfvf = rvu_get_pfvf(rvu, pcifunc);
+	/* Only PF is allowed */
+	if (is_vf(pcifunc))
+		return NIX_AF_ERR_TL1_RR_PRIO_PERM_DENIED;
+
+	pfvf->tl1_rr_prio = req->tl1_rr_prio;
+
+	/* update TL1 topology */
+	schq = nix_get_tx_link(rvu, pcifunc);
+	regval = rvu_read64(rvu, blkaddr, NIX_AF_TL1X_TOPOLOGY(schq));
+	regval &= ~GENMASK_ULL(4, 1);
+	regval |= pfvf->tl1_rr_prio << 1;
+	rvu_write64(rvu, blkaddr, NIX_AF_TL1X_TOPOLOGY(schq), regval);
+
+	return 0;
+}
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
index a4c18565f9f4..96132cfe2141 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
@@ -591,7 +591,7 @@  int otx2_txschq_config(struct otx2_nic *pfvf, int lvl, int prio, bool txschq_for
 	u16 (*schq_list)[MAX_TXSCHQ_PER_FUNC];
 	struct otx2_hw *hw = &pfvf->hw;
 	struct nix_txschq_config *req;
-	u64 schq, parent;
+	u16 schq, parent;
 	u64 dwrr_val;
 
 	dwrr_val = mtu_to_dwrr_weight(pfvf, pfvf->tx_max_pktlen);
@@ -654,7 +654,7 @@  int otx2_txschq_config(struct otx2_nic *pfvf, int lvl, int prio, bool txschq_for
 
 		req->num_regs++;
 		req->reg[1] = NIX_AF_TL2X_SCHEDULE(schq);
-		req->regval[1] = TXSCH_TL1_DFLT_RR_PRIO << 24 | dwrr_val;
+		req->regval[1] = hw->txschq_aggr_lvl_rr_prio << 24 | dwrr_val;
 
 		if (lvl == hw->txschq_link_cfg_lvl) {
 			req->num_regs++;
@@ -678,7 +678,7 @@  int otx2_txschq_config(struct otx2_nic *pfvf, int lvl, int prio, bool txschq_for
 
 		req->num_regs++;
 		req->reg[1] = NIX_AF_TL1X_TOPOLOGY(schq);
-		req->regval[1] = (TXSCH_TL1_DFLT_RR_PRIO << 1);
+		req->regval[1] = hw->txschq_aggr_lvl_rr_prio << 1;
 
 		req->num_regs++;
 		req->reg[2] = NIX_AF_TL1X_CIR(schq);
@@ -742,6 +742,9 @@  int otx2_txsch_alloc(struct otx2_nic *pfvf)
 			pfvf->hw.txschq_list[lvl][schq] =
 				rsp->schq_list[lvl][schq];
 
+	pfvf->hw.txschq_link_cfg_lvl     = rsp->link_cfg_lvl;
+	pfvf->hw.txschq_aggr_lvl_rr_prio = rsp->aggr_lvl_rr_prio;
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h
index b142cc685a28..3e09f52f2dc6 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.h
@@ -208,6 +208,7 @@  struct otx2_hw {
 
 	/* NIX */
 	u8			txschq_link_cfg_lvl;
+	u8			txschq_aggr_lvl_rr_prio;
 	u16			txschq_list[NIX_TXSCH_LVL_CNT][MAX_TXSCHQ_PER_FUNC];
 	u16			matchall_ipolicer;
 	u32			dwrr_mtu;
diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_devlink.c b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_devlink.c
index 63ef7c41d18d..b6bb6945f08a 100644
--- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_devlink.c
+++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_devlink.c
@@ -64,9 +64,82 @@  static int otx2_dl_mcam_count_get(struct devlink *devlink, u32 id,
 	return 0;
 }
 
+static int otx2_dl_tl1_rr_prio_validate(struct devlink *devlink, u32 id,
+					union devlink_param_value val,
+					struct netlink_ext_ack *extack)
+{
+	struct otx2_devlink *otx2_dl = devlink_priv(devlink);
+	struct otx2_nic *pf = otx2_dl->pfvf;
+
+	if (is_otx2_vf(pf->pcifunc)) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "TL1RR PRIORITY setting not allowed on VFs");
+		return -EOPNOTSUPP;
+	}
+
+	if (pci_num_vf(pf->pdev)) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "TL1RR PRIORITY setting not allowed as VFs are already attached");
+		return -EOPNOTSUPP;
+	}
+
+	if (val.vu8 > 7) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Valid priority range 0 - 7");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int otx2_dl_tl1_rr_prio_get(struct devlink *devlink, u32 id,
+				   struct devlink_param_gset_ctx *ctx)
+{
+	struct otx2_devlink *otx2_dl = devlink_priv(devlink);
+	struct otx2_nic *pfvf = otx2_dl->pfvf;
+
+	ctx->val.vu8 = pfvf->hw.txschq_aggr_lvl_rr_prio;
+
+	return 0;
+}
+
+static int otx2_dl_tl1_rr_prio_set(struct devlink *devlink, u32 id,
+				   struct devlink_param_gset_ctx *ctx)
+{
+	struct otx2_devlink *otx2_dl = devlink_priv(devlink);
+	struct otx2_nic *pfvf = otx2_dl->pfvf;
+	struct nix_tl1_rr_prio_req *req;
+	bool if_up;
+	int err;
+
+	if_up = netif_running(pfvf->netdev);
+
+	/* send mailbox to AF */
+	mutex_lock(&pfvf->mbox.lock);
+
+	req = otx2_mbox_alloc_msg_nix_tl1_rr_prio(&pfvf->mbox);
+	if (!req) {
+		mutex_unlock(&pfvf->mbox.lock);
+		return -ENOMEM;
+	}
+
+	req->tl1_rr_prio = ctx->val.vu8;
+	err = otx2_sync_mbox_msg(&pfvf->mbox);
+	mutex_unlock(&pfvf->mbox.lock);
+
+	/* Reconfigure TL1/TL2 DWRR PRIORITY */
+	if (!err && if_up) {
+		otx2_stop(pfvf->netdev);
+		otx2_open(pfvf->netdev);
+	}
+
+	return err;
+}
+
 enum otx2_dl_param_id {
 	OTX2_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,
 	OTX2_DEVLINK_PARAM_ID_MCAM_COUNT,
+	OTX2_DEVLINK_PARAM_ID_TL1_RR_PRIO,
 };
 
 static const struct devlink_param otx2_dl_params[] = {
@@ -75,6 +148,11 @@  static const struct devlink_param otx2_dl_params[] = {
 			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
 			     otx2_dl_mcam_count_get, otx2_dl_mcam_count_set,
 			     otx2_dl_mcam_count_validate),
+	DEVLINK_PARAM_DRIVER(OTX2_DEVLINK_PARAM_ID_TL1_RR_PRIO,
+			     "tl1_rr_prio", DEVLINK_PARAM_TYPE_U8,
+			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),
+			     otx2_dl_tl1_rr_prio_get, otx2_dl_tl1_rr_prio_set,
+			     otx2_dl_tl1_rr_prio_validate),
 };
 
 static const struct devlink_ops otx2_devlink_ops = {