Message ID | 20230724151849.323497-1-jhs@mojatatu.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net,1/1] net: sched: cls_u32: Fix match key mis-addressing | expand |
On Mon, 24 Jul 2023 at 08:19, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > + /* The tableid from handle and tableid from htid must match */ > if (TC_U32_HTID(handle) && TC_U32_HTID(handle ^ htid)) { Well, I guess the comment at least talks about matching. I still do think that most people aren't going to read "Oh, TC_U32_HTID(handle ^ htid) being non-zero means that they they differ in the hash table bits". because the whole trick depends on bit op details, and depends on the fact that TC_U32_HTID() is purely a bit masking operation. But whatever. I would also like to see some explanation of this pattern handle = htid | TC_U32_NODE(handle); and why that "binary or" makes sense. Are the node bits in 'htid' guaranteed to be zero? Because if 'htid' can have node bits set, what's the logical reason for or'ing things together? And why is that variable called 'htid', when clearly it also has the bucket ID, and the comment even says we have a valid bucket id? So I do still find this code to be explicitly written to be confusing. It's literally using variable names *designed* to not make sense, and have other meanings. Hmm? I'm not hating the patch, but when I look at that code I just go "this is confusing". And I don't think it's because I'm confused. I think it's the code. Linus
On Mon, Jul 24, 2023 at 1:49 PM Linus Torvalds <torvalds@linuxfoundation.org> wrote: > > On Mon, 24 Jul 2023 at 08:19, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > + /* The tableid from handle and tableid from htid must match */ > > if (TC_U32_HTID(handle) && TC_U32_HTID(handle ^ htid)) { > > Well, I guess the comment at least talks about matching. > > I still do think that most people aren't going to read > > "Oh, TC_U32_HTID(handle ^ htid) being non-zero means that they > they differ in the hash table bits". > > because the whole trick depends on bit op details, and depends on the > fact that TC_U32_HTID() is purely a bit masking operation. > > But whatever. I will improve the text leading to this to state that the goal of building the new handle is to come up with an "address" or "path" for the table entry. Then the table id mismatch is understood from that context as part of the rules for building the table entry "address". > I would also like to see some explanation of this pattern > > handle = htid | TC_U32_NODE(handle); > > and why that "binary or" makes sense. Are the node bits in 'htid' > guaranteed to be zero? Per existing user space tools, yes - they are guaranteed to be zero (per my gitchelogy of both kernel + iproute2 since inception this has been the behavior); > Because if 'htid' can have node bits set, what's the logical reason > for or'ing things together? > Hrm. I am not sure if this is what you are getting to: but you caught a different bug there. While none of existing user space tooling(I know of) sends htid with nodeid set (i.e it is always zero) syzkaller or someone else's poc could send a non-zero nodeid in the htid. It is not catastrophic but user expectation will be skewed - both the handle nodeid + htid node id would be used to derive the nodeid. It will fail from the idr allocation if the nodeid is in use. It feels like a second patch though - rejecting htid with a specified nodeid (that way expected user space behavior is maintained). > And why is that variable called 'htid', when clearly it also has the > bucket ID, and the comment even says we have a valid bucket id? > The name could be changed. I am not sure what a good name is but the semantics are: it carries a minimum of hash table id and at most hash table id + bucket id. This sounds lame: ht_maybe_plus_bktid ? or make them two separate variables htid and bktid. > So I do still find this code to be explicitly written to be confusing. > It's literally using variable names *designed* to not make sense, and > have other meanings. > > Hmm? > > I'm not hating the patch, but when I look at that code I just go "this > is confusing". And I don't think it's because I'm confused. I think > it's the code. I dont think it was intended that way - or maybe i have been taking it for granted given on/off staring at the code for different reasons over the years. If it was to be written today we should certainly have made the 3 ids independent and really should have allowed only one (instead of two ways) to specify the different object ids from user space. It is at least 2 decades old, in those days maybe saving a bit of space was a good thing. cheers, jamal > Linus
On Tue, 25 Jul 2023 at 10:04, Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > I would also like to see some explanation of this pattern > > > > handle = htid | TC_U32_NODE(handle); > > > > and why that "binary or" makes sense. Are the node bits in 'htid' > > guaranteed to be zero? > > Per existing user space tools, yes - they are guaranteed to be zero > (per my gitchelogy of both kernel + iproute2 since inception this has > been the behavior); Ok, if the htid bits are zero, it's all good. It's fine to use a binary 'or' as a way to 'insert bits in the word', but only if the old bits were zero, and that wasn't obvious to me. The *normal* pattern would be to explicitly mask off the bits you want to use, so that you don't get some random mixing of bits of the fields. Of course, that's a bit inconvenient here, since you don't have the obvious accessors. And while using bitfields would make the source code look fine, bitfields are *horrible* for any ABI, since they have very weakly defined semantics (ie litte-endian vs big-endian, and the *bit* order is not at all guaranteed to match the *byte* order). > > Because if 'htid' can have node bits set, what's the logical reason > > for or'ing things together? > > Hrm. I am not sure if this is what you are getting to: but you caught > a different bug there. So what I'm getting at was just that *if* you can have mixing of bits of that NODE part of the variable, I think the end result doesn't end up being very sensible. It's not that 'binary or' isn't a valid operation. But it normally isn't all that sane to randomly just mix bits in these kinds of things. What would it mean to mix node bits from an old value with the user-supplied one? So that pattern just looked odd to me. Linus
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index 5abf31e432ca..e0eabbcce9d4 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -1024,18 +1024,54 @@ static int u32_change(struct net *net, struct sk_buff *in_skb, return -EINVAL; } + /* If we got this far _we have the table(ht)_ with its matching htid. + * Note that earlier the ht selection is a result of either a) the user + * providing the htid specified via TCA_U32_HASH attribute or b) when + * no such attribute is passed then a default ht, typically root at + * 0x80000000, is chosen. + * The passed htid will also hold the bucketid. 0 is fine. For example + * of the root ht, 0x80000000 is indicating bucketid 0, whereas a user + * passed htid may have 0x60001000 indicating hash bucketid 1. + * + * We may also have a handle, if the user passed one. The handle carries + * annotation of htid(12b):bucketid(8b):node/entryid(12b). + * The value of bucketid on the handle is ignored even if one was passed; + * rather the value on the htid is always assumed to be the bucketid. + */ if (handle) { + /* The tableid from handle and tableid from htid must match */ if (TC_U32_HTID(handle) && TC_U32_HTID(handle ^ htid)) { NL_SET_ERR_MSG_MOD(extack, "Handle specified hash table address mismatch"); return -EINVAL; } - handle = htid | TC_U32_NODE(handle); - err = idr_alloc_u32(&ht->handle_idr, NULL, &handle, handle, - GFP_KERNEL); - if (err) - return err; - } else + /* Ok, so far we have a valid htid(12b):bucketid(8b) but we + * need to finalize our handle to point to the entry as well + * (with a proper node/entryid(12b)). Nodeid _cannot be 0_ for + * entries since it is reserved only for tables(see earlier + * code which processes TC_U32_DIVISOR attribute). + * if the handle did not specify a non-zero nodeid (example + * passed 0x60000000) then pick a new one from the pool of IDs + * this hash table has been allocating from. + * If OTOH it is specified (i.e for example the user passed a + * handle such as 0x60000123), then we use it generate our final + * handle which is used to uniquely identify the match entry. + */ + if (!TC_U32_NODE(handle)) { + handle = gen_new_kid(ht, htid); + } else { + handle = htid | TC_U32_NODE(handle); + err = idr_alloc_u32(&ht->handle_idr, NULL, &handle, + handle, GFP_KERNEL); + if (err) + return err; + } + } else { + /* we dont have a handle lets just generate one based on htid + * recall that htid has both the table and bucket ids already + * encoded, only missing piece is the nodeid. + */ handle = gen_new_kid(ht, htid); + } if (tb[TCA_U32_SEL] == NULL) { NL_SET_ERR_MSG_MOD(extack, "Selector not specified");
A match entry is uniquely identified with an "address" or "path" in the form of: hashtable ID(12b):bucketid(8b):nodeid(12b). A table on which the entry is inserted assumes the address with nodeid of zero. When creating table match entries all of hash table id, bucket id and node (match entry id) are needed to be either specified by the user or via reasonable in-kernel defaults. The in-kernel default for a table id is 0x800(omnipresent root table); for bucketid it is 0x0. Prior to this fix there was none for a nodeid i.e. the code assumed that the user passed the correct nodeid and if the user passes a nodeid of 0 (as Mingi Cho did) then that is what was assumed. But nodeid of 0 is reserved for identifying the table. This is not a problem until we dump. The dump code notices that the nodeid is zero and assumes it is referencing a table and therefore table struct tc_u_hnode instead of what was created i.e match entry struct tc_u_knode. Ming does an equivalent of: tc filter add dev dummy0 parent 10: prio 1 handle 0x1000 \ protocol ip u32 match ip src 10.0.0.1/32 classid 10:1 action ok Essentially specifying a table id 0, bucketid 1 and nodeid of zero Tableid 0 is remapped to the default of 0x800. Bucketid 1 is ignored and defaults to 0x00. Nodeid was assumed to be what Ming passed - 0x000 dumping before fix shows: ~$ tc filter ls dev dummy0 parent 10: filter protocol ip pref 1 u32 chain 0 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor -30591 Note that the last line reports a table instead of a match entry (you can tell this because it says "ht divisor..."). As a result of reporting the wrong data type (misinterpretting of struct tc_u_knode as being struct tc_u_hnode) the divisor is reported with value of -30591. Ming identified this as part of the heap address (physmap_base is 0xffff8880 (-30591 - 1)). The fix is to ensure that when table entry matches are added and no nodeid is specified (i.e nodeid == 0) then we get the next available nodeid from the table's pool. After the fix, this is what the dump shows: $ tc filter ls dev dummy0 parent 10: filter protocol ip pref 1 u32 chain 0 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 1 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 10:1 not_in_hw match 0a000001/ffffffff at 12 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 Reported-by: Mingi Cho <mgcho.minic@gmail.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> --- net/sched/cls_u32.c | 48 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 42 insertions(+), 6 deletions(-)